Searching text in Django models

25 Feb 2021 9 minutes

The need to perform text searches is something that often pops up in web applications, so much so that there are many projects dedicated solely to solving this problem. In this post I will cover some approaches to perform text searches in your own Django app without resorting to database-specific functionality or external searching solutions. At the end of this post you’ll have a flexible search solution that can order search results by relevance.

Let’s assume that you have the following model that you want to allow users to search through:

class Product(models.Model):
    name = models.CharField(max_length=100)
    description = models.TextField()

Throughout this post we’ll be implementing our search function in a function named search which takes one string argument representing the search query. Ultimately this function will also sort the search results, but we’ll get to that later on.

def search(query):
    ...
    return products

Finally, we assume that this is the current products loaded in our database:

IDNameDescription
1FlourAdd sugar and you’ve got a cake
2Brown SugarWorld’s best brown sugar
3Icing SugarAlso known as frosting or powdered sugar
4Glucose LiquidClear sugar syrup
5CupcakeBecause usual sized cakes are too expensive

Disclaimer: I’m not a baker, so don’t take any baking advice from this post.

Approach 1: Just plain __icontains

This is the easiest approach, and works fine when you’re searching very specific single-word terms (e.g. part numbers, barcodes, etc.). The search function might look something like this:

def search(query):
    return Product.objects.filter(
        Q(name__icontains=query) | Q(description__icontains=query)
    )

This approach however quickly falls apart when using multi-term searches or any deviations from the original text. For instance “sugar brown” will return no results. This filter will also not scale well when there are millions of records to search through, so we’ll need to do better than this.

Approach 2: Commutative __icontains

This approach is a slight improvement from approach 1, where the order of the search terms doesn’t matter anymore. The basic approach goes something like this:

def search(query):
    # build the filter term-by-term
    q = Q()
    for term in query.split():
        q &= Q(name__icontains=term) | Q(description__icontains=term)

    # perform the query
    return Product.objects.filter(q)

At least this time “sugar brown” will also match our “Brown Sugar” product as expected, but this still suffers from the inefficiency of approach 1. This approach is still too inflexible for general searching - searching for “worlds best brown sugar” will not match the “Brown Sugar” product since the “worlds” is missing an apostrophe. At this point __icontains will no longer be able to help us and we will need to find a better approach.

Approach 3: Keyword index

We now introduce a new Keyword model - this model will store interesting keywords that we will be extracting from our Product models. Then, instead of searching our Products directly we will be searching through our keywords instead.

from django.db.models import Index, UniqueConstraint

class Keyword(models.Model):
    product = models.ForeignKey(Product, on_delete=models.CASCADE)
    keyword = models.CharField(max_length=50)

    class Meta:
        indexes = [Index(fields=['keyword'])]
        constraints = [
            UniqueConstraint(fields=['keyword', 'product'], name='unique_keyword'),
        ]

Note that we’ve added an index to the model on the keyword field, this will help by making the search more efficient (you can read up more on Django Indexes in the docs).

To generate these keywords we will also be introducing two new functions: one is to generate keywords from a piece of text and the other is to save these keywords for a given product.

To generate our keywords we’ll split the text into a list of tokens and pre-process these tokens a little bit by removing all non-alphanumeric characters and making all the tokens lowercase:

# given a string generates a set of keywords for searching.
def generate_keywords(text):
    # split the text into a list of lowercase tokens
    tokens = [s.lower() for s in text.split()]

    # remove non alpha-numeric characters from tokens
    tokens = [''.join(c for c in token if c.isalpha()) for token in tokens]

    # remove empty and duplicate tokens and limit length of all tokens to 50
    # (max keyword length for Keyword model)
    tokens = set(t[:50] for t in tokens if t)

    return tokens

We then extend our original Product model with a new function to generate keywords whenever we save the Product:

class Product(models.Model):
    ...

    def generate_index(self):
        # generate product keywords
        name_keywords = generate_keywords(self.name)
        description_keywords = generate_keywords(self.description)

        keywords = name_keywords.union(description_keywords)

        # remove all outdated keywords
        self.keyword_set.exclude(keyword__in=keywords).delete()

        # create the new keywords
        current = set(self.keyword_set.values_list('keyword', flat=True))
        create = keywords - current

        Keyword.objects.bulk_create([
            Keyword(keyword=keyword, product=self) for keyword in create
        ])

    def save(self, *args, **kwargs):
        # do the usual saving
        super().save(*args, **kwargs)

        # generate new keywords
        self.generate_index()

Finally, we update our search function to now search through the Keywords instead of the Products directly:

def search(query):
    # generate keywords to search for
    keywords = generate_keywords(query)

    products = Product.objects.filter(
        # find all products with matching keywords
        keyword_set__keyword__in=keywords,
    ).annotate(
        # count how many keyword matches each product has
        keyword_matches=Count('keyword_set'),
    ).filter(
        # filter for products that match all of the keywords
        keyword_matches__gte=len(keywords),
    )

    return products

Note that the search function uses the same generate_keywords function to transform the search query into a list of keywords to search for. To filter the products, we annotate (documentation for annotations in Django here) each product with a Count of how many keyword matches it had. If the amount of matches is greater than or equal to the amount of search keywords it means we’ve got a hit (all the keywords matched).

This method solved two things: firstly, it is a lot more efficient than the __icontains method since all the keywords are indexed and pre-processed beforehand. By using annotations we’re also making sure that all the heavy-lifting is done by the database (as it should be). Second and most importantly; we’ve now made the whole search process a lot more flexible. The search is now already insensitive to non-alphanumeric characters (i.e. searching for “world’s” or “worlds” will return the same results) but we can go even further.

Approach 4: Keyword stemming

The previous approach is already working well, but now we’ve run into a problem. A hungry customer searched for “cupcakes” but found no results because we only have a “cupcake” for sale. This is where stemming comes in. Stemming is the process whereby inflections are removed from words, i.e. “running” -> “run”, “runs” -> “run” and so forth.

If we made use of keyword stemming the “cupcakes” search would have matched our “cupcake” product. Heck, even “cupcaking” would work. To do stemming we’ll make use of the nltk library, a Python library for doing natural-language processing.

To perform stemming on a word, you can use nltk in the following manner:

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

# all of these will print "cupcak"
print(stemmer.stem("cupcake"))
print(stemmer.stem("cupcakes"))
print(stemmer.stem("cupcaking"))

We’ll be using the PorterStemmer for this post due to its simplicity and speed, but nltk supports quite a few stemming algorithms. We now update our generate_keywords function to perform stemming on all tokens that it thinks is words (you might need to adjust this logic depending on your application):

# set up the stemmer
stemmer = PorterStemmer()

def generate_keywords(text):
    # split the text into a list of lowercase tokens
    tokens = [s.lower() for s in text.split()]

    # remove non alpha-numeric characters from tokens
    tokens = [''.join(c for c in token if c.isalpha()) for token in tokens]

    # split the tokens into words and "other" - a token is considered a word if
    # it doesn't contain any digits
    words = []
    other = []
    for token in tokens:
        (other if any(s.isdigit() for s in token) else words).append(token)

    # apply stemming to words
    words = [stemmer.stem(w) for w in words]

    # remove empty and duplicate tokens and limit length of all tokens to 50
    # (max keyword length for Keyword model)
    tokens = set(t[:50] for t in words + other if t)

    return tokens
Stemming is language specific - if you want to apply stemming to non-English keywords you will probably need to use another stemming algorithm.

The searching process is now much more flexible than when we started out and even those people that search for “cupcaking” will now be satisfied. We only have one problem left to solve. If a customer searches for “sugar”, they will be shown the following results:

IDNameDescription
1FlourAdd sugar and you’ve got a cake
2Brown SugarWorld’s best brown sugar
3Icing SugarAlso known as frosting or powdered sugar
4Glucose LiquidClear sugar syrup

Flour? They don’t care about that - yet it is the very first result even though it only had the “sugar” term in its description. The final problem to solve is making sure that the customer sees relevant results first.

Approach 5: Scoring search results

To order our search results we will need to assign some score to each result so that more relevant results are shown first. To do that we extend our original Keyword class and add an importance field - the more important a keyword is the higher its importance value will be:

class Keyword(models.Model):
    ...
    importance = models.FloatField()

We then also update our generate_index function in our Product model to assign the importance to keywords:

class Product(models.Model):
    ...

    def generate_index(self):
        # generate product keywords
        name_keywords = generate_keywords(self.name)
        description_keywords = generate_keywords(self.description)

        # remove duplicate keywords
        description_keywords -= name_keywords

        # assign importance to keywords
        pairs = [
            (1.1, name_keywords),
            (0.5, description_keywords),
        ]

        # remove all outdated keywords (we now need to take keyword whose
        # importance has changed into account)
        q = Q()
        for importance, keywords in pairs:
            for keyword in keywords:
                q |= Q(
                    keyword=keyword,
                    importance__range=(importance - 0.05, importance + 0.05),
                )

        self.keyword_set.exclude(q).delete()

        # create the new keywords
        current = set(self.keyword_set.values_list('keyword', flat=True))
        create = name_keywords.union(description_keywords) - current

        models = []
        for importance, keywords in pairs:
            models += [Keyword(
                product=self,
                keyword=keyword,
                importance=importance,
            ) for keyword in keywords.intersection(create)]

        Keyword.objects.bulk_create(models)

The importance values I chose here were rather arbitrary, but it roughly correlates to one keyword match in the product name being more important than two keyword matches in the description.

Finally we update our search function as well:

from django.db.models import Sum

def search(query):
    # generate keywords to search for
    keywords = generate_keywords(query)

    products = Product.objects.filter(
        # find all products with matching keywords
        keyword_set__keyword__in=keywords,
    ).annotate(
        # calculate how well the product matches the search
        search_score=Sum('keyword_set__importance'),
    ).order_by(
        # order the products according to how well they match the search
        '-search_score',
    )

    return products

We then calculate how well the product matches the search by summing the importance of each keyword it matches and then we sort the Products according to that. Note that we don’t check if a Product matches all of the keywords anymore since the Products that matches all of the keywords will probably be first.

Now if a customer searches for “brown sugar” the results will look something like this:

IDNameDescription
2Brown SugarWorld’s best brown sugar
3Icing SugarAlso known as frosting or powdered sugar
1FlourAdd sugar and you’ve got a cake
4Glucose LiquidClear sugar syrup

Further steps and additional reading

We don’t need to stop here: we can add spellchecking, translations or anything else we can fit into the generate_keywords function. Django already has some text searching functions built-in, but these are PostgreSQL specific (you can check them out here).


7b8fb40 0.136.5
© 2024 Kobus van Schoor