Searching text in Django models
The need to perform text searches is something that often pops up in web applications, so much so that there are many projects dedicated solely to solving this problem. In this post I will cover some approaches to perform text searches in your own Django app without resorting to database-specific functionality or external searching solutions. At the end of this post you’ll have a flexible search solution that can order search results by relevance.
Let’s assume that you have the following model that you want to allow users to search through:
class Product(models.Model):
name = models.CharField(max_length=100)
description = models.TextField()
Throughout this post we’ll be implementing our search function in a function
named search
which takes one string argument representing the search query.
Ultimately this function will also sort the search results, but we’ll get to
that later on.
def search(query):
...
return products
Finally, we assume that this is the current products loaded in our database:
ID | Name | Description |
---|---|---|
1 | Flour | Add sugar and you’ve got a cake |
2 | Brown Sugar | World’s best brown sugar |
3 | Icing Sugar | Also known as frosting or powdered sugar |
4 | Glucose Liquid | Clear sugar syrup |
5 | Cupcake | Because usual sized cakes are too expensive |
Disclaimer: I’m not a baker, so don’t take any baking advice from this post.
Approach 1: Just plain __icontains
This is the easiest approach, and works fine when you’re searching very specific single-word terms (e.g. part numbers, barcodes, etc.). The search function might look something like this:
def search(query):
return Product.objects.filter(
Q(name__icontains=query) | Q(description__icontains=query)
)
This approach however quickly falls apart when using multi-term searches or any deviations from the original text. For instance “sugar brown” will return no results. This filter will also not scale well when there are millions of records to search through, so we’ll need to do better than this.
Approach 2: Commutative __icontains
This approach is a slight improvement from approach 1, where the order of the search terms doesn’t matter anymore. The basic approach goes something like this:
def search(query):
# build the filter term-by-term
q = Q()
for term in query.split():
q &= Q(name__icontains=term) | Q(description__icontains=term)
# perform the query
return Product.objects.filter(q)
At least this time “sugar brown” will also match our “Brown Sugar” product as
expected, but this still suffers from the inefficiency of approach 1. This
approach is still too inflexible for general searching - searching for “worlds
best brown sugar” will not match the “Brown Sugar” product since the “worlds”
is missing an apostrophe. At this point __icontains
will no longer be able to
help us and we will need to find a better approach.
Approach 3: Keyword index
We now introduce a new Keyword
model - this model will store interesting
keywords that we will be extracting from our Product
models. Then, instead of
searching our Product
s directly we will be searching through our keywords
instead.
from django.db.models import Index, UniqueConstraint
class Keyword(models.Model):
product = models.ForeignKey(Product, on_delete=models.CASCADE)
keyword = models.CharField(max_length=50)
class Meta:
indexes = [Index(fields=['keyword'])]
constraints = [
UniqueConstraint(fields=['keyword', 'product'], name='unique_keyword'),
]
Note that we’ve added an index to the model on the keyword
field, this will
help by making the search more efficient (you can read up more on Django
Index
es in the
docs).
To generate these keywords we will also be introducing two new functions: one is to generate keywords from a piece of text and the other is to save these keywords for a given product.
To generate our keywords we’ll split the text into a list of tokens and pre-process these tokens a little bit by removing all non-alphanumeric characters and making all the tokens lowercase:
# given a string generates a set of keywords for searching.
def generate_keywords(text):
# split the text into a list of lowercase tokens
tokens = [s.lower() for s in text.split()]
# remove non alpha-numeric characters from tokens
tokens = [''.join(c for c in token if c.isalpha()) for token in tokens]
# remove empty and duplicate tokens and limit length of all tokens to 50
# (max keyword length for Keyword model)
tokens = set(t[:50] for t in tokens if t)
return tokens
We then extend our original Product
model with a new function to generate
keywords whenever we save the Product
:
class Product(models.Model):
...
def generate_index(self):
# generate product keywords
name_keywords = generate_keywords(self.name)
description_keywords = generate_keywords(self.description)
keywords = name_keywords.union(description_keywords)
# remove all outdated keywords
self.keyword_set.exclude(keyword__in=keywords).delete()
# create the new keywords
current = set(self.keyword_set.values_list('keyword', flat=True))
create = keywords - current
Keyword.objects.bulk_create([
Keyword(keyword=keyword, product=self) for keyword in create
])
def save(self, *args, **kwargs):
# do the usual saving
super().save(*args, **kwargs)
# generate new keywords
self.generate_index()
Finally, we update our search function to now search through the Keyword
s
instead of the Product
s directly:
def search(query):
# generate keywords to search for
keywords = generate_keywords(query)
products = Product.objects.filter(
# find all products with matching keywords
keyword_set__keyword__in=keywords,
).annotate(
# count how many keyword matches each product has
keyword_matches=Count('keyword_set'),
).filter(
# filter for products that match all of the keywords
keyword_matches__gte=len(keywords),
)
return products
Note that the search
function uses the same generate_keywords
function to
transform the search query into a list of keywords to search for. To filter the
products, we annotate (documentation for annotations in Django
here) each
product with a Count
of how many keyword matches it had. If the amount of
matches is greater than or equal to the amount of search keywords it means
we’ve got a hit (all the keywords matched).
This method solved two things: firstly, it is a lot more efficient than the
__icontains
method since all the keywords are indexed and pre-processed
beforehand. By using annotations we’re also making sure that all the
heavy-lifting is done by the database (as it should be). Second and most
importantly; we’ve now made the whole search process a lot more flexible. The
search is now already insensitive to non-alphanumeric characters (i.e.
searching for “world’s” or “worlds” will return the same results) but we can go
even further.
Approach 4: Keyword stemming
The previous approach is already working well, but now we’ve run into a problem. A hungry customer searched for “cupcakes” but found no results because we only have a “cupcake” for sale. This is where stemming comes in. Stemming is the process whereby inflections are removed from words, i.e. “running” -> “run”, “runs” -> “run” and so forth.
If we made use of keyword stemming the “cupcakes” search would have matched our
“cupcake” product. Heck, even “cupcaking” would work. To do stemming we’ll make
use of the nltk
library, a Python library for doing natural-language
processing.
To perform stemming on a word, you can use nltk
in the following manner:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
# all of these will print "cupcak"
print(stemmer.stem("cupcake"))
print(stemmer.stem("cupcakes"))
print(stemmer.stem("cupcaking"))
We’ll be using the PorterStemmer
for this post due to its simplicity and
speed, but nltk
supports quite a few stemming algorithms. We now update our
generate_keywords
function to perform stemming on all tokens that it thinks
is words (you might need to adjust this logic depending on your application):
# set up the stemmer
stemmer = PorterStemmer()
def generate_keywords(text):
# split the text into a list of lowercase tokens
tokens = [s.lower() for s in text.split()]
# remove non alpha-numeric characters from tokens
tokens = [''.join(c for c in token if c.isalpha()) for token in tokens]
# split the tokens into words and "other" - a token is considered a word if
# it doesn't contain any digits
words = []
other = []
for token in tokens:
(other if any(s.isdigit() for s in token) else words).append(token)
# apply stemming to words
words = [stemmer.stem(w) for w in words]
# remove empty and duplicate tokens and limit length of all tokens to 50
# (max keyword length for Keyword model)
tokens = set(t[:50] for t in words + other if t)
return tokens
The searching process is now much more flexible than when we started out and even those people that search for “cupcaking” will now be satisfied. We only have one problem left to solve. If a customer searches for “sugar”, they will be shown the following results:
ID | Name | Description |
---|---|---|
1 | Flour | Add sugar and you’ve got a cake |
2 | Brown Sugar | World’s best brown sugar |
3 | Icing Sugar | Also known as frosting or powdered sugar |
4 | Glucose Liquid | Clear sugar syrup |
Flour? They don’t care about that - yet it is the very first result even though it only had the “sugar” term in its description. The final problem to solve is making sure that the customer sees relevant results first.
Approach 5: Scoring search results
To order our search results we will need to assign some score to each result so
that more relevant results are shown first. To do that we extend our original
Keyword
class and add an importance
field - the more important a keyword is
the higher its importance
value will be:
class Keyword(models.Model):
...
importance = models.FloatField()
We then also update our generate_index
function in our Product
model to
assign the importance to keywords:
class Product(models.Model):
...
def generate_index(self):
# generate product keywords
name_keywords = generate_keywords(self.name)
description_keywords = generate_keywords(self.description)
# remove duplicate keywords
description_keywords -= name_keywords
# assign importance to keywords
pairs = [
(1.1, name_keywords),
(0.5, description_keywords),
]
# remove all outdated keywords (we now need to take keyword whose
# importance has changed into account)
q = Q()
for importance, keywords in pairs:
for keyword in keywords:
q |= Q(
keyword=keyword,
importance__range=(importance - 0.05, importance + 0.05),
)
self.keyword_set.exclude(q).delete()
# create the new keywords
current = set(self.keyword_set.values_list('keyword', flat=True))
create = name_keywords.union(description_keywords) - current
models = []
for importance, keywords in pairs:
models += [Keyword(
product=self,
keyword=keyword,
importance=importance,
) for keyword in keywords.intersection(create)]
Keyword.objects.bulk_create(models)
The importance values I chose here were rather arbitrary, but it roughly correlates to one keyword match in the product name being more important than two keyword matches in the description.
Finally we update our search function as well:
from django.db.models import Sum
def search(query):
# generate keywords to search for
keywords = generate_keywords(query)
products = Product.objects.filter(
# find all products with matching keywords
keyword_set__keyword__in=keywords,
).annotate(
# calculate how well the product matches the search
search_score=Sum('keyword_set__importance'),
).order_by(
# order the products according to how well they match the search
'-search_score',
)
return products
We then calculate how well the product matches the search by summing the
importance
of each keyword it matches and then we sort the Product
s
according to that. Note that we don’t check if a Product
matches all of the
keywords anymore since the Product
s that matches all of the keywords will
probably be first.
Now if a customer searches for “brown sugar” the results will look something like this:
ID | Name | Description |
---|---|---|
2 | Brown Sugar | World’s best brown sugar |
3 | Icing Sugar | Also known as frosting or powdered sugar |
1 | Flour | Add sugar and you’ve got a cake |
4 | Glucose Liquid | Clear sugar syrup |
Further steps and additional reading
We don’t need to stop here: we can add spellchecking, translations or anything
else we can fit into the generate_keywords
function. Django already has some
text searching functions built-in, but these are PostgreSQL specific (you can
check them out
here).