Blog

The Digital Agency for International Development

Django Haystack Lessons

By Hamish Downer on 11 June 2014

Haystack etc

We've recently been pushing haystack and learning as we go, so we thought we'd share some of what we learnt along the way.

Pages with CMS PlaceHolders

On a recent project using Django CMS we had some bugs with indexing some models that included placeholders. We wanted to put the content of the plugins that were associated with the placeholder into the search index template (that is the text searched by haystack). So we want to put the following in the search index template:

{% for placeholder in object.placeholders.all %}
  {% get_placeholder_content %}
  {{ content|strip_tags_keeping_spaces }}
{% endfor %}

So we needed to create a template tag that would produce the content. However we found that some of the plugin types wouldn't render in the search index context, so we needed to exclude them. So our template tag ended up looking like:

from django import template
from searchlist.models import SearchList
from taglist.models import TagList

register = template.Library()
ERROR_MSG = "%r tag requires at least one placeholder as argument"

class GetPlaceholderContent(template.Node):

    def render(self, context):
        placeholder_name = template.Variable("placeholder")
        placeholder = placeholder_name.resolve(context)
        plugins = placeholder.get_plugins()
        content = []

        for plugin in plugins:
            if not plugin.plugin_type in ("TagListPlugin", "SearchListPlugin"):
                content.append(plugin.render_plugin(context, placeholder))
        context['content'] = " ".join(content)

        return ""

@register.tag
def get_placeholder_content(parser, token):
    parts = token.split_contents()
    if len(parts) != 1:
        raise template.TemplateSyntaxError(ERROR_MSG % token.contents.split()[0])
    return GetPlaceholderContent()

We also needed to create a template tag that would strip the HTML tags so that the search index wouldn't contain them. That looked like:

@register.filter
def strip_tags_keeping_spaces(value):
    """
    Usage {% a_string|strip_tags_keeping_spaces %}
    This tag will remove html and keep spacing so "<p>hello</p><p>bye</p>"
    becomes "hello bye". It also removes script tags entirely and deletes 
    multiple spaces.
    """
    value = re.sub(r'<script.*>[\s\S]*</script>', ' ', force_unicode(value))
    value = re.sub(r'<[^>]*?>', ' ', force_unicode(value))
    value = re.sub(r'\s{2,}', ' ', force_unicode(value))
    value = value.strip()

    return value

ManyToManyField

Our bug was putting the manager straight in, when we needed to iterate over the individual items. So if we had a ManyToManyField called regions we'd originally put in:

{{ object.regions }}

where as we really needed

{% for region in object.regions %}
    {{ region }}
{% endfor %}

Sorting by Title

Our client wanted to have sorting by title as an option. They wanted the search to ignore some words at the start ("A", "The", ...) and to ignore punctuation. So the following would count as sorted correctly:

  • The ABC of Stuff
  • Bananas are not the only fruit
  • "Chocolate tonight?", a study of desserts
  • A Study of Starters
  • Zebras!

However we still need to search and tokenise the titles correctly. In the haystack search_indexes.py file we had:

from haystack.indexes import RealTimeSearchIndex, CharField
from lib.helper import prepare_for_sort

class MyModelIndex(RealTimeSearchIndex):
    text = CharField(document=True, use_template=True)
    title = CharField(model_attr='title')
    # For sorting
    title_sort = CharField()
    # other fields
    # ...

    def prepare_title_sort(self, obj):
        return prepare_for_sort(obj.title)

The prepare for sort function looks like:

import regex as re

STOPWORDS = ( "a", "the")
WORD_FILTERS = [re.compile("^{0} | {0} ".format(word)) for word in STOPWORDS]
# This will match all unicode punctuation characters
PUNCTUATION_FILTER = re.compile(r'\p{P}+')

def prepare_for_sort(value):
    clean = ""
    if value:
        clean = value.lower()
        for word_filter in WORD_FILTERS:
            clean = " ".join(word_filter.split(clean))
        clean = PUNCTUATION_FILTER.sub("", clean).strip()
    return clean

Note that we're using the more recent regex library (that will replace re in python eventually) as it allows us to use Unicode codepoint properties. In this case we're using \p{P}+ as the regular expression, which means match any unicode character considered to be punctuation. Our original implementation only matched the ascii punctuation, which meant that some unicode quotation marks slipped through (U+201C and U+201D in particular - “ and ”)

This gave us the two fields we wanted. We then needed to tell the search backend to treat them differently - title should be tokenised and searched for, while title_sort should be stored verbatim. On this particular project we were using Solr as the back end. So in the schema we have:

<field name="title" type="text" indexed="true" stored="true" multiValued="false" />
...
<field name="title_sort" type="string" indexed="true" stored="true" multiValued="false" />

Synonyms

Again on the project using Solr we were having trouble as the US and UK spellings of words were giving different results, eg behavior/behaviour or randomise/randomize. With solr we just needed to provide a list of synonyms in a file called synonyms.txt and it worked out of the box.

Unexpected changes upgrading to Haystack 2.x

We hit one thing when upgrading to Haystack 2.x - the = sign changed meaning - before it meant "exact match", after it meant "contains". For example say we had a list of countries that include "Netherlands" and "Netherlands Antilles" and a query:

SearchQuerySet().filter(country="Netherlands")

This query now returns both countries instead of just the European one as it did previously. If you want the previous behaviour, then you need to be explicit and use __exact.

This information is not found in otherwise excellent section "Migrating From Haystack 1.X to Haystack 2.X". Instead it is documented warning in section on SearchQuerySet API. If you are migrating from 1.x, then it may be a good idea to go through the whole documentation and check at least section marked as warning.