ElasticSearch Bliss with ElasticUtils

While Django Haystack remains the go to recommendation for adding simple search indexing to your Django sites, you can quickly outgrow the simplified "bag of text" data model that haystack attempts to unify the various supported indexing engines. Depending on your use case, you eventually need to customize the tokenizers, scoring, spell correct, autocomplete, etc. These more advanced use cases no longer work commonly across backends forcing you to leave the comfort of Haystack for native APIs for your chosen indexer. ElasticUtils is a newer project from the fine folks at Mozilla that exposes much of the rich capabilities of ElasticSearch in a more elegant, pythonic interface. Much like an ORM can simplify the process of generating SQL for your databsae queries, ElasticUtils provides a streamlined interface for generating search queries for the ElasticSearch Query DSL.

Why ElasticSearch

Why ElasticSearch could really be its own blog post. It's an immensely powerful search indexing system built on top of the rock solid Lucene library. It features a considerable number of built in query and filter types that allow great range in the type of search conditions allowed. In addition, it has many customizable tokenizers, analyzers and filters to transform your documents into readily identified search results. It's most frequently compared to SOLR (which is also Lucene based), which provides similar features. Comparatively, the ReST based API interface of ElasticSearch shifts much of the data definition and configuration to the client, eliminating the burdensome and more rigid system administration challenges of SOLR. Finally, ElasticSearch (as the name gives away), was built with clustering in mind from the start, providing a pathway to sharding your dataset relatively seamlessly for scalability as your system grows.

Having said that, ElasticSearch is a powerful, but intricate search system. Don't expect to be able to bolt it on to a project for a simple full-text search, there's likely easier solutions on the market. There's a lot of knobs and parameters, and you're going to need to make an investment to fully understand the ElasticSearch architecture. ElasticUtils doesn't really attempt to abstract away those details in the way that Django Haystack may. It does have rather sensible defaults though, that make getting started more of a weekend project than a month long sabbatical.

Ironically, ElasticSearch's flexibility has opened it up to non-traditional indexing use cases such as log management, analytics and data mining.

ElasticUtils in Action

ElasticSearch allows creating rich, complex search queries using a ReSTful API. Beyond the basic, find documents that match the given terms, you can filter your searches by categories or other criteria (usually called faceting, such as those tag clouds or drill-downs). It features considerable function to develop powerful and specialized search applications.

The challenge with many rich, expressive query languages, is how to generate these expressions reliability without a significant maintenance cost to the development team. Enter ElasticUtils, which provides access to most of ElasticSearch's features, but using a chainable queryset-like expression API for generating search parameters.

As a case, here's an actual (relatively simply) query generated from our search system looking for products related to "nike air max" (a line of shoes):

{
    "filter": {
        "term": {
            "is_visible": true
        }
    },
    "query": {
        "query_string": {
            "default_field": "_all",
            "query": "nike air max"
        }
    }
}

This query, takes our user supplied input: "nike air max", and generates a query_string type query against the _all field (which is a combination of all the indexed fields), and filters the results to ones that are marked public (is_visible = true).

As you can see, ElasticSearch queries are a highly nested JSON structure. Having to manage all these nodes and keep them in order can be complex and error prone.

Contrast the same query using ElasticUtils:

from elasticutils import S

query = S().filter(is_visible=True).query(_all__query_string='nike air max')

Notice how we can construct the same nested structure, in a chainable fashion that allows us to augment or enhance our existing queries? What if we wanted to restrict our query just to products in our shoes category?

query = query.filter(category="shoes")

Django Integration

While ElasticUtils is a general python library, it comes with some django integration out of the box to simplify integration. First, if you use the django-ized versions of the API constructs (S, F, etc), they will leverage the django settings system for configuration like ElasticSearch server URLs.

Second, ElasticUtils has pre-defined Celery tasks for indexing and un-indexing your django models. This Makes it really straightforward to hook up a post_save signal handler to your models to have them re-indexed when updated in real-time (vs the traditional overnight batch re-index).

The one missing out-of-the-box component for Django is a management command for indexing your data. There's an incomplete pull-request that's unfortunately been sidelined for now. Hopefully the team behind ElasticUtils can come together and finish it off. In the meantime, you can easily roll your own in less than 100 lines of code.

Indexing your Data

Assembling your Mappings

To define your indexes in ElasticUtils, you define a MappingType class much like in the Django ORM, you'd define a Model. This allows you to provide the necessary parameters to ElasticSearch to define the properties of the index and provide helper methods to perform the indexing and construct the searches.

The amount of boilerplate code required depends on whether you're using the base MappingType or the django-ized version. The django version leverages the Django ORM for retrieval of documents to index as an example.


from elasticutils.contrib.django import Indexable, MappingType
from example.product.models import Product


class ProductMapping(MappingType, Indexable):
    @classmethod
    def get_index(cls):
        return 'products'

    @classmethod
    def get_mapping_type_name(cls):
        return "product"

    @classmethod
    def get_model(cls):
        return Product

    @classmethod
    def get_mapping(cls):
        return {
            # TBD
        }

    @classmethod
    def extract_document(cls, obj_id, obj=None):
        return {
            # TBD
        }

There's a couple methods with TBDs (extract_document(), get_mapping()) in there that we'll revisit individually, but you can see much of this is scaffolding not unlike the Meta class on Django models (but using classmethods rather than a nested inner class). These configuration methods for the most part just convert their output to JSON and are delivered to ElasticSearch un-modified.

Defining your Data Mapping

The mapping configuration ultimately defines how ElasticSearch stores and indexes your data. It needs to know what fields are available in a given index, and parameters on how to consume and process those fields. If you've ever used Apache SOLR, you'll be relieved how much easier this configuration is compared to the XML-hell that you've surely experienced.

For our example Product index, let's create some common fields that might be associated with a product listing. This isn't designed to be a full tutorial on elasticsearch, so we'll have to skip over some of the more complex configuration options around analyzers, tokenizers and other indexing controls.

@classmethod
def get_mapping(cls):
    return {
        'properties': {
            'name': {
                'type': 'string',
            },
            'description': {
                'type': 'string',
                'analyzer': 'snowball',
            },
            'sku': {
                'type': 'string',
                'index': 'not_analyzed',
            },
            'price': {
                'type': 'integer',
                'index': 'not_analyzed',
            },
            'category': {
                'type': 'string',
                'index': 'not_analyzed',
            },
        }
    }

Most of our fields are designated as type string, while our price field is an integer (it will be number of cents, so price_in_dollars * 100). You'll notice that many of the fields have an attribute analyzer: not_analyzed. This is an instruction to ElasticSearch not to attempt to parse the field and just leave it as is. Normally, ElasticSearch would split the field into terms based on word boundaries (see the standard analyzer which is the default). But for many of the fields, we want to it to be indexed and stored exactly as we delivered (such as for sku which should be treated as a single term). For description, we've used the snowball analyzer which provides additional filtering around stop words and stemming. You can also create your own analyzers for more custom indexing of a given field.

This is just the tip of the iceberg. There's a multitude of configuration parameters for each field in your mapping.

Extracting Fields during Indexing

After defining the index mapping fields, let's implement the other un-implemented method, extract_document(), that captures and prepares your data into the relevant fields. This provides you the control in extracting and transforming your data prior to ingestion by elasticsearch.


    @classmethod
    def extract_document(cls, obj_id, obj=None):
        if obj is None:
            obj = cls.get_model().get(pk=obj_id)

        return {
            'id': obj.pk,
            'name': obj.name,
            'description': obj.description,
            'sku': obj.sku,
            'price': int(obj.price * 100),  # Convert to cents
            'category': obj.category.name.lower(),   # assuming `category` is a foreign key
        }

Here we've extracted several attributes on our model as well the product category from a related model.

Create Indexes

Finally, With all your mappings configured, you can now create or update your indexes using elasticsearch. We simply generate a settings configuration based on the elasticsearch options and elasticutils does the work for us.


es = ProductMapping.get_es()
settings = {}

# Add our mapping configuration to the index settings
settings.update(ProductMapping.get_mapping())

# TBD: Add index settings here such as custom analyzers

es.create_index(ProductMapping.get_index(), settings=settings)

Finally, let's leverage the built-in celery tasks to perform the index on our model data. To index a model, you simply provide its primary key to the index_objects task, which as we've defined above in extract_document(), will pull it from the database and extract the fields for indexing.

from elasticutils.contrib.django import tasks


model_objs = ProductMapping.get_model().objects.all()
model_ids = list(model_objs.values_list('pk', flat=True))
tasks.index_objects(ProductMapping, model_ids)

Searching your Index

Now that we've indexed our data, let's walk through a couple example searches that can demonstration the richness of the ElasticSearch and ElasticUtils query expression API.

Couple of notes on the ElasticUtils S class that handles your search queries. They mimic many of the attributes of Django's querysets.

They're chainable, so you can build up your search parameters through successive calls to add filters, queries and other parameters. Likewise, you can build-up common base search parameters and re-use it for a number of sub-queries that share those common parameters.
They're lazy, so search is only performed when you try to act on the data such as iterating, or calling specific methods like count().

Search for Matching Terms

Let's start with the obvious use case, we want to find all products who's name contains the key search terms. With Lucene based indexes, you usually can choose between a rich user supplied query language like a google search or constructing your search parameters more programatically (or a combination of both). If you're looking to give your users a free-form input (again, like Google) to specify their search parameters with operators and other modifiers (for example: nike AND "air max" or category:shoes), you're like going to want to use ElasticSearch's query string query. On the other hand, if your interface is more data oriented (say a bunch of checkboxes and sliders for controlling the search parameters), you'll want more exact search controls, you'll want to leverage ElasticSearch's many other queries including match, range and bool queries.

ElasticUtils takes a cue from django's querysets and allows you to specify the type of query using a double underscore qualifier. So you can search a given field using a type of query like: [field_name]__[query_type].

s = ProductMapping.search()
results = s.query(name__query_string='nike "air max" or category:shoes')
for result in results:
    # process the results

If instead of a free-form query string, we could instead build up a search programmatically using another query type. ElasticSearch provides a prefix query that can provide a good way to autocomplete search terms that begin with the inputted search phrase. They're are more efficient methods using ngrams, but this will suffice for our example.

s = ProductMapping.search()
# Returns all results that start with 'nik' in the `shoes` category
results = s.query(name__prefix='nik').filter(category='shoes')
for result in results:
    # process the results

Faceted Search Results

One of the powerful features of ElasticSearch is the ability to quickly generate facets based on your indexed fields. If you're unfamiliar with facets, think of a tag cloud, or a list of categories (such as brands on an e-commerce site) that allow you to drill-down through a list of results to find your intended document.

In our example mapping, we have a category field that would be a nice facet to help our users select which category they're looking for within our search results. Let's first search for "nike" and then display a list of categories that have matches for that product name.

s = ProductMapping.search()
# Add facet results for `category`
query = s.query(name='nike').facet('category')
facets = query.facet_counts()
# Let's iterate the category facets
for facet in facets['category']:
    category = facet['term']
    count = facet['count']

Counts

Sometimes, you don't necessarily need to retrieve the search results, but just want to display the number of results. Like the Django ORM, you can call .count() to retrieve the number of results for a given query.

s = ProductMapping.search()
# Returns all results that start with 'nik' in the `shoes` category
num_results = s.query(name__prefix='nik').filter(category='shoes').count()

Again, this is just a sampler. There are numerous ways to search in elasticsearch and these are but a few simple examples. As your queries grow in complexity, the chain-ability of ElasticUtils search API keeps everything manageable.

Conclusion

There you have it, a method to build complex search systems using the powerful ElasticSearch systsem in a very pythonic manner. If you're familiar with the common Django ORM patterns, it's even better.

Hopefully this gives you an overview of the ElasticUtils python library as a powerful interface to the ElasticSearch indexing and retrieval system. Much like the Django ORM makes working with databases easier compared with raw SQL, ElasticUtils makes performing complex search queries easier than building your own ElasticSearch ReST calls.

This was really just an introduction, we only showed simple field extraction and text searching. Much of the power of elasticsearch is realized when you start implementing more complex patterns like faceting, range queries, autocomplete, spell correction and suggestions. If there's interest (leave me a comment), I'll work on follow-up posts on how to construct those patterns on top of ElasticUtils.

Caveats

I wrote this guide based on ElasticUtils v0.80. The development trunk recently ported from the underlying pyelasticsearch library, to the officially blessed, elasticsearch-py python bindings. While ElasticUtils for the most part masks the need to understand the underlying library, esoteric or management commands may require direct library access.