Practical Guide to Grouping Results with Elasticsearch

  • Jose Raymundo Cruz
  • Jun 06, 2016
Engineering

Elasticsearch is a massively distributed search engine built on top of Lucene. It has a very clean and straightforward JSON API for indexing data and for searching/retrieving that data. But behind this API there are a lot of features that can help you improve and manipulate search results to do whatever your application requires.

In this post, I’d like to talk about aggregations and how I used them to bring bucketed search results ordered by score. I’ll move ahead by first explaining the context and what we wanted, then the implementation and finally some insights about what each part does. Let’s get to it.

Context: The Problem

We have a recruitment platform that has a denormalized table for job applications. The platform is a rails app, and we are using elasticsearch-rails as the client library to talk to elasticsearch. The only thing in common for the job applications is the email address, everything else might be different. Let’s see an example:

// Job applications
| Email        | Name          | Phone        | ... more fields
| jon@hot.com  | Jon Doe       | 222-333-4444 | ...
| jon@hot.com  | Jon Doe       | 222-333-5555 | ...
| jane@doe.com | Jane Doe      | 555-555-4444 | ...
| jane@doe.com | Jane Doe      | 555-555-4941 | ...
| jane@rit.edu | Jane Jackson  | 999-999-9999 | ...
| jane@doe.com | Jane Smith    | 555-555-2333 | ...

Elasticsearch was configured to have all the fields be searchable, so your search query will get matched against any field. Let’s put in some (imaginary) numbers to represent how the match will sort the results.

  • Search query matches on name: 2 points
  • Search query matches on email: 1 point

Based on those numbers, for the search query “Doe” we see the following results:

| jane@doe.com | Jane Doe     | 555-555-4444 // 3 points
| jane@doe.com | Jane Doe     | 555-555-4941 // 3 points
| jane@doe.com | Jane Smith   | 555-555-2333 // 3 points
| jon@hot.com  | Jon Doe      | 222-333-4444 // 1 point
| jon@hot.com  | Jon Doe      | 222-333-5555 // 1 point
| jane@rit.edu | Jane Jackson | 999-999-9999 // 0 points

Jane Doe sits on top because “Doe” was hit twice, on the name and on the email. But what if you were really looking for “Jon Doe” you have to scroll down to the 4th result, even though the first 3 results are the same candidate. And what if Jane applied 15 times, we don’t want a single candidate to show up as all results of the first page. So we decided to group results by email and collapse them into a single row that can be expanded if wanted.

These are the results we want:

Now the results are collapsed into only 3 rows, one per each candidate. This yields a smaller result set and better results for our uses since they can now search with partial names and find a range of different people. If you want other instances for a given candidate you just expand the row and other results show up. For instance, imagine clicking on Click to expand for the first result Jon Stewart, that will show you the nested results for the candidate, allowing you to pick the specific instance you are looking for:

In comes elasticsearch terms aggregationa feature that allows elasticsearch to group results based on a specific field of the model. Using terms aggregation in combination with a couple of sub-aggregations such as top hits aggregation and max aggregation we were able to group by email address and sort the buckets based on the max score per bucket. This is the final search query (using the elasticsearch-rails API):

class CandidateQuery
  def self.query(keywords)
    {
      query: {
        multi_match: {
          query: keywords,
          fields: ["name^2", "email", "phone"]
        }
      },
      aggs: {
        by_email: { # Top level aggregation: Group by email
          terms: {
            field: "email_raw",
            size: 10,
            # Order results by sub-aggregation named 'max_score'
            order: { max_score: "desc" } 
          },
          aggs: { # Sub-aggregations
            # Include the top 15 hits on each bucket in the results
            by_top_hit: { top_hits: { size: 15 } },
            
            # Keep a running count of the max score by any member of this bucket
            max_score: { max: { lang: "expression", script: "_score" } }
          }
        }
      }
    }
  end
end

This query will yield the following JSON results for our given database and the search query “Doe”

{
  "aggregations": {
    "by_email": {
      "buckets": [
        {
          "key": "jane@doe.com",
          "max_score": {
            "value": 2.538935899734497
          },
          "by_top_hit": {
            "hits": {
              "max_score": 2.538936,
              "hits": [
                {
                  "_score": 2.538936,
                  "_source": { "name": "Jane Doe", "phone": "555-555-4444" }
                },
                {
                  "_score": 2.375705,
                  "_source": { "name": "Jane Doe", "phone": "555-555-4941" }
                },
                {
                  "_score": 2.340123,
                  "_source": { "name": "Jane Smith", "phone": "555-555-2333" }
                }
              ]
            }
          }
        },
        {
          "key": "jon@hot.com",
          "max_score": {
            "value": 0.11838431656360626
          },
          "by_top_hit": {
            "hits": {
              "max_score": 0.11838432,
              "hits": [
                {
                  "_score": 0.11838432,
                  "_source": { "name": "Jon Doe", "phone": "222-333-4444" }
                },
                {
                  "_score": 0.11838432,
                  "_source": { "name": "Jon Doe", "phone": "222-333-5555" }
                },
              ]
            }
          }
        }
      ]
    }
  }
}

I’ve omitted nonrelevant fields from the results for brevity (in reality result is much bigger). As you can see that’s what we want, results grouped by candidate emails and ordered by the top score of each bucket.

Now let’s dissect the query and map it to the result to understand what’s going on. Let’s begin with the query.

Query

query: {
  multi_match: {
    query: keywords,
    fields: ["name^2", "email", "phone"]
  }
},

This is the actual query we make on the search engine. Here we specify the keywords and what fields to search against. We can also boost some fields so any result that gets a hit on that field will receive a higher score (in this example we boost the name field by 2, giving it more score than email and phone). For more information see multi_match queries.

Top-Level Aggregation: Aggregate by Email

aggs: {
  by_email: { # Top level aggregation: Group by email
    terms: {
      field: "email_raw",
      size: 10,
      # Order results by sub-aggregation named 'max_score'
      order: { max_score: "desc" } 
    },

This is the top-level aggregation we are using. Elasticsearch allows you to bucket results based on a field (or term) using an aggregation they call terms aggregationThis aggregation is using the term email_raw to group results together. The field email_raw is an indexed field that stores the plain email (as opposed to the plain email field that stores a tokenized version of the email ~ i.e. [“jon”, “doe”, “com”, etc, etc]). Any result with the same email will be placed in the same bucket. We also specify how we should order the buckets, we set the order using max_score which is a sub-aggregation done at the bucket level.

Sub-Aggregations: Top_Hits and Max

aggs: { # Sub-aggregations
  # Include the top 15 hits on each bucket in the results
  by_top_hit: { top_hits: { size: 15 } },
  
  # Keep a running count of the max score by any member of this bucket
  max_score: { max: { lang: "expression", script: "_score" } }
}

Finally, we look into top_hits and max. Notice that these two aggregations are nested inside the terms aggregation, so they will operate on a bucket level. By this, I mean that terms aggregation will first group all results by email and create a bucket per email, and then top_hits and max will operate in each independent bucket.

Let’s start with max. This aggregation will look at the score of each element added in the bucket and will save the max score. This is used by the top-level terms aggregation to order the buckets. Basically we order the buckets by the highest score each bucket contains. If you take a look at the results shown above you’ll notice a max_score field on each bucket, the max aggregation is the one creating that field.

The top_hits aggregation is used to actually save the document in each bucket. If we didn’t include the top_hits then each bucket would only have the key (the email), the max score, the document count (how many documents are in the bucket) but would not include the actual documents. Using top_hits in combination with terms we get both the buckets and all documents in each bucket (you can see the results of top_hits by looking at the results above at the field “by_top_hits”)

Composition of the Result

Here is an image that tries to show the composition of the results:

Final Thoughts

There are always some compromises to any solution taken, in our case the complexity of the code increased and the flexibility of the search query decreased. But I think the benefits outweighs the downsides, benefits such as accuracy of results and a killer performance since all the heavy lifting is done by elasticsearch (which is really fast) and our systems have little work to do (just render the results).


Jose Raymundo Cruz joined AlphaSights in July 2015 and serves as a Software Engineer in our New York office.