Introduction to
Apache Lucene & Elasticsearch

By Ivaylo Pavlov (20 April 2019)

What is Apache Lucene?

  • High-performance text search engine library
  • Written entirely in Java
  • Cross-platform and versatile
  • Open source

Powerful Search Algos

High-performance Indexing

  • More than 150 GiB/hour
  • Minimal RAM requirements: 1MiB
  • Fast incremental indexing
  • Index size is 20%-30% of indexed text
  • Ranked searching
  • Multifaceted queries support
  • Fielded search & Any-field Sorting
  • Multi-index search
  • Concurrent update & search
  • Memory efficient & typo-tolerant
  • Pluggable Ranking Models
  • Configurable Storage Engine

Overview of Lucene Indexing Mechanics

Lucene's Reverse Index Demystified

  • We saw how the inverse index looks like in the Mechanics slide
  • The frequency table is used for building relevance score
  • Lucene does a merge-sort (n*logn) of indices as documents are added.
    • Factor of 3: Merges indices every 3 indices
    • Latest documents are in small indices, older ones in large ones
    • Allows fast search as you index, as it's a single seek per term
    • Has ranked and sorted results, cached bit filters and sort keys

Overview of Lucene Analyzers

Standard Analyzer (default) - removes stop words, lowercases, tokenizes, recognizes emails and URLs
[quick] [brown] [fox] [jumped] [over] [lazy@dog.com]
Simple Analyzer - lowercases and tokenizes
[a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] [com]
Stop Analyzer - lowercases, tokenizes, splits by non-letter characters, removes stop words
[quick] [brown] [fox] [jumped] [over] [lazy] [dog] [com]


Whitespace Analyzer - splits by whitespace characters
Keyword Analyzer - entire sentence is a single token
Language Analyzer -  understands English, French and Spanish, the most sophisiticated of all
Custom-defined Analyzer - user-defined set of text filters

Analyzer - A collection of text filters

Parsing the sentence "A Quick Brown Fox jumped over the Lazy@Dog.com"

N-Gram Tokenizer & Lucene Data types

  • Integer
  • Long (used for Dates)
  • Float
  • Double

Perfect for auto-complete! Amortizes to 30% of total parsed data. So people on the web say...

Parsing "quick" with n-gram tokenizer will generate indices like :

Length 1 (unigram): [ q, u, i, c, k ]

Length 2 (bigram): [ qu, ui, ic, ck ]

Length 3 (trigram): [ qui, uic, ick ]

Length 4 (four-gram): [ quic, uick ]

Length 5 (five-gram): [ quick ]

Supported data types:

  • Text
    • Keyword (String) / Not Analyzed
    • Text / Analyzed
  • Binary

What is Elasticsearch

  • JSON-based distributed web server built on top of Lucene 
  • Exposes REST API
  • Schemaless - types automagically defined at index time
  • Horizontally scalable and Pluggable
  • Near real-time (NRT)
  • Adds a distributed system features like
    • Support for queues, thread-pool, node monitoring & management APIs
    • High availability via shard replication across nodes
    • DSL query language (Write in JSON instead of raw Lucene syntax)
  • Middle layer of the ELK Stack

Intended use

  • Instant full-featured search for web applications
  • Out of the box auto-completion support
  • Out of the box Fuzzy search - think Google's "Did you mean" feature

Elasticsearch Cluster Diagram

Client

Router

< HTTP >

<< >>

Elasticsearch Lingo Explained

Field - Named key in a document, think column name in a SQL database

Term - Value for a field

Document - Individual record, a collection of fields

Index - The "schemaless" list for the collection of documents

Primary shard - Independent lucene index, only shard accepting writes to its documents
Replica shard - Duplicate shard for faster retrieval and high-availability of the data

Data node - Holds data shards and performs CRUD operations, search and aggregations
Master node - Only node that can modify the cluster, index & shard configurations
Ingest node - Node that applies ingest pipeline for document enrichment before indexing
Coordinating node - An elected data node responsible for the query and results

Machine Learning node - If X-Pack is installed, to use Machine Learning features in Kibana, a minimum of one ML node is required. (Read on how it differs?)

Elasticsearch High-level Request Flow Overview

  1. A request comes from the client over HTTP and hits the Elasticsearch cluster router
  2. The router checks:
    • If storing, send to node with primary shard based on default formula:
      Shard # = Hash (Routing) % Total Primary Shards
    • If retrieving, send to any available data node in the cluster
  3. The picked data node becomes coordination node and is responsible for request splitting and forwarding to the other nodes and for performing aggregations. The inter-node communication happens over the internal transportation layer.
  4. The coordination node, receives the partial responses from the other nodes and aggregates them
  5. The coordination node sends the results back to the router
  6. The router responds to the client with the results

How is Elasticsearch so outrageously fast

Duplicates data in multiple n-gram indices trades in disk space for speed

Inverted indices are hashmaps with complexity O(1) assuming good distribution

It keeps as much as possible in-memory

Multi-tiered Caching

  • Request level: (excludes queries on date ranges and with preset "size")
  • Data level: Frequently hit lucene indices

Querying Elasticsearch

Nested AND, OR and NOT syntax

  • query
  • filter
  • aggregations
    • top hits
    • counts
  • source
  • size
  • bool
  • must
  • must_not
  • should

Provides rich DSL language

How to improve querying performance

  • Use filters, if relevance score is not needed, to save on math calculations
  • Avoid scripts-based filters
  • Search by round dates, avoid using "now"
  • Don't go overboard with the nodes and shards > results joining penalty
  • Avoid sparse records
  • Use scroll query for large sets, keeps results in heap, versus size queries

How to improve indexing performance

  • Predefine index mapping (schema)
  • Index documents in bulk
  • Have dedicated ingestion and machine learning nodes if possible
  • Break your data by dates, reduces Lucene Indices Maintenance
  • Increase Index buffer size > 512MiB
  • Use auto-generated ids
  • Increase or unset index refresh interval (default is 1 sec)

Elasticsearch Plugins

  1. Mapper Attachment - Ingest PDFs, DOCX and others (based on Apache Tika)
  2. Ingest Attachment - Very similar to mapper attachment (based on Apache Tika)
  3. BigDesk Plugins - Provides live charts and stats for Elasticsearch cluster
  4. FsCrawler - index documents directly over SSH

 

As Elasticsearch is new, its plugin ecosystem is smaller than Apache Solr's 

Vizualization in Kibana

FAQ

  1. I deleted a document, my index size didn't change
    • It marks the document for deletion, but it doesn't delete until it triggers lucene indices merge.
  2. I added a document, my index size didn't change
    • A previous delete didn't trigger indices merge, but adding did, so it evened out after the two operations.
  3. I want to rename a field or change the data type after index creation
    • Cannot be updated, the index has to be re-created and the data to be re-indexed
  4. I want to change the number of shards in the cluster after index creation
    • Due to router mechanics and hashing function modulo, the index has to be recreated.
  5. Is Elasticsearch good for sparse data?
    • No, it pre-allocates space based on the current index schema, so null values take disk space. The bigger the space between populated fields, the more inefficient the index is and more space it takes.
  6. Does Elasticsearch/Lucene have an update operation
    • Yes, and No, update is actually delete and add operations, partial re-index doesn't exist at the moment.

Sources

The documentation for Elasticsearch is absolutely brilliant. It has brief and to the point explanations with plenty of examples. Absolute treasure when it comes to writing DSL queries. Definitely worth going through the "Getting Started" section.

Thank you for listening!