Introduction to
Apache Lucene & Elasticsearch

By Ivaylo Pavlov (20 April 2019)

What is Apache Lucene?

High-performance text search engine library
Written entirely in Java
Cross-platform and versatile
Open source

Powerful Search Algos

High-performance Indexing

More than 150 GiB/hour
Minimal RAM requirements: 1MiB
Fast incremental indexing
Index size is 20%-30% of indexed text

Ranked searching
Multifaceted queries support
Fielded search & Any-field Sorting
Multi-index search
Concurrent update & search
Memory efficient & typo-tolerant
Pluggable Ranking Models
Configurable Storage Engine

Overview of Lucene Indexing Mechanics

Lucene's Reverse Index Demystified

We saw how the inverse index looks like in the Mechanics slide
The frequency table is used for building relevance score
Lucene does a merge-sort (n*logn) of indices as documents are added.
- Factor of 3: Merges indices every 3 indices
- Latest documents are in small indices, older ones in large ones
- Allows fast search as you index, as it's a single seek per term
- Has ranked and sorted results, cached bit filters and sort keys

Overview of Lucene Analyzers

Standard Analyzer (default) - removes stop words, lowercases, tokenizes, recognizes emails and URLs
[quick] [brown] [fox] [jumped] [over] [lazy@dog.com]
Simple Analyzer - lowercases and tokenizes
[a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] [com]
Stop Analyzer - lowercases, tokenizes, splits by non-letter characters, removes stop words
[quick] [brown] [fox] [jumped] [over] [lazy] [dog] [com]

Whitespace Analyzer - splits by whitespace characters
Keyword Analyzer - entire sentence is a single token
Language Analyzer - understands English, French and Spanish, the most sophisiticated of all
Custom-defined Analyzer - user-defined set of text filters

Analyzer - A collection of text filters

Parsing the sentence "A Quick Brown Fox jumped over the Lazy@Dog.com"

N-Gram Tokenizer & Lucene Data types

Integer
Long (used for Dates)
Float
Double

Perfect for auto-complete! Amortizes to 30% of total parsed data. So people on the web say...

Parsing "quick" with n-gram tokenizer will generate indices like :

Length 1 (unigram): [ q, u, i, c, k ]

Length 2 (bigram): [ qu, ui, ic, ck ]

Length 3 (trigram): [ qui, uic, ick ]

Length 4 (four-gram): [ quic, uick ]

Length 5 (five-gram): [ quick ]

Supported data types:

Text
- Keyword (String) / Not Analyzed
- Text / Analyzed
Binary

What is Elasticsearch

JSON-based distributed web server built on top of Lucene
Exposes REST API
Schemaless - types automagically defined at index time
Horizontally scalable and Pluggable
Near real-time (NRT)
Adds a distributed system features like
- Support for queues, thread-pool, node monitoring & management APIs
- High availability via shard replication across nodes
- DSL query language (Write in JSON instead of raw Lucene syntax)
Middle layer of the ELK Stack

Intended use

Instant full-featured search for web applications
Out of the box auto-completion support
Out of the box Fuzzy search - think Google's "Did you mean" feature

Elasticsearch Cluster Diagram

Client

Router

< HTTP >

<< >>

Elasticsearch Lingo Explained

Field - Named key in a document, think column name in a SQL database

Term - Value for a field

Document - Individual record, a collection of fields

Index - The "schemaless" list for the collection of documents

Primary shard - Independent lucene index, only shard accepting writes to its documents
Replica shard - Duplicate shard for faster retrieval and high-availability of the data

Data node - Holds data shards and performs CRUD operations, search and aggregations
Master node - Only node that can modify the cluster, index & shard configurations
Ingest node - Node that applies ingest pipeline for document enrichment before indexing
Coordinating node - An elected data node responsible for the query and results

Machine Learning node - If X-Pack is installed, to use Machine Learning features in Kibana, a minimum of one ML node is required. (Read on how it differs?)

Elasticsearch High-level Request Flow Overview

A request comes from the client over HTTP and hits the Elasticsearch cluster router
The router checks:
- If storing, send to node with primary shard based on default formula:
  Shard # = Hash (Routing) % Total Primary Shards
- If retrieving, send to any available data node in the cluster
The picked data node becomes coordination node and is responsible for request splitting and forwarding to the other nodes and for performing aggregations. The inter-node communication happens over the internal transportation layer.
The coordination node, receives the partial responses from the other nodes and aggregates them
The coordination node sends the results back to the router
The router responds to the client with the results

How is Elasticsearch so outrageously fast

Duplicates data in multiple n-gram indices trades in disk space for speed

Inverted indices are hashmaps with complexity O(1) assuming good distribution

It keeps as much as possible in-memory

Multi-tiered Caching

Request level: (excludes queries on date ranges and with preset "size")
Data level: Frequently hit lucene indices

Querying Elasticsearch

Nested AND, OR and NOT syntax

query
filter
aggregations
- top hits
- counts
source
size

bool
must
must_not
should

Provides rich DSL language

How to improve querying performance

Use filters, if relevance score is not needed, to save on math calculations
Avoid scripts-based filters
Search by round dates, avoid using "now"
Don't go overboard with the nodes and shards > results joining penalty
Avoid sparse records
Use scroll query for large sets, keeps results in heap, versus size queries

How to improve indexing performance

Predefine index mapping (schema)
Index documents in bulk
Have dedicated ingestion and machine learning nodes if possible
Break your data by dates, reduces Lucene Indices Maintenance
Increase Index buffer size > 512MiB
Use auto-generated ids
Increase or unset index refresh interval (default is 1 sec)

Elasticsearch Plugins

Mapper Attachment - Ingest PDFs, DOCX and others (based on Apache Tika)
Ingest Attachment - Very similar to mapper attachment (based on Apache Tika)
BigDesk Plugins - Provides live charts and stats for Elasticsearch cluster
FsCrawler - index documents directly over SSH

As Elasticsearch is new, its plugin ecosystem is smaller than Apache Solr's

Vizualization in Kibana

FAQ

I deleted a document, my index size didn't change
- It marks the document for deletion, but it doesn't delete until it triggers lucene indices merge.
I added a document, my index size didn't change
- A previous delete didn't trigger indices merge, but adding did, so it evened out after the two operations.
I want to rename a field or change the data type after index creation
- Cannot be updated, the index has to be re-created and the data to be re-indexed
I want to change the number of shards in the cluster after index creation
- Due to router mechanics and hashing function modulo, the index has to be recreated.
Is Elasticsearch good for sparse data?
- No, it pre-allocates space based on the current index schema, so null values take disk space. The bigger the space between populated fields, the more inefficient the index is and more space it takes.
Does Elasticsearch/Lucene have an update operation
- Yes, and No, update is actually delete and add operations, partial re-index doesn't exist at the moment.

Sources

Elasticsearch Reference Guide

Apache Lucene Documentation

Europython 2014: Elasticsearch from the bottom up

Apache Lucene: Then & Now

The documentation for Elasticsearch is absolutely brilliant. It has brief and to the point explanations with plenty of examples. Absolute treasure when it comes to writing DSL queries. Definitely worth going through the "Getting Started" section.

Thank you for listening!

https://www.ivaylopavlov.com

Introduction to Apache Lucene & Elasticsearch