ElasticSearch DeepDive

What is Elasticsearch?

Elasticsearch is an open-source, distributed search and analytics engine built on top of Apache Lucene. It offers advanced full-text search capabilities and real-time data processing, making it highly efficient for applications that require fast and scalable search, retrieval, and analysis of large volumes of structured and unstructured data

Key Features:

Full-Text Search: Advanced text-based queries, including autocomplete and fuzzy search.
Blazing-Fast Performance: Utilizes an inverted index for quick data retrieval.
Scalability: Add nodes to scale horizontally.

Core Concepts in Elasticsearch

1. Index
- - Elasticsearch Index = Relational Database Table
    
    * Document in Elasticsearch = Row in a relational table
    
    * Field in a document = Column in a relational table

Why Should You Use Elasticsearch?

There are many reasons why Elasticsearch has become so popular, especially for businesses that deal with lots of data. Here are a few:

1. Speed and Scalability

Elasticsearch is known for its speed. It can quickly search through tons of data and provide results in real-time. Whether you’re looking for a specific product or analyzing customer feedback, Elasticsearch makes it fast.

Example: Imagine an online store with thousands of products. If you search for "running shoes," Elasticsearch will return the relevant products in less than a second, even though there are so many products to sift through.

Elasticsearch is also scalable. This means as your data grows, Elasticsearch can handle it without slowing down.

2. Full-Text Search

One of Elasticsearch’s best features is its ability to do full-text search. That means, if you're looking for a term or phrase, Elasticsearch can find it, even if you don’t type the exact word.

Example: Let’s say you search for “Iphone 12.” Even if the product is listed as “Apple iPhone 12,” Elasticsearch will know they’re the same thing and show it in your results.

3. Schema Flexibility

Unlike traditional databases that require you to define a rigid structure for your data, Elasticsearch is schema-less. This gives you the freedom to store data in many different formats (like text, numbers, or images) without worrying about fitting into a strict mold.

Example: You can store user comments, product reviews, images, and transaction records in the same system. Elasticsearch can handle all these different types of data, making it much easier to work with diverse data sources.

4. Real-Time Data Processing

Another reason to use Elasticsearch is that it allows real-time data processing. This is essential if you need up-to-the-minute data updates, such as live search results or activity monitoring.

Example: In an e-commerce site, when a customer purchases a product, Elasticsearch can immediately update the inventory and reflect the change in search results.

5. It's Part of a Larger Ecosystem

Elasticsearch is often used as part of the Elastic Stack (also known as ELK Stack), which includes tools for managing and visualizing data:

Logstash: Collects and processes data.
Kibana: Visualizes data in interactive dashboards.
Beats: Lightweight data shippers for sending data to Elasticsearch.

This ecosystem is especially useful for businesses that need to monitor large amounts of data, like website traffic, server logs, or application performance.

When Should You Use Elasticsearch?

Elasticsearch is perfect for certain situations, especially when you're working with large, dynamic datasets. Here are some specific scenarios where it shines:

1. Real-Time Search

If you have a website, app, or service where users need to search through data in real-time , Elasticsearch is a great choice.

Example: A customer searching for “wireless headphones” on an online store, with thousands of products listed, will see the results almost instantly.

2. Log and Event Analytics

For businesses that track user activity or system logs, Elasticsearch helps by storing and searching large volumes of log data. It lets you easily analyze the data to find patterns or identify issues.

Example: A company could use Elasticsearch to analyze log files from its servers, looking for unusual patterns that may indicate a security breach or a system error.

3. Metrics and Monitoring

If you need to monitor key business metrics (like website visits or user activity), Elasticsearch can handle the large volume of time-series data, such as tracking user sign-ins or transaction counts.

Example: An online platform could track the number of active users every minute and display that data on an interactive dashboard using Kibana.

4. Recommendation Systems

Many businesses use Elasticsearch to build recommendation engines. By analyzing users’ behavior or past purchases, Elasticsearch can help suggest products or content that they may be interested in.

Example: After buying a laptop, Elasticsearch might recommend accessories like laptop bags or mouse pads based on the user’s past behavior and preferences.

5. Full-Text Search for Large Datasets

If you have a large amount of text (like articles, documents, or social media posts) and need to search through it, Elasticsearch is ideal. It’s built specifically for text-heavy data.

Example: A news website can index articles and let users search for relevant news stories, even if they don’t know the exact title or keywords.

Elasticsearch Query Examples for Aadhar Data

Basic Query: Retrieve All Documents

This query fetches all documents from the aadhar index:

GET /aadhar/_search
{
  "query": {
    "match_all": {}
  }
}

Conditional Queries

Match Query

Retrieve all records where gender is "male":

GET /aadhar/_search
{
  "query": {
    "match": {
      "gender": "male"
    }
  }
}

Term Query

Find a document where the id is 15:

GET /aadhar/_search
{
  "query": {
    "term": {
      "id": 15
    }
  }
}

Boolean Query with `must`

Search for records where fullName is "Akarsh Khatri" and aadhar is 912900983267:

GET /aadhar/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "fullName": "Akarsh Khatri" } },
        { "term": { "aadhar": 912900983267 } }
      ]
    }
  }
}

Aggregations

Count and Group By `fullName`

Group records by fullName and count occurrences:

GET /aadhar/_search
{
  "size": 0,
  "aggs": {
    "by_fullName": {
      "terms": {
        "field": "fullName.keyword"
      }
    }
  }
}

Range Query

Retrieve all documents where dateOfBirth is greater than or equal to 1990-01-01:

GET /aadhar/_search
{
  "query": {
    "range": {
      "dateOfBirth": {
        "gte": "1990-01-01T00:00:00"
      }
    }
  }
}

Sum Aggregation

Calculate the total of phoneNumber field values:

GET /aadhar/_search
{
  "size": 0,
  "aggs": {
    "total_phone_number": {
      "sum": {
        "field": "phoneNumber"
      }
    }
  }
}

Multiple-Level Grouping

Group records by gender and year of dateOfBirth:

GET /aadhar/_search
{
  "size": 0,
  "aggs": {
    "composite_group": {
      "composite": {
        "sources": [
          { "gender": { "terms": { "field": "gender.keyword" } } },
          { "dob": { "date_histogram": { "field": "dateOfBirth", "calendar_interval": "year" } } }
        ]
      }
    }
  }
}

Pagination

Standard Pagination

Fetch 10 records, starting from the 11th document:

GET /aadhar/_search
{
  "from": 10,
  "size": 10,
  "query": {
    "match_all": {}
  }
}

Pagination for Large Datasets

Efficiently handle large datasets with search_after:

GET /aadhar/_search
{
  "size": 10,
  "query": {
    "match_all": {}
  },
  "sort": [
    { "id": "asc" }
  ],
  "search_after": [1000]
}

Update Operations

Update a Specific Document

Update the phoneNumber field for the document with id 16:

POST /aadhar/_update/16
{
  "doc": {
    "phoneNumber": 9876543210
  }
}

Bulk Update by Query

Update the address field for documents where aadhar is 145721561989:

POST /aadhar/_update_by_query
{
  "query": {
    "term": {
      "aadhar": 145721561989
    }
  },
  "script": {
    "source": "ctx._source.address = 'hyderabad, hitech City, ts State'",
    "lang": "painless"
  }
}

Insert a Document

Add a new document with specific fields:

PUT /aadhar/_doc/12
{
  "id": 12,
  "fullName": "Daniel Waters",
  "phoneNumber": 9876543210,
  "address": "Updated Address"
}