ElasticSearch DeepDive
What is Elasticsearch?
Elasticsearch is an open-source, distributed search and analytics engine built on top of Apache Lucene. It offers advanced full-text search capabilities and real-time data processing, making it highly efficient for applications that require fast and scalable search, retrieval, and analysis of large volumes of structured and unstructured data
Key Features:
Full-Text Search: Advanced text-based queries, including autocomplete and fuzzy search.
Blazing-Fast Performance: Utilizes an inverted index for quick data retrieval.
Scalability: Add nodes to scale horizontally.
Core Concepts in Elasticsearch
1. Index
Elasticsearch Index = Relational Database Table
* Document in Elasticsearch = Row in a relational table
* Field in a document = Column in a relational table
Why Should You Use Elasticsearch?
There are many reasons why Elasticsearch has become so popular, especially for businesses that deal with lots of data. Here are a few:
1. Speed and Scalability
Elasticsearch is known for its speed. It can quickly search through tons of data and provide results in real-time. Whether you’re looking for a specific product or analyzing customer feedback, Elasticsearch makes it fast.
Example: Imagine an online store with thousands of products. If you search for "running shoes," Elasticsearch will return the relevant products in less than a second, even though there are so many products to sift through.
Elasticsearch is also scalable. This means as your data grows, Elasticsearch can handle it without slowing down.
2. Full-Text Search
One of Elasticsearch’s best features is its ability to do full-text search. That means, if you're looking for a term or phrase, Elasticsearch can find it, even if you don’t type the exact word.
Example: Let’s say you search for “Iphone 12.” Even if the product is listed as “Apple iPhone 12,” Elasticsearch will know they’re the same thing and show it in your results.
3. Schema Flexibility
Unlike traditional databases that require you to define a rigid structure for your data, Elasticsearch is schema-less. This gives you the freedom to store data in many different formats (like text, numbers, or images) without worrying about fitting into a strict mold.
Example: You can store user comments, product reviews, images, and transaction records in the same system. Elasticsearch can handle all these different types of data, making it much easier to work with diverse data sources.
4. Real-Time Data Processing
Another reason to use Elasticsearch is that it allows real-time data processing. This is essential if you need up-to-the-minute data updates, such as live search results or activity monitoring.
Example: In an e-commerce site, when a customer purchases a product, Elasticsearch can immediately update the inventory and reflect the change in search results.
5. It's Part of a Larger Ecosystem
Elasticsearch is often used as part of the Elastic Stack (also known as ELK Stack), which includes tools for managing and visualizing data:
Logstash: Collects and processes data.
Kibana: Visualizes data in interactive dashboards.
Beats: Lightweight data shippers for sending data to Elasticsearch.
This ecosystem is especially useful for businesses that need to monitor large amounts of data, like website traffic, server logs, or application performance.
When Should You Use Elasticsearch?
Elasticsearch is perfect for certain situations, especially when you're working with large, dynamic datasets. Here are some specific scenarios where it shines:
1. Real-Time Search
If you have a website, app, or service where users need to search through data in real-time , Elasticsearch is a great choice.
Example: A customer searching for “wireless headphones” on an online store, with thousands of products listed, will see the results almost instantly.
2. Log and Event Analytics
For businesses that track user activity or system logs, Elasticsearch helps by storing and searching large volumes of log data. It lets you easily analyze the data to find patterns or identify issues.
Example: A company could use Elasticsearch to analyze log files from its servers, looking for unusual patterns that may indicate a security breach or a system error.
3. Metrics and Monitoring
If you need to monitor key business metrics (like website visits or user activity), Elasticsearch can handle the large volume of time-series data, such as tracking user sign-ins or transaction counts.
Example: An online platform could track the number of active users every minute and display that data on an interactive dashboard using Kibana.
4. Recommendation Systems
Many businesses use Elasticsearch to build recommendation engines. By analyzing users’ behavior or past purchases, Elasticsearch can help suggest products or content that they may be interested in.
Example: After buying a laptop, Elasticsearch might recommend accessories like laptop bags or mouse pads based on the user’s past behavior and preferences.
5. Full-Text Search for Large Datasets
If you have a large amount of text (like articles, documents, or social media posts) and need to search through it, Elasticsearch is ideal. It’s built specifically for text-heavy data.
Example: A news website can index articles and let users search for relevant news stories, even if they don’t know the exact title or keywords.
Elasticsearch Query Examples for Aadhar Data
Basic Query: Retrieve All Documents
This query fetches all documents from the aadhar
index:
GET /aadhar/_search
{
"query": {
"match_all": {}
}
}
Conditional Queries
Match Query
Retrieve all records where gender
is "male"
:
GET /aadhar/_search
{
"query": {
"match": {
"gender": "male"
}
}
}
Term Query
Find a document where the id
is 15
:
GET /aadhar/_search
{
"query": {
"term": {
"id": 15
}
}
}
Boolean Query with must
Search for records where fullName
is "Akarsh Khatri"
and aadhar
is 912900983267
:
GET /aadhar/_search
{
"query": {
"bool": {
"must": [
{ "match": { "fullName": "Akarsh Khatri" } },
{ "term": { "aadhar": 912900983267 } }
]
}
}
}
Aggregations
Count and Group By fullName
Group records by fullName
and count occurrences:
GET /aadhar/_search
{
"size": 0,
"aggs": {
"by_fullName": {
"terms": {
"field": "fullName.keyword"
}
}
}
}
Range Query
Retrieve all documents where dateOfBirth
is greater than or equal to 1990-01-01
:
GET /aadhar/_search
{
"query": {
"range": {
"dateOfBirth": {
"gte": "1990-01-01T00:00:00"
}
}
}
}
Sum Aggregation
Calculate the total of phoneNumber
field values:
GET /aadhar/_search
{
"size": 0,
"aggs": {
"total_phone_number": {
"sum": {
"field": "phoneNumber"
}
}
}
}
Multiple-Level Grouping
Group records by gender
and year of dateOfBirth
:
GET /aadhar/_search
{
"size": 0,
"aggs": {
"composite_group": {
"composite": {
"sources": [
{ "gender": { "terms": { "field": "gender.keyword" } } },
{ "dob": { "date_histogram": { "field": "dateOfBirth", "calendar_interval": "year" } } }
]
}
}
}
}
Pagination
Standard Pagination
Fetch 10 records, starting from the 11th document:
GET /aadhar/_search
{
"from": 10,
"size": 10,
"query": {
"match_all": {}
}
}
Pagination for Large Datasets
Efficiently handle large datasets with search_after
:
GET /aadhar/_search
{
"size": 10,
"query": {
"match_all": {}
},
"sort": [
{ "id": "asc" }
],
"search_after": [1000]
}
Update Operations
Update a Specific Document
Update the phoneNumber
field for the document with id
16
:
POST /aadhar/_update/16
{
"doc": {
"phoneNumber": 9876543210
}
}
Bulk Update by Query
Update the address
field for documents where aadhar
is 145721561989
:
POST /aadhar/_update_by_query
{
"query": {
"term": {
"aadhar": 145721561989
}
},
"script": {
"source": "ctx._source.address = 'hyderabad, hitech City, ts State'",
"lang": "painless"
}
}
Insert a Document
Add a new document with specific fields:
PUT /aadhar/_doc/12
{
"id": 12,
"fullName": "Daniel Waters",
"phoneNumber": 9876543210,
"address": "Updated Address"
}