Niko's Project Corner

I encountered an interesting question at Stack Overflow about fuzzy searching of hashes on Elasticsearch and decided to give it a go. It has native support for fuzzy text searches but due to performance reasons it only supports an edit distance up-to 2. In this context the maximum allowed distance was eight so an alternative solution was needed. A solution was found from locality-sensitive hashing.

The situation is explained in the original question and the proposed solution is explained at my answer. In the stated problem millions of images were hashed into 64 bits, and given a hash we want to find "almost every" image with has a Hamming-distance of 8 or less. Not too many false matches should be returned either. The used hashing method is mentioned also in Wikipedia on locality-sensitive hashing article.

When the image is stored to ES, 1024 samples are taken and stored along with the image id and original hash. Each sample consist of 16 randomly (but consistently) selected bits which can be stored in Java's short data type. These values are stored to Lucene's search index but not stored along with the documents, this way index size is minimized on disk. Also _source and _all were disabled. Under this set-up 1 million documents in 4 shards with 1024 samples take about 6.7 GB of disk space. Without these optimizations the file size was 17.2 GB / index and queries on many indexes took about 67% longer.

Query times' percentiles (1st, 25th, 50th, 75th and 99th) are shown in Figure 1 in log-scale. Times significantly increase once all data in the query context don't fit in memory (24 GB) and have to be read from two SSDs instead. On 3 million documents median response time was about 250 ms but for 6 million documents it grew to 6000 ms! Memory requirement could be lowered by taking fewer LSH-samples but it would result in poorer precision vs. recall tradeoff.

Figure 1: Query times on varying number of data (log-scale on time).

Related blog posts:

Benchmarking Elasticsearch and MS SQL on NYC Taxis, 2017 May (Matching: Databases, Elasticsearch, GitHub)
Analyzing NYC Taxi dataset with Elasticsearch and Kibana, 2017 Mar (Matching: Databases, Elasticsearch, GitHub)
Very fuzzy searching with CUDA, 2015 Nov (Matching: Databases, GitHub, Stack Overflow)
An efficient schema for hierarchical data on Elasticsearch, 2016 Nov (Matching: Databases, Elasticsearch, Python)
Server monitoring and analytics, 2014 Apr (Matching: Databases, Elasticsearch)

Home	(Home page)
About	(About me)
Platform	(About this blog)

LinkedIn	(Niko Nyrhilä)
GitHub	(nikonyrh)
Stackoverflow	(nikonyrh)

Bruteforcing Countdown numbe...	(2023 Apr)
Cheating at Bananagrams with...	(2023 Apr)
Introduction to Stable Diffu...	(2022 Nov)
Matching puzzle pieces together	(2022 Jul)
Single channel speech / musi...	(2022 Feb)

Computer Vision	(13)
GitHub	(12)
Databases	(9)
Elasticsearch	(6)
FFT	(5)
Rendering	(5)
Applied mathematics	(4)

Python	(13)
C++	(11)
Matlab	(10)
Keras	(6)
Clojure	(6)
Bash	(6)
PHP	(6)

Very fuzzy searching with Elasticsearch

Related blog posts:

Home

Navigation

External

Most recent

Most frequent tags

Most frequent languages

Co-occurrence matrix

	Matl	Pyth	C++	Cloj	Bash	Kera
Comput	6	6	3	1	0	5
GitHub	0	2	1	4	3	0
Databa	0	3	2	2	1	0
Render	3	0	3	0	0	0
Nginx	0	1	0	0	4	0
Autoen	0	3	0	1	0	2
Elasti	0	2	0	3	0	0
FFT	3	1	1	0	0	1
Data S	2	1	2	1	0	1
JVM	0	1	0	3	1	0
Docker	0	1	0	0	3	0
FastCG	0	0	3	0	0	0
Applie	2	2	0	0	0	0
Field	2	0	2	0	0	0
Omnidi	2	0	2	0	0	0
Affine	2	0	2	0	0	0
Master	1	0	2	0	0	0
Archit	0	1	0	0	2	0
Visual	1	0	2	0	0	0
Spark	0	1	0	0	2	0
Blog	0	0	0	2	0	0
Hyphen	0	0	0	2	0	0
Stack	0	1	1	0	0	0
SQL	0	0	1	1	0	0
Busine	0	1	0	1	0	0
Signal	0	1	0	0	0	1
Encryp	0	0	0	0	1	0
Git	0	0	0	1	0	0
Stable	0	1	0	0	0	0
Redis	0	1	0	0	0	0
Thrust	0	0	1	0	0	0
Kibana	0	0	0	1	0	0
Astron	1	0	0	0	0	0
Mustac	0	0	1	0	0	0
NAT	0	0	0	0	1	0
jQuery	0	0	1	0	0	0
SSH	0	0	0	0	1	0
Happyh	0	0	1	0	0	0
Backup	0	0	0	0	1	0
Pthrea	0	0	1	0	0	0
AWS	0	0	0	0	1	0
SIFT	0	0	1	0	0	0
SURF	0	0	1	0	0	0
Conjug	0	0	1	0	0	0
Kalman	0	0	1	0	0	0
Partic	0	0	1	0	0	0
Gradie	0	0	1	0	0	0
Simult	0	0	1	0	0	0
Roboti	0	0	1	0	0	0
Princi	1	0	0	0	0	0
Receiv	1	0	0	0	0	0
Linear	1	0	0	0	0	0
Suppor	1	0	0	0	0	0
Machin	1	0	0	0	0	0
Discre	1	0	0	0	0	0