background

Niko's Project Corner

Very fuzzy searching with Elasticsearch

Description Visualizing laser scanned point clouds.
Languages Python
Tags Elas­tic­search
Databases
GitHub
Stack Over­flow
Duration Fall 2015
Modified 21st October 2015
GitHub nikonyrh/stackoverflow-scripts
thumbnail

I en­coun­tered an in­ter­est­ing ques­tion at Stack Over­flow about fuzzy search­ing of hashes on Elas­tic­search and de­cided to give it a go. It has na­tive sup­port for fuzzy text searches but due to per­for­mance rea­sons it only sup­ports an edit dis­tance up-to 2. In this con­text the max­imum al­lowed dis­tance was eight so an al­ter­na­tive so­lu­tion was needed. A so­lu­tion was found from lo­cal­ity-sen­si­tive hash­ing.

The sit­ua­tion is ex­plained in the orig­inal ques­tion and the pro­posed so­lu­tion is ex­plained at my an­swer. In the stated prob­lem mil­lions of im­ages were hashed into 64 bits, and given a hash we want to find "al­most ev­ery" im­age with has a Ham­ming-dis­tance of 8 or less. Not too many false matches should be re­turned ei­ther. The used hash­ing method is men­tioned also in Wikipedia on lo­cal­ity-sen­si­tive hash­ing ar­ti­cle.

When the im­age is stored to ES, 1024 sam­ples are taken and stored along with the im­age id and orig­inal hash. Each sam­ple con­sist of 16 ran­domly (but con­sis­tently) se­lected bits which can be stored in Java's short data type. These val­ues are stored to Lucene's search in­dex but not stored along with the doc­uments, this way in­dex size is min­imized on disk. Also _source and _all were dis­abled. Un­der this set-up 1 mil­lion doc­uments in 4 shards with 1024 sam­ples take about 6.7 GB of disk space. With­out these op­ti­miza­tions the file size was 17.2 GB / in­dex and queries on many in­dexes took about 67% longer.

Query times' per­centiles (1st, 25th, 50th, 75th and 99th) are shown in Fig­ure 1 in log-scale. Times sig­nif­icantly in­crease once all data in the query con­text don't fit in mem­ory (24 GB) and have to be read from two SSDs in­stead. On 3 mil­lion doc­uments me­dian re­sponse time was about 250 ms but for 6 mil­lion doc­uments it grew to 6000 ms! Mem­ory re­quire­ment could be low­ered by tak­ing fewer LSH-sam­ples but it would re­sult in poorer pre­ci­sion vs. re­call trade­off.

query_times
Figure 1: Query times on vary­ing num­ber of data (log-scale on time).

Related blog posts:

BenchmarkTaxiridesEsSql
CljTaxirides
FuzzySearchCuda
HierarchicalSchemaEs
ServerMon