background

Niko's Project Corner

Language Python and tag Elasticsearch at other sites


An efficient schema for hierarchical data on Elasticsearch

(20th November 2016)

Many busi­nesses gen­er­ate rich datasets from which valu­able in­sights can be dis­cov­ered. A ba­sic start­ing point is to an­alyze sep­arate events such as item sales, tourist at­trac­tion vis­its or movies seen. From these a time se­ries (to­tal sales / item / day, to­tal vis­its / tourist spot / week) or ba­sic met­rics (his­togram of movie rat­ings) can be ag­gre­gated. Things get a lot more in­ter­est­ing when in­di­vid­ual data points can be linked to­gether by a com­mon id, such as items be­ing bought in the same bas­ket or by the same house hold (iden­ti­fied by a loy­alty card), the spots vis­ited by a tourist group through out their jour­ney or movie rat­ings given by a speci­fic user. This richer data can be used to build rec­om­men­da­tion en­gi­nes, iden­tify sub­sti­tute prod­ucts or ser­vices and do clus­ter­ing anal­ysis. This ar­ti­cle de­scribes a schema for Elas­tic­search which sup­ports ef­fi­cient fil­ter­ing and ag­gre­ga­tions, and is au­to­mat­ically com­pat­ible with new data val­ues.

Languages: Python
Tags: Business Intelligence Databases Elasticsearch

Very fuzzy searching with Elasticsearch

(21st October 2015)

I en­coun­tered an in­ter­est­ing ques­tion at Stack Over­flow about fuzzy search­ing of hashes on Elas­tic­search and de­cided to give it a go. It has na­tive sup­port for fuzzy text searches but due to per­for­mance rea­sons it only sup­ports an edit dis­tance up-to 2. In this con­text the max­imum al­lowed dis­tance was eight so an al­ter­na­tive so­lu­tion was needed. A so­lu­tion was found from lo­cal­ity-sen­si­tive hash­ing.

Languages: Python
Tags: Elasticsearch Databases GitHub Stack Overflow
GitHub: nikonyrh/stackoverflow-scripts