Niko's Project Corner

An efficient schema for hierarchical data on Elasticsearch

Description Supporting arbitrary nesting and fast queries
Languages Python
Tags Busi­ness In­tel­li­gence
Duration Fall 2016
Modified 20th November 2016

Many busi­nesses gen­er­ate rich datasets from which valu­able in­sights can be dis­cov­ered. A ba­sic start­ing point is to an­alyze sep­arate events such as item sales, tourist at­trac­tion vis­its or movies seen. From these a time se­ries (to­tal sales / item / day, to­tal vis­its / tourist spot / week) or ba­sic met­rics (his­togram of movie rat­ings) can be ag­gre­gated. Things get a lot more in­ter­est­ing when in­di­vid­ual data points can be linked to­gether by a com­mon id, such as items be­ing bought in the same bas­ket or by the same house hold (iden­ti­fied by a loy­alty card), the spots vis­ited by a tourist group through out their jour­ney or movie rat­ings given by a speci­fic user. This richer data can be used to build rec­om­men­da­tion en­gi­nes, iden­tify sub­sti­tute prod­ucts or ser­vices and do clus­ter­ing anal­ysis. This ar­ti­cle de­scribes a schema for Elas­tic­search which sup­ports ef­fi­cient fil­ter­ing and ag­gre­ga­tions, and is au­to­mat­ically com­pat­ible with new data val­ues.

It is true that even mod­er­ately large datasets may have a de­cent per­for­mance on a sin­gle tra­di­tional SQL server with mod­ern tech­nolo­gies such as colum­nar stor­age en­gi­nes. How­ever I find hor­izon­tally scal­able so­lu­tions more in­ter­est­ing as they don't put a hard limit on the amount of data or users you can han­dle at once, and ex­pen­sive ver­ti­cal scal­ing isn't the only op­tion. Also you can re­duce the amount of raw data by a few or­ders of mag­ni­tude by sam­pling tech­niques but then your an­swers won't be ex­act any­more, and ide­ally you'd han­dle and in­di­cate this un­cer­tainty on your re­port­ing tools some­how. Also in SQL you don't need to think about the ex­pected queries so much in ad­vance as you can flex­ibly JOIN data from dif­fer­ent ta­bles on a ad-hoc man­ner. Also the data size on disk could be smaller as you don't need to de-nor­mal­ize data on doc­uments as you have to do with most NoSQL so­lu­tions.

Many ques­tions can be an­swered by just stor­ing a set of nu­mer­ical ids into a field, whereas some ques­tions re­quire ad­di­tional data such as to­tal amount of money or time spent on a speci­fic item. The first dataset can ap­ply fil­ters like "tourists who vis­ited Paris" but the for­mer can fil­ter for "tourists who spent more than two days in Paris". Also the first one can only ag­gre­gate "most fre­quently vis­ited restau­rants of tourists who vis­ited a mu­seum in Paris", but the sec­ond one can ag­gre­gate "restau­rants with most money spent by tourists who stayed in Paris for just one day". The first one is a lot sim­pler to im­ple­ment and query in Elas­tic­search as it di­rectly sup­ports mul­ti­ple val­ues on a field, the sec­ond one re­quires nested doc­uments such as {"lo­ca­tion_id": 123, "lo­ca­tion_type": "city", "money_spent": 123.4, "time_spent": 17.5}. Note that this data model does not record the or­der in which the cities were vis­ited, but can eas­ily be han­dled by hav­ing prev_lo­ca­tion_id and next_lo­ca­tion_id fields. On some do­mains a graph database could be more suit­able op­tion than oth­ers.

Nat­urally you should store top-level ag­gre­gates di­rectly to the root doc­ument, such as to­tal_money_spent and to­tal_time_spent. They have min­imal ef­fect on data size but greatly sim­plify fil­ter­ing and ag­gre­ga­tions. Also at the time of writ­ing Kibana does not seem to sup­port nested ag­gre­ga­tions (dis­cus­sion at Github). I would also store other types of in­for­ma­tion such as the num­ber of cities and POIs vis­ited, and fields which can as­sumed to be static such as num­ber of tourists and their na­tion­al­ities.

An other im­por­tant topic is how to model hi­er­ar­chi­cal na­ture of the data. For ex­am­ple we have the ge­ograph­ical di­men­sion with the lo­ca­tion it­self (ide­ally an in­te­ger id), its street, re­gion of city, city, re­gion of coun­try, coun­try, re­gion of con­ti­nent and the con­ti­nent it­self. A straight-for­ward op­tion is a pure co­or­di­nates-based one, which has im­plicit re­gional hi­er­ar­chy as de­fined by the geo­hash pre­fix ag­gre­ga­tion. It can create cool heat-maps at desired resolution but loses the explicit information on data's hierarchy. On some use cases a pie diagram with two or more levels of hierarchy might be an useful representation, for example to see the percentage of time spent on most popular cities, and for each city the percentage of time spent on each type of attraction. This is made possible by simply adding parent_location_id to each sub-document.

Related blog posts: