background

Niko's Project Corner

Elasticsearch at other sites


JGit blame for fun and profit(?)

(2nd April 2018)

Soft­ware pro­jects are typ­ically "tracked" on a ver­sion con­trol sys­tem (VCS). Each "ver­sion" of the code is called a "com­mit", which does not only store file con­tents, but also plenty of meta­data. This cre­ates a very rich set of data, and in the age of open source there are thou­sands of pro­jects to study. A few ex­am­ples are Git of The­seus and Gi­ten­tial, but by fo­cus­ing on "git blame" (see who has com­mit­ted each line on each file) I hope to bring some­thing new to the table. In short I have an­alyzed how source code gets re­placed by newer code, track­ing the top­ics of who, when and why, and how old the code was.

Languages: Clojure
Tags: Git Elasticsearch

Benchmarking Elasticsearch and MS SQL on NYC Taxis

(7th May 2017)

The NYC Taxi dataset has been used on quite many bench­marks (for ex­am­ple by Mark Litwintschik), per­haps be­cause it has a quite rich set of columns but their mean­ing is mostly triv­ial to un­der­stand. I de­vel­oped a Clo­jure pro­ject which gen­er­ates Elas­tic­search and SQL queries with three dif­fer­ent tem­plates for fil­ters and four dif­fer­ent tem­plates of ag­gre­ga­tions. This should give a de­cent in­di­ca­tion of these databases per­for­mance un­der a typ­ical work­load, al­though this test did not run queries con­cur­rently and it does not mix dif­fer­ent query types when the bench­mark is run­ning. How­ever bench­marks are al­ways tricky to de­sign and ex­ecute prop­erly so I'm sure there is room for im­prove­ments. In this pro­ject the tested database en­gi­nes were Elas­tic­search 5.2.2 (with Or­acle JVM 1.8.0_121) and MS SQL Server 2014.

Languages: Clojure
Tags: GitHub Databases Elasticsearch SQL
GitHub: nikonyrh/nyc-taxi-data

Analyzing NYC Taxi dataset with Elasticsearch and Kibana

(19th March 2017)

The NYC taxi­cab dataset has seen lots of love from many data sci­en­tists such as Todd W. Schei­der and Mark Litwintschik. I de­cided to give it a go while learn­ing Clo­jure, as I sus­pected that it might be a good lan­guage for ETL jobs. This ar­ti­cle de­scribes how I loaded the dataset, nor­mal­ized its con­ven­tions and columns, con­verted from CSV to JSON and stored them to Elas­tic­search.

Languages: Clojure
Tags: GitHub JVM Elasticsearch Databases Business Intelligence Kibana
GitHub: nikonyrh/nyc-taxi-data

An efficient schema for hierarchical data on Elasticsearch

(20th November 2016)

Many busi­nesses gen­er­ate rich datasets from which valu­able in­sights can be dis­cov­ered. A ba­sic start­ing point is to an­alyze sep­arate events such as item sales, tourist at­trac­tion vis­its or movies seen. From these a time se­ries (to­tal sales / item / day, to­tal vis­its / tourist spot / week) or ba­sic met­rics (his­togram of movie rat­ings) can be ag­gre­gated. Things get a lot more in­ter­est­ing when in­di­vid­ual data points can be linked to­gether by a com­mon id, such as items be­ing bought in the same bas­ket or by the same house hold (iden­ti­fied by a loy­alty card), the spots vis­ited by a tourist group through out their jour­ney or movie rat­ings given by a speci­fic user. This richer data can be used to build rec­om­men­da­tion en­gi­nes, iden­tify sub­sti­tute prod­ucts or ser­vices and do clus­ter­ing anal­ysis. This ar­ti­cle de­scribes a schema for Elas­tic­search which sup­ports ef­fi­cient fil­ter­ing and ag­gre­ga­tions, and is au­to­mat­ically com­pat­ible with new data val­ues.

Languages: Python
Tags: Business Intelligence Databases Elasticsearch

Very fuzzy searching with Elasticsearch

(21st October 2015)

I en­coun­tered an in­ter­est­ing ques­tion at Stack Over­flow about fuzzy search­ing of hashes on Elas­tic­search and de­cided to give it a go. It has na­tive sup­port for fuzzy text searches but due to per­for­mance rea­sons it only sup­ports an edit dis­tance up-to 2. In this con­text the max­imum al­lowed dis­tance was eight so an al­ter­na­tive so­lu­tion was needed. A so­lu­tion was found from lo­cal­ity-sen­si­tive hash­ing.

Languages: Python
Tags: Elasticsearch Databases GitHub Stack Overflow
GitHub: nikonyrh/stackoverflow-scripts

Server monitoring and analytics

(26th April 2014)

There al­ready ex­ists many server mon­itor­ing and log­ging sys­tems, but I was in­ter­ested to de­velop and de­ploy my own. It was also a good chance to learn about Elas­tic­Search's ag­gre­ga­tion queries (new in v1.0.0). Orig­inally Elas­tic­Search was de­signed to provide scal­able doc­ument based stor­age and ef­fi­cient search, but now it is gain­ing more ca­pa­bil­ities. The pro­ject con­sists of a cron job which pushes new met­rics to Elas­tic­Search, a REST­ful JSON API to query statis­tics on recorded num­bers and plot the re­sults in a browser (based on High­Charts).

Languages: PHP
Tags: Elasticsearch Databases