Niko's Project Corner

Databases at other sites

Benchmarking Elasticsearch and MS SQL on NYC Taxis

(7th May 2017)

The NYC Taxi dataset has been used on quite many bench­marks (for ex­am­ple by Mark Litwintschik), per­haps be­cause it has a quite rich set of columns but their mean­ing is mostly triv­ial to un­der­stand. I de­vel­oped a Clo­jure pro­ject which gen­er­ates Elas­tic­search and SQL queries with three dif­fer­ent tem­plates for fil­ters and four dif­fer­ent tem­plates of ag­gre­ga­tions. This should give a de­cent in­di­ca­tion of these databases per­for­mance un­der a typ­ical work­load, al­though this test did not run queries con­cur­rently and it does not mix dif­fer­ent query types when the bench­mark is run­ning. How­ever bench­marks are al­ways tricky to de­sign and ex­ecute prop­erly so I'm sure there is room for im­prove­ments. In this pro­ject the tested database en­gi­nes were Elas­tic­search 5.2.2 (with Or­acle JVM 1.8.0_121) and MS SQL Server 2014.

Languages: Clojure
Tags: GitHub Databases Elasticsearch SQL
GitHub: nikonyrh/nyc-taxi-data

Analyzing NYC Taxi dataset with Elasticsearch and Kibana

(19th March 2017)

The NYC taxi­cab dataset has seen lots of love from many data sci­en­tists such as Todd W. Schei­der and Mark Litwintschik. I de­cided to give it a go while learn­ing Clo­jure, as I sus­pected that it might be a good lan­guage for ETL jobs. This ar­ti­cle de­scribes how I loaded the dataset, nor­mal­ized its con­ven­tions and columns, con­verted from CSV to JSON and stored them to Elas­tic­search.

Languages: Clojure
Tags: GitHub JVM Elasticsearch Databases Business Intelligence Kibana
GitHub: nikonyrh/nyc-taxi-data

An efficient schema for hierarchical data on Elasticsearch

(20th November 2016)

Many busi­nesses gen­er­ate rich datasets from which valu­able in­sights can be dis­cov­ered. A ba­sic start­ing point is to an­alyze sep­arate events such as item sales, tourist at­trac­tion vis­its or movies seen. From these a time se­ries (to­tal sales / item / day, to­tal vis­its / tourist spot / week) or ba­sic met­rics (his­togram of movie rat­ings) can be ag­gre­gated. Things get a lot more in­ter­est­ing when in­di­vid­ual data points can be linked to­gether by a com­mon id, such as items be­ing bought in the same bas­ket or by the same house hold (iden­ti­fied by a loy­alty card), the spots vis­ited by a tourist group through out their jour­ney or movie rat­ings given by a speci­fic user. This richer data can be used to build rec­om­men­da­tion en­gi­nes, iden­tify sub­sti­tute prod­ucts or ser­vices and do clus­ter­ing anal­ysis. This ar­ti­cle de­scribes a schema for Elas­tic­search which sup­ports ef­fi­cient fil­ter­ing and ag­gre­ga­tions, and is au­to­mat­ically com­pat­ible with new data val­ues.

Languages: Python
Tags: Business Intelligence Databases Elasticsearch

Caching and perf. monitoring with Redis and Python

(10th October 2016)

When im­ple­ment­ing real-time APIs most of the time server load can greatly be re­duced by caching fre­quently ac­cessed and rarely mod­ified data, or re-us­able cal­cu­la­tion re­sults. Luck­ily Python has sev­eral fea­tures which make it easy to add new con­structs and wrap­pers to the lan­guage, for ex­am­ple thanks to *args, **kwargs func­tion ar­gu­ments, first-class func­tions, dec­ora­tors and so fort. Thus it doesn't take too much ef­fort to im­ple­ment a @cached dec­ora­tor with business-specific logic on cache invalidation. Redis is the perfect fit for the job thanks to its high performance, binary-friendly key-value store with TTL and different data eviction policies and support for other data structures which make it trivial to store additional key metrics there.

Languages: Python
Tags: Databases Redis

Service discovery with Docker, Consul and Registrator

(28th August 2016)

Tra­di­tion­ally com­put­ers were named and not eas­ily re­placed in the event it broke down. Server soft­ware was lis­ten­ing on a hard-coded port, and to link pieces to­gether these ma­chine names and ser­vice ports were hard-coded into other soft­ware's con­fig­ura­tion files. Now in the era of cloud com­put­ing and ser­vice ori­ented ar­chi­tec­ture this is no longer an ad­equate so­lu­tion, thus elas­tic scal­ing and ser­vice dis­cov­ery are be­com­ing the norm. One easy so­lu­tion is to com­bine the pow­ers of Docker, Con­sul and Reg­is­tra­tor.

Languages: Bash
Tags: Architecture Docker Databases Nginx
GitHub: nikonyrh/docker-scripts

Very fuzzy searching with CUDA

(2nd November 2015)

This is an al­ter­na­tive an­swer to the ques­tion I en­coun­tered at Stack Over­flow about fuzzy search­ing of hashes on Elas­tic­search. My orig­inal an­swer used lo­cal­ity-sen­si­tive hash­ing. Su­pe­rior speed and sim­ple im­ple­men­ta­tion were gained by us­ing nVidia's CUDA via Thrust li­brary.

Languages: C++ CUDA
Tags: Thrust Databases GitHub Stack Overflow
GitHub: nikonyrh/stackoverflow-scripts

Very fuzzy searching with Elasticsearch

(21st October 2015)

I en­coun­tered an in­ter­est­ing ques­tion at Stack Over­flow about fuzzy search­ing of hashes on Elas­tic­search and de­cided to give it a go. It has na­tive sup­port for fuzzy text searches but due to per­for­mance rea­sons it only sup­ports an edit dis­tance up-to 2. In this con­text the max­imum al­lowed dis­tance was eight so an al­ter­na­tive so­lu­tion was needed. A so­lu­tion was found from lo­cal­ity-sen­si­tive hash­ing.

Languages: Python
Tags: Elasticsearch Databases GitHub Stack Overflow
GitHub: nikonyrh/stackoverflow-scripts

Server monitoring and analytics

(26th April 2014)

There al­ready ex­ists many server mon­itor­ing and log­ging sys­tems, but I was in­ter­ested to de­velop and de­ploy my own. It was also a good chance to learn about Elas­tic­Search's ag­gre­ga­tion queries (new in v1.0.0). Orig­inally Elas­tic­Search was de­signed to provide scal­able doc­ument based stor­age and ef­fi­cient search, but now it is gain­ing more ca­pa­bil­ities. The pro­ject con­sists of a cron job which pushes new met­rics to Elas­tic­Search, a REST­ful JSON API to query statis­tics on recorded num­bers and plot the re­sults in a browser (based on High­Charts).

Languages: PHP
Tags: Elasticsearch Databases

Efficient in-memory analytical database

(1st December 2013)

Tra­di­tional databases such as MySQL are not de­signed to per­form well in an­alyt­ical queries, which re­quires ac­cess to pos­si­bly all of the rows on se­lected columns. This re­sults in a full table scan and it can­not ben­efit from any in­dexes. Column-ori­ented en­gi­nes try to cir­cum­vent this is­sue, but I went one step deeper and made the stor­age column value ori­ented, sim­ilar to an in­verted in­dex. This re­sults in 2 — 10× speedup from op­ti­mized colum­nar so­lu­tions and 80× the speed of MySQL.

Languages: C++
Tags: FastCGI SQL Databases