Niko's Project Corner

GitHub at other sites

Benchmarking Elasticsearch and MS SQL on NYC Taxis

(7th May 2017)

The NYC Taxi dataset has been used on quite many bench­marks (for ex­am­ple by Mark Litwintschik), per­haps be­cause it has a quite rich set of columns but their mean­ing is mostly triv­ial to un­der­stand. I de­vel­oped a Clo­jure pro­ject which gen­er­ates Elas­tic­search and SQL queries with three dif­fer­ent tem­plates for fil­ters and four dif­fer­ent tem­plates of ag­gre­ga­tions. This should give a de­cent in­di­ca­tion of these databases per­for­mance un­der a typ­ical work­load, al­though this test did not run queries con­cur­rently and it does not mix dif­fer­ent query types when the bench­mark is run­ning. How­ever bench­marks are al­ways tricky to de­sign and ex­ecute prop­erly so I'm sure there is room for im­prove­ments. In this pro­ject the tested database en­gi­nes were Elas­tic­search 5.2.2 (with Or­acle JVM 1.8.0_121) and MS SQL Server 2014.

Languages: Clojure
Tags: GitHub Databases Elasticsearch SQL
GitHub: nikonyrh/nyc-taxi-data

Analyzing NYC Taxi dataset with Elasticsearch and Kibana

(19th March 2017)

The NYC taxi­cab dataset has seen lots of love from many data sci­en­tists such as Todd W. Schei­der and Mark Litwintschik. I de­cided to give it a go while learn­ing Clo­jure, as I sus­pected that it might be a good lan­guage for ETL jobs. This ar­ti­cle de­scribes how I loaded the dataset, nor­mal­ized its con­ven­tions and columns, con­verted from CSV to JSON and stored them to Elas­tic­search.

Languages: Clojure
Tags: GitHub JVM Elasticsearch Databases Business Intelligence Kibana
GitHub: nikonyrh/nyc-taxi-data

Mustache templates in Clojure

(25th January 2017)

Mus­tache is a well-known tem­plate sys­tem with im­ple­men­ta­tions in most pop­ular lan­guages. At its core it is log­icless same tem­plates can be di­rectly used on other pro­jects. For ex­am­ple I am plan­ning to port this blgo en­gine from PHP to Clo­jure but I only need to re­place La­TeX pars­ing and HTML gen­er­ation parts, I should be able to use ex­ist­ing Mus­tache tem­plates with­out any mod­ifi­ca­tions. To learn Clo­jure pro­gram­ming I de­cided not to use the rec­om­mended li­brary but in­stead im­ple­ment my own.

Languages: Clojure
Tags: Blog GitHub JVM
GitHub: nikonyrh/mustache-clj

English hyphenation algorithm in Clojure

(17th August 2016)

This is noth­ing that spec­tac­ular (as if any­thing on my blog is), but I still wanted to de­scribe the out­line of the pro­ject of port­ing the hy­phen­ation al­go­rithm from PHP to Clo­jure. The im­ple­men­ta­tion is only about 80 lines of code + com­ments + 20 lines of unit tests. For com­par­ison the orig­inal PHP abom­ina­tion is about is about 160 LoCs, al­though it is a bit bloated by im­ple­ment­ing the pat­terns search via a trie data struc­ture in­stead of us­ing the str­pos func­tion.

Languages: Clojure
Tags: Hyphenation Blog GitHub JVM
GitHub: nikonyrh/hyphenator-clj

Finnish Invoice Template

(7th August 2016)

Af­ter fin­ish­ing a paid pro­ject, ide­ally a for­mal look­ing in­voice would be sent to the client. I know there are many com­mer­cial prod­ucts avail­able, but I couldn't find a good open source al­ter­na­tive es­pe­cially with the stan­dard Finnish for­mat­ting. I was happy to find jheusala/finnish-in­voice-tem­plate from GitHub which had all of the tricky LaTeX stuff done.

Languages: LaTeX
Tags: GitHub Entrepreneurship
GitHub: nikonyrh/finnish-invoice-template

Nginx docker image for easy file access via HTTP

(16th April 2016)

Of­ten I find my­self hav­ing a SSH con­nec­tion to a re­mote server, and I'd like to re­trieve some files to my own ma­chine. Com­mon meth­ods for this in­clude Win­dows/Samba share, SSHFS and up­load to cloud (which isn't triv­ial to do via plain cURL). Here an easy-to-use al­ter­na­tive is de­scribed: a sin­gle line com­mand to load and run a docker im­age which con­tains a pre-con­fig­ured Ng­inx in­stance. Then files can be ac­cessed via plain HTTP at the user-as­signed port (as­sum­ing fire­wall isn't block­ing it).

Languages: Bash
Tags: Docker Spark Nginx GitHub
GitHub: nikonyrh/docker-scripts
DockerHub: nikonyrh/nginx_bridge

Scalable analytics with Docker, Spark and Python

(23rd December 2015)

Tra­di­tion­ally data sci­en­tists in­stalled soft­ware pack­ages di­rectly to their ma­chi­nes, wrote code, trained mod­els, saved re­sults to lo­cal files and ap­plied mod­els to new data in batch pro­cess­ing style. New data-driven prod­ucts re­quire rapid de­vel­op­ment of new mod­els, scal­able train­ing and easy in­te­gra­tion to other as­pects of the busi­ness. Here I am propos­ing one (per­haps al­ready well-known) cloud-ready ar­chi­tec­ture to meet these re­quire­ments.

Languages: Bash Python
Tags: Architecture Docker Spark Nginx GitHub JVM
GitHub: nikonyrh/docker-scripts

Very fuzzy searching with CUDA

(2nd November 2015)

This is an al­ter­na­tive an­swer to the ques­tion I en­coun­tered at Stack Over­flow about fuzzy search­ing of hashes on Elas­tic­search. My orig­inal an­swer used lo­cal­ity-sen­si­tive hash­ing. Su­pe­rior speed and sim­ple im­ple­men­ta­tion were gained by us­ing nVidia's CUDA via Thrust li­brary.

Languages: C++ CUDA
Tags: Thrust Databases GitHub Stack Overflow
GitHub: nikonyrh/stackoverflow-scripts

Very fuzzy searching with Elasticsearch

(21st October 2015)

I en­coun­tered an in­ter­est­ing ques­tion at Stack Over­flow about fuzzy search­ing of hashes on Elas­tic­search and de­cided to give it a go. It has na­tive sup­port for fuzzy text searches but due to per­for­mance rea­sons it only sup­ports an edit dis­tance up-to 2. In this con­text the max­imum al­lowed dis­tance was eight so an al­ter­na­tive so­lu­tion was needed. A so­lu­tion was found from lo­cal­ity-sen­si­tive hash­ing.

Languages: Python
Tags: Elasticsearch Databases GitHub Stack Overflow
GitHub: nikonyrh/stackoverflow-scripts

Anonymous and secure information storing and sharing

(25th April 2015)

Nowa­days en­cryp­tion is stan­dard prac­tice on web when data is in tran­si­tion, and there are even a few ser­vices which of­fer client-side en­cryp­tion and thus are truly end-to-end. Nev­er­the­less for some rea­son they all re­quire you to cre­ate and ac­count by pro­vid­ing your email and pass­word, al­though this is not strictly nec­es­sary for stor­ing and shar­ing data. In this sys­tem the doc­ument id, en­cryp­tion key and HMAC key are gen­er­ated ad-hoc on the client and only min­imal nec­es­sary in­for­ma­tion is re­vealed to the server. A live demo should be avail­able at no­knowl­

Languages: PHP
Tags: GitHub Encryption
GitHub: nikonyrh/noknowledgenotes

Automated image capturing + API

(10th April 2015)

Out of in­ter­est on na­ture ob­ser­va­tion, com­puter vi­sion, im­age pro­cess­ing and so forth I de­vel­oped an au­to­mated sys­tem to cap­ture one photo / min­ute and store it on a disk. The pro­ject also has Bash and PHP scripts co­or­di­nat­ing ex­ter­nal tools such as mon­tage for im­age stitch­ing and men­coder for video gen­er­ation. PHP also pro­vides an HTTP API for im­age gen­er­ation and file size statis­tics.

Languages: Bash PHP
Tags: GitHub
GitHub: nikonyrh/webcammon

English hyphenation algorithm in PHP

(10th July 2013)

A good pre­sen­ta­tion about hy­phen­ation in HTML doc­uments can be seen here, but it is client side (JavaScript) ori­ented. Ba­si­cally you shouldn't use jus­ti­fied text un­less it is hy­phen­ated, be­cause long words will cause huge spaces be­tween words to make the line stretch out the whole width of the el­ement. I found a few PHP scripts such as ph­pHy­phen­ator 1.5, but typ­ically they weren't im­ple­mented as a sin­gle stand-alone PHP class. Since the un­der­ly­ing al­go­rithm is fairly sim­ple, I de­cided to write it from scratch.

Languages: PHP
Tags: Hyphenation Blog GitHub
GitHub: nikonyrh/hyphenator-php