background

Niko's Project Corner

Chess video search engine

(13th June 2021)

Youtube has a quite good search func­tion­al­ity based on video ti­tles, de­scrip­tions and maybe even sub­ti­tles but it doesn't go into ac­tual video con­tents and provide ac­cu­rate times­tamps for users' searches. An youtu­ber "Agad­ma­tor" has a very pop­ular chan­nel (1.1 mil­lion sub­scribers, 454 mil­lion video views at the time of writ­ing) which show­cases ma­jor chess games from past and re­cent tour­na­ments and on­line games. Here a search en­gine is in­tro­duced which an­alyzes the videos, rec­og­nizes chess pieces and builds a database of all of the po­si­tions on the board ready to be searched. It keeps track of the ex­act times­tamps of the videos in which the queried po­si­tion oc­curs so it is able to provide di­rect links to rel­evant videos.

Languages: Python Keras Clojure
Tags: Computer Vision Data Structures

Satellite crash course

(10th May 2021)

Since hear­ing the news of a fal­ing Chi­nese rocket booster Long March 5B and be­ing re­minded that Earth's sur­face is about 70% ocean, I got in­ter­ested on how the or­bital pa­ram­eters af­fect the odds of crash­ing to ocean vs. ground. I was nearly fin­ished with the pro­ject when I re­al­ized that Earth is ro­tat­ing un­der the satel­lite, thus in­val­idat­ing all the re­sults! This doesn't take or­bits' ec­cen­tric­ity into ac­count ei­ther, but I've heard that the ath­mo­spheric drag has a den­dency of re­duc­ing it to zero as the or­bit falls. Any­way, I found the var­ious "straight" paths around the globe in­ter­est­ing and de­cided to pub­lish these re­sults any­way. In con­clu­sion there are paths which spend only 9 % on top of land (in­clud­ing lakes) and 91% on top of ocean or up-to 57% on top of land and 43% on top of ocean.

Languages: Python
Tags: Applied mathematics

JGit blame for fun and profit(?)

(2nd April 2018)

Soft­ware pro­jects are typ­ically "tracked" on a ver­sion con­trol sys­tem (VCS). Each "ver­sion" of the code is called a "com­mit", which does not only store file con­tents, but also plenty of meta­data. This cre­ates a very rich set of data, and in the age of open source there are thou­sands of pro­jects to study. A few ex­am­ples are Git of The­seus and Gi­ten­tial, but by fo­cus­ing on "git blame" (see who has com­mit­ted each line on each file) I hope to bring some­thing new to the table. In short I have an­alyzed how source code gets re­placed by newer code, track­ing the top­ics of who, when and why, and how old the code was.

Languages: Clojure
Tags: Git Elasticsearch

Benchmarking Elasticsearch and MS SQL on NYC Taxis

(7th May 2017)

The NYC Taxi dataset has been used on quite many bench­marks (for ex­am­ple by Mark Litwintschik), per­haps be­cause it has a quite rich set of columns but their mean­ing is mostly triv­ial to un­der­stand. I de­vel­oped a Clo­jure pro­ject which gen­er­ates Elas­tic­search and SQL queries with three dif­fer­ent tem­plates for fil­ters and four dif­fer­ent tem­plates of ag­gre­ga­tions. This should give a de­cent in­di­ca­tion of these databases per­for­mance un­der a typ­ical work­load, al­though this test did not run queries con­cur­rently and it does not mix dif­fer­ent query types when the bench­mark is run­ning. How­ever bench­marks are al­ways tricky to de­sign and ex­ecute prop­erly so I'm sure there is room for im­prove­ments. In this pro­ject the tested database en­gi­nes were Elas­tic­search 5.2.2 (with Or­acle JVM 1.8.0_121) and MS SQL Server 2014.

Languages: Clojure
Tags: GitHub Databases Elasticsearch SQL
GitHub: nikonyrh/nyc-taxi-data

Analyzing NYC Taxi dataset with Elasticsearch and Kibana

(19th March 2017)

The NYC taxi­cab dataset has seen lots of love from many data sci­en­tists such as Todd W. Schei­der and Mark Litwintschik. I de­cided to give it a go while learn­ing Clo­jure, as I sus­pected that it might be a good lan­guage for ETL jobs. This ar­ti­cle de­scribes how I loaded the dataset, nor­mal­ized its con­ven­tions and columns, con­verted from CSV to JSON and stored them to Elas­tic­search.

Languages: Clojure
Tags: GitHub JVM Elasticsearch Databases Business Intelligence Kibana
GitHub: nikonyrh/nyc-taxi-data

Mustache templates in Clojure

(25th January 2017)

Mus­tache is a well-known tem­plate sys­tem with im­ple­men­ta­tions in most pop­ular lan­guages. At its core it is log­icless same tem­plates can be di­rectly used on other pro­jects. For ex­am­ple I am plan­ning to port this blgo en­gine from PHP to Clo­jure but I only need to re­place La­TeX pars­ing and HTML gen­er­ation parts, I should be able to use ex­ist­ing Mus­tache tem­plates with­out any mod­ifi­ca­tions. To learn Clo­jure pro­gram­ming I de­cided not to use the rec­om­mended li­brary but in­stead im­ple­ment my own.

Languages: Clojure
Tags: Blog GitHub JVM
GitHub: nikonyrh/mustache-clj

An efficient schema for hierarchical data on Elasticsearch

(20th November 2016)

Many busi­nesses gen­er­ate rich datasets from which valu­able in­sights can be dis­cov­ered. A ba­sic start­ing point is to an­alyze sep­arate events such as item sales, tourist at­trac­tion vis­its or movies seen. From these a time se­ries (to­tal sales / item / day, to­tal vis­its / tourist spot / week) or ba­sic met­rics (his­togram of movie rat­ings) can be ag­gre­gated. Things get a lot more in­ter­est­ing when in­di­vid­ual data points can be linked to­gether by a com­mon id, such as items be­ing bought in the same bas­ket or by the same house hold (iden­ti­fied by a loy­alty card), the spots vis­ited by a tourist group through out their jour­ney or movie rat­ings given by a speci­fic user. This richer data can be used to build rec­om­men­da­tion en­gi­nes, iden­tify sub­sti­tute prod­ucts or ser­vices and do clus­ter­ing anal­ysis. This ar­ti­cle de­scribes a schema for Elas­tic­search which sup­ports ef­fi­cient fil­ter­ing and ag­gre­ga­tions, and is au­to­mat­ically com­pat­ible with new data val­ues.

Languages: Python
Tags: Business Intelligence Databases Elasticsearch

Caching and perf. monitoring with Redis and Python

(10th October 2016)

When im­ple­ment­ing real-time APIs most of the time server load can greatly be re­duced by caching fre­quently ac­cessed and rarely mod­ified data, or re-us­able cal­cu­la­tion re­sults. Luck­ily Python has sev­eral fea­tures which make it easy to add new con­structs and wrap­pers to the lan­guage, for ex­am­ple thanks to *args, **kwargs func­tion ar­gu­ments, first-class func­tions, dec­ora­tors and so fort. Thus it doesn't take too much ef­fort to im­ple­ment a @cached dec­ora­tor with business-specific logic on cache invalidation. Redis is the perfect fit for the job thanks to its high performance, binary-friendly key-value store with TTL and different data eviction policies and support for other data structures which make it trivial to store additional key metrics there.

Languages: Python
Tags: Databases Redis

Service discovery with Docker, Consul and Registrator

(28th August 2016)

Tra­di­tion­ally com­put­ers were named and not eas­ily re­placed in the event it broke down. Server soft­ware was lis­ten­ing on a hard-coded port, and to link pieces to­gether these ma­chine names and ser­vice ports were hard-coded into other soft­ware's con­fig­ura­tion files. Now in the era of cloud com­put­ing and ser­vice ori­ented ar­chi­tec­ture this is no longer an ad­equate so­lu­tion, thus elas­tic scal­ing and ser­vice dis­cov­ery are be­com­ing the norm. One easy so­lu­tion is to com­bine the pow­ers of Docker, Con­sul and Reg­is­tra­tor.

Languages: Bash
Tags: Architecture Docker Databases Nginx
GitHub: nikonyrh/docker-scripts

English hyphenation algorithm in Clojure

(17th August 2016)

This is noth­ing that spec­tac­ular (as if any­thing on my blog is), but I still wanted to de­scribe the out­line of the pro­ject of port­ing the hy­phen­ation al­go­rithm from PHP to Clo­jure. The im­ple­men­ta­tion is only about 80 lines of code + com­ments + 20 lines of unit tests. For com­par­ison the orig­inal PHP abom­ina­tion is about is about 160 LoCs, al­though it is a bit bloated by im­ple­ment­ing the pat­terns search via a trie data struc­ture in­stead of us­ing the str­pos func­tion.

Languages: Clojure
Tags: Hyphenation Blog GitHub JVM
GitHub: nikonyrh/hyphenator-clj

Finnish Invoice Template

(7th August 2016)

Af­ter fin­ish­ing a paid pro­ject, ide­ally a for­mal look­ing in­voice would be sent to the client. I know there are many com­mer­cial prod­ucts avail­able, but I couldn't find a good open source al­ter­na­tive es­pe­cially with the stan­dard Finnish for­mat­ting. I was happy to find jheusala/finnish-in­voice-tem­plate from GitHub which had all of the tricky LaTeX stuff done.

Languages: LaTeX
Tags: GitHub Entrepreneurship
GitHub: nikonyrh/finnish-invoice-template

Hash-based commitment schemes without a computer

(7th May 2016)

An in­ter­est­ing ques­tion was posted to crypto.stack­ex­change.com: "Is there a sim­ple hash func­tion that one can com­pute with­out a com­puter?" Here are three pro­posed al­go­rithms based on Zo­brist hash­ing, RC4 and A5/1. These should be reasonably secure even against attacks with a calculator, except the one based on Zobrist hashing (but I don't know how to prove or dis-prove this claim). These constructs are especially well suited for com­mit­ment schemes.

Languages: Pseudo
Tags: Encryption Stack Overflow

Simulating gravitational field near a torus

(19th April 2016)

There are many games with a strong em­pha­sis on grav­ity, and at times even multi-body tra­jec­tory sim­ula­tions. How­ever they all seem to be based on spher­ical ge­om­etry (as plan­ets are shaped by grav­ity), but other shapes should cre­ate in­ter­est­ing tra­jec­to­ries. As torus has ro­ta­tional sym­me­try its grav­ity field can be mod­elled on a 2D cross-sec­tion. In this pro­ject torus' field is es­ti­mated in 3D, pro­jected to 2D and in­ter­po­la­tion func­tions are fit­ted. The space- and time-ef­fi­cient model could be used in a game to do physics sim­ula­tion in real time.

Languages: Matlab
Tags: Applied mathematics

Nginx docker image for easy file access via HTTP

(16th April 2016)

Of­ten I find my­self hav­ing a SSH con­nec­tion to a re­mote server, and I'd like to re­trieve some files to my own ma­chine. Com­mon meth­ods for this in­clude Win­dows/Samba share, SSHFS and up­load to cloud (which isn't triv­ial to do via plain cURL). Here an easy-to-use al­ter­na­tive is de­scribed: a sin­gle line com­mand to load and run a docker im­age which con­tains a pre-con­fig­ured Ng­inx in­stance. Then files can be ac­cessed via plain HTTP at the user-as­signed port (as­sum­ing fire­wall isn't block­ing it).

Languages: Bash
Tags: Docker Spark Nginx GitHub
GitHub: nikonyrh/docker-scripts
DockerHub: nikonyrh/nginx_bridge

Scalable analytics with Docker, Spark and Python

(23rd December 2015)

Tra­di­tion­ally data sci­en­tists in­stalled soft­ware pack­ages di­rectly to their ma­chi­nes, wrote code, trained mod­els, saved re­sults to lo­cal files and ap­plied mod­els to new data in batch pro­cess­ing style. New data-driven prod­ucts re­quire rapid de­vel­op­ment of new mod­els, scal­able train­ing and easy in­te­gra­tion to other as­pects of the busi­ness. Here I am propos­ing one (per­haps al­ready well-known) cloud-ready ar­chi­tec­ture to meet these re­quire­ments.

Languages: Bash Python
Tags: Architecture Docker Spark Nginx GitHub JVM
GitHub: nikonyrh/docker-scripts

[ 1 | 2 | 3 ]