Niko's Project Corner

GitHub at other sites

GitHub in Wikipedia
GitHub in Google

Benchmarking Elasticsearch and MS SQL on NYC Taxis

(7th May 2017)

The NYC Taxi dataset has been used on quite many benchmarks (for example by Mark Litwintschik), perhaps because it has a quite rich set of columns but their meaning is mostly trivial to understand. I developed a Clojure project which generates Elasticsearch and SQL queries with three different templates for filters and four different templates of aggregations. This should give a decent indication of these databases performance under a typical workload, although this test did not run queries concurrently and it does not mix different query types when the benchmark is running. However benchmarks are always tricky to design and execute properly so I'm sure there is room for improvements. In this project the tested database engines were Elasticsearch 5.2.2 (with Oracle JVM 1.8.0_121) and MS SQL Server 2014.

Languages:	Clojure
Tags:	GitHub Databases Elasticsearch SQL
GitHub:	nikonyrh/nyc-taxi-data

Analyzing NYC Taxi dataset with Elasticsearch and Kibana

(19th March 2017)

The NYC taxicab dataset has seen lots of love from many data scientists such as Todd W. Scheider and Mark Litwintschik. I decided to give it a go while learning Clojure, as I suspected that it might be a good language for ETL jobs. This article describes how I loaded the dataset, normalized its conventions and columns, converted from CSV to JSON and stored them to Elasticsearch.

Languages:	Clojure
Tags:	GitHub JVM Elasticsearch Databases Business Intelligence Kibana
GitHub:	nikonyrh/nyc-taxi-data

Mustache templates in Clojure

(25th January 2017)

Mustache is a well-known template system with implementations in most popular languages. At its core it is logicless same templates can be directly used on other projects. For example I am planning to port this blgo engine from PHP to Clojure but I only need to replace LaTeX parsing and HTML generation parts, I should be able to use existing Mustache templates without any modifications. To learn Clojure programming I decided not to use the recommended library but instead implement my own.

Languages:	Clojure
Tags:	Blog GitHub JVM
GitHub:	nikonyrh/mustache-clj

English hyphenation algorithm in Clojure

(17th August 2016)

This is nothing that spectacular (as if anything on my blog is), but I still wanted to describe the outline of the project of porting the hyphenation algorithm from PHP to Clojure. The implementation is only about 80 lines of code + comments + 20 lines of unit tests. For comparison the original PHP abomination is about is about 160 LoCs, although it is a bit bloated by implementing the patterns search via a trie data structure instead of using the strpos function.

Languages:	Clojure
Tags:	Hyphenation Blog GitHub JVM
GitHub:	nikonyrh/hyphenator-clj

Finnish Invoice Template

(7th August 2016)

After finishing a paid project, ideally a formal looking invoice would be sent to the client. I know there are many commercial products available, but I couldn't find a good open source alternative especially with the standard Finnish formatting. I was happy to find jheusala/finnish-invoice-template from GitHub which had all of the tricky L^aT_eX stuff done.

Languages:	L^aT_eX
Tags:	GitHub Entrepreneurship
GitHub:	nikonyrh/finnish-invoice-template

Nginx docker image for easy file access via HTTP

(16th April 2016)

Often I find myself having a SSH connection to a remote server, and I'd like to retrieve some files to my own machine. Common methods for this include Windows/Samba share, SSHFS and upload to cloud (which isn't trivial to do via plain cURL). Here an easy-to-use alternative is described: a single line command to load and run a docker image which contains a pre-configured Nginx instance. Then files can be accessed via plain HTTP at the user-assigned port (assuming firewall isn't blocking it).

Languages:	Bash
Tags:	Docker Spark Nginx GitHub
GitHub:	nikonyrh/docker-scripts
DockerHub:	nikonyrh/nginx_bridge

Scalable analytics with Docker, Spark and Python

(23rd December 2015)

Traditionally data scientists installed software packages directly to their machines, wrote code, trained models, saved results to local files and applied models to new data in batch processing style. New data-driven products require rapid development of new models, scalable training and easy integration to other aspects of the business. Here I am proposing one (perhaps already well-known) cloud-ready architecture to meet these requirements.

Languages:	Bash Python
Tags:	Architecture Docker Spark Nginx GitHub JVM
GitHub:	nikonyrh/docker-scripts

Very fuzzy searching with CUDA

(2nd November 2015)

This is an alternative answer to the question I encountered at Stack Overflow about fuzzy searching of hashes on Elasticsearch. My original answer used locality-sensitive hashing. Superior speed and simple implementation were gained by using nVidia's CUDA via Thrust library.

Languages:	C++ CUDA
Tags:	Thrust Databases GitHub Stack Overflow
GitHub:	nikonyrh/stackoverflow-scripts

Very fuzzy searching with Elasticsearch

(21st October 2015)

I encountered an interesting question at Stack Overflow about fuzzy searching of hashes on Elasticsearch and decided to give it a go. It has native support for fuzzy text searches but due to performance reasons it only supports an edit distance up-to 2. In this context the maximum allowed distance was eight so an alternative solution was needed. A solution was found from locality-sensitive hashing.

Languages:	Python
Tags:	Elasticsearch Databases GitHub Stack Overflow
GitHub:	nikonyrh/stackoverflow-scripts

Anonymous and secure information storing and sharing

(25th April 2015)

Nowadays encryption is standard practice on web when data is in transition, and there are even a few services which offer client-side encryption and thus are truly end-to-end. Nevertheless for some reason they all require you to create and account by providing your email and password, although this is not strictly necessary for storing and sharing data. In this system the document id, encryption key and HMAC key are generated ad-hoc on the client and only minimal necessary information is revealed to the server. A live demo should be available at noknowledgenotes.nikonyrh.org.

Languages:	PHP
Tags:	GitHub Encryption
GitHub:	nikonyrh/noknowledgenotes

Automated image capturing + API

(10th April 2015)

Out of interest on nature observation, computer vision, image processing and so forth I developed an automated system to capture one photo / minute and store it on a disk. The project also has Bash and PHP scripts coordinating external tools such as montage for image stitching and mencoder for video generation. PHP also provides an HTTP API for image generation and file size statistics.

Languages:	Bash PHP
Tags:	GitHub
GitHub:	nikonyrh/webcammon

English hyphenation algorithm in PHP

(10th July 2013)

A good presentation about hyphenation in HTML documents can be seen here, but it is client side (JavaScript) oriented. Basically you shouldn't use justified text unless it is hyphenated, because long words will cause huge spaces between words to make the line stretch out the whole width of the element. I found a few PHP scripts such as phpHyphenator 1.5, but typically they weren't implemented as a single stand-alone PHP class. Since the underlying algorithm is fairly simple, I decided to write it from scratch.

Languages:	PHP
Tags:	Hyphenation Blog GitHub Data Structures
GitHub:	nikonyrh/hyphenator-php

Home

Navigation

Home	(Home page)
About	(About me)
Platform	(About this blog)

External

LinkedIn	(Niko Nyrhilä)
GitHub	(nikonyrh)
Stackoverflow	(nikonyrh)

Most recent

Bruteforcing Countdown numbe...	(2023 Apr)
Cheating at Bananagrams with...	(2023 Apr)
Introduction to Stable Diffu...	(2022 Nov)
Matching puzzle pieces together	(2022 Jul)
Single channel speech / musi...	(2022 Feb)

Most frequent tags

Computer Vision	(13)
GitHub	(12)
Databases	(9)
Elasticsearch	(6)
FFT	(5)
Rendering	(5)
Applied mathematics	(4)

Most frequent languages

Python	(13)
C++	(11)
Matlab	(10)
Keras	(6)
Clojure	(6)
Bash	(6)
PHP	(6)

Co-occurrence matrix

	Matl	Pyth	C++	Cloj	Bash	Kera
Comput	6	6	3	1	0	5
GitHub	0	2	1	4	3	0
Databa	0	3	2	2	1	0
Render	3	0	3	0	0	0
Nginx	0	1	0	0	4	0
Autoen	0	3	0	1	0	2
Elasti	0	2	0	3	0	0
FFT	3	1	1	0	0	1
Data S	2	1	2	1	0	1
JVM	0	1	0	3	1	0
Docker	0	1	0	0	3	0
FastCG	0	0	3	0	0	0
Applie	2	2	0	0	0	0
Field	2	0	2	0	0	0
Omnidi	2	0	2	0	0	0
Affine	2	0	2	0	0	0
Master	1	0	2	0	0	0
Archit	0	1	0	0	2	0
Visual	1	0	2	0	0	0
Spark	0	1	0	0	2	0
Blog	0	0	0	2	0	0
Hyphen	0	0	0	2	0	0
Stack	0	1	1	0	0	0
SQL	0	0	1	1	0	0
Busine	0	1	0	1	0	0
Signal	0	1	0	0	0	1
Encryp	0	0	0	0	1	0
Git	0	0	0	1	0	0
Stable	0	1	0	0	0	0
Redis	0	1	0	0	0	0
Thrust	0	0	1	0	0	0
Kibana	0	0	0	1	0	0
Astron	1	0	0	0	0	0
Mustac	0	0	1	0	0	0
NAT	0	0	0	0	1	0
jQuery	0	0	1	0	0	0
SSH	0	0	0	0	1	0
Happyh	0	0	1	0	0	0
Backup	0	0	0	0	1	0
Pthrea	0	0	1	0	0	0
AWS	0	0	0	0	1	0
SIFT	0	0	1	0	0	0
SURF	0	0	1	0	0	0
Conjug	0	0	1	0	0	0
Kalman	0	0	1	0	0	0
Partic	0	0	1	0	0	0
Gradie	0	0	1	0	0	0
Simult	0	0	1	0	0	0
Roboti	0	0	1	0	0	0
Princi	1	0	0	0	0	0
Receiv	1	0	0	0	0	0
Linear	1	0	0	0	0	0
Suppor	1	0	0	0	0	0
Machin	1	0	0	0	0	0
Discre	1	0	0	0	0	0