(2nd April 2018)
|
Software projects are typically "tracked" on a version control system (VCS). Each "version" of the code is called a "commit", which does not only store file contents, but also plenty of metadata. This creates a very rich set of data, and in the age of open source there are thousands of projects to study. A few examples are Git of Theseus and Gitential, but by focusing on "git blame" (see who has committed each line on each file) I hope to bring something new to the table. In short I have analyzed how source code gets replaced by newer code, tracking the topics of who, when and why, and how old the code was.
|
|
(7th May 2017)
|
The NYC Taxi dataset has been used on quite many benchmarks (for example by Mark Litwintschik), perhaps because it has a quite rich set of columns but their meaning is mostly trivial to understand. I developed a Clojure project which generates Elasticsearch and SQL queries with three different templates for filters and four different templates of aggregations. This should give a decent indication of these databases performance under a typical workload, although this test did not run queries concurrently and it does not mix different query types when the benchmark is running. However benchmarks are always tricky to design and execute properly so I'm sure there is room for improvements. In this project the tested database engines were Elasticsearch 5.2.2 (with Oracle JVM 1.8.0_121) and MS SQL Server 2014.
|
|
(19th March 2017)
|
The NYC taxicab dataset has seen lots of love from many data scientists such as Todd W. Scheider and Mark Litwintschik. I decided to give it a go while learning Clojure, as I suspected that it might be a good language for ETL jobs. This article describes how I loaded the dataset, normalized its conventions and columns, converted from CSV to JSON and stored them to Elasticsearch.
|
|
(25th January 2017)
|
Mustache is a well-known template system with implementations in most popular languages. At its core it is logicless same templates can be directly used on other projects. For example I am planning to port this blgo engine from PHP to Clojure but I only need to replace LaTeX parsing and HTML generation parts, I should be able to use existing Mustache templates without any modifications. To learn Clojure programming I decided not to use the recommended library but instead implement my own.
|
|
(20th November 2016)
|
Many businesses generate rich datasets from which valuable insights can be discovered. A basic starting point is to analyze separate events such as item sales, tourist attraction visits or movies seen. From these a time series (total sales / item / day, total visits / tourist spot / week) or basic metrics (histogram of movie ratings) can be aggregated. Things get a lot more interesting when individual data points can be linked together by a common id, such as items being bought in the same basket or by the same house hold (identified by a loyalty card), the spots visited by a tourist group through out their journey or movie ratings given by a specific user. This richer data can be used to build recommendation engines, identify substitute products or services and do clustering analysis. This article describes a schema for Elasticsearch which supports efficient filtering and aggregations, and is automatically compatible with new data values.
|
|
(10th October 2016)
|
When implementing real-time APIs most of the time server load can greatly be reduced by caching frequently accessed and rarely modified data, or re-usable calculation results. Luckily Python has several features which make it easy to add new constructs and wrappers to the language, for example thanks to *args, **kwargs function arguments, first-class functions, decorators and so fort. Thus it doesn't take too much effort to implement a @cached decorator with business-specific logic on cache invalidation. Redis is the perfect fit for the job thanks to its high performance, binary-friendly key-value store with TTL and different data eviction policies and support for other data structures which make it trivial to store additional key metrics there.
|
|
(28th August 2016)
|
Traditionally computers were named and not easily replaced in the event it broke down. Server software was listening on a hard-coded port, and to link pieces together these machine names and service ports were hard-coded into other software's configuration files. Now in the era of cloud computing and service oriented architecture this is no longer an adequate solution, thus elastic scaling and service discovery are becoming the norm. One easy solution is to combine the powers of Docker, Consul and Registrator.
|
|
(17th August 2016)
|
This is nothing that spectacular (as if anything on my blog is), but I still wanted to describe the outline of the project of porting the hyphenation algorithm from PHP to Clojure. The implementation is only about 80 lines of code + comments + 20 lines of unit tests. For comparison the original PHP abomination is about is about 160 LoCs, although it is a bit bloated by implementing the patterns search via a trie data structure instead of using the strpos function.
|
|
(7th August 2016)
|
After finishing a paid project, ideally a formal looking invoice would be sent to the client. I know there are many commercial products available, but I couldn't find a good open source alternative especially with the standard Finnish formatting. I was happy to find jheusala/finnish-invoice-template from GitHub which had all of the tricky LaTeX stuff done.
|
|
(7th May 2016)
|
An interesting question was posted to crypto.stackexchange.com: "Is there a simple hash function that one can compute without a computer?" Here are three proposed algorithms based on Zobrist hashing, RC4 and A5/1. These should be reasonably secure even against attacks with a calculator, except the one based on Zobrist hashing (but I don't know how to prove or dis-prove this claim). These constructs are especially well suited for commitment schemes.
|
|
(19th April 2016)
|
There are many games with a strong emphasis on gravity, and at times even multi-body trajectory simulations. However they all seem to be based on spherical geometry (as planets are shaped by gravity), but other shapes should create interesting trajectories. As torus has rotational symmetry its gravity field can be modelled on a 2D cross-section. In this project torus' field is estimated in 3D, projected to 2D and interpolation functions are fitted. The space- and time-efficient model could be used in a game to do physics simulation in real time.
|
|
(16th April 2016)
|
Often I find myself having a SSH connection to a remote server, and I'd like to retrieve some files to my own machine. Common methods for this include Windows/Samba share, SSHFS and upload to cloud (which isn't trivial to do via plain cURL). Here an easy-to-use alternative is described: a single line command to load and run a docker image which contains a pre-configured Nginx instance. Then files can be accessed via plain HTTP at the user-assigned port (assuming firewall isn't blocking it).
|
|
(23rd December 2015)
|
Traditionally data scientists installed software packages directly to their machines, wrote code, trained models, saved results to local files and applied models to new data in batch processing style. New data-driven products require rapid development of new models, scalable training and easy integration to other aspects of the business. Here I am proposing one (perhaps already well-known) cloud-ready architecture to meet these requirements.
|
|
(2nd November 2015)
|
This is an alternative answer to the question I encountered at Stack Overflow about fuzzy searching of hashes on Elasticsearch. My original answer used locality-sensitive hashing. Superior speed and simple implementation were gained by using nVidia's CUDA via Thrust library.
|
|
(21st October 2015)
|
I encountered an interesting question at Stack Overflow about fuzzy searching of hashes on Elasticsearch and decided to give it a go. It has native support for fuzzy text searches but due to performance reasons it only supports an edit distance up-to 2. In this context the maximum allowed distance was eight so an alternative solution was needed. A solution was found from locality-sensitive hashing.
|
|
[
1
|
2
|
3
]