Niko's Project Corner

Python at other sites

Chess video search engine

(13th June 2021)

Youtube has a quite good search func­tion­al­ity based on video ti­tles, de­scrip­tions and maybe even sub­ti­tles but it doesn't go into ac­tual video con­tents and provide ac­cu­rate times­tamps for users' searches. An youtu­ber "Agad­ma­tor" has a very pop­ular chan­nel (1.1 mil­lion sub­scribers, 454 mil­lion video views at the time of writ­ing) which show­cases ma­jor chess games from past and re­cent tour­na­ments and on­line games. Here a search en­gine is in­tro­duced which an­alyzes the videos, rec­og­nizes chess pieces and builds a database of all of the po­si­tions on the board ready to be searched. It keeps track of the ex­act times­tamps of the videos in which the queried po­si­tion oc­curs so it is able to provide di­rect links to rel­evant videos.

Languages: Python Keras Clojure
Tags: Computer Vision Data Structures

Satellite crash course

(10th May 2021)

Since hear­ing the news of a fal­ing Chi­nese rocket booster Long March 5B and be­ing re­minded that Earth's sur­face is about 70% ocean, I got in­ter­ested on how the or­bital pa­ram­eters af­fect the odds of crash­ing to ocean vs. ground. I was nearly fin­ished with the pro­ject when I re­al­ized that Earth is ro­tat­ing un­der the satel­lite, thus in­val­idat­ing all the re­sults! This doesn't take or­bits' ec­cen­tric­ity into ac­count ei­ther, but I've heard that the ath­mo­spheric drag has a den­dency of re­duc­ing it to zero as the or­bit falls. Any­way, I found the var­ious "straight" paths around the globe in­ter­est­ing and de­cided to pub­lish these re­sults any­way. In con­clu­sion there are paths which spend only 9 % on top of land (in­clud­ing lakes) and 91% on top of ocean or up-to 57% on top of land and 43% on top of ocean.

Languages: Python
Tags: Applied mathematics

An efficient schema for hierarchical data on Elasticsearch

(20th November 2016)

Many busi­nesses gen­er­ate rich datasets from which valu­able in­sights can be dis­cov­ered. A ba­sic start­ing point is to an­alyze sep­arate events such as item sales, tourist at­trac­tion vis­its or movies seen. From these a time se­ries (to­tal sales / item / day, to­tal vis­its / tourist spot / week) or ba­sic met­rics (his­togram of movie rat­ings) can be ag­gre­gated. Things get a lot more in­ter­est­ing when in­di­vid­ual data points can be linked to­gether by a com­mon id, such as items be­ing bought in the same bas­ket or by the same house hold (iden­ti­fied by a loy­alty card), the spots vis­ited by a tourist group through out their jour­ney or movie rat­ings given by a speci­fic user. This richer data can be used to build rec­om­men­da­tion en­gi­nes, iden­tify sub­sti­tute prod­ucts or ser­vices and do clus­ter­ing anal­ysis. This ar­ti­cle de­scribes a schema for Elas­tic­search which sup­ports ef­fi­cient fil­ter­ing and ag­gre­ga­tions, and is au­to­mat­ically com­pat­ible with new data val­ues.

Languages: Python
Tags: Business Intelligence Databases Elasticsearch

Caching and perf. monitoring with Redis and Python

(10th October 2016)

When im­ple­ment­ing real-time APIs most of the time server load can greatly be re­duced by caching fre­quently ac­cessed and rarely mod­ified data, or re-us­able cal­cu­la­tion re­sults. Luck­ily Python has sev­eral fea­tures which make it easy to add new con­structs and wrap­pers to the lan­guage, for ex­am­ple thanks to *args, **kwargs func­tion ar­gu­ments, first-class func­tions, dec­ora­tors and so fort. Thus it doesn't take too much ef­fort to im­ple­ment a @cached dec­ora­tor with business-specific logic on cache invalidation. Redis is the perfect fit for the job thanks to its high performance, binary-friendly key-value store with TTL and different data eviction policies and support for other data structures which make it trivial to store additional key metrics there.

Languages: Python
Tags: Databases Redis

Scalable analytics with Docker, Spark and Python

(23rd December 2015)

Tra­di­tion­ally data sci­en­tists in­stalled soft­ware pack­ages di­rectly to their ma­chi­nes, wrote code, trained mod­els, saved re­sults to lo­cal files and ap­plied mod­els to new data in batch pro­cess­ing style. New data-driven prod­ucts re­quire rapid de­vel­op­ment of new mod­els, scal­able train­ing and easy in­te­gra­tion to other as­pects of the busi­ness. Here I am propos­ing one (per­haps al­ready well-known) cloud-ready ar­chi­tec­ture to meet these re­quire­ments.

Languages: Bash Python
Tags: Architecture Docker Spark Nginx GitHub JVM
GitHub: nikonyrh/docker-scripts

Very fuzzy searching with Elasticsearch

(21st October 2015)

I en­coun­tered an in­ter­est­ing ques­tion at Stack Over­flow about fuzzy search­ing of hashes on Elas­tic­search and de­cided to give it a go. It has na­tive sup­port for fuzzy text searches but due to per­for­mance rea­sons it only sup­ports an edit dis­tance up-to 2. In this con­text the max­imum al­lowed dis­tance was eight so an al­ter­na­tive so­lu­tion was needed. A so­lu­tion was found from lo­cal­ity-sen­si­tive hash­ing.

Languages: Python
Tags: Elasticsearch Databases GitHub Stack Overflow
GitHub: nikonyrh/stackoverflow-scripts