Analyzing NYC Taxi dataset with Elasticsearch and Kibana(19th March 2017) |
|||||||
The NYC taxicab dataset has seen lots of love from many data scientists such as Todd W. Scheider and Mark Litwintschik. I decided to give it a go while learning Clojure, as I suspected that it might be a good language for ETL jobs. This article describes how I loaded the dataset, normalized its conventions and columns, converted from CSV to JSON and stored them to Elasticsearch.
|
|
||||||
An efficient schema for hierarchical data on Elasticsearch(20th November 2016) |
|||||||
Many businesses generate rich datasets from which valuable insights can be discovered. A basic starting point is to analyze separate events such as item sales, tourist attraction visits or movies seen. From these a time series (total sales / item / day, total visits / tourist spot / week) or basic metrics (histogram of movie ratings) can be aggregated. Things get a lot more interesting when individual data points can be linked together by a common id, such as items being bought in the same basket or by the same house hold (identified by a loyalty card), the spots visited by a tourist group through out their journey or movie ratings given by a specific user. This richer data can be used to build recommendation engines, identify substitute products or services and do clustering analysis. This article describes a schema for Elasticsearch which supports efficient filtering and aggregations, and is automatically compatible with new data values.
|
|
Home |
Home | (Home page) |
About | (About me) |
Platform | (About this blog) |
(Niko Nyrhilä) | |
GitHub | (nikonyrh) |
Stackoverflow | (nikonyrh) |
Bruteforcing Countdown numbe... | (2023 Apr) |
Cheating at Bananagrams with... | (2023 Apr) |
Introduction to Stable Diffu... | (2022 Nov) |
Matching puzzle pieces together | (2022 Jul) |
Single channel speech / musi... | (2022 Feb) |
Computer Vision | (13) |
GitHub | (12) |
Databases | (9) |
Elasticsearch | (6) |
FFT | (5) |
Rendering | (5) |
Applied mathematics | (4) |
Python | (13) |
C++ | (11) |
Matlab | (10) |
Keras | (6) |
Clojure | (6) |
Bash | (6) |
PHP | (6) |
Matl | Pyth | C++ | Cloj | Bash | Kera | |
Comput | 6 | 6 | 3 | 1 | 0 | 5 |
GitHub | 0 | 2 | 1 | 4 | 3 | 0 |
Databa | 0 | 3 | 2 | 2 | 1 | 0 |
Render | 3 | 0 | 3 | 0 | 0 | 0 |
Nginx | 0 | 1 | 0 | 0 | 4 | 0 |
Autoen | 0 | 3 | 0 | 1 | 0 | 2 |
Elasti | 0 | 2 | 0 | 3 | 0 | 0 |
FFT | 3 | 1 | 1 | 0 | 0 | 1 |
Data S | 2 | 1 | 2 | 1 | 0 | 1 |
JVM | 0 | 1 | 0 | 3 | 1 | 0 |
Docker | 0 | 1 | 0 | 0 | 3 | 0 |
FastCG | 0 | 0 | 3 | 0 | 0 | 0 |
Applie | 2 | 2 | 0 | 0 | 0 | 0 |
Field | 2 | 0 | 2 | 0 | 0 | 0 |
Omnidi | 2 | 0 | 2 | 0 | 0 | 0 |
Affine | 2 | 0 | 2 | 0 | 0 | 0 |
Master | 1 | 0 | 2 | 0 | 0 | 0 |
Archit | 0 | 1 | 0 | 0 | 2 | 0 |
Visual | 1 | 0 | 2 | 0 | 0 | 0 |
Spark | 0 | 1 | 0 | 0 | 2 | 0 |
Blog | 0 | 0 | 0 | 2 | 0 | 0 |
Hyphen | 0 | 0 | 0 | 2 | 0 | 0 |
Stack | 0 | 1 | 1 | 0 | 0 | 0 |
SQL | 0 | 0 | 1 | 1 | 0 | 0 |
Busine | 0 | 1 | 0 | 1 | 0 | 0 |
Signal | 0 | 1 | 0 | 0 | 0 | 1 |
Encryp | 0 | 0 | 0 | 0 | 1 | 0 |
Git | 0 | 0 | 0 | 1 | 0 | 0 |
Stable | 0 | 1 | 0 | 0 | 0 | 0 |
Redis | 0 | 1 | 0 | 0 | 0 | 0 |
Thrust | 0 | 0 | 1 | 0 | 0 | 0 |
Kibana | 0 | 0 | 0 | 1 | 0 | 0 |
Astron | 1 | 0 | 0 | 0 | 0 | 0 |
Mustac | 0 | 0 | 1 | 0 | 0 | 0 |
NAT | 0 | 0 | 0 | 0 | 1 | 0 |
jQuery | 0 | 0 | 1 | 0 | 0 | 0 |
SSH | 0 | 0 | 0 | 0 | 1 | 0 |
Happyh | 0 | 0 | 1 | 0 | 0 | 0 |
Backup | 0 | 0 | 0 | 0 | 1 | 0 |
Pthrea | 0 | 0 | 1 | 0 | 0 | 0 |
AWS | 0 | 0 | 0 | 0 | 1 | 0 |
SIFT | 0 | 0 | 1 | 0 | 0 | 0 |
SURF | 0 | 0 | 1 | 0 | 0 | 0 |
Conjug | 0 | 0 | 1 | 0 | 0 | 0 |
Kalman | 0 | 0 | 1 | 0 | 0 | 0 |
Partic | 0 | 0 | 1 | 0 | 0 | 0 |
Gradie | 0 | 0 | 1 | 0 | 0 | 0 |
Simult | 0 | 0 | 1 | 0 | 0 | 0 |
Roboti | 0 | 0 | 1 | 0 | 0 | 0 |
Princi | 1 | 0 | 0 | 0 | 0 | 0 |
Receiv | 1 | 0 | 0 | 0 | 0 | 0 |
Linear | 1 | 0 | 0 | 0 | 0 | 0 |
Suppor | 1 | 0 | 0 | 0 | 0 | 0 |
Machin | 1 | 0 | 0 | 0 | 0 | 0 |
Discre | 1 | 0 | 0 | 0 | 0 | 0 |