Niko's Project Corner

Efficient in-memory analytical database

Description In-memory database for analytical queries
Languages C++
Tags FastCGI
Duration Fall 2013
Modified 1st December 2013

Tra­di­tional databases such as MySQL are not de­signed to per­form well in an­alyt­ical queries, which re­quires ac­cess to pos­si­bly all of the rows on se­lected columns. This re­sults in a full table scan and it can­not ben­efit from any in­dexes. Column-ori­ented en­gi­nes try to cir­cum­vent this is­sue, but I went one step deeper and made the stor­age column value ori­ented, sim­ilar to an in­verted in­dex. This re­sults in 2 — 10× speedup from op­ti­mized colum­nar so­lu­tions and 80× the speed of MySQL.

Fi­nal bench­mark re­sults are pre­sented in Fig­ure 1, but their de­tails are dis­cussed later. It pre­sents ag­gre­gated query times for nine dif­fer­ent queries of Olap­Calc (this pro­ject) and five other database prod­ucts. Olap­Calc sup­ports both sin­gle-threaded and multi-threaded ex­ecu­tion, and in the tested hard­ware multi-thread­ing gave 3× speed boost. Some database prod­ucts sup­port only a sin­gle ex­ecu­tion thread (MySQL) where as oth­ers would sup­port multi-thread­ing but it is dis­abled in the free com­mu­nity edi­tion (e.g. In­fo­Bright). Mon­etDB is multi-threaded and is de­signed to uti­lize CPU cache-aware al­go­rithms. Nev­er­the­less Olap­Calc is 2.1× as fast (100 ms vs. 210 ms). Fur­ther anal­ysis and mem­ory us­age fol­lows later.

Figure 1: Ge­omet­ric means of query run-times for Olap­Calc and five other DB prod­ucts (log-scale).

The read-only database is im­ple­mented as a back­ground FastCGI pro­cess which loads the data from disk at star­tup. The cur­rent en­vi­ron­ment is a 64-bit Linux in­side a vir­tual ma­chine, in which the data read­ing and on-the-fly in­dex­ing speed is ≈14 Mb/s, or 1 Gb in just over a min­ute. The re­sult­ing mem­ory us­age is 25 — 35% of the raw file size, de­pend­ing on the num­ber of unique val­ues in columns. The in­verted in­dex can be fur­ther re­duced by ≈40% by stor­ing it in a com­pressed for­mat. How­ever the on-the-fly de­com­pres­sion of re­quired columns in­creases the query time to 3.8× of the orig­inal time.

The sys­tem pro­vides a sim­ple HTTP API which re­turns cal­cu­la­tion re­sults as a JSON string, so it is easy to in­te­grate with any front-end UI or graphs. At the cur­rent stage there is no any sup­port for SQL queries, but it isn't too dif­fi­cult to add the sup­port. The cur­rent im­ple­men­ta­tion sup­ports queries which group by a list of columns and asks for a sin­gle ag­gre­ga­tion func­tion (sum, avg, me­dian, ...) of a sin­gle column.

The data is stored purely in RAM, so there is no need for disk ac­cess un­less data has been swapped to disk. On big datasets (50 Gb and be­yond) this will be­come a prob­lem even when com­pres­sion is uti­lized. This can be avoided by ei­ther pre-ag­gre­gat­ing the data at some level, or by par­ti­tion­ing the data into smaller sub-sets and merg­ing the re­sults from mul­ti­ple ma­chi­nes. Luck­ily RAM is get­ting cheaper year by year and 64-bit op­er­at­ing sys­tems have plenty of room for growth in avail­able mem­ory size.

Cur­rently the only re­quire­ment for in­put files is that their column names fol­low a cer­tain con­ven­tion. Column names need to be post-fixed with ei­ther _id, _txt or _date to in­di­cate their type. If a text-type column doesn't have a pre-de­fined id column, the id is auto-gen­er­ated on-the-fly in the al­pha­bet­ical or­der. The nu­mer­ical val­ues (from which ag­gre­gates are cal­cu­lated) are iden­ti­fied by the _fact post­fix.

The un­der­ly­ing stor­age prin­ci­ple is ac­tu­ally very sim­ple: for each column's each unique value, store the list of row num­bers which have this value. In ef­fect this is an in­verted in­dex of each column which is typ­ically used by search en­gi­nes. To im­ple­ment a "group by" func­tion one only needs to im­ple­ment the merg­ing of two lists of in­te­gers, which is a well stud­ied prob­lem. Stan­dard ap­proaches are Sort-merge join and hash join. In some cases hash join is faster, but it re­quires to ei­ther store a hash table for all columns (which uses ex­tra mem­ory) or build­ing it on-de­mand (which takes ex­tra time).

To uti­lize ef­fi­cient bit-wise AND op­er­ator in joins, a set of row num­bers is stored in a 64-bit in­te­ger by us­ing top 21 bits for the " ad­dress" and 43 bits for the "con­tent". The ad­dress of nth row num­ber is floor(n / 43), and its lo­ca­tion within the block is mod(n, 43). The merge al­go­rithm uti­lizes the stan­dard merge join to find the cor­rect ad­dress and bit-wise AND to de­ter­mine which rows within the block ex­ist in both columns. This logic is it­er­atively ap­plied when there are more than two columns. The ad­dress­ing scheme is il­lus­trated in Fig­ure 2 with a re­duced num­ber of bits vi­su­al­ized. The stor­age is very com­pact when the con­tent is mostly ones, which oc­curs on low-car­di­nal­ity columns and if the data is or­dered by this column be­fore in­ser­tion. If not then most of the bits will be zero and more blocks are re­quired for stor­age. This is coun­tered by ap­ply­ing an ex­tra com­pres­sion step to those columns which ben­efit the most.

Figure 2: Ex­am­ple bi­nary con­tents of four "row num­ber blocks" with 21 bits for ad­dress (on the left) and 64 - 21 = 43 bits for the con­tent (on the right).

The num­ber of bits used for block ad­dress is a trade­off be­tween ef­fi­ciency, mem­ory waste on sparse columns and the to­tal num­ber of unique ad­dresses. When n bits are used for ad­dress and 64 - n for con­tent, the to­tal num­ber of rows that can be sup­ported is f(n) = 2n &mid­dot; (64 - n). For ex­am­ple f(16) → 3.1 mil­lion, f(21) → 90 mil­lion and f(32) → 671 mil­lion. The num­ber 21 was cho­sen be­cause 90 mil­lion is not too low limit and 64 = 21 · 3 + 1, which means that three ad­dresses can be stored in a sin­gle 64-bit in­te­ger, to­gether with a ex­tra bit for stor­ing a flag value. If more than 90 mil­lion rows needs to be pro­cessed, the data can be par­ti­tioned into mul­ti­ple chunks which are pro­cessed in­de­pen­dently and fi­nally merged.

Dur­ing the merg­ing the next match­ing block ad­dress is found by an ex­po­nen­tial search strat­egy. It means that the step length is dou­bled un­til the found value is greater than the value which is be­ing searched. Once this oc­curs it is known that the tar­get value should be found within the cur­rent in­ter­val, for which a bi­nary search is ex­ecuted. Once the range is nar­rowed down to a rel­atively short length it is scanned by lin­ear search, be­cause it is more cache friendly and thus gives some per­for­mance boost. This al­go­rithm is O(log(d)), where d is the num­ber of el­ements from the cur­rent lo­ca­tion to the tar­get value. These steps are shown in Fig­ure 3. Ex­po­nen­tially grow­ing ini­tial steps are shown in red. The searched value is in­di­cated by the green cir­cle. Af­ter it is by­passed the cur­rent in­ter­val is searched by a bi­nary search, which is shown in blue. Once the in­ter­val has shrunk enough it is scanned by a lin­ear search.

Figure 3: Steps of ex­po­nen­tial search al­go­rithm to find val­ues from an or­dered list. First the ex­po­nen­tially grow­ing search step is done (in red), af­ter which the found in­ter­val is searched by bi­nary search (in blue) and lin­ear search (the fi­nal green ar­row).

The merge al­go­rithm pro­duces a list of JSON out­put to­kens, some of which are con­stant strings such as column names and open­ing tags, and oth­ers are place­hold­ers for ag­gre­gated val­ues. Ini­tially these place­hold­ers con­tain the list of row num­bers which will be used to de­ter­mine the ag­gre­gated value, stored in the pre­vi­ously ex­plained block stor­age for­mat. Then the mea­sure type and ag­gre­gated column is cho­sen, and all place­hold­ers' val­ues are up­dated to store the re­sult of the ag­gre­ga­tion func­tion. Af­ter this is done the out­put string can be gen­er­ated. Fur­ther com­plex­ity arises from on-de­mand de­com­pres­sion and also multi-thread­ing each of these steps.

The data com­pres­sion is based on Elias gamma cod­ing, which is a vari­able length code and ef­fi­cient in stor­ing small in­te­gers. Stor­ing the num­ber n takes 2 + 1 bits, which is still sig­nif­icant for big­ger num­bers. To avoid this prob­lem, two dif­fer­ent strate­gies are ap­plied (one for block ad­dresses and an­other for block con­tents). For block ad­dresses can just store the deltas be­tween the ad­dresses, which are a lot bet­ter suited for this type of code. For block con­tents a run length en­cod­ing is used, and run lengths are stored as Elias gamma codes as well. Typ­ically this re­duces the mem­ory us­age by 40% but in­creases the pro­cess­ing time to 3× of the orig­inal. Per­haps some other com­pres­sion scheme would be faster to de­com­press.

In bench­marks the used dataset is 1 Gb of prop­erty list­ings data (≈ 6 mil­lion rows), which can be com­pressed down to 70 Mb by ag­gres­sive set­tings on 7zip. So far I've ran the bench­mark­ing suite for MySQL, In­fo­Bright, In­finiDB and Mon­etDB. They have quite dif­fer­ent im­ple­men­ta­tion and de­sign goals, but all the tested queries are like "SE­LECT x, y, z, avg(a) FROM table GROUP BY x, y, z OR­DER BY x, y, z" with­out any WHERE clauses. The en­vi­ron­ment is a 64-bit Linux run­ning on In­ter Core i7-3630QM @ 2.4 Ghz with 6 of the 8 cores en­abled for the vir­tual ma­chine, with 4 Gb of RAM. Some care had to be taken to avoid sur­pris­ing caching ef­fects. The bench­mark is meant to mea­sure the ac­tual cal­cu­la­tion per­for­mance, not caching or pre-cal­cu­la­tion ef­fi­ciency. Ide­ally the sys­tem would re­quire min­imal con­fig­ura­tion and just adapt to which ever data it is fed.

The step in the bench­mark is to load the test data to all of these database en­gi­nes. Gen­er­ally they all sup­port bulk-load­ing of CSV files, but there were some prob­lems with Chi­nese char­ac­ters in peo­ple's name column. Also there were some dif­fer­ences across ven­dors on which SQL syn­tax to use, or whether a pro­pri­etary batch load­ing tool was sup­plied. For Mon­etDB I used "COPY INTO table FROM STDIN" and In­finiDB has its own cpim­port tool. Olap­Calc reads data in di­rectly from a set of CSV files and oth­ers sup­ported "LOAD DATA LO­CAL IN­FILE 'file­name' INTO table FIELDS TER­MI­NATED BY ';'". Load times and mem­ory us­age re­port is shown in Table 1.

Table 1: Pro­gram data load statis­tics.
EngineLoad time (s)Disk (Mb)RAM (Mb)
MySQL InnoDB183.8???168.6
OlapCalc (compr.)81.90.0231.5

Run­times of in­di­vid­ual queries are shown in Fig­ure 4. The ag­gre­gated val­ues in the mid­dle column are same as those in ear­lier Fig­ure 1. De­tailed tim­ings on the left gives a de­tailed view on each pro­duct's per­for­mance, namely Olap­Calc has a very dis­tinct graph from oth­ers. Olap­Calc's slow­est query took 33× as long as the fastest query (16× for com­pressed for­mat), whereas other pro­duct's ra­tios were 1.6 — 4. Also the query #8 was rel­atively faster in other sys­tems than in Olap­Calc.

Figure 4: Run­times of all 9 queries (on left), their ag­gre­gated val­ues (in mid­dle) and speed com­par­ison to the fastest pro­ject (on right).

Over­all these run­times were ag­gre­gated into a sin­gle num­ber by cal­cu­lat­ing the ge­omet­ric mean. It has a nice prop­erty that 1% in­crease in any of the re­sults has the same im­pact on the fi­nal score. Av­er­age mean would give a dis­pro­por­tion­ate weight on queries with longer ex­ecu­tion times. These are plot­ted in the cen­ter of Fig­ure 4. The fastest pro­ject is multi-threaded Olap­Calc, hav­ing a ge­omet­ric mean of 100 mil­lisec­onds. This was cho­sen as the base­line to which oth­ers were com­pared. The sec­ond fastest was Mon­etDB with 2.2× run­time, fol­lowed by In­finiDB (10.1×), In­fo­Bright (13×) and MySQL (80×). Ap­ply­ing com­pres­sion in Olap­Calc causes the run­time to be 5× as long.

Fu­ture ex­ten­sions for Olap­Calc are sup­port for ad-hoc arith­meti­cal queries (like "avg(a - b)"), new ag­gre­ga­tion types (at least "COUNT(DIS­TINCT abc)") and SQL pars­ing. A dae­mon based on MySQL core would al­low stan­dard con­nec­tions be used from other pro­gram­ming lan­guages. Ad­di­tion­ally cur­rently only "di­men­sion columns" are com­pressed, where the facts (such as price, area) are stored un­com­pressed in 32-bit floats. Also ad­di­tional com­par­isons should be done, for ex­am­ple with Post­greSQL, Lu­cidDB and Ver­tica. The us­age of big­ger datasets could be en­abled by us­ing mem­ory-mapped files.

Related blog posts: