background

Niko's Project Corner

Real-time interest point tracking

Description Interest point detection and tracking with FFTW and CUDA
Languages C++
FFTW
CUDA
Tags Com­puter Vi­sion
FFT
Om­ni­di­rec­tional Cam­era
Mas­ter's The­sis
Vi­sual Odom­etry
Field Of View
SIFT
SURF
Duration Spring 2012 – Fall 2012
Modified 15th July 2013
thumbnail

As men­tioned in an other ar­ti­cle about om­ni­di­rec­tional cam­eras, my Mas­ter's The­sis' main topic was real-time in­ter­est point ex­trac­tion and track­ing on an om­ni­di­rec­tional im­age in a chal­leng­ing forest en­vi­ron­ment. I found OpenCV's rou­ti­nes mostly rather slow and run­ning in a sin­gle thread, so I ended up im­ple­ment­ing ev­ery­thing my­self to gain more con­trol on the data flow and threads' de­pen­den­cies. The im­ple­mented code would si­mul­ta­ne­ously use 4 threads on CPU and a few hun­dred on the GPU, ex­ecut­ing in­ter­est point ex­trac­tion and match­ing at 27 fps (37 ms/frame) for 1800 × 360 pix­els (≈0.65 Mpix) panoramic im­age.

In the topic of vi­sual odom­etry, in­ter­est point ex­trac­tion and match­ing is a cru­cial step, and it is no sur­prise that this is a widely in­ves­ti­gated topic. Some meth­ods are op­ti­mized for speed where as oth­ers are op­ti­mized for most ac­cu­rate re­sults. Also there are sev­eral dif­fer­ent types of er­rors (in­ter­est point drift­ing, in­cor­rect matches etc.), and how much the sys­tem re­lies on a-pri­ori knowl­edge which was ex­tracted at pre­vi­ous frames. Also some ad­di­tional sen­sors such as In­er­tial Mea­sure­ment Unit (IMU) might be avail­able, and the most com­pu­ta­tion­ally ef­fi­cient rely on ac­cu­rate a-pri­ori in­for­ma­tion and re­li­able read­ings from an IMU to have very good ini­tial guesses on where in­ter­est points should be found from the next frame. How­ever these meth­ods are less gen­eral, and are more dif­fi­cult to gen­er­al­ize and to ap­ply in new sit­ua­tions. In ad­di­tion, many vi­sual odom­etry sys­tems use stereo vi­sion in­stead of just a sin­gle cam­era, and this also in­tro­duces its own dif­fi­cul­ties but also ad­van­tages.

In my the­sis and in this ar­ti­cle I only fo­cus on monoc­ular vi­sion al­go­rithms with no a-pri­ori knowl­edge of the in­ter­est points, en­vi­ron­ment or the mo­tion be­tween frames. Also I won't de­scribe stan­dard in­ter­est point al­go­rithms such as SIFT or SURF, be­cause they are well de­scribed in Wikipedia and are dif­fi­cult to ro­bustly ap­ply in real-time sce­nar­ios. They are com­pu­ta­tion­ally in­ten­sive to ex­tract and match, and don't have many ad­justable pa­ram­eters to ease the com­pu­ta­tional cost. Of course there are nu­mer­ous oth­ers al­go­rithms as well, but typ­ically they have quite poor scale and ro­ta­tion in­vari­ance.

The al­go­rithm I came up with starts by con­vert­ing the ob­served cir­cu­lar om­ni­di­rec­tional im­age into a equian­gu­lar panorama of de­sired res­olu­tion, such as 1800 × 360 pix­els. This as­pect ra­tio cor­re­sponds ap­prox­imately to the field of view of the cam­era, which in this case was 360° × (11 + 62)°. These di­men­sions have small prime fac­tor de­com­po­si­tions (23 × 32 × 52 and 23 × 32 × 5), which is important for an efficient FFT calculation. The original circular image and the resulting panorama can be seen in Figure 1.

omni_example
Figure 1: Ex­am­ple cap­tured om­ni­di­rec­tional im­age, and the gen­er­ated panoramic view.

The in­ter­est point ex­trac­tion is based on the stan­dard Lapla­cian of Gaus­sian (LoG) blob de­tec­tor. Its ben­efits in­clude ex­cel­lent scale and ro­ta­tion in­vari­ance, able to ro­bustly di­vide in­ter­est points into two dif­fer­ent classes (dark and bright blobs), are very sta­ble and are plenty to be found in nat­ural en­vi­ron­ments. Per­haps the only down-side is that it gives un­sta­ble re­sponse also along edges, but these are easy to de­tect and fil­ter. One sim­ple al­ter­na­tive would have been some of the nu­mer­ous cor­ner de­tec­tors, but they have only one class of de­tected points, are prone to im­age blur and cor­ners are more dif­fi­cult to ob­serve in na­ture.

The SIFT de­tec­tor ap­prox­imates Lapla­cian of Gaus­sian via Dif­fer­ence of Gaus­sian, which is based on cal­cu­lat­ing dif­fer­ences be­tween im­ages which have blurred with Gaus­sian blur at dif­fer­ent scales. An other fast de­tec­tor Cen­SurE (Cen­ter Sur­round Ex­tremas for Re­al­time Fea­ture De­tec­tion and Match­ing) is based on ap­prox­imat­ing DoG by spend­ing ef­fort in cal­cu­lat­ing slanted in­te­gral im­ages, which can be then used to cal­cu­late con­vo­lu­tions of ar­bi­trary large ar­eas in a con­stant time, but also has its own trade-offs.

The ex­trac­tion al­go­rithm was im­ple­mented a quad-core pro­ces­sor in mind, and this is re­flected in all steps. Most of the time is spent on pro­cess­ing the most de­tailed level of the im­age pyra­mid, and the smaller pyra­mid lev­els aren't that sig­nif­icant. The first two steps are shown in Fig­ure 2. At the first step the LoG is cal­cu­lated at four dif­fer­ent scales in par­al­lel, uti­liz­ing the FFTW li­brary. LoG could be im­ple­mented as a stan­dard con­vo­lu­tion over the im­age, but this would make it suit­able only for small ker­nels. In­stead when cal­cu­lat­ing the con­vo­lu­tion via for­ward Fourier trans­form, el­ement-wise mul­ti­pli­ca­tion and in­verse Fourier Trans­form the run­time doesn't de­pend on the cho­sen scale σ. Also scal­ing the re­sponse by a con­stant is done for ''free'', be­cause the ap­plied mask can be pre-mul­ti­plied by the de­sired amount.

extraction
Figure 2: First two steps of the ex­trac­tion steps, and how the task is ex­ecuted par­al­lel in four threads.

Once LoG re­sponses have been cal­cu­lated, the lo­cal scale-space min­ima and max­ima are searched. Only the mid­dle two lay­ers can be searched, and also they have a margin at their bor­ders. These mar­gins are in­di­cated by the pink color in Fig­ure 2. Each pixel is com­pared against its 5 + 8 + 5 neigh­bors (in­di­cated in gray), to de­ter­mine if all of them are ei­ther smaller or larger than the pixel's value. The pixel is first com­pared against its neigh­bors within the same scale, to bet­ter uti­lize caches of the pro­ces­sor. To keep all four cores busy, each core is re­spon­si­ble for find­ing lo­cal ex­trema from one half of an im­age. Found ex­trema need to be fil­tered for edge re­sponses and those which have too low lo­cal con­trast.

Once in­ter­est points have been found, de­scrip­tors can be ex­tracted. Un­for­tu­nately my novel al­go­rithm haven't been pub­lished yet, and thus I won't re­veal all the de­tails yet. But I can say that the over­all de­scrip­tors are ex­tremely fast to cal­cu­late, affine-in­vari­ant(!), highly ro­bust and their match­ing fits well in the CUDA pro­gram­ming model. The num­ber of de­scrip­tors is 10× — 20× of the num­ber of in­ter­est points, but a very good heuris­tic ex­ists to ef­fi­ciently ig­nore most of the im­plau­si­ble matches with lit­tle ex­tra cost. This is mostly true for the case when a rel­atively high-frame rate video feed is an­alyzed.

In the im­ple­mented track­ing sys­tem, GPU is used to match frames 1 and 2 while CPU is busy ex­tract­ing fea­tures from frame 3. Thus the frame rate is de­ter­mined by the slow­est com­po­nent of the sys­tem. This de­pends on the im­age res­olu­tion and var­ious other pa­ram­eters, but typ­ically these are quite bal­anced.

I have ran ex­ten­sive com­par­isons of this al­go­rithm against the pop­ular SURF, and found out my al­go­rithm gain­ing a lot lower er­ror rates in dif­fer­ent match­ing sce­nar­ios. The used dataset was the well-known Leu­ven, which has a good set of dif­fer­ent im­age dis­tor­tions such as blur, view­point change and zoom + ro­ta­tion. Ex­am­ple pairs of im­ages can be seen in Fig­ure 3.

image_examples
Figure 3: Ex­am­ple pairs of im­ages from the Leu­ven im­age set.

Since the de­scrip­tor match­ing is based on a vot­ing scheme, it proved to be a very ro­bust in match­ing task. Two dif­fer­ent ways of match­ing were eval­uated. The first one is based on match­ing each in­ter­est point to its clos­est match, and the other one is based on thresh­old match­ing, where a point is con­sid­ered match­ing to which ever point if this sim­ilar­ity thresh­old is ex­ceeded. The first cri­te­ria is eval­uated on a ''1 - pre­ci­sion'' and cor­rect pos­itive match­ing rate, and the other one is on false pos­itive rate ver­sus false neg­ative rate. On the sec­ond graph EER stands for ''equal er­ror rate'', in which the prob­abil­ity of false neg­ative is equal to the prob­abil­ity of false pos­itive.

boat_roc
Figure 4: Re­ceiver op­er­at­ing char­ac­ter­is­tic curves for boat se­ries.

The per­for­mance for boat se­quence is shown in Fig­ure 4. In all sce­nar­ios the novel al­go­rithm gained bet­ter ac­cu­racy than SURF, es­pe­cially for those im­ages which had the least dif­fer­ence in scale. The graf­fiti se­quence re­sulted in sim­ilar re­sults, as can be seen in Fig­ure 5. I hope that in fu­ture I'll get this al­go­rithm prop­erly pub­lished, and then it can be re­viewed by also other peers in the sci­en­tific com­mu­nity.

graffiti_roc
Figure 5: Re­ceiver op­er­at­ing char­ac­ter­is­tic curves for graf­fiti se­ries.

Related blog posts:

Bananagrams1
StableDiffusionBasics
Puzzles
SpeechMusicSeparation
VideoClustering