Niko's Project Corner

CUDA realtime rendering engine

Description Basic realtime CUDA rendering engine with relfective surfaces.
Languages C++
Tags Ren­der­ing
Duration Spring 2013
Modified 9th July 2013

So far I've writ­ten a ba­sic ren­der­ing en­gine which uses Nvidia's CUDA (Com­pute Uni­fied De­vice Ar­chi­tec­ture) which can ren­der re­flec­tive sur­faces with en­vi­ron­men­tal map­ping and anti-alias­ing and mo­tion blur at 200 fps with min­imal us­age of 3rd party li­braries such as OpenGL. This let me fully im­ple­ment the cross-plat­form ren­der­ing pipeline from data trans­fer to pixel-level RGB cal­cu­la­tions, all in C-like syn­tax.

An ex­am­ple ren­dered frame can be seen in Fig­ure 1. Cubes aren't tex­tured, but that would be fairly easy to im­ple­ment since CUDA sup­ports fil­tered tex­ture lookups. How­ever I'm likely to im­ple­ment pro­ce­du­ral vol­umet­ric tex­tures, be­cause the game­play lets the user to ar­bi­trary slice the ob­ject by draw­ing the cut­ting plane by mouse. If the ob­jects were tex­tured, then I'd need to im­ple­ment also gen­er­ation of UV co­or­di­nates and tex­ture bitmaps.

Figure 1: Ex­am­ple ren­der­ing of two cubes hav­ing re­flec­tive un-even sur­faces, with an­tialias­ing and vi­gnetting. Cur­rent im­ple­men­ta­tion is un­able to make ob­jects vis­ible in sur­face re­flec­tions.

The an­tialias­ing is im­ple­mented by hav­ing the CUDA thread to su­per­sam­ple each pixel, and store the av­er­aged re­sult. These re­sults can be seen in Fig­ure 2. In ad­di­tion the used sam­pling pat­tern is mir­rored be­tween frames and the im­age is av­er­aged with the pre­vi­ous im­age, so in prac­tice the 4x sam­pling pro­duces re­sults which are sim­ilar to a more com­pu­ta­tion­ally de­mand­ing 8x sam­pling. This has the ad­di­tional ad­van­tage of hav­ing the pos­si­bil­ity of in­ter­po­lat­ing the sub-sam­pling di­rec­tion based on cam­era mo­tion since the pre­vi­ous frame to sim­ulate mo­tion blur on the back­ground. This is demon­strated in Fig­ure 3. This com­bined with very high fram­er­ate re­sults in a very smooth user ex­pe­ri­ence.

Figure 2: Dis­play­ing the spa­tial an­tialiased ren­der­ing with 1x (with­out), 4x and 8x su­per­sam­pling.
Figure 3: Dis­play­ing the spa­tio-tem­po­ral an­tialiased ren­der­ing with 1x, 4x and 8x su­per­sam­pling.

CUDA threads op­er­ate in blocks of 16 × 16 = 256 threads in to­tal, one thread / screen pixel. If the out­put res­olu­tion is 1280 × 720, there are 128016 × 72016 = 80 × 45 = 3600 blocks to render. The number of active blocks depends on the hardware, but the used hardware (Nvidia GeForce GT 650M, maximum of 2048 threads) achieved a high occypancy quite easily. This is illustraded in Figure 4. If the thread block has to check ray-triangle collisions with only a single triangle, then it is encoded in blue, otherwise in red. Interestingly this can be solved quite efficiently in CPU as a pre-step for the rendering, by projecting the triangle corners to screen pixels and doing the checks purely in 2D.

Figure 4: Dur­ing ren­der­ing the screen is par­ti­tioned into in­de­pended 16 × 16 pix­els blocks, which cor­re­spond to CUDA's thread blocks. Each block has 256 threads, one for each pixel. When an­tialias­ing is used, each thread is it­er­ated mul­ti­ple times to de­ter­mine the av­er­age color for the pixel. Pix­els out­side bound­ing box can be ren­dered with less ef­fort, be­cause only the back­ground needs to be ren­dered.

If the mod­els con­sist of large poly­gons, then each thread block needs to do only a few ray-tri­an­gle in­ter­sec­tion tests. Also if that block does not con­tain any ob­jects, the block doesn't need to ren­der any­thing but the back­ground, ei­ther by pro­ce­du­ral cal­cu­la­tions or by a sim­ple tex­ture lookup. The ren­der­ing is im­ple­mented as first ren­der­ing just plain back­ground for the whole screen, and then have sec­ond pass on the re­gion which con­tains ren­der­able ob­jects.

Ide­ally the ren­dered out­put would be dis­played at the screen di­rectly us­ing OpenGL, but un­four­tu­nately I haven't been able to get this work­ing yet. The cur­rent work-around is to copy the re­sult­ing im­age from GPU to a SDL sur­face (in CPU's RAM), and then use a sur­face blit op­er­ation to dis­play it on the screen. This adds a bit ex­tra over­head, but could be eas­ily refac­tored by some­one who is more fa­mil­iar with CUDA/OpenGL in­te­gra­tion, and there is even an ar­ti­cle ''OpenGL In­ter­op­er­abil­ity with CUDA''.

The slic­ing pro­cess can be seen in Fig­ure 5. By spec­ify­ing two points in the screen space, based on vir­tual cam­era's pa­ram­eter it can be de­ter­mined that in which di­rec­tion in world co­or­di­nates the two points are point­ing at. By tak­ing a cross pro­duct of these vec­tors, we get the nor­mal of the split­ting plane. By know­ing that the plane goes through the vir­tual cam­era's fo­cal point, we have full knowl­edge of its equa­tion. Then all ver­tices of the mod­els can be checked if its end­points are in dif­fer­ent sides of the plane or not. If they are, the whole tri­an­gle is flagged as be­ing af­fected by the split­ting ac­tion.

Figure 5: As shown on the left, the user can de­fine a split­ting plane, along which the ob­ject will be sliced and have its mesh re­fined. Af­fected tri­an­gles are high­lighted in red. A re­sult of mul­ti­lple slic­ings is shown on the right. Dur­ing the game­play these holes would be au­to­mat­ically cov­ered.

The ac­tual split­ting works by re­fin­ing each tri­an­gle into three sub-tri­an­gles, one of which will be deleted by the split­ting ac­tion. Af­ter all tri­an­gles have been re­fined, those edges which only have one linked tri­an­gle form the perime­ter of the hole. At this point a proper tri­an­gu­la­tion needs to be au­to­mat­ically gen­er­ated to fill the hole, and this should try to max­imize the min­imum an­gle of all re­sult­ing tri­an­gles.

The out­come of this al­go­rithm is a valid mesh, but typ­ically it has many re­dun­dant ver­tices which could be pruned. How­ever so far I haven't been able to im­ple­ment the pro­ce­dure of re­fin­ing the mesh while it­er­at­ing through it, be­cause data struc­tures be­come tem­porar­ily in­con­sis­tent. Maybe an eas­ier so­lu­tion would be to copy the old struc­ture into a tem­po­rary vari­able, and then re-cre­ate the op­ti­mized mesh from scratch. With­out this op­ti­miza­tion the mesh be­comes un­nec­es­sar­ily com­plex af­ter 4 to 6 split­ting ac­tions.

Related blog posts: