Niko's Project Corner

So far I've written a basic rendering engine which uses Nvidia's CUDA (Compute Unified Device Architecture) which can render reflective surfaces with environmental mapping and anti-aliasing and motion blur at 200 fps with minimal usage of 3^rd party libraries such as OpenGL. This let me fully implement the cross-platform rendering pipeline from data transfer to pixel-level RGB calculations, all in C-like syntax.

An example rendered frame can be seen in Figure 1. Cubes aren't textured, but that would be fairly easy to implement since CUDA supports filtered texture lookups. However I'm likely to implement procedural volumetric textures, because the gameplay lets the user to arbitrary slice the object by drawing the cutting plane by mouse. If the objects were textured, then I'd need to implement also generation of UV coordinates and texture bitmaps.

Figure 1: Example rendering of two cubes having reflective un-even surfaces, with antialiasing and vignetting. Current implementation is unable to make objects visible in surface reflections.

The antialiasing is implemented by having the CUDA thread to supersample each pixel, and store the averaged result. These results can be seen in Figure 2. In addition the used sampling pattern is mirrored between frames and the image is averaged with the previous image, so in practice the 4x sampling produces results which are similar to a more computationally demanding 8x sampling. This has the additional advantage of having the possibility of interpolating the sub-sampling direction based on camera motion since the previous frame to simulate motion blur on the background. This is demonstrated in Figure 3. This combined with very high framerate results in a very smooth user experience.

Figure 2: Displaying the spatial antialiased rendering with 1x (without), 4x and 8x supersampling.

Figure 3: Displaying the spatio-temporal antialiased rendering with 1x, 4x and 8x supersampling.

CUDA threads operate in blocks of 16 × 16 = 256 threads in total, one thread / screen pixel. If the output resolution is 1280 × 720, there are ¹²⁸⁰⁄₁₆ × ⁷²⁰⁄₁₆ = 80 × 45 = 3600 blocks to render. The number of active blocks depends on the hardware, but the used hardware (Nvidia GeForce GT 650M, maximum of 2048 threads) achieved a high occypancy quite easily. This is illustraded in Figure 4. If the thread block has to check ray-triangle collisions with only a single triangle, then it is encoded in blue, otherwise in red. Interestingly this can be solved quite efficiently in CPU as a pre-step for the rendering, by projecting the triangle corners to screen pixels and doing the checks purely in 2D.

Figure 4: During rendering the screen is partitioned into independed 16 × 16 pixels blocks, which correspond to CUDA's thread blocks. Each block has 256 threads, one for each pixel. When antialiasing is used, each thread is iterated multiple times to determine the average color for the pixel. Pixels outside bounding box can be rendered with less effort, because only the background needs to be rendered.

If the models consist of large polygons, then each thread block needs to do only a few ray-triangle intersection tests. Also if that block does not contain any objects, the block doesn't need to render anything but the background, either by procedural calculations or by a simple texture lookup. The rendering is implemented as first rendering just plain background for the whole screen, and then have second pass on the region which contains renderable objects.

Ideally the rendered output would be displayed at the screen directly using OpenGL, but unfourtunately I haven't been able to get this working yet. The current work-around is to copy the resulting image from GPU to a SDL surface (in CPU's RAM), and then use a surface blit operation to display it on the screen. This adds a bit extra overhead, but could be easily refactored by someone who is more familiar with CUDA/OpenGL integration, and there is even an article ''OpenGL Interoperability with CUDA''.

The slicing process can be seen in Figure 5. By specifying two points in the screen space, based on virtual camera's parameter it can be determined that in which direction in world coordinates the two points are pointing at. By taking a cross product of these vectors, we get the normal of the splitting plane. By knowing that the plane goes through the virtual camera's focal point, we have full knowledge of its equation. Then all vertices of the models can be checked if its endpoints are in different sides of the plane or not. If they are, the whole triangle is flagged as being affected by the splitting action.

Figure 5: As shown on the left, the user can define a splitting plane, along which the object will be sliced and have its mesh refined. Affected triangles are highlighted in red. A result of multilple slicings is shown on the right. During the gameplay these holes would be automatically covered.

The actual splitting works by refining each triangle into three sub-triangles, one of which will be deleted by the splitting action. After all triangles have been refined, those edges which only have one linked triangle form the perimeter of the hole. At this point a proper triangulation needs to be automatically generated to fill the hole, and this should try to maximize the minimum angle of all resulting triangles.

The outcome of this algorithm is a valid mesh, but typically it has many redundant vertices which could be pruned. However so far I haven't been able to implement the procedure of refining the mesh while iterating through it, because data structures become temporarily inconsistent. Maybe an easier solution would be to copy the old structure into a temporary variable, and then re-create the optimized mesh from scratch. Without this optimization the mesh becomes unnecessarily complex after 4 to 6 splitting actions.

Global illumination, 2013 Jul (Matching: C++, Rendering, SDL)
Visualizing laser scanned geography, 2013 Jul (Matching: C++, Rendering)
Bruteforcing Countdown numbers game with CUDA, 2023 Apr (Matching: CUDA)
Automatic map stitching, 2014 Sep (Matching: Rendering)
Rendering omnidirectional images, 2013 Jul (Matching: Rendering)

Home	(Home page)
About	(About me)
Platform	(About this blog)

LinkedIn	(Niko Nyrhilä)
GitHub	(nikonyrh)
Stackoverflow	(nikonyrh)

Bruteforcing Countdown numbe...	(2023 Apr)
Cheating at Bananagrams with...	(2023 Apr)
Introduction to Stable Diffu...	(2022 Nov)
Matching puzzle pieces together	(2022 Jul)
Single channel speech / musi...	(2022 Feb)

Computer Vision	(13)
GitHub	(12)
Databases	(9)
Elasticsearch	(6)
FFT	(5)
Rendering	(5)
Applied mathematics	(4)

CUDA realtime rendering engine

Related blog posts:

Home

Navigation

External

Most recent

Most frequent tags

Most frequent languages

Co-occurrence matrix

	Matl	Pyth	C++	Cloj	Bash	Kera
Comput	6	6	3	1	0	5
GitHub	0	2	1	4	3	0
Databa	0	3	2	2	1	0
Render	3	0	3	0	0	0
Nginx	0	1	0	0	4	0
Autoen	0	3	0	1	0	2
Elasti	0	2	0	3	0	0
FFT	3	1	1	0	0	1
Data S	2	1	2	1	0	1
JVM	0	1	0	3	1	0
Docker	0	1	0	0	3	0
FastCG	0	0	3	0	0	0
Applie	2	2	0	0	0	0
Field	2	0	2	0	0	0
Omnidi	2	0	2	0	0	0
Affine	2	0	2	0	0	0
Master	1	0	2	0	0	0
Archit	0	1	0	0	2	0
Visual	1	0	2	0	0	0
Spark	0	1	0	0	2	0
Blog	0	0	0	2	0	0
Hyphen	0	0	0	2	0	0
Stack	0	1	1	0	0	0
SQL	0	0	1	1	0	0
Busine	0	1	0	1	0	0
Signal	0	1	0	0	0	1
Encryp	0	0	0	0	1	0
Git	0	0	0	1	0	0
Stable	0	1	0	0	0	0
Redis	0	1	0	0	0	0
Thrust	0	0	1	0	0	0
Kibana	0	0	0	1	0	0
Astron	1	0	0	0	0	0
Mustac	0	0	1	0	0	0
NAT	0	0	0	0	1	0
jQuery	0	0	1	0	0	0
SSH	0	0	0	0	1	0
Happyh	0	0	1	0	0	0
Backup	0	0	0	0	1	0
Pthrea	0	0	1	0	0	0
AWS	0	0	0	0	1	0
SIFT	0	0	1	0	0	0
SURF	0	0	1	0	0	0
Conjug	0	0	1	0	0	0
Kalman	0	0	1	0	0	0
Partic	0	0	1	0	0	0
Gradie	0	0	1	0	0	0
Simult	0	0	1	0	0	0
Roboti	0	0	1	0	0	0
Princi	1	0	0	0	0	0
Receiv	1	0	0	0	0	0
Linear	1	0	0	0	0	0
Suppor	1	0	0	0	0	0
Machin	1	0	0	0	0	0
Discre	1	0	0	0	0	0

Python	(13)
C++	(11)
Matlab	(10)
Keras	(6)
Clojure	(6)
Bash	(6)
PHP	(6)