Sunday, 25 January 2015

Experiment #2 : c++11 multithreaded raytracer

Hello everyone! A few weeks ago I got a read at The C++ Programming Language 4th Edition in order to refresh my knowledge about c++11 , since I'm not using it at work. I was particularly curious about the built-in threading support. I thought that features like packaged task and lambda function, plus applying concept from functional programming paradigm should make relatively easy to make truly cross platform multi-thraded applications. Of course you have less control over it, thing such as lightweight thread or mutex and even set affinity or thread pools, so If you are looking for maximum performance probably they are not optimal. Keep in mind that most of efficiency of a multithreaded application depends on how you design your application...less read-write data is shared across thread means less lock/mutex and less waiting time for you thread to acquire the resource it need. I decided to make a little multithreaded test application and since I've a passion for graphic programmer I decided to go for a multithreaded simple raytracer application, also because is a kind of problem that can be easily it features the following:
  • colored planes and sphere
  • multiple coloured lights
  • shadows
That's it! no specular, no reflections, no texure and no anti-aliasig. They are actually very easy to implement and maybe I will extend the demo in future, but for now I preferred to keep the code lines at minimum in order to be better readable in this post. The ray test routines are word of wisdom from real time collision detection by Christer Ericson.
our simple raytracer output

At first, I implemented a single thead version. To convert in multithread I had literally to had 20 lines more of  code. basically I had to decide what to feed to each thread, my option was to divide the screen in in 2 rows and and a number of columns equal to the half of the number of cores the host cpu has and to feed each portion of screen to a thread. This is not optimal since there can be portions of the screen that are "easy" to compute while other ones require more work. So our bottleneck here is the thread that have to do more raycast than others. A better strategy would be having a pool of thread, and assign each of them a job a fairly little portion of the screen that our main thread will keep feeding them until they are finished. This would have required more work and the use of std::thread , std::packaged_task and std::promise and std::future. My program simply run the code assign a color to the screen buffer in a lambda function, which is given as argument to std::async. We then store an array of futures and we keep  checking whether or not all threads are finished. We don't need any thread synchronization mechanism since most shared structure the thread uses are read only and the only one you write to (the color buffer) is written in separate location by each thread. 
this is the code:

you can find instead the single thread version here

I deliberately used only POD structure with a data oriented approach, so don't criticize my class design.
How does it perform?
Creating a thread have its cost, also the speed depends on how the code is written and on the CPU architecture. Even if you don't have race condition an synchronization issues, your algorithm could spent most of the time fetching and writing data from RAM that is very slow compared to cache.
This is an histogram of the perfomance of single thread vs multithread algorithm tested on my cpu, an AMD FX-8320, an 8 core cpu.
when the data to process became significant we reach a peak of 5x faster than a a single threaded option, not bad at all!



No comments:

Post a Comment