Sunday, 25 January 2015

Experiment #2 : c++11 multithreaded raytracer

Hello everyone! A few weeks ago I got a read at The C++ Programming Language 4th Edition in order to refresh my knowledge about c++11 , since I'm not using it at work. I was particularly curious about the built-in threading support. I thought that features like packaged task and lambda function, plus applying concept from functional programming paradigm should make relatively easy to make truly cross platform multi-thraded applications. Of course you have less control over it, thing such as lightweight thread or mutex and even set affinity or thread pools, so If you are looking for maximum performance probably they are not optimal. Keep in mind that most of efficiency of a multithreaded application depends on how you design your application...less read-write data is shared across thread means less lock/mutex and less waiting time for you thread to acquire the resource it need. I decided to make a little multithreaded test application and since I've a passion for graphic programmer I decided to go for a multithreaded simple raytracer application, also because is a kind of problem that can be easily it features the following:
  • colored planes and sphere
  • multiple coloured lights
  • shadows
That's it! no specular, no reflections, no texure and no anti-aliasig. They are actually very easy to implement and maybe I will extend the demo in future, but for now I preferred to keep the code lines at minimum in order to be better readable in this post. The ray test routines are word of wisdom from real time collision detection by Christer Ericson.
our simple raytracer output

At first, I implemented a single thead version. To convert in multithread I had literally to had 20 lines more of  code. basically I had to decide what to feed to each thread, my option was to divide the screen in in 2 rows and and a number of columns equal to the half of the number of cores the host cpu has and to feed each portion of screen to a thread. This is not optimal since there can be portions of the screen that are "easy" to compute while other ones require more work. So our bottleneck here is the thread that have to do more raycast than others. A better strategy would be having a pool of thread, and assign each of them a job a fairly little portion of the screen that our main thread will keep feeding them until they are finished. This would have required more work and the use of std::thread , std::packaged_task and std::promise and std::future. My program simply run the code assign a color to the screen buffer in a lambda function, which is given as argument to std::async. We then store an array of futures and we keep  checking whether or not all threads are finished. We don't need any thread synchronization mechanism since most shared structure the thread uses are read only and the only one you write to (the color buffer) is written in separate location by each thread. 
this is the code:
#include <memory>
#include <stdio.h>
#include <math.h>
#include <vector>
#include <algorithm>
#include <chrono>
#include <thread>
#include <future>
// some constant
static const int img_width = 3840; //4k image
static const int img_height = 2160;
const int max_scene_size = 16;
static const float eps = 0.00001f;
// basic vector math stuff
union Vec3
{
float data[3];
struct
{
float x, y, z;
};
float lenght() const
{
return sqrt(x*x + y*y + z*z);
}
Vec3 operator/(float f)const
{
return Vec3{ x / f, y / f, z / f };
}
Vec3 operator*(float f)const
{
return Vec3{ x * f, y * f, z * f };
}
Vec3 operator+(const Vec3 &other)const
{
return Vec3{ x + other.x, y + other.y, z + other.z };
}
Vec3 operator-(const Vec3 &other)const
{
return Vec3{ x - other.x, y - other.y, z - other.z };
}
Vec3 operator-()const
{
return Vec3{ -x, -y, -z };
}
Vec3 operator*(const Vec3 &other)const
{
return Vec3{ x * other.x, y * other.y, z * other.z };
}
Vec3 normalize() const
{
return *this / lenght();
}
float dot(const Vec3 &v) const
{
return (x * v.x + y * v.y + z * v.z);
}
Vec3 average(const Vec3 &v) const
{
return{ (x + v.x) * 0.5f, (y + v.y) * 0.5f, (z + v.z) * 0.5f };
}
Vec3 cross(const Vec3 &v)const
{
//xyzzy
Vec3 res;
res.x = (y * v.z) - (z * v.y);
res.y = (z * v.x) - (x * v.z);
res.z = (x * v.y) - (y * v.x);
return res;
}
// todo this really make sense only for color
Vec3 saturate()const
{
float sum = x + y + z;
float excess = sum - 3.0f;
Vec3 res = { x, y, z };
if (excess > 0.0f)
{
res.x = x + excess * (x / sum);
res.y = y + excess * (y / sum);
res.z = z + excess * (z / sum);
}
if (res.x > 1.0) res.x = 1.0f;
if (res.y > 1.0) res.y = 1.0f;
if (res.z > 1.0) res.z = 1.0f;
return res;
}
};
struct Ray
{
Vec3 origin;
Vec3 direction;
};
struct Camera
{
Vec3 position;
Vec3 direction;
Vec3 right;
Vec3 up;
};
// a color is similar to a vec3
using Color = Vec3;
template <int SIZE>
struct Light
{
Ray light_world[SIZE];
Color color[SIZE];
int current_size;
Light() : current_size{ 0 }{}
void add_light(const Ray& pos, const Color& col)
{
if (current_size < SIZE)
{
light_world[current_size] = pos;
color[current_size] = col;
++current_size;
}
}
};
union ColorBMP
{
unsigned char data[3];
struct
{
unsigned char r, g, b;
};
};
//data oriented structure
template <int SIZE>
struct Objects
{
enum object_types
{
SPHERE,
PLANE,
};
Vec3 center[SIZE];
Color color[SIZE];
float radius[SIZE];
object_types types[SIZE];
int current_size ;
Objects()
:current_size{0}{}
void add_sphere(const Vec3 &cen, const Color &col, float rad)
{
if (current_size < SIZE)
{
center[current_size] = cen;
color[current_size] = col;
radius[current_size] = rad;
types[current_size] = SPHERE;
++current_size;
}
}
void add_Plane(const Vec3 &normal, const Color &col, float distance)
{
if (current_size < SIZE)
{
center[current_size] = normal;
color[current_size] = col;
radius[current_size] = distance;
types[current_size] = PLANE;
++current_size;
}
}
Color GetColor(int idx) const
{
return color[idx];
}
Vec3 GetNormal(const Vec3 &vec, int idx) const
{
if (types[idx] == SPHERE)
{
return (vec - center[idx]).normalize();
}
else if(types[idx] == PLANE)
{
return center[idx];
}
return {.0f,.0f,.0f};
}
float ray_intersection(const Ray & ray, int idx) const
{
if (types[idx] == SPHERE)
{
Vec3 m(ray.origin - center[idx]);
float b = (m).dot(ray.direction);
float c = m.dot(m) - radius[idx] * radius[idx];
if (c > 0.0f && b > 0.0f)
{
return -1.0;
}
float discr = b*b - c;
if (discr < 0.0)
{
return -1.0;
}
float t = -b - sqrt(discr);
if (t < 0.0f)
{
return -1.0f;
}
return t;
}
else if (types[idx] == PLANE)
{
float t = (radius[idx] - center[idx].dot(ray.origin)) / center[idx].dot(ray.direction);
if (t >= 0.0)
{
return t;
}
return -1.0;
}
return 0.0f;
}
};
//bmp stuff
unsigned char header[54];
void InitHeader(int w, int h)
{
// todo check paddign and if the way to write these bites is platform dependend.
unsigned int size = (w * h * 3) + sizeof(header);
unsigned int sizedata = (w * h * 3);
header[0] = 'B';
header[1] = 'M';
header[10] = 54;
unsigned char *pos = &header[2];
unsigned char *info = &header[14];
*info = 40;
//24bpp
info[12] = 1;
info[14] = 24;
info += 4;
// write the size header
pos[0] = (unsigned char)(size);
pos[1] = (unsigned char)(size >> 8);
pos[2] = (unsigned char)(size >> 16);
pos[3] = (unsigned char)(size >> 24);
pos += sizeof(int);
// write the h and h in the info header
info[0] = (unsigned char)(w);
info[1] = (unsigned char)(w >> 8);
info[2] = (unsigned char)(w >> 16);
info[3] = (unsigned char)(w >> 24);
info += sizeof(int);
info[0] = (unsigned char)(h);
info[1] = (unsigned char)(h >> 8);
info[2] = (unsigned char)(h >> 16);
info[3] = (unsigned char)(h >> 24);
info += sizeof(int);
info += 8;
info[0] = (unsigned char)(sizedata);
info[1] = (unsigned char)(sizedata >> 8);
info[2] = (unsigned char)(sizedata >> 16);
info[3] = (unsigned char)(sizedata >> 24);
info += sizeof(int);
unsigned int ppm = 72;
info[0] = (unsigned char)(ppm);
info[1] = (unsigned char)(ppm >> 8);
info[2] = (unsigned char)(ppm >> 16);
info[3] = (unsigned char)(ppm >> 24);
info += sizeof(ppm);
info[0] = (unsigned char)(ppm);
info[1] = (unsigned char)(ppm >> 8);
info[2] = (unsigned char)(ppm >> 16);
info[3] = (unsigned char)(ppm >> 24);
info += sizeof(ppm);
}
void writebmp(const char * filename, ColorBMP *data, int w, int h)
{
FILE *f = fopen(filename, "wb");
InitHeader(w, h);
fwrite(header, sizeof(header), 1, f);
fwrite(data, w*h*sizeof(ColorBMP), 1, f);
fclose(f);
}
Color compute_color(const Vec3 &intersection, int closest, const Light<max_scene_size> &lights, const Objects<max_scene_size> &objects, std::vector<std::pair<float, int>> &intersections, float ambientfactor)
{
Vec3 normal = objects.GetNormal(intersection,closest);
Color color = objects.GetColor(closest);
intersections.clear();
Color finalcolor = color * ambientfactor;
for ( int i = 0; i < lights.current_size; ++i)
{
Vec3 light_dir = (lights.light_world[i].origin - intersection).normalize();
float dot = light_dir.dot(normal);
if (dot > 0.0f)
{
bool shadow = false;
Vec3 ligth_distance = (lights.light_world[i].origin - intersection);
float distance = ligth_distance.lenght();
//fire a ray from intersection position to the light to check to check if the object is shadowed
Ray shadow_ray = { intersection, (lights.light_world[i].origin - intersection).normalize() };
for ( int j = 0; j < objects.current_size; ++j)
{
intersections.push_back(std::pair<float, int>(objects.ray_intersection(shadow_ray,j), j));
}
for (unsigned int k = 0; k < intersections.size(); ++k)
{
if (intersections[k].first> eps && intersections[k].first <= distance)
{
shadow = true;
break;
}
}
if (!shadow)
{
finalcolor = finalcolor + ((color * lights.color[i]) * dot);
}
}
}
return finalcolor.saturate();
}
// do the actual job
void process_image(int w, int h)
{
ColorBMP* data = new ColorBMP[w* h];
memset(data, 0, w* h * sizeof(ColorBMP));
Objects<max_scene_size> object_container;
Light<max_scene_size> light_container;
Vec3 pos = { 6.0f, 1.5f, -3.0f };
Vec3 look = { .0f, .0f, .0f };
Vec3 Y = { .0f, 1.0f, .0f };
Vec3 camdir = (look - pos).normalize();
Vec3 camright = Y.cross(camdir).normalize();
Vec3 camup = camright.cross(camdir).normalize();
//we have a camera
Camera camera = { pos, camdir, camright, camup };
//add a light
light_container.add_light({ { -7.0f, 10.0f, -10.0f }, { 0.0f, 0.0f, 0.0f } }, { 1.0f, 1.0f, 1.0f });
//add a couple of sphered and a plane
object_container.add_sphere({ 0.0f, 0.0f, 0.0f }, { 0.0f, 1.0f, 0.0f }, 1.0f);
object_container.add_sphere({ 3.0f, -0.5f, 0.0f }, { 1.0f, 1.0f, 0.0f }, 0.5f);
object_container.add_Plane(Y, { 0.0f, 0.0f, 1.0f }, { -1.0f });
std::chrono::time_point<std::chrono::system_clock> start, end;
start = std::chrono::system_clock::now();
float aspect = (static_cast<float>(w) / static_cast<float>(h));
int numthread = std::max<int>(2, std::thread::hardware_concurrency());
// must be even
numthread % 2 == 0 ? numthread : ++numthread;
std::vector< std::future<void> > futures;
futures.reserve(numthread);
// split w and h
int splitw = w / (numthread/2);
int splith = h/2;
int start_x = 0;
int start_y = 0;
int end_x = splitw;
int end_y = splith;
for (int thread_idx = 0; thread_idx < numthread; ++thread_idx)
{
//spawn numthread threads with labmdas
std::future<void> threads = std::async(std::launch::async, [=, &light_container, &object_container, &data](){
std::vector<std::pair<float, int>>intersections;
intersections.reserve(1024);
float xinc, yincr;
for (int i = start_x; i < end_x; ++i)
{
for (int j = start_y; j < end_y; ++j)
{
int pixelpos = i + j * w;
// decide how to offset the camera ray according to the screen ratio
if (aspect > 1.0)
{
xinc = ((i + 0.5f) / w) * aspect - (((w - h) / static_cast<float>(h)) *0.5f);
yincr = ((h - j) + 0.5f) / h;
}
else if (aspect < 1.0)
{
xinc = (i + 0.5f) / w;
yincr = ((h - j + 0.5f) / h) / aspect - (((h - w) / static_cast<float>(w)) *0.5f);
}
else
{
xinc = (i + 0.5f) / w;
yincr = ((h - j + 0.5f) / h);
}
// starts to fire rays
Vec3 camera_pos = camera.position;
Vec3 camera_dir = (camera.direction + (camera.right * (xinc - 0.5f)) + (camup *(yincr - 0.5f))).normalize();
Ray camera_ray = { camera_pos, camera_dir };
// store the intersection between the camera and all the objects in scene.
for ( int k = 0; k < object_container.current_size; ++k)
{
float intersection = object_container.ray_intersection(camera_ray,k);
if (intersection > 0.0f)
intersections.push_back(std::pair<float, int>(intersection, k));
}
// find the nearest
std::sort(intersections.begin(), intersections.end(), [](std::pair<float, int> a, std::pair<float, int> b){return b.first > a.first; });
if (!intersections.empty())
{
// get the intersection point
Vec3 intersection = camera_pos + (camera_dir * intersections[0].first);
Color color = compute_color(intersection, intersections[0].second, light_container, object_container, intersections, 0.2f);
data[pixelpos].r = (unsigned char)(color.x * 255);
data[pixelpos].g = (unsigned char)(color.y * 255);
data[pixelpos].b = (unsigned char)(color.z * 255);
}
intersections.clear();
}
}
});
futures.push_back(std::move(threads));
start_x = start_x + splitw;
end_x = end_x + splitw;
if (end_x > w)
{
start_x = 0;
end_x = splitw;
start_y = start_y + splith;
end_y = start_y + splith;
}
}
// wait till all the threads are done and wait 2 ms
while (!futures.empty())
{
if (futures.back().wait_for(std::chrono::milliseconds(2)) == std::future_status::ready)
{
futures.pop_back();
}
}
end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;
printf("rendering took %fs\n", elapsed_seconds);
writebmp("test.bmp", data, img_width, img_height);
delete[] data;
system("pause");
}
void draw_scene()
{
process_image(img_width, img_height);
}
int main()
{
draw_scene();
return 0;
}

you can find instead the single thread version here

I deliberately used only POD structure with a data oriented approach, so don't criticize my class design.
How does it perform?
Creating a thread have its cost, also the speed depends on how the code is written and on the CPU architecture. Even if you don't have race condition an synchronization issues, your algorithm could spent most of the time fetching and writing data from RAM that is very slow compared to cache.
This is an histogram of the perfomance of single thread vs multithread algorithm tested on my cpu, an AMD FX-8320, an 8 core cpu.
when the data to process became significant we reach a peak of 5x faster than a a single threaded option, not bad at all!



Monday, 12 January 2015

Experiment #1: simple c++/html5 cross platform 2d engine

Some time ago, when I read that epic ported citadel demo on html5 , I decided to have an in depth  look on how they did that and give it a try. The company I was working for, playerthree  make also html5 web enabled games, so for me the possibility to write a game for the web and for native mobile device in c++ seems just a dream.
Epic were using emscripten, a cross compiler from c++ to javascript  that promise that everything that can be compiled by CLANG, it can be ported in javascript. In fact the main principle behind this tool is that bitcode generate by llvm can be easily translate  in some sort of assembler virtual machine running on javascript. In particular , it uses a javascript language subset known as asmjs. It sounds like this solution would bloat a lot of code by translate object  files in some sort of ASCII version of compiled assembly code but turn out that since asmjs can be easily mapped to machine code, javascript engine can perform ahead of time compilation (AOT) and gain a massive performance gain. The asmjs guys declare that there is only a 2x performance slowdown when compared to  native code on some javascript engine.
The whole toolbox you need for emscripten can be directly installed and configured using EMSDK . Once installed you only need to recompile your project using emcc command, the same way you'd compile your application with a normal compiler.
So, back to us I decided to write a small 2d engine to see if I was able to compile with emscripten, it support the following features:
  • Sprite batching
  • Texture atlas (Texture Packer and glideros format)
  • Interpolators
  • Timed callbacks
  • font support (glideros format)
  • touch input
  • OpenGL 2.0
  • basic memory management 
  • basic containers
  • audio (only on windows at the moment)
it support windows, iOS and Android, plus html thanks to emscripten, it doesn't depend on many library, just glfw for rendering window management and input on windows side and libpng.
On top of that I did a game, during a gamejam organized by playerthree to test if a full game could be done... and it worked. The web version even run on iOS8 and chrome mobile browsers!




in "the only way is up" you have to make your way from the bottom of the ocean to the surface. Simply move your mouse/finger to make the character strafe. You can play the web version here.
All the game design, graphics and audio were made by Gravy,  playerthree creative director and co-founder.  
If you are interested, I've made the engine and the game code available on github. Apology for the game code since is something made during an 8 hour long jam, and if you have any question , feel free to ask.

Sunday, 4 January 2015

Greetings

Hello everyone and welcome to Dr. Glitschen's blog. 
Please allow me to introduce myself: I'm a professional video game programmer living in UK for almost 4 years now. English is not my native language so I hope you will excuse me if sometimes I make some mistakes. 
My principal interest are new technologies, programming, playing videogames and board games, cooking, sleep and 80's stuff in general. Sometimes I'm lazy. The idea to have a blog came to me at least a couple of years ago, but as I just told you, I'm lazy :D. With the beginning of 2015 among with a new job, I thought that this was the right moment to do it and , of course , I'm already a few days late :D I will talk mainly about programming ideas, graphic programming and c++ in general, but I don't exclude it will evolve in something different in future and I'll try to post often and keep the blog updated. Hope you will enjoy the ride and you will participate numerously and actively in my post. Oh, I almost forgot... happy 2015!