Project Idea:
Mandelbrot fraction using CUDA and C++.
See the project on GitHub at the GitHub repo
Problem Statement:
Using CUDA with the C++ API I want to draw a Mandelbrot fractal. And I want to be able to display that to the user in the easiest way possible.
- Use CUDA
- Use C++
- Use X11 to draw since Linux is the easiest dev env.
Results:
The code can be seen at the previous link or at the C++ file and the offending portions for the single and multithreaded code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
__global__ void mandelbrot_single_thread_gpu(int n, int* mbr) {
for(int i = 0; i < n; i++) {
// Index from the top left, using row-major order/x-major order
uint32_t x_index = (i % WIDTH);
uint32_t y_index = (i / WIDTH);
float calculated_position_x = (x_index / (0.5 * WIDTH)) - 1.5;
float calculated_position_y = (y_index / (0.5 * HEIGHT)) - 1.0;
cuFloatComplex z_0 = make_cuFloatComplex(calculated_position_x, calculated_position_y);
cuFloatComplex z_n = make_cuFloatComplex(calculated_position_x, calculated_position_y);
int m_i = 0;
for(m_i = 0; m_i < ITER_MAX; m_i++) {
z_n = cuCaddf(cuCmulf(z_n, z_n), z_0);
if(cuCabsf(z_n) > 2.0) {
break;
}
}
mbr[i] = m_i;
}
}
__global__ void
mandelbrot_multi_thread_gpu(int n, int* mbr, uint16_t thread_blocks, uint16_t threads) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int section_size = (n) / (thread_blocks * threads);
for(int i = tid * section_size; i < (tid + 1) * section_size; i++) {
// Index from the top left, using row-major order/x-major order
uint32_t x_index = (i % WIDTH);
uint32_t y_index = (i / WIDTH);
float calculated_position_x = (x_index / (0.5 * WIDTH)) - 1.5;
float calculated_position_y = (y_index / (0.5 * HEIGHT)) - 1.0;
cuFloatComplex z_0 = make_cuFloatComplex(calculated_position_x, calculated_position_y);
cuFloatComplex z_n = make_cuFloatComplex(calculated_position_x, calculated_position_y);
int m_i = 0;
for(m_i = 0; m_i < ITER_MAX; m_i++) {
z_n = cuCaddf(cuCmulf(z_n, z_n), z_0);
if(cuCabsf(z_n) > 2.0) {
break;
}
}
mbr[i] = m_i;
}
}
// ...
int main(void) {
// ...
mandelbrot_single_thread_gpu<<<1, 1>>>(N, mbr);
uint16_t thread_blocks = 4;
uint16_t threads = 256;
mandelbrot_multi_thread_gpu<<<thread_blocks, threads>>>(N, mbr, thread_blocks, threads);
// ...
}
And the most salient section of the multithreaded code where I actually break up the work is:
1
2
3
4
5
6
7
8
9
10
__global__ void
mandelbrot_multi_thread_gpu(int n, int* mbr, uint16_t thread_blocks, uint16_t threads) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int section_size = (n) / (thread_blocks * threads);
for(int i = tid * section_size; i < (tid + 1) * section_size; i++) {
// Mandelbrot logic
...
}
}
The previous section is where the work is divvied up between the threads based on the allotted thread index. This would look like a number of horizontal stripes along the entire render.
The rendered result:
On a 2048x2048
px grid using a basic single threaded approach on the GPU.
1
2
3
4
5
6
7
==42795== Profiling application: ./build/mandelFrac
==42795== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 208.700s 1 208.700s 208.700s 208.700s mandelbrot_single_thread_gpu(int, int*)
API calls: 99.75% 208.701s 1 208.701s 208.701s 208.701s cudaDeviceSynchronize
0.25% 519.95ms 1 519.95ms 519.95ms 519.95ms cudaMallocManaged
...
Now for the multithreaded version with 4 thread blocks and 256 threads.
1
2
3
4
5
6
7
==43499== Profiling application: ./build/mandelFrac
==43499== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 371.59ms 1 371.59ms 371.59ms 371.59ms mandelbrot_multi_thread_gpu(int, int*, unsigned short, unsigned short)
API calls: 56.89% 493.62ms 1 493.62ms 493.62ms 493.62ms cudaMallocManaged
42.84% 371.70ms 1 371.70ms 371.70ms 371.70ms cudaDeviceSynchronize
...
The speedup is wild. I get a 56,160%
speedup!!
Work Division:
Since I decided to divide the work up in the way that I did, we can also have a look at how the work was divided by rendering out the thread ID number in some color scheme.
This would look like this in an exaggerated way with fewer workers/threads.
Whereas the actual worker tested looked more like:
The number of workers is physically unrepresentable as this is an 8-bit image and I had $4 \times 256 = 1024$ thread IDs