Project Idea:

Mandelbrot fraction using CUDA and C++.

See the project on GitHub at the GitHub repo

Problem Statement:

Using CUDA with the C++ API I want to draw a Mandelbrot fractal. And I want to be able to display that to the user in the easiest way possible.

Use CUDA
Use C++
Use X11 to draw since Linux is the easiest dev env.

Results:

The code can be seen at the previous link or at the C++ file and the offending portions for the single and multithreaded code:

  
__global__ void mandelbrot_single_thread_gpu(int n, int* mbr) {
    for(int i = 0; i < n; i++) {
        // Index from the top left, using row-major order/x-major order
        uint32_t x_index = (i % WIDTH);
        uint32_t y_index = (i / WIDTH);

        float calculated_position_x = (x_index / (0.5 * WIDTH)) - 1.5;
        float calculated_position_y = (y_index / (0.5 * HEIGHT)) - 1.0;

        cuFloatComplex z_0 = make_cuFloatComplex(calculated_position_x, calculated_position_y);
        cuFloatComplex z_n = make_cuFloatComplex(calculated_position_x, calculated_position_y);

        int m_i = 0;
        for(m_i = 0; m_i < ITER_MAX; m_i++) {
            z_n = cuCaddf(cuCmulf(z_n, z_n), z_0);
            if(cuCabsf(z_n) > 2.0) {
                break;
            }
        }
        mbr[i] = m_i;
    }
}

__global__ void
mandelbrot_multi_thread_gpu(int n, int* mbr, uint16_t thread_blocks, uint16_t threads) {
    int tid          = blockIdx.x * blockDim.x + threadIdx.x;
    int section_size = (n) / (thread_blocks * threads);
    for(int i = tid * section_size; i < (tid + 1) * section_size; i++) {
        // Index from the top left, using row-major order/x-major order
        uint32_t x_index = (i % WIDTH);
        uint32_t y_index = (i / WIDTH);

        float calculated_position_x = (x_index / (0.5 * WIDTH)) - 1.5;
        float calculated_position_y = (y_index / (0.5 * HEIGHT)) - 1.0;

        cuFloatComplex z_0 = make_cuFloatComplex(calculated_position_x, calculated_position_y);
        cuFloatComplex z_n = make_cuFloatComplex(calculated_position_x, calculated_position_y);

        int m_i = 0;
        for(m_i = 0; m_i < ITER_MAX; m_i++) {
            z_n = cuCaddf(cuCmulf(z_n, z_n), z_0);
            if(cuCabsf(z_n) > 2.0) {
                break;
            }
        }
        mbr[i] = m_i;
    }
}
// ...
int main(void) {
    // ...
    mandelbrot_single_thread_gpu<<<1, 1>>>(N, mbr);

    uint16_t thread_blocks = 4;
    uint16_t threads       = 256;
    mandelbrot_multi_thread_gpu<<<thread_blocks, threads>>>(N, mbr, thread_blocks, threads);
    // ...
}

And the most salient section of the multithreaded code where I actually break up the work is:

  
__global__ void
mandelbrot_multi_thread_gpu(int n, int* mbr, uint16_t thread_blocks, uint16_t threads) {
    int tid          = blockIdx.x * blockDim.x + threadIdx.x;
    int section_size = (n) / (thread_blocks * threads);
    for(int i = tid * section_size; i < (tid + 1) * section_size; i++) {
        // Mandelbrot logic
        ...
    }
}

The previous section is where the work is divvied up between the threads based on the allotted thread index. This would look like a number of horizontal stripes along the entire render.

The rendered result:

Link to the raw image

On a 2048x2048px grid using a basic single threaded approach on the GPU.

==42795== Profiling application: ./build/mandelFrac
==42795== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  208.700s         1  208.700s  208.700s  208.700s  mandelbrot_single_thread_gpu(int, int*)
      API calls:   99.75%  208.701s         1  208.701s  208.701s  208.701s  cudaDeviceSynchronize
                    0.25%  519.95ms         1  519.95ms  519.95ms  519.95ms  cudaMallocManaged
                    ...

Now for the multithreaded version with 4 thread blocks and 256 threads.

==43499== Profiling application: ./build/mandelFrac
==43499== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  371.59ms         1  371.59ms  371.59ms  371.59ms  mandelbrot_multi_thread_gpu(int, int*, unsigned short, unsigned short)
      API calls:   56.89%  493.62ms         1  493.62ms  493.62ms  493.62ms  cudaMallocManaged
                   42.84%  371.70ms         1  371.70ms  371.70ms  371.70ms  cudaDeviceSynchronize
                   ...

The speedup is wild. I get a 56,160% speedup!!

Work Division:

Since I decided to divide the work up in the way that I did, we can also have a look at how the work was divided by rendering out the thread ID number in some color scheme.

This would look like this in an exaggerated way with fewer workers/threads.

Link to the raw image

Whereas the actual worker tested looked more like:

Link to the raw image

The number of workers is physically unrepresentable as this is an 8-bit image and I had $4 \times 256 = 1024$ thread IDs

CUDA Mandelbrot Fractal

Project Idea:

Problem Statement:

Results:

Work Division:

Further Reading

Microphone Array PCB

Parameterized Filter Design

New resume, who dis?