vulkan-logo

Vulkan is great. It provides a cross-platform API to write applications that use the GPU to do graphics and general purpose compute. Designed from the ground-up to be a modern API, using Vulkan can be quite difficult so you better know what you’re doing if you plan to use Vulkan for your application.

Vulkan provides both a graphics and compute APIs and in this post I will focus on the compute part of it as I’m still not very familiar with the graphics side of it.

I had a difficult time searching for simple Vulkan compute samples as the official Khronos Vulkan-Samples only include more elaborate examples. My goal was to write the minimal amount of code to get a compute shader running using Vulkan.

I found two very useful resources that do more or less what I wanted:

Neil’s post is great as it goes straight into the code but it uses the C API from Vulkan, so very verbose. I wanted to use vulkan.hpp as it provides a nice C++ interface to use Vulkan. The Vulkan C++ bindings are almost a one-to-one match with the C API, so theoretically I could just port Neil’s code to use the Vulkan C++ header but I found this task was not that simple. Slava’s sample did just that, however, as with most Vulkan samples out there, there is an infrastructure around the code to make it easier to write, namely classes with a lot of methods to abstract some of the resource management required to use Vulkan.

Despite that, both links proved to be really useful when writing my own sample.

The goal is simple: Run a compute shader that squares the numbers from an input buffer and stores the results in an output buffer, i.e., run the equivalent of the following code but in the GPU, using Vulkan:

std::vector<int> Input, Output;
for (int I = 0; I < Input.size(); ++I)
{
    Output[I] = Input[I] * Input[I];
}

So let’s start writing this program.

Infrastructure

Make sure you have the Vulkan SDK installed on your machine.

I’m assuming you have the Vulkan SDK installed on your machine. For this program I’m going to use CMake and HLSL for the shader part.

The following CMake file can be used to build the program. I chose to compile the shader to SPIR-V ahead of time to make things simple in the C++ side. Using add_custom_target you can create a CMake target that will compile the Square.hlsl shader when building the project.

cmake_minimum_required(VERSION 3.16)

project(VulkanCompute)

find_package(Vulkan REQUIRED)

add_custom_command(
    OUTPUT "${CMAKE_BINARY_DIR}/Square.spv"
    COMMAND $ENV{VK_SDK_PATH}/Bin/dxc -T cs_6_0 -E "Main" -spirv -fvk-use-dx-layout -fspv-target-env=vulkan1.1 -Fo "${CMAKE_BINARY_DIR}/Square.spv" "Square.hlsl"
    DEPENDS "Square.hlsl"
    WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
    COMMENT "Buiding Shaders"
)
add_custom_target(ComputeShader DEPENDS "${CMAKE_BINARY_DIR}/Square.spv")

add_executable(VulkanCompute "main.cpp")
target_link_libraries(VulkanCompute PRIVATE Vulkan::Vulkan)
add_dependencies(VulkanCompute ComputeShader)

Preamble

To use the Vulkan C++ header one just need to

#include <vulkan/vulkan.hpp>

Vulkan Instance - vk::Instance

A Vulkan application starts with a vk::Instance, so lets create one:

vk::ApplicationInfo AppInfo{
    "VulkanCompute",      // Application Name
    1,                    // Application Version
    nullptr,              // Engine Name or nullptr
    0,                    // Engine Version
    VK_API_VERSION_1_1    // Vulkan API version
};

const std::vector<const char*> Layers = { "VK_LAYER_KHRONOS_validation" };
vk::InstanceCreateInfo InstanceCreateInfo(vk::InstanceCreateFlags(), // Flags
                                          &AppInfo,                  // Application Info
                                          Layers,                    // Layers
                                          {});                       // Extensions
vk::Instance Instance = vk::createInstance(InstanceCreateInfo);

Here I’m enabling the VK_LAYER_KHRONOS_validation layer so we can have some help from Vulkan in case something goes wrong.

Enumerating the Physical Devices - vk::PhysicalDevice

A vk::PhysicalDevice represents, as the name suggests, the physical piece of hardware that we can use to run our application. We need to select a physical device from which we can create a logical device, vk::Device that we use to interact with it:

vk::PhysicalDevice PhysicalDevice = Instance.enumeratePhysicalDevices().front();
vk::PhysicalDeviceProperties DeviceProps = PhysicalDevice.getProperties();
std::cout << "Device Name    : " << DeviceProps.deviceName << std::endl;
const uint32_t ApiVersion = DeviceProps.apiVersion;
std::cout << "Vulkan Version : " << VK_VERSION_MAJOR(ApiVersion) << "." << VK_VERSION_MINOR(ApiVersion) << "." << VK_VERSION_PATCH(ApiVersion);
vk::PhysicalDeviceLimits DeviceLimits = DeviceProps.limits;
std::cout << "Max Compute Shared Memory Size: " << DeviceLimits.maxComputeSharedMemorySize / 1024 << " KB" << std::endl;

Here I’m just printing some information from the first physical device available in the machine.

Queue Family Index - vk::QueueFamilyProperties

We need a vk::Queue where we submit work to be done by the device, hopefully a GPU in this case. Note that extra code is required to make sure the selected device is the one you want, in case there are more than one available.

Vulkan supports different types of queues, so we need to query which queue family we need to create a queue suitable for compute work:

std::vector<vk::QueueFamilyProperties> QueueFamilyProps = PhysicalDevice.getQueueFamilyProperties();
auto PropIt = std::find_if(QueueFamilyProps.begin(), QueueFamilyProps.end(), [](const vk::QueueFamilyProperties& Prop)
{
    return Prop.queueFlags & vk::QueueFlagBits::eCompute;
});
const uint32_t ComputeQueueFamilyIndex = std::distance(QueueFamilyProps.begin(), PropIt);
std::cout << "Compute Queue Family Index: " << ComputeQueueFamilyIndex << std::endl;

This will select a queue that has compute capabilities. This is equivalent to searching for a queue with the VK_QUEUE_COMPUTE_BIT flag using the C API.

Vulkan Device - vk::Device

Creating a device requires a vk::DeviceQueueCreateInfo and a vk::DeviceCreateInfo:

vk::DeviceQueueCreateInfo DeviceQueueCreateInfo(vk::DeviceQueueCreateFlags(),   // Flags
                                                ComputeQueueFamilyIndex,        // Queue Family Index
                                                1);                             // Number of Queues
vk::DeviceCreateInfo DeviceCreateInfo(vk::DeviceCreateFlags(),   // Flags
                                        DeviceQueueCreateInfo);  // Device Queue Create Info struct
vk::Device Device = PhysicalDevice.createDevice(DeviceCreateInfo);

Allocating Memory

Allocating memory in Vulkan is a pain. There libraries that facilitate this task like AMD’s Vulkan Memory Allocator, but using this libraries is out of the scope for this post, also I wanted to see how to do it manually first.

Vulkan separates buffers memory. Buffers in Vulkan are just a view into a piece of memory that you also need to manually configure. Allocating memory then is split into three parts:

  1. Create the required buffers for the application
  2. Allocate the memory to back the buffers
  3. Bind the buffers to the memory

This separation allows programmers to fine tune memory usage by allocating a large chunk of memory for several buffers for example. For the sake of simplicity, each vk::Buffer will have its corresponding vk::Memory associated with it:

Creating the buffers - vk::Buffer

I’m going to create two buffers with 10 elements each:

const uint32_t NumElements = 10;
const uint32_t BufferSize = NumElements * sizeof(int32_t);

vk::BufferCreateInfo BufferCreateInfo{
    vk::BufferCreateFlags(),                    // Flags
    BufferSize,                                 // Size
    vk::BufferUsageFlagBits::eStorageBuffer,    // Usage
    vk::SharingMode::eExclusive,                // Sharing mode
    1,                                          // Number of queue family indices
    &ComputeQueueFamilyIndex                    // List of queue family indices
};
vk::Buffer InBuffer = Device.createBuffer(BufferCreateInfo);
vk::Buffer OutBuffer = Device.createBuffer(BufferCreateInfo);

Allocating memory

To allocate memory in Vulkan we first need to find the type of memory we actually require to back the buffers we have. The vk::Device provides a member function called vk::Device::getBufferMemoryRequirements that returns a vk::MemoryRequirements object with information so we can ask Vulkan how much memory to allocate for each buffer:

vk::MemoryRequirements InBufferMemoryRequirements = Device.getBufferMemoryRequirements(InBuffer);
vk::MemoryRequirements OutBufferMemoryRequirements = Device.getBufferMemoryRequirements(OutBuffer);

With this information on hand we can query Vulkan for the memory type required to allocate memory that is visible from the host, i.e., memory that can be mapped on the host side:

vk::PhysicalDeviceMemoryProperties MemoryProperties = PhysicalDevice.getMemoryProperties();

uint32_t MemoryTypeIndex = uint32_t(~0);
vk::DeviceSize MemoryHeapSize = uint32_t(~0);
for (uint32_t CurrentMemoryTypeIndex = 0; CurrentMemoryTypeIndex < MemoryProperties.memoryTypeCount; ++CurrentMemoryTypeIndex)
{
    vk::MemoryType MemoryType = MemoryProperties.memoryTypes[CurrentMemoryTypeIndex];
    if ((vk::MemoryPropertyFlagBits::eHostVisible & MemoryType.propertyFlags) &&
        (vk::MemoryPropertyFlagBits::eHostCoherent & MemoryType.propertyFlags))
    {
        MemoryHeapSize = MemoryProperties.memoryHeaps[MemoryType.heapIndex].size;
        MemoryTypeIndex = CurrentMemoryTypeIndex;
        break;
    }
}

std::cout << "Memory Type Index: " << MemoryTypeIndex << std::endl;
std::cout << "Memory Heap Size : " << MemoryHeapSize / 1024 / 1024 / 1024 << " GB" << std::endl;

And finally we can ask the device to allocate the required memory for our buffers:

vk::MemoryAllocateInfo InBufferMemoryAllocateInfo(InBufferMemoryRequirements.size, MemoryTypeIndex);
vk::MemoryAllocateInfo OutBufferMemoryAllocateInfo(OutBufferMemoryRequirements.size, MemoryTypeIndex);
vk::DeviceMemory InBufferMemory = Device.allocateMemory(InBufferMemoryAllocateInfo);
vk::DeviceMemory OutBufferMemory = Device.allocateMemory(InBufferMemoryAllocateInfo);

The last step of the memory allocation part is to get a mapped pointer to this memory that can be used to copy data from the host to the device. For this simple example I’m just setting the value of each element as its index:

int32_t* InBufferPtr = static_cast<int32_t*>(Device.mapMemory(InBufferMemory, 0, BufferSize));
for (int32_t I = 0; I < NumElements; ++I)
{
    InBufferPtr[I] = I;
}
Device.unmapMemory(InBufferMemory);

Binding Buffers to Memory

Finally we can bind the buffers to the allocated memory:

Device.bindBufferMemory(InBuffer, InBufferMemory, 0);
Device.bindBufferMemory(OutBuffer, OutBufferMemory, 0);

And that concludes the memory allocation part of the program.

Creating the Compute Pipeline - vk::Pipeline

The next part of the process is to create the compute pipeline that will be used to run the compute shader on the GPU. Lets start with the HLSL shader we are going to run:

[[vk::binding(0, 0)]] RWStructuredBuffer<int> InBuffer;
[[vk::binding(1, 0)]] RWStructuredBuffer<int> OutBuffer;

[numthreads(1, 1, 1)]
void Main(uint3 DTid : SV_DispatchThreadID)
{
    OutBuffer[DTid.x] = InBuffer[DTid.x] * InBuffer[DTid.x];
}

The only difference from a regular HLSL shader are the annotations for each buffer. [[vk::binding(1, 0)]] means that a buffer will use binding 1 from descriptor set 0. Descriptor sets can be thought of a way to tell Vulkan how the buffers we defined in the previous section are going to be passed to the pipeline. If you have an existing HLSL shader that you want to use but you can’t change the its source code to have the bindings, you can still set the bindings in the command line using dxc using the -fvk-bind-globals argument.

With the CMake setup defined in the beginning of the post, there will be a file called Square.spv in the build folder for the project. This file contains the SPIR-V bytecode representing this shader. The goal here is to load this shader as the compute stage of our pipeline.

Shader Module - vk::ShaderModule

We start by reading the contents of the SPIR-V file and creating a vk::ShaderModule:

std::vector<char> ShaderContents;
if (std::ifstream ShaderFile{ "Square.spv", std::ios::binary | std::ios::ate })
{
    const size_t FileSize = ShaderFile.tellg();
    ShaderFile.seekg(0);
    ShaderContents.resize(FileSize, '\0');
    ShaderFile.read(ShaderContents.data(), FileSize);
}

vk::ShaderModuleCreateInfo ShaderModuleCreateInfo(
    vk::ShaderModuleCreateFlags(),                                // Flags
    ShaderContents.size(),                                        // Code size
    reinterpret_cast<const uint32_t*>(ShaderContents.data()));    // Code
vk::ShaderModule ShaderModule = Device.createShaderModule(ShaderModuleCreateInfo);

Descriptor Set Layout - vk::DescriptorSetLayout

Next we define the vk::DescriptorSetLayout. This object will tell Vulkan the layout of data to be passed into to the pipeline. Note that this is not the actual descriptor set, it is just the layout of the thing. This got me confused when I first saw it. The actual descriptor set is represented by a vk::DescriptorSet and it needs to be allocated using a descriptor pool. Vulkan is complicated, but you have control!

Let’s define the vk::DescriptorSetLayout:

const std::vector<vk::DescriptorSetLayoutBinding> DescriptorSetLayoutBinding = {
    {0, vk::DescriptorType::eStorageBuffer, 1, vk::ShaderStageFlagBits::eCompute},
    {1, vk::DescriptorType::eStorageBuffer, 1, vk::ShaderStageFlagBits::eCompute}
};
vk::DescriptorSetLayoutCreateInfo DescriptorSetLayoutCreateInfo(
    vk::DescriptorSetLayoutCreateFlags(),
    DescriptorSetLayoutBinding);
vk::DescriptorSetLayout DescriptorSetLayout = Device.createDescriptorSetLayout(DescriptorSetLayoutCreateInfo);

The vk::DescriptorSetLayout is specified using a series of vk::DescriptorSetLayoutBinding objects. Each binding will assign an index to a buffer in the pipeline. With the vk::DescriptorSetLayout created we can move to create the layout of our compute pipeline. Yes, another layout, not the actual thing!

Pipeline Layout - vk::PipelineLayout

vk::PipelineLayoutCreateInfo PipelineLayoutCreateInfo(vk::PipelineLayoutCreateFlags(), DescriptorSetLayout);
vk::PipelineLayout PipelineLayout = Device.createPipelineLayout(PipelineLayoutCreateInfo);
vk::PipelineCache PipelineCache = Device.createPipelineCache(vk::PipelineCacheCreateInfo());

Pipeline - vk::Pipeline

Now we can finally create the compute pipeline:

vk::PipelineShaderStageCreateInfo PipelineShaderCreateInfo(
    vk::PipelineShaderStageCreateFlags(),  // Flags
    vk::ShaderStageFlagBits::eCompute,     // Stage
    ShaderModule,                          // Shader Module
    "Main");                               // Shader Entry Point
vk::ComputePipelineCreateInfo ComputePipelineCreateInfo(
    vk::PipelineCreateFlags(),    // Flags
    PipelineShaderCreateInfo,     // Shader Create Info struct
    PipelineLayout);              // Pipeline Layout
vk::Pipeline ComputePipeline = Device.createComputePipeline(PipelineCache, ComputePipelineCreateInfo);

Creating the vk::DescriptorSet

Descriptor sets must be allocated in a vk::DescriptorPool, so we need to create one first:

vk::DescriptorPoolSize DescriptorPoolSize(vk::DescriptorType::eStorageBuffer, 2);
vk::DescriptorPoolCreateInfo DescriptorPoolCreateInfo(vk::DescriptorPoolCreateFlags(), 1, DescriptorPoolSize);
vk::DescriptorPool DescriptorPool = Device.createDescriptorPool(DescriptorPoolCreateInfo);

Now we can finally allocate the descriptor set and update them to use our buffers:

vk::DescriptorSetAllocateInfo DescriptorSetAllocInfo(DescriptorPool, 1, &DescriptorSetLayout);
const std::vector<vk::DescriptorSet> DescriptorSets = Device.allocateDescriptorSets(DescriptorSetAllocInfo);
vk::DescriptorSet DescriptorSet = DescriptorSets.front();
vk::DescriptorBufferInfo InBufferInfo(InBuffer, 0, NumElements * sizeof(int32_t));
vk::DescriptorBufferInfo OutBufferInfo(OutBuffer, 0, NumElements * sizeof(int32_t));

const std::vector<vk::WriteDescriptorSet> WriteDescriptorSets = {
    {DescriptorSet, 0, 0, 1, vk::DescriptorType::eStorageBuffer, nullptr, &InBufferInfo},
    {DescriptorSet, 1, 0, 1, vk::DescriptorType::eStorageBuffer, nullptr, &OutBufferInfo},
};
Device.updateDescriptorSets(WriteDescriptorSets, {});

Submitting the work to the GPU

Command Pool - vk::CommandPool

To actually run this shader on the GPU we need to submit the work on a vk::Queue. We tell the queue to commands stored in a one or more vk::CommandBuffers. Commands must be allocated in a vk::CommandPool, so we need to create one first:

vk::CommandPoolCreateInfo CommandPoolCreateInfo(vk::CommandPoolCreateFlags(), ComputeQueueFamilyIndex);
vk::CommandPool CommandPool = Device.createCommandPool(CommandPoolCreateInfo);

Command Buffers - vk::CommandBuffer

Now we can use the command pool to allocate one or more command buffers:

vk::CommandBufferAllocateInfo CommandBufferAllocInfo(
    CommandPool,                         // Command Pool
    vk::CommandBufferLevel::ePrimary,    // Level
    1);                                  // Num Command Buffers
const std::vector<vk::CommandBuffer> CmdBuffers = Device.allocateCommandBuffers(CommandBufferAllocInfo);
vk::CommandBuffer CmdBuffer = CmdBuffers.front();

Recording Commands

We can now record commands in the vk::CommandBuffer object. To run the compute shader we need to bind the pipeline, descriptor sets and record a vk::CommandBuffer::dispatch call:

vk::CommandBufferBeginInfo CmdBufferBeginInfo(vk::CommandBufferUsageFlagBits::eOneTimeSubmit);
CmdBuffer.begin(CmdBufferBeginInfo);
CmdBuffer.bindPipeline(vk::PipelineBindPoint::eCompute, ComputePipeline);
CmdBuffer.bindDescriptorSets(vk::PipelineBindPoint::eCompute,    // Bind point
                                PipelineLayout,                  // Pipeline Layout
                                0,                               // First descriptor set
                                { DescriptorSet },               // List of descriptor sets
                                {});                             // Dynamic offsets
CmdBuffer.dispatch(NumElements, 1, 1);
CmdBuffer.end();

The vk::CommandBuffer::dispatch function takes the number of threads to launch in the device. In this example we are launching one thread for element.

With the vk::CommmandBuffer recorded we can finaly submit the work the GPU. We first get the vk::Queue from the vk::Device using the queue family index retrieved earlier and we create a vk::Fence. The fence is a mechanism we can use to wait for the compute shader to complete. After waiting we can read the results of our computation:

vk::Queue Queue = Device.getQueue(ComputeQueueFamilyIndex, 0);
vk::Fence Fence = Device.createFence(vk::FenceCreateInfo());

vk::SubmitInfo SubmitInfo(0,                // Num Wait Semaphores
                            nullptr,        // Wait Semaphores
                            nullptr,        // Pipeline Stage Flags
                            1,              // Num Command Buffers
                            &CmdBuffer);    // List of command buffers
Queue.submit({ SubmitInfo }, Fence);
Device.waitForFences({ Fence },             // List of fences
                        true,               // Wait All
                        uint64_t(-1));      // Timeout

The final step is to map the output buffer and read the results. For this example we are just going to print the values in the terminal:

InBufferPtr = static_cast<int32_t*>(Device.mapMemory(InBufferMemory, 0, BufferSize));
for (uint32_t I = 0; I < NumElements; ++I)
{
    std::cout << InBufferPtr[I] << " ";
}
std::cout << std::endl;
Device.unmapMemory(InBufferMemory);

int32_t* OutBufferPtr = static_cast<int32_t*>(Device.mapMemory(OutBufferMemory, 0, BufferSize));
for (uint32_t I = 0; I < NumElements; ++I)
{
    std::cout << OutBufferPtr[I] << " ";
}
std::cout << std::endl;
Device.unmapMemory(OutBufferMemory);

Cleaning

We need to manually delete the resources used by our program or the validation layer will shout at you:

Device.resetCommandPool(CommandPool, vk::CommandPoolResetFlags());
Device.destroyFence(Fence);
Device.destroyDescriptorSetLayout(DescriptorSetLayout);
Device.destroyPipelineLayout(PipelineLayout);
Device.destroyPipelineCache(PipelineCache);
Device.destroyShaderModule(ShaderModule);
Device.destroyPipeline(ComputePipeline);
Device.destroyDescriptorPool(DescriptorPool);
Device.destroyCommandPool(CommandPool);
Device.freeMemory(InBufferMemory);
Device.freeMemory(OutBufferMemory);
Device.destroyBuffer(InBuffer);
Device.destroyBuffer(OutBuffer);
Device.destroy();
Instance.destroy();

The Vulkan C++ Header also has a raii set of objects that can be used to avoid manually cleaning resources, for example:

vk::raii::Context  context;
vk::raii::Instance instance = vk::raii::su::makeInstance( context, AppName, EngineName );
// enumerate the physicalDevices
vk::raii::PhysicalDevices physicalDevices( instance );

So this objects will clean their resources upon destruction.

Conclusion

If you are here, well done, you can now use Vulkan to square a few numbers on the GPU =). Hopefully this will be useful to you in staring with Vulkan to run compute workloads on your GPU.

This code is pretty much in the order required to build your application so you should be able to copy-and-paste this code into your main function and have some numbers printed in the terminal. You can also find this code in this Github Repo ready to be built and executed.

See you next time!