Vulkan is great. It provides a cross-platform API to write applications that use the GPU to do graphics and general purpose compute. Designed from the ground-up to be a modern API, using Vulkan can be quite difficult so you better know what you’re doing if you plan to use Vulkan for your application.
Vulkan provides both a graphics and compute APIs and in this post I will focus on the compute part of it as I’m still not very familiar with the graphics side of it.
I had a difficult time searching for simple Vulkan compute samples as the official Khronos Vulkan-Samples only include more elaborate examples. My goal was to write the minimal amount of code to get a compute shader running using Vulkan.
I found two very useful resources that do more or less what I wanted:
- A Simple Vulkan Compute Example by Neil Henning
- Vulkan Compute Example by Slava Savenko
Neil’s post is great as it goes straight into the code but it uses the C API from Vulkan, so
very verbose. I wanted to use vulkan.hpp
as
it provides a nice C++ interface to use Vulkan. The Vulkan C++ bindings are almost a
one-to-one match with the C API, so theoretically I could just port Neil’s code to use
the Vulkan C++ header but I found this task was not that simple. Slava’s sample did just
that, however, as with most Vulkan samples out there, there is an infrastructure around the
code to make it easier to write, namely classes with a lot of methods to abstract some of
the resource management required to use Vulkan.
Despite that, both links proved to be really useful when writing my own sample.
The goal is simple: Run a compute shader that squares the numbers from an input buffer and stores the results in an output buffer, i.e., run the equivalent of the following code but in the GPU, using Vulkan:
std::vector<int> Input, Output;
for (int I = 0; I < Input.size(); ++I)
{
Output[I] = Input[I] * Input[I];
}
So let’s start writing this program.
Infrastructure
Make sure you have the Vulkan SDK installed on your machine.
I’m assuming you have the Vulkan SDK installed on your machine. For this program I’m going to use CMake and HLSL for the shader part.
The following CMake file can be used to build the program. I chose to compile the shader
to SPIR-V ahead of time to make things simple in the C++ side. Using add_custom_target
you can create a CMake target that will compile the Square.hlsl
shader when building
the project.
cmake_minimum_required(VERSION 3.16)
project(VulkanCompute)
find_package(Vulkan REQUIRED)
add_custom_command(
OUTPUT "${CMAKE_BINARY_DIR}/Square.spv"
COMMAND $ENV{VK_SDK_PATH}/Bin/dxc -T cs_6_0 -E "Main" -spirv -fvk-use-dx-layout -fspv-target-env=vulkan1.1 -Fo "${CMAKE_BINARY_DIR}/Square.spv" "Square.hlsl"
DEPENDS "Square.hlsl"
WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
COMMENT "Buiding Shaders"
)
add_custom_target(ComputeShader DEPENDS "${CMAKE_BINARY_DIR}/Square.spv")
add_executable(VulkanCompute "main.cpp")
target_link_libraries(VulkanCompute PRIVATE Vulkan::Vulkan)
add_dependencies(VulkanCompute ComputeShader)
Preamble
To use the Vulkan C++ header one just need to
#include <vulkan/vulkan.hpp>
Vulkan Instance - vk::Instance
A Vulkan application starts with a vk::Instance
, so lets create one:
vk::ApplicationInfo AppInfo{
"VulkanCompute", // Application Name
1, // Application Version
nullptr, // Engine Name or nullptr
0, // Engine Version
VK_API_VERSION_1_1 // Vulkan API version
};
const std::vector<const char*> Layers = { "VK_LAYER_KHRONOS_validation" };
vk::InstanceCreateInfo InstanceCreateInfo(vk::InstanceCreateFlags(), // Flags
&AppInfo, // Application Info
Layers, // Layers
{}); // Extensions
vk::Instance Instance = vk::createInstance(InstanceCreateInfo);
Here I’m enabling the VK_LAYER_KHRONOS_validation
layer so we can have some help from
Vulkan in case something goes wrong.
Enumerating the Physical Devices - vk::PhysicalDevice
A vk::PhysicalDevice
represents, as the name suggests, the physical piece of hardware
that we can use to run our application. We need to select a physical device from which
we can create a logical device, vk::Device
that we use to interact with it:
vk::PhysicalDevice PhysicalDevice = Instance.enumeratePhysicalDevices().front();
vk::PhysicalDeviceProperties DeviceProps = PhysicalDevice.getProperties();
std::cout << "Device Name : " << DeviceProps.deviceName << std::endl;
const uint32_t ApiVersion = DeviceProps.apiVersion;
std::cout << "Vulkan Version : " << VK_VERSION_MAJOR(ApiVersion) << "." << VK_VERSION_MINOR(ApiVersion) << "." << VK_VERSION_PATCH(ApiVersion);
vk::PhysicalDeviceLimits DeviceLimits = DeviceProps.limits;
std::cout << "Max Compute Shared Memory Size: " << DeviceLimits.maxComputeSharedMemorySize / 1024 << " KB" << std::endl;
Here I’m just printing some information from the first physical device available in the machine.
Queue Family Index - vk::QueueFamilyProperties
We need a vk::Queue
where we submit work to be done by the device, hopefully a GPU
in this case. Note that extra code is required to make sure the selected device is the
one you want, in case there are more than one available.
Vulkan supports different types of queues, so we need to query which queue family we need to create a queue suitable for compute work:
std::vector<vk::QueueFamilyProperties> QueueFamilyProps = PhysicalDevice.getQueueFamilyProperties();
auto PropIt = std::find_if(QueueFamilyProps.begin(), QueueFamilyProps.end(), [](const vk::QueueFamilyProperties& Prop)
{
return Prop.queueFlags & vk::QueueFlagBits::eCompute;
});
const uint32_t ComputeQueueFamilyIndex = std::distance(QueueFamilyProps.begin(), PropIt);
std::cout << "Compute Queue Family Index: " << ComputeQueueFamilyIndex << std::endl;
This will select a queue that has compute capabilities. This is equivalent to searching
for a queue with the VK_QUEUE_COMPUTE_BIT
flag using the C API.
Vulkan Device - vk::Device
Creating a device requires a vk::DeviceQueueCreateInfo
and a vk::DeviceCreateInfo
:
vk::DeviceQueueCreateInfo DeviceQueueCreateInfo(vk::DeviceQueueCreateFlags(), // Flags
ComputeQueueFamilyIndex, // Queue Family Index
1); // Number of Queues
vk::DeviceCreateInfo DeviceCreateInfo(vk::DeviceCreateFlags(), // Flags
DeviceQueueCreateInfo); // Device Queue Create Info struct
vk::Device Device = PhysicalDevice.createDevice(DeviceCreateInfo);
Allocating Memory
Allocating memory in Vulkan is a pain. There libraries that facilitate this task like AMD’s Vulkan Memory Allocator, but using this libraries is out of the scope for this post, also I wanted to see how to do it manually first.
Vulkan separates buffers memory. Buffers in Vulkan are just a view into a piece of memory that you also need to manually configure. Allocating memory then is split into three parts:
- Create the required buffers for the application
- Allocate the memory to back the buffers
- Bind the buffers to the memory
This separation allows programmers to fine tune memory usage by allocating a large chunk of
memory for several buffers for example. For the sake of simplicity, each vk::Buffer
will
have its corresponding vk::Memory
associated with it:
Creating the buffers - vk::Buffer
I’m going to create two buffers with 10 elements each:
const uint32_t NumElements = 10;
const uint32_t BufferSize = NumElements * sizeof(int32_t);
vk::BufferCreateInfo BufferCreateInfo{
vk::BufferCreateFlags(), // Flags
BufferSize, // Size
vk::BufferUsageFlagBits::eStorageBuffer, // Usage
vk::SharingMode::eExclusive, // Sharing mode
1, // Number of queue family indices
&ComputeQueueFamilyIndex // List of queue family indices
};
vk::Buffer InBuffer = Device.createBuffer(BufferCreateInfo);
vk::Buffer OutBuffer = Device.createBuffer(BufferCreateInfo);
Allocating memory
To allocate memory in Vulkan we first need to find the type of memory we actually require
to back the buffers we have. The vk::Device
provides a member function called vk::Device::getBufferMemoryRequirements
that returns a vk::MemoryRequirements
object
with information so we can ask Vulkan how much memory to allocate for each buffer:
vk::MemoryRequirements InBufferMemoryRequirements = Device.getBufferMemoryRequirements(InBuffer);
vk::MemoryRequirements OutBufferMemoryRequirements = Device.getBufferMemoryRequirements(OutBuffer);
With this information on hand we can query Vulkan for the memory type required to allocate memory that is visible from the host, i.e., memory that can be mapped on the host side:
vk::PhysicalDeviceMemoryProperties MemoryProperties = PhysicalDevice.getMemoryProperties();
uint32_t MemoryTypeIndex = uint32_t(~0);
vk::DeviceSize MemoryHeapSize = uint32_t(~0);
for (uint32_t CurrentMemoryTypeIndex = 0; CurrentMemoryTypeIndex < MemoryProperties.memoryTypeCount; ++CurrentMemoryTypeIndex)
{
vk::MemoryType MemoryType = MemoryProperties.memoryTypes[CurrentMemoryTypeIndex];
if ((vk::MemoryPropertyFlagBits::eHostVisible & MemoryType.propertyFlags) &&
(vk::MemoryPropertyFlagBits::eHostCoherent & MemoryType.propertyFlags))
{
MemoryHeapSize = MemoryProperties.memoryHeaps[MemoryType.heapIndex].size;
MemoryTypeIndex = CurrentMemoryTypeIndex;
break;
}
}
std::cout << "Memory Type Index: " << MemoryTypeIndex << std::endl;
std::cout << "Memory Heap Size : " << MemoryHeapSize / 1024 / 1024 / 1024 << " GB" << std::endl;
And finally we can ask the device to allocate the required memory for our buffers:
vk::MemoryAllocateInfo InBufferMemoryAllocateInfo(InBufferMemoryRequirements.size, MemoryTypeIndex);
vk::MemoryAllocateInfo OutBufferMemoryAllocateInfo(OutBufferMemoryRequirements.size, MemoryTypeIndex);
vk::DeviceMemory InBufferMemory = Device.allocateMemory(InBufferMemoryAllocateInfo);
vk::DeviceMemory OutBufferMemory = Device.allocateMemory(InBufferMemoryAllocateInfo);
The last step of the memory allocation part is to get a mapped pointer to this memory that can be used to copy data from the host to the device. For this simple example I’m just setting the value of each element as its index:
int32_t* InBufferPtr = static_cast<int32_t*>(Device.mapMemory(InBufferMemory, 0, BufferSize));
for (int32_t I = 0; I < NumElements; ++I)
{
InBufferPtr[I] = I;
}
Device.unmapMemory(InBufferMemory);
Binding Buffers to Memory
Finally we can bind the buffers to the allocated memory:
Device.bindBufferMemory(InBuffer, InBufferMemory, 0);
Device.bindBufferMemory(OutBuffer, OutBufferMemory, 0);
And that concludes the memory allocation part of the program.
Creating the Compute Pipeline - vk::Pipeline
The next part of the process is to create the compute pipeline that will be used to run the compute shader on the GPU. Lets start with the HLSL shader we are going to run:
[[vk::binding(0, 0)]] RWStructuredBuffer<int> InBuffer;
[[vk::binding(1, 0)]] RWStructuredBuffer<int> OutBuffer;
[numthreads(1, 1, 1)]
void Main(uint3 DTid : SV_DispatchThreadID)
{
OutBuffer[DTid.x] = InBuffer[DTid.x] * InBuffer[DTid.x];
}
The only difference from a regular HLSL shader are the annotations for each buffer.
[[vk::binding(1, 0)]]
means that a buffer will use binding 1 from descriptor set 0.
Descriptor sets can be thought of a way to tell Vulkan how the buffers we defined
in the previous section are going to be passed to the pipeline. If you have an existing
HLSL shader that you want to use but you can’t change the its source code to have the
bindings, you can still set the bindings in the command line using dxc
using the
-fvk-bind-globals
argument.
With the CMake setup defined in the beginning of the post, there will be a file called
Square.spv
in the build folder for the project. This file contains the SPIR-V bytecode
representing this shader. The goal here is to load this shader as the compute stage of
our pipeline.
Shader Module - vk::ShaderModule
We start by reading the contents of the SPIR-V file and creating a vk::ShaderModule
:
std::vector<char> ShaderContents;
if (std::ifstream ShaderFile{ "Square.spv", std::ios::binary | std::ios::ate })
{
const size_t FileSize = ShaderFile.tellg();
ShaderFile.seekg(0);
ShaderContents.resize(FileSize, '\0');
ShaderFile.read(ShaderContents.data(), FileSize);
}
vk::ShaderModuleCreateInfo ShaderModuleCreateInfo(
vk::ShaderModuleCreateFlags(), // Flags
ShaderContents.size(), // Code size
reinterpret_cast<const uint32_t*>(ShaderContents.data())); // Code
vk::ShaderModule ShaderModule = Device.createShaderModule(ShaderModuleCreateInfo);
Descriptor Set Layout - vk::DescriptorSetLayout
Next we define the vk::DescriptorSetLayout
. This object will tell Vulkan the layout
of data to be passed into to the pipeline. Note that this is not the actual descriptor set,
it is just the layout of the thing. This got me confused when I first saw it. The actual
descriptor set is represented by a vk::DescriptorSet
and it needs to be allocated using
a descriptor pool. Vulkan is complicated, but you have control!
Let’s define the vk::DescriptorSetLayout
:
const std::vector<vk::DescriptorSetLayoutBinding> DescriptorSetLayoutBinding = {
{0, vk::DescriptorType::eStorageBuffer, 1, vk::ShaderStageFlagBits::eCompute},
{1, vk::DescriptorType::eStorageBuffer, 1, vk::ShaderStageFlagBits::eCompute}
};
vk::DescriptorSetLayoutCreateInfo DescriptorSetLayoutCreateInfo(
vk::DescriptorSetLayoutCreateFlags(),
DescriptorSetLayoutBinding);
vk::DescriptorSetLayout DescriptorSetLayout = Device.createDescriptorSetLayout(DescriptorSetLayoutCreateInfo);
The vk::DescriptorSetLayout
is specified using a series of vk::DescriptorSetLayoutBinding
objects. Each binding will assign an index to a buffer in the pipeline. With the
vk::DescriptorSetLayout
created we can move to create the layout of our compute pipeline.
Yes, another layout, not the actual thing!
Pipeline Layout - vk::PipelineLayout
vk::PipelineLayoutCreateInfo PipelineLayoutCreateInfo(vk::PipelineLayoutCreateFlags(), DescriptorSetLayout);
vk::PipelineLayout PipelineLayout = Device.createPipelineLayout(PipelineLayoutCreateInfo);
vk::PipelineCache PipelineCache = Device.createPipelineCache(vk::PipelineCacheCreateInfo());
Pipeline - vk::Pipeline
Now we can finally create the compute pipeline:
vk::PipelineShaderStageCreateInfo PipelineShaderCreateInfo(
vk::PipelineShaderStageCreateFlags(), // Flags
vk::ShaderStageFlagBits::eCompute, // Stage
ShaderModule, // Shader Module
"Main"); // Shader Entry Point
vk::ComputePipelineCreateInfo ComputePipelineCreateInfo(
vk::PipelineCreateFlags(), // Flags
PipelineShaderCreateInfo, // Shader Create Info struct
PipelineLayout); // Pipeline Layout
vk::Pipeline ComputePipeline = Device.createComputePipeline(PipelineCache, ComputePipelineCreateInfo);
Creating the vk::DescriptorSet
Descriptor sets must be allocated in a vk::DescriptorPool
, so we need to create one first:
vk::DescriptorPoolSize DescriptorPoolSize(vk::DescriptorType::eStorageBuffer, 2);
vk::DescriptorPoolCreateInfo DescriptorPoolCreateInfo(vk::DescriptorPoolCreateFlags(), 1, DescriptorPoolSize);
vk::DescriptorPool DescriptorPool = Device.createDescriptorPool(DescriptorPoolCreateInfo);
Now we can finally allocate the descriptor set and update them to use our buffers:
vk::DescriptorSetAllocateInfo DescriptorSetAllocInfo(DescriptorPool, 1, &DescriptorSetLayout);
const std::vector<vk::DescriptorSet> DescriptorSets = Device.allocateDescriptorSets(DescriptorSetAllocInfo);
vk::DescriptorSet DescriptorSet = DescriptorSets.front();
vk::DescriptorBufferInfo InBufferInfo(InBuffer, 0, NumElements * sizeof(int32_t));
vk::DescriptorBufferInfo OutBufferInfo(OutBuffer, 0, NumElements * sizeof(int32_t));
const std::vector<vk::WriteDescriptorSet> WriteDescriptorSets = {
{DescriptorSet, 0, 0, 1, vk::DescriptorType::eStorageBuffer, nullptr, &InBufferInfo},
{DescriptorSet, 1, 0, 1, vk::DescriptorType::eStorageBuffer, nullptr, &OutBufferInfo},
};
Device.updateDescriptorSets(WriteDescriptorSets, {});
Submitting the work to the GPU
Command Pool - vk::CommandPool
To actually run this shader on the GPU we need to submit the work on a vk::Queue
. We
tell the queue to commands stored in a one or more vk::CommandBuffer
s. Commands must
be allocated in a vk::CommandPool
, so we need to create one first:
vk::CommandPoolCreateInfo CommandPoolCreateInfo(vk::CommandPoolCreateFlags(), ComputeQueueFamilyIndex);
vk::CommandPool CommandPool = Device.createCommandPool(CommandPoolCreateInfo);
Command Buffers - vk::CommandBuffer
Now we can use the command pool to allocate one or more command buffers:
vk::CommandBufferAllocateInfo CommandBufferAllocInfo(
CommandPool, // Command Pool
vk::CommandBufferLevel::ePrimary, // Level
1); // Num Command Buffers
const std::vector<vk::CommandBuffer> CmdBuffers = Device.allocateCommandBuffers(CommandBufferAllocInfo);
vk::CommandBuffer CmdBuffer = CmdBuffers.front();
Recording Commands
We can now record commands in the vk::CommandBuffer
object. To run the compute shader
we need to bind the pipeline, descriptor sets and record a vk::CommandBuffer::dispatch
call:
vk::CommandBufferBeginInfo CmdBufferBeginInfo(vk::CommandBufferUsageFlagBits::eOneTimeSubmit);
CmdBuffer.begin(CmdBufferBeginInfo);
CmdBuffer.bindPipeline(vk::PipelineBindPoint::eCompute, ComputePipeline);
CmdBuffer.bindDescriptorSets(vk::PipelineBindPoint::eCompute, // Bind point
PipelineLayout, // Pipeline Layout
0, // First descriptor set
{ DescriptorSet }, // List of descriptor sets
{}); // Dynamic offsets
CmdBuffer.dispatch(NumElements, 1, 1);
CmdBuffer.end();
The vk::CommandBuffer::dispatch
function takes the number of threads to launch in the device.
In this example we are launching one thread for element.
With the vk::CommmandBuffer
recorded we can finaly submit the work the GPU. We first
get the vk::Queue
from the vk::Device
using the queue family index retrieved earlier
and we create a vk::Fence
. The fence is a mechanism we can use to wait for the compute
shader to complete. After waiting we can read the results of our computation:
vk::Queue Queue = Device.getQueue(ComputeQueueFamilyIndex, 0);
vk::Fence Fence = Device.createFence(vk::FenceCreateInfo());
vk::SubmitInfo SubmitInfo(0, // Num Wait Semaphores
nullptr, // Wait Semaphores
nullptr, // Pipeline Stage Flags
1, // Num Command Buffers
&CmdBuffer); // List of command buffers
Queue.submit({ SubmitInfo }, Fence);
Device.waitForFences({ Fence }, // List of fences
true, // Wait All
uint64_t(-1)); // Timeout
The final step is to map the output buffer and read the results. For this example we are just going to print the values in the terminal:
InBufferPtr = static_cast<int32_t*>(Device.mapMemory(InBufferMemory, 0, BufferSize));
for (uint32_t I = 0; I < NumElements; ++I)
{
std::cout << InBufferPtr[I] << " ";
}
std::cout << std::endl;
Device.unmapMemory(InBufferMemory);
int32_t* OutBufferPtr = static_cast<int32_t*>(Device.mapMemory(OutBufferMemory, 0, BufferSize));
for (uint32_t I = 0; I < NumElements; ++I)
{
std::cout << OutBufferPtr[I] << " ";
}
std::cout << std::endl;
Device.unmapMemory(OutBufferMemory);
Cleaning
We need to manually delete the resources used by our program or the validation layer will shout at you:
Device.resetCommandPool(CommandPool, vk::CommandPoolResetFlags());
Device.destroyFence(Fence);
Device.destroyDescriptorSetLayout(DescriptorSetLayout);
Device.destroyPipelineLayout(PipelineLayout);
Device.destroyPipelineCache(PipelineCache);
Device.destroyShaderModule(ShaderModule);
Device.destroyPipeline(ComputePipeline);
Device.destroyDescriptorPool(DescriptorPool);
Device.destroyCommandPool(CommandPool);
Device.freeMemory(InBufferMemory);
Device.freeMemory(OutBufferMemory);
Device.destroyBuffer(InBuffer);
Device.destroyBuffer(OutBuffer);
Device.destroy();
Instance.destroy();
The Vulkan C++ Header also has a raii
set of objects that can be used to avoid manually
cleaning resources, for example:
vk::raii::Context context;
vk::raii::Instance instance = vk::raii::su::makeInstance( context, AppName, EngineName );
// enumerate the physicalDevices
vk::raii::PhysicalDevices physicalDevices( instance );
So this objects will clean their resources upon destruction.
Conclusion
If you are here, well done, you can now use Vulkan to square a few numbers on the GPU =). Hopefully this will be useful to you in staring with Vulkan to run compute workloads on your GPU.
This code is pretty much in the order required to build your application so you should be able to copy-and-paste this code into your main function and have some numbers printed in the terminal. You can also find this code in this Github Repo ready to be built and executed.
See you next time!