Classes
struct	Array
	Represents a buffer of values on the GPU. More...

struct	Bindings
	Represents an ordered collection of WGPUBuffers (wrapped as tensors, non-overlapping views, or arrays) for the purpose of binding them to a kernel operation to make them accessible to the GPU kernel. More...

struct	CallbackData
	Used for on-done callback data for asynchronous operations sduch as kernel launching. More...

struct	Context
	Represents a GPU context, aggregates WebGPU API handles to interact with the GPU including the instance, adapter, device, and queue. More...

struct	CopyData
	Staging buffer and callback data for copying data between the GPU and CPU. More...

struct	Kernel
	Represents handles + metadata for a reusable kernel on the GPU. The struct members can be divided into "consumed upon dispatch" (commandBuffer) and reusable ahead-of-time setup (all other members). More...

struct	KernelCode
	KernelCode is the representation of WGSL GPU code with template substitutions applied. It is a type around the code string with additional metadata for workgroup size and precision since they are specified in the WGSL code. Additionally, label and entryPoint are used by `createKernel()` to specify the label and entry point of the kernel. More...

struct	KernelPool
	A pool of kernels to manage GPU resources. For simple use cases this is instantiated as a member in the Context struct although it's possible to have multiple resource pools of kernels in more complex scenarios. More...

struct	Logger
	Logger struct for logging messages. stream: The stream to log to. buffer: A buffer to store the formatted message. level: The log level to log messages at. More...

struct	NoParam
	NoParam is a no-op type used to indicate that a kernel does not have any parameters. More...

struct	Shape
	Represents the shape of a tensor. More...

struct	Tensor
	Represents a tensor on the GPU, which is a buffer of values with a shape. More...

struct	TensorPool
	Represents a pool of tensors to manage GPU resources. The pool is responsible for managing the lifetime of the tensors and freeing them when the pool is destroyed. More...

struct	TensorView
	Represents a non-owning view into a tensor specifying an offset and a subspan. This is useful for specifying a slice of a tensor on the GPU without copying the data. More...

Enumerations
enum	NumType { kf16 , kf32 }

enum	LogLevel { kError = 0 , kWarn = 1 , kInfo = 2 , kTrace = 3 }

Functions
size_t	size (const Shape &shape)
	Returns the number of elements in a tensor with the given shape, which is equal to the product of the dimensions.

template<std::size_t N>
	Bindings (std::array< Tensor, N >) -> Bindings< N >
	Deduction guide for Bindings.

template<typename... Args>
	Bindings (Args...) -> Bindings< sizeof...(Args)>

size_t	sizeBytes (const NumType &type)
	Returns the number of bytes of a number type.

std::string	toString (NumType type)
	Converts NumType to string.

std::string	toString (const Shape &shape)
	Converts Shape to string. The string formatting is meant to be slotted into WGSL code (hence no additional parentheses or brackets).

std::string	toString (size_t value)
	Converts size_t to string. Wraps std::to_string for consistency, instead of having to remember to switch between std::to_string and toString depending on the type.

void	replaceAll (std::string &str, const std::string &from, const std::string &to)
	simple in-place string replacement helper function for substituting placeholders in a WGSL string template.

void	replaceAll (std::string &str, const std::vector< std::pair< std::string, std::string > > &reps)
	Overload of the string replacement helper function to replace multiple substrings in a string with multiple replacements.

bool	operator< (const Kernel &lhs, const Kernel &rhs)
	Operator implementation to make the Kernel type hashable.

void	processEvents (const WGPUInstance &instance)

Tensor	createTensor (TensorPool &pool, WGPUDevice &device, const Shape &shape, NumType dtype, WGPUBufferUsageFlags usage=WGPUBufferUsage_Storage\|WGPUBufferUsage_CopyDst\|WGPUBufferUsage_CopySrc)
	Tensor factory function to create a tensor (a Tensor type is simply an Array with an N-dimensional Shape specification) on the GPU. The tensor is created with the given shape, data type, and usage flags, added to the TensorPool, and returned.

Tensor	createTensor (Context &ctx, const Shape &shape, NumType dtype)
	Overload of the tensor factory function to instantiate a tensor on the GPU with a given shape and data type.

Tensor	createTensor (Context &ctx, const Shape &shape, NumType dtype, float *data)
	Overload of the tensor factory function to instantiate a tensor on the GPU with a given shape, data type. This overload also takes initial float* data to populate the tensor with.

Tensor	createTensor (Context &ctx, const Shape &shape, NumType dtype, half *data)
	Overload of the tensor factory function to instantiate a tensor on the GPU with a given shape, data type. This overload also takes initial half* data to populate the tensor with.

void	FreeTensor (TensorPool &pool, Tensor tensor)
	Frees a tensor resource and updates the tensor pool.

void	check (bool condition, const char message, const char file="unkown", int line=-1)
	Checks a condition and logs an error message if the condition is false. In debug mode, it will also exit the program with an error code.

Context	createContext (const WGPUInstanceDescriptor &desc={}, const WGPURequestAdapterOptions &adapterOpts={}, const WGPUDeviceDescriptor &devDescriptor={})
	Factory function to create a GPU context, which aggregates WebGPU API handles to interact with the GPU including the instance, adapter, device, and queue.

void	wait (Context &ctx, std::future< void > &future)

void	toCPU (Context &ctx, Tensor &tensor, void *data, size_t bufferSize, CopyData &op)
	Copies data from a GPU buffer to CPU memory.

void	toCPU (Context &ctx, Tensor &tensor, void *data, size_t bufferSize)
	Overload of the toCPU function to copy data from a GPU buffer to CPU but initializes a staging buffer and promise/future for the operation for you.

template<size_t N>
void	toCPU (Context &ctx, Tensor &tensor, std::array< float, N > &data)
	Overload of the toCPU function to copy data from a GPU buffer to CPU memory for an array of floats instead of a pointer to a float buffer.

void	toGPU (Context &ctx, const void *data, WGPUBuffer buffer, size_t size)
	Copies data from CPU memory to a GPU buffer. The toGPU overloads are effectively a convenience wrapper around the WebGPU API call wgpuQueueWriteBuffer.

void	toGPU (Context &ctx, const float *data, Tensor &tensor)
	Overload of the toGPU function to copy data from CPU memory to a GPU taking a Tensor instance instead of a WGPUBuffer instance.

void	toGPU (Context &ctx, const half *data, Tensor &tensor)

template<typename Params >
void	toGPU (Context &ctx, Params &params, Kernel &op)

void	resetCommandBuffer (WGPUDevice &device, Kernel &op)
	Resets the command buffer in preparation for a kernel dispatch. Since command buffers are consumed upon submission, this function is used both in the initial kernel creation and every time the kernel is to be reused for a dispatch.

size_t	cdiv (size_t n, size_t d)
	Ceiling division.

Shape	cdiv (Shape total, Shape group)
	cdiv for shape specification. Mostly useful for evenly dividing total # threads by workgroup size dimensions.

Kernel	createKernel (Context &ctx, const KernelCode &code, const Tensor dataBindings, size_t numTensors, const size_t viewOffsets, const Shape &nWorkgroups, const void *params=nullptr, size_t paramsSize=0)
	A factory function to create a kernel on the GPU. The kernel is created with the given WGSL code, input tensors, output tensor, and optional parameters.

template<typename ParamsType = NoParam, size_t numInputs>
Kernel	createKernel (Context &ctx, const KernelCode &code, const Bindings< numInputs > &dataBindings, const Shape &nWorkgroups, const ParamsType &params=ParamsType{})
	Overload which wraps the createKernel factory function to create a kernel on the GPU. This overload uses takes a static collection of input tensors instead of a pointer and a statically determined ParamsType instead of casting params to a void pointer.

void	dispatchKernel (Context &ctx, Kernel &kernel, std::promise< void > &promise)
	Asynchronously submits a kernel to the GPU queue for execution. It also sets up a callback to notify when the kernel has finished executing by setting the value of the promise in the kernel instance argument.

template<typename numtype >
std::string	show (const numtype *a, size_t rows, size_t cols, const std::string &name="")
	Show a 2D array as a string, base implementation.

template<typename numtype , size_t rows, size_t cols>
std::string	show (const std::array< numtype, rows *cols > &a, const std::string &name="")
	Overload of `show()` for std::array.

template<size_t rows, size_t cols>
std::string	show (const std::array< float, rows *cols > &a, const std::string &name="")
	Overload of `show()` for float std::array.

void	range (float *input, size_t N, float start=0.0, float step=1.0)
	Populate the array with a range of values. This is mostly for testing purposes.

template<size_t N>
void	range (std::array< float, N > &input, float start=0.0, float step=1.0)
	Overload of `range()` for std::array.

void	randint (float *a, size_t N, std::mt19937 &gen, int min=-1, int max=1)
	Populate the array with random integers.

template<typename numtype , size_t size>
void	randint (std::array< numtype, size > &a, std::mt19937 &gen, int min=-1, int max=1)
	Overload of `randint()` for std::array.

void	randn (float *a, size_t N, std::mt19937 &gen, float mean=0.0, float std=1.0)
	Populate the array with random floats, generated from a Gaussian distribution.

template<size_t size>
void	randn (std::array< float, size > &a, std::mt19937 &gen, float mean=0.0, float std=1.0)
	Overload of `randn()` for std::array.

void	eye (float *a, size_t N)
	Populate a square matrix with the identity matrix.

void	transpose (float input, float output, size_t M, size_t N)
	Transpose a matrix.

void	flip (float *a, size_t R, size_t C, bool horizontal=true)
	Flip a matrix horizontally or vertically.

bool	isclose (float a, float b, size_t n, float tol=1e-3)
	Determine if the values of two arrays are close to each other.

void	LOG (Logger &logger, int level, const char *message,...)
	Log a message to the logger. If NDEBUG is defined in a source or as a compiler flag, this is a no-op.

void	setLogLevel (int level)
	Set the log level of the default logger.

Variables
template<typename T >
constexpr bool	IsNoParam = std::is_same_v<T, NoParam>

static constexpr int	kShowMaxRows = 8

static constexpr int	kShowMaxCols = 8

static const char *	kLevelStr [] = {"error", "warn", "info", "trace"}

static Logger	kDefLog = {stdout, "", kInfo}
	Default logger for logging messages to stdout at the info level. Output stream and logging level for the default logger can be globally changed on a per-program basis.

Enumeration Type Documentation

◆ LogLevel

enum gpu::LogLevel

Enumerator
kError
kWarn
kInfo
kTrace

Definition at line 9 of file logging.h.

9{ kError = 0, kWarn = 1, kInfo = 2, kTrace = 3 };

gpu::kError

@ kError

Definition logging.h:9

gpu::kWarn

@ kWarn

Definition logging.h:9

gpu::kTrace

@ kTrace

Definition logging.h:9

gpu::kInfo

@ kInfo

Definition logging.h:9

◆ NumType

enum gpu::NumType

Enumerator
kf16
kf32

Definition at line 183 of file gpu.h.

             {
  kf16, // (experimental)
  kf32
};

Function Documentation

◆ Bindings() [1/2]

template<typename... Args>

gpu::Bindings ( Args... ) -> Bindings< sizeof...(Args)>

◆ Bindings() [2/2]

template<std::size_t N>

gpu::Bindings ( std::array< Tensor, N > ) -> Bindings< N >

Deduction guide for Bindings.

◆ cdiv() [1/2]

Shape gpu::cdiv	(	Shape	total,
		Shape	group )

inline

cdiv for shape specification. Mostly useful for evenly dividing total # threads by workgroup size dimensions.

Definition at line 970 of file gpu.h.

                                            {
  assert(total.rank == group.rank);
  Shape result;
  result.rank = total.rank;
  for (size_t dim = 0; dim < total.rank; ++dim) {
    result[dim] = cdiv(total[dim], group[dim]);
  }
  return result;
}

◆ cdiv() [2/2]

size_t gpu::cdiv	(	size_t	n,
		size_t	d )

inline

Ceiling division.

Definition at line 964 of file gpu.h.

964{ return (n + d - 1) / d; }

◆ check()

void gpu::check	(	bool	condition,
		const char *	message,
		const char *	file = "unkown",
		int	line = -1 )

inline

Checks a condition and logs an error message if the condition is false. In debug mode, it will also exit the program with an error code.

Parameters

[in]	condition	The condition to check.
[in]	message	The error message to log if the condition is false.
[in]	file	The source file where the check is performed.
[in]	line	The line number in the source file where the check is performed.

Definition at line 646 of file gpu.h.

                                                              {
  if (!condition) {
    LOG(kDefLog, kError, "Error in file %s line %d:\n%s", file, line, message);
    exit(1);
  } else {
    LOG(kDefLog, kTrace, "Success in file %s line %d:\n%s", file, line,
        message);
  }
}

◆ createContext()

Context gpu::createContext	(	const WGPUInstanceDescriptor &	desc = {},
		const WGPURequestAdapterOptions &	adapterOpts = {},
		const WGPUDeviceDescriptor &	devDescriptor = {} )

inline

Factory function to create a GPU context, which aggregates WebGPU API handles to interact with the GPU including the instance, adapter, device, and queue.

The function takes optional descriptor parameters for the instance descriptor, adapter request options, and device descriptor, which are passed through to the WebGPU API calls to create the instance, adapter, and device.

If dawn is used, it also sets up an error callback for device loss.

Parameters

[in]	desc	Instance descriptor for the WebGPU instance (optional)
[in]	adapterOpts	Adapter request options for the WebGPU adapter (optional)
[in]	devDescriptor	Device descriptor for the WebGPU device (optional)

Returns: Context instance representing the created GPU context

Context ctx = createContext();

gpu::createContext

Context createContext(const WGPUInstanceDescriptor &desc={}, const WGPURequestAdapterOptions &adapterOpts={}, const WGPUDeviceDescriptor &devDescriptor={})

Factory function to create a GPU context, which aggregates WebGPU API handles to interact with the GP...

Definition gpu.h:678

gpu::Context

Represents a GPU context, aggregates WebGPU API handles to interact with the GPU including the instan...

Definition gpu.h:434

Definition at line 678 of file gpu.h.

                                                                  {},
                             const WGPURequestAdapterOptions &adapterOpts = {},
                             const WGPUDeviceDescriptor &devDescriptor = {}) {
  Context context;
  {
#ifndef __EMSCRIPTEN__
    context.instance = wgpuCreateInstance(&desc);
#else
    // Emscripten does not support the instance descriptor
    // and throws an assertion error if it is not nullptr.
    context.instance = wgpuCreateInstance(nullptr);
#endif
    check(context.instance, "Initialize WebGPU", __FILE__, __LINE__);
  }
  LOG(kDefLog, kInfo, "Requesting adapter");
  {
    struct AdapterData {
      WGPUAdapter adapter = nullptr;
      bool requestEnded = false;
    };
    AdapterData adapterData;
    auto onAdapterRequestEnded = [](WGPURequestAdapterStatus status,
                                    WGPUAdapter adapter, char const *message,
                                    void *pUserData) {
      AdapterData &adapterData = *reinterpret_cast<AdapterData *>(pUserData);
      check(status == WGPURequestAdapterStatus_Success,
            "Request WebGPU adapter", __FILE__, __LINE__);
      adapterData.adapter = adapter;
      adapterData.requestEnded = true;
    };
    wgpuInstanceRequestAdapter(context.instance, &adapterOpts,
                               onAdapterRequestEnded, (void *)&adapterData);
    while (!adapterData.requestEnded) {
      processEvents(context.instance);
    }
    assert(adapterData.requestEnded);
    context.adapter = adapterData.adapter;
  }
  LOG(kDefLog, kInfo, "Requesting device");
  {
    struct DeviceData {
      WGPUDevice device = nullptr;
      bool requestEnded = false;
    };
    DeviceData devData;
    auto onDeviceRequestEnded = [](WGPURequestDeviceStatus status,
                                   WGPUDevice device, char const *message,
                                   void *pUserData) {
      DeviceData &devData = *reinterpret_cast<DeviceData *>(pUserData);
      check(status == WGPURequestDeviceStatus_Success,
            "Could not get WebGPU device.", __FILE__, __LINE__);
      LOG(kDefLog, kTrace, "Device Request succeeded %x",
          static_cast<void *>(device));
      devData.device = device;
      devData.requestEnded = true;
    };
#ifdef WEBGPU_BACKEND_DAWN
    devDescriptor.deviceLostCallbackInfo = {
        .callback =
            [](WGPUDevice const *device, WGPUDeviceLostReason reason,
               char const *message, void *userdata) {
              if (reason != WGPUDeviceLostReason_Destroyed) {
                LOG(kDefLog, kError, "Device lost (code %d):\n%s", reason,
                    message);
              } else {
                LOG(kDefLog, kInfo, "Device destroyed: %s", message);
              }
            },
    };
#endif
    LOG(kDefLog, kInfo, "Requesting device");
    wgpuAdapterRequestDevice(context.adapter, &devDescriptor,
                             onDeviceRequestEnded, (void *)&devData);
    LOG(kDefLog, kInfo, "Waiting for device request to end");
    while (!devData.requestEnded) {
      processEvents(context.instance);
    }
    LOG(kDefLog, kInfo, "Device request ended");
    assert(devData.requestEnded);
    context.device = devData.device;
    wgpuDeviceSetUncapturedErrorCallback(
        context.device,
        [](WGPUErrorType type, char const *message, void *devData) {
          LOG(kDefLog, kError, "Device uncaptured error: %s", message);
          throw std::runtime_error("Device uncaptured exception.");
        },
        nullptr);
  }
  context.queue = wgpuDeviceGetQueue(context.device);
  return context;
}

◆ createKernel() [1/2]

template<typename ParamsType = NoParam, size_t numInputs>

Kernel gpu::createKernel	(	Context &	ctx,
		const KernelCode &	code,
		const Bindings< numInputs > &	dataBindings,
		const Shape &	nWorkgroups,
		const ParamsType &	params = ParamsType{} )

Overload which wraps the createKernel factory function to create a kernel on the GPU. This overload uses takes a static collection of input tensors instead of a pointer and a statically determined ParamsType instead of casting params to a void pointer.

Parameters

[in]	ctx	Context instance to manage the kernel
[in]	code	WGSL code for the kernel
[in]	dataBindings	A Bindings of tensors whose GPU buffers are bound to the kernel as inputs and outputs.
[in]	nWorkgroups	Number of workgroups in the x, y, z grid, must be a Shape of rank == 3.
[in]	params	Optional parameters for the kernel. If the kernel does not have any parameters, use NoParam.

Returns: Kernel instance representing the created kernel

Kernel kernel = createKernel(ctx, code, tensorData, output,

gpu::createKernel

Kernel createKernel(Context &ctx, const KernelCode &code, const Tensor *dataBindings, size_t numTensors, const size_t *viewOffsets, const Shape &nWorkgroups, const void *params=nullptr, size_t paramsSize=0)

A factory function to create a kernel on the GPU. The kernel is created with the given WGSL code,...

Definition gpu.h:1007

gpu::Kernel

Represents handles + metadata for a reusable kernel on the GPU. The struct members can be divided int...

Definition gpu.h:382

nWorkgroups, params);

Definition at line 1163 of file gpu.h.

                                                         {}) {
  if constexpr (!IsNoParam<ParamsType>) {
    // LOG(kDefLog, kTrace, "Using params of size %d bytes",
    // sizeof(ParamsType));
    return createKernel(ctx, code, dataBindings.data.data(), numInputs,
                        dataBindings.viewOffsets.data(), nWorkgroups,
                        reinterpret_cast<const void *>(&params),
                        sizeof(ParamsType));
  } else {
    // LOG(kDefLog, kTrace , "No params");
    return createKernel(ctx, code, dataBindings.data.data(), numInputs,
                        dataBindings.viewOffsets.data(), nWorkgroups, nullptr,
                        0);
  }
}

◆ createKernel() [2/2]

Kernel gpu::createKernel	(	Context &	ctx,
		const KernelCode &	code,
		const Tensor *	dataBindings,
		size_t	numTensors,
		const size_t *	viewOffsets,
		const Shape &	nWorkgroups,
		const void *	params = nullptr,
		size_t	paramsSize = 0 )

inline

A factory function to create a kernel on the GPU. The kernel is created with the given WGSL code, input tensors, output tensor, and optional parameters.

Note that the values of the input tensors are not used here, only the reference handles to the underlying buffers as well as the size of the buffers.

Parameters

[in]	ctx	Context instance to manage the kernel
[in]	code	WGSL code for the kernel
[in]	dataBindings	Pointer to a span of tensors bound to the kernel
[in]	numTensors	Number of tensors in the dataBindings span
[in]	viewOffsets	Pointer to an array of view offsets for the input tensors
[in]	nWorkgroups	Shape of the workgroup
[in]	params	Optional parameters for the kernel. If the kernel does not have any parameters, use NoParam. This is cast as void* to allow for arbitrary types to be passed as parameters.
[in]	paramsSize	Size of the parameters buffer in bytes.

Returns: Kernel instance representing the created kernel

Kernel kernel = createKernel(ctx, code, dataBindings, numInputs,

output, nThreads, params, paramsSize);

Definition at line 1007 of file gpu.h.

                                                  {
  assert(nWorkgroups.rank == 3);
  WGPUDevice device = ctx.device;
  WGPUQueue queue = ctx.queue;
  Kernel op;
  // paramIndex is the index into bgLayoutEntries for the parameters buffer If
  // there are no parameters for the kernel, paramsSize == 0 and paramIndex is
  // effectively undefined (== -1)
  size_t paramIndex = -1;
  // Note: paramIndex is undefined unless paramsSize > 0
  size_t numBindings = numTensors;
  if (paramsSize > 0) {
    numBindings++;                // parameters buffer
    paramIndex = numBindings - 1; // index of the parameters buffer within
                                  // op.buffers, op.bufferSizes and
                                  // bgLayoutEntries
  }
  op.buffers = std::make_unique<WGPUBuffer[]>(numBindings);
  op.bufferSizes = std::make_unique<size_t[]>(numBindings);
  op.numBindings = numBindings;
  std::vector<WGPUBindGroupLayoutEntry> bgLayoutEntries(numBindings);
  // Create layout entries for input buffers
  for (size_t i = 0; i < numTensors; ++i) {
    bgLayoutEntries[i] = WGPUBindGroupLayoutEntry{
        .binding = static_cast<uint32_t>(i),
        .visibility = WGPUShaderStage_Compute,
        .buffer =
            WGPUBufferBindingLayout{
                .type = WGPUBufferBindingType_Storage,
                .minBindingSize = dataBindings[i].data.size,
            },
    };
  }
  if (paramsSize > 0) {
    LOG(kDefLog, kInfo, "Create layout entry for the params buffer");
    // Create layout entry for the params buffer
    bgLayoutEntries[paramIndex] = WGPUBindGroupLayoutEntry{
        .binding = static_cast<uint32_t>(paramIndex),
        .visibility = WGPUShaderStage_Compute,
        .buffer =
            WGPUBufferBindingLayout{
                .type = WGPUBufferBindingType_Uniform,
                .minBindingSize = paramsSize,
            },
    };
  }
  WGPUBindGroupLayoutDescriptor bgLayoutDesc = {
      .entryCount = static_cast<uint32_t>(bgLayoutEntries.size()),
      .entries = bgLayoutEntries.data(),
  };
  WGPUBindGroupLayout bgLayout =
      wgpuDeviceCreateBindGroupLayout(device, &bgLayoutDesc);
  for (size_t i = 0; i < numTensors; ++i) {
    op.buffers[i] = dataBindings[i].data.buffer;
    op.bufferSizes[i] = dataBindings[i].data.size;
  }
  // Create a buffer for the Params struct
  if (paramsSize > 0) {
    WGPUBufferDescriptor paramsBufferDesc = {
        .usage = WGPUBufferUsage_Uniform | WGPUBufferUsage_CopyDst,
        .size = paramsSize,
        .mappedAtCreation = false,
    };
    op.buffers[paramIndex] = wgpuDeviceCreateBuffer(device, &paramsBufferDesc);
    op.bufferSizes[paramIndex] = paramsSize;
    wgpuQueueWriteBuffer(queue, op.buffers[paramIndex], 0, params, paramsSize);
    LOG(kDefLog, kTrace, "Params buffer written");
  } else {
    LOG(kDefLog, kTrace, "No params buffer needed");
  }
  std::vector<WGPUBindGroupEntry> bindGroupEntries(numBindings);
  for (size_t i = 0; i < numTensors; ++i) {
    bindGroupEntries[i] = WGPUBindGroupEntry{
        .binding = static_cast<uint32_t>(i),
        .buffer = op.buffers[i],
        .offset = viewOffsets[i],
        .size = op.bufferSizes[i],
    };
  }
  if (paramsSize > 0) {
    LOG(kDefLog, kInfo, "Create bind group entry for the params buffer");
    LOG(kDefLog, kInfo, "paramIndex: %d", paramIndex);
    bindGroupEntries[paramIndex] = WGPUBindGroupEntry{
        .binding = static_cast<uint32_t>(paramIndex),
        .buffer = op.buffers[paramIndex],
        .offset = 0,
        .size = paramsSize,
    };
  }
  LOG(kDefLog, kTrace, "BG Entries Size: %d", numBindings);
  WGPUBindGroupDescriptor bindGroupDesc = {
      .layout = bgLayout,
      .entryCount = static_cast<uint32_t>(numBindings),
      .entries = bindGroupEntries.data(),
  };
  op.bindGroup = wgpuDeviceCreateBindGroup(device, &bindGroupDesc);
  {
    WGPUPipelineLayoutDescriptor pipelineLayoutDesc = {
        .bindGroupLayoutCount = 1,
        .bindGroupLayouts = &bgLayout,
    };
    WGPUPipelineLayout pipelineLayout =
        wgpuDeviceCreatePipelineLayout(device, &pipelineLayoutDesc);
    WGPUShaderModuleWGSLDescriptor wgslDesc = {
        .code = code.data.c_str(),
    };
    wgslDesc.chain.sType = WGPUSType_ShaderModuleWGSLDescriptor;
    WGPUShaderModuleDescriptor shaderModuleDesc = {};
    shaderModuleDesc.nextInChain = &wgslDesc.chain;
    shaderModuleDesc.label = code.label.c_str();
    WGPUComputePipelineDescriptor computePipelineDesc = {};
    computePipelineDesc.layout = pipelineLayout;
    computePipelineDesc.compute.module =
        wgpuDeviceCreateShaderModule(device, &shaderModuleDesc);
    computePipelineDesc.compute.entryPoint = code.entryPoint.c_str();
    computePipelineDesc.label = code.label.c_str();
    op.computePipeline =
        wgpuDeviceCreateComputePipeline(device, &computePipelineDesc);
  }
  /*
  op.nWorkgroups = {cdiv(nThreads[0], code.workgroupSize[0]),
                    cdiv(nThreads[1], code.workgroupSize[1]),
                    cdiv(nThreads[2], code.workgroupSize[2])};
  */
  op.nWorkgroups = {nWorkgroups[0], nWorkgroups[1], nWorkgroups[2]};
  resetCommandBuffer(device, op);
  ctx.kernelPool.data.insert(&op);
  return op;
}

◆ createTensor() [1/4]

Tensor gpu::createTensor	(	Context &	ctx,
		const Shape &	shape,
		NumType	dtype )

inline

Overload of the tensor factory function to instantiate a tensor on the GPU with a given shape and data type.

Instead of taking the TensoPool and raw WebGPU API WGPUDevice and WGPUBufferUsageFlags arguments, this is a convenience wrapper around the core createTensor function which has default usage flags for a storage buffer, and also takes in the Context object.

instance instead of the narrower TensorPool object.

Parameters

[in]	ctx	Context instance to manage the tensor
[in]	shape	Shape of the tensor
[in]	dtype	Data type of the tensor (e.g. kf32)

Returns: Tensor instance representing the created tensor

Tensor tensor = createTensor(ctx, {256, 256}, kf32);

gpu::createTensor

Tensor createTensor(TensorPool &pool, WGPUDevice &device, const Shape &shape, NumType dtype, WGPUBufferUsageFlags usage=WGPUBufferUsage_Storage|WGPUBufferUsage_CopyDst|WGPUBufferUsage_CopySrc)

Tensor factory function to create a tensor (a Tensor type is simply an Array with an N-dimensional Sh...

Definition gpu.h:491

gpu::Tensor

Represents a tensor on the GPU, which is a buffer of values with a shape.

Definition gpu.h:96

Definition at line 530 of file gpu.h.

                                                                            {
  return createTensor(ctx.pool, ctx.device, shape, dtype);
}

◆ createTensor() [2/4]

Tensor gpu::createTensor	(	Context &	ctx,
		const Shape &	shape,
		NumType	dtype,
		float *	data )

inline

Overload of the tensor factory function to instantiate a tensor on the GPU with a given shape, data type. This overload also takes initial float* data to populate the tensor with.

The data is assumed to be of size equal to the product of the dimensions in the shape, and is copied to the GPU buffer.

Parameters

[in]	ctx	Context instance to manage the tensor
[in]	shape	Shape of the tensor
[in]	dtype	Data type of the tensor (e.g. kf32)
[in]	data	Initial data to populate the tensor with

Returns: Tensor instance representing the created tensor

Tensor tensor = createTensor(ctx, {256, 256}, kf32, data);

Definition at line 552 of file gpu.h.

                                        {
  assert(dtype == kf32);
  Tensor tensor =
      createTensor(ctx.pool, ctx.device, shape, dtype,
                   WGPUBufferUsage_Storage | WGPUBufferUsage_CopyDst |
                       WGPUBufferUsage_CopySrc);
  wgpuQueueWriteBuffer(ctx.queue, tensor.data.buffer, 0, data,
                       tensor.data.size);
  return tensor;
}

◆ createTensor() [3/4]

Tensor gpu::createTensor	(	Context &	ctx,
		const Shape &	shape,
		NumType	dtype,
		half *	data )

inline

Overload of the tensor factory function to instantiate a tensor on the GPU with a given shape, data type. This overload also takes initial half* data to populate the tensor with.

The data is assumed to be of size equal to the product of the dimensions in the shape, and is copied to the GPU buffer.

Parameters

[in]	ctx	Context instance to manage the tensor
[in]	shape	Shape of the tensor
[in]	dtype	Data type of the tensor (e.g. kf32)
[in]	data	Initial data to populate the tensor with

Returns: Tensor instance representing the created tensor

Tensor tensor = createTensor(ctx, {256, 256}, kf32, data);

Definition at line 582 of file gpu.h.

                                       {
  assert(dtype == kf16);
  Tensor tensor =
      createTensor(ctx.pool, ctx.device, shape, dtype,
                   WGPUBufferUsage_Storage | WGPUBufferUsage_CopyDst |
                       WGPUBufferUsage_CopySrc);
  wgpuQueueWriteBuffer(ctx.queue, tensor.data.buffer, 0, data,
                       tensor.data.size);
  return tensor;
}

◆ createTensor() [4/4]

Tensor gpu::createTensor	(	TensorPool &	pool,
		WGPUDevice &	device,
		const Shape &	shape,
		NumType	dtype,
		WGPUBufferUsageFlags	usage = WGPUBufferUsage_Storage \| WGPUBufferUsage_CopyDst \| WGPUBufferUsage_CopySrc )

inline

Tensor factory function to create a tensor (a Tensor type is simply an Array with an N-dimensional Shape specification) on the GPU. The tensor is created with the given shape, data type, and usage flags, added to the TensorPool, and returned.

This is the core implementation which takes the minimal set of parameters in terms of the raw WebGPU API, and is used by the other createTensor overloads which provide more ergonomic interfaces.

Parameters

[in]	pool	TensorPool instance to manage the tensor
[in]	device	WGPUDevice instance to create the tensor on
[in]	shape	Shape of the tensor
[in]	dtype	Data type of the tensor (e.g. kf32)
[in]	usage	Usage flags for the tensor buffer

Returns: Tensor instance representing the created tensor

Tensor tensor = createTensor(pool, device, {256, 256}, kf32);

Definition at line 491 of file gpu.h.

                                                                   {
  LOG(kDefLog, kTrace, "Creating tensor");
  size_t numElements = size(shape);
  size_t size = sizeBytes(dtype) * numElements;
  WGPUBufferDescriptor bufferDesc = {
      .usage = usage,
      .size = size,
  };
  WGPUBuffer buffer = wgpuDeviceCreateBuffer(device, &bufferDesc);
  pool.data[buffer] = Tensor{
      .data = Array{.buffer = buffer, .usage = usage, .size = size},
      .shape = shape,
  };
  return pool.data[buffer];
}

◆ dispatchKernel()

void gpu::dispatchKernel	(	Context &	ctx,
		Kernel &	kernel,
		std::promise< void > &	promise )

inline

Asynchronously submits a kernel to the GPU queue for execution. It also sets up a callback to notify when the kernel has finished executing by setting the value of the promise in the kernel instance argument.

dispatchKernel does not wait for the kernel to finish executing and returns immediately. The caller can wait for the kernel to finish executing by calling wait() on the future in the kernel instance.

Parameters

[in]	ctx	Context instance to manage the kernel, from which the queue for the GPU is obtained
[in]	kernel	Kernel instance to dispatch
[in]	promise	Promise to set when the kernel has finished executing

dispatchKernel(ctx, kernel);

gpu::dispatchKernel

void dispatchKernel(Context &ctx, Kernel &kernel, std::promise< void > &promise)

Asynchronously submits a kernel to the GPU queue for execution. It also sets up a callback to notify ...

Definition gpu.h:1200

Definition at line 1200 of file gpu.h.

                                                      {
  // Submit the command buffer
  wgpuQueueSubmit(ctx.queue, 1, &kernel.commandBuffer);
  wgpuQueueOnSubmittedWorkDone(
      ctx.queue,
      [](WGPUQueueWorkDoneStatus status, void *data) {
        check(status == WGPUQueueWorkDoneStatus_Success, "Queue work done",
              __FILE__, __LINE__);
        auto *promise = static_cast<std::promise<void> *>(data);
        promise->set_value();
      },
      &promise);
}

◆ eye()

void gpu::eye	(	float *	a,
		size_t	N )

inline

Populate a square matrix with the identity matrix.

Parameters

a	The array to populate.
N	The number of rows and columns in the square matrix.

Definition at line 231 of file array_utils.h.

                                    {
  for (size_t i = 0; i < N; i++) {
    for (size_t j = 0; j < N; j++) {
      a[i * N + j] = (i == j) ? 1.0 : 0.0;
    }
  }
}

◆ flip()

void gpu::flip	(	float *	a,
		size_t	R,
		size_t	C,
		bool	horizontal = true )

inline

Flip a matrix horizontally or vertically.

Parameters

a	The matrix to flip.
R	The number of rows in the matrix.
C	The number of columns in the matrix.
horizontal	Whether to flip horizontally (true) or vertically (false).

Definition at line 264 of file array_utils.h.

                                                                       {
  if (horizontal) {
    for (size_t i = 0; i < R; i++) {
      for (size_t j = 0; j < C / 2; j++) {
        std::swap(a[i * C + j], a[i * C + C - j - 1]);
      }
    }
  } else {
    for (size_t i = 0; i < R / 2; i++) {
      for (size_t j = 0; j < C; j++) {
        std::swap(a[i * C + j], a[(R - i - 1) * C + j]);
      }
    }
  }
}

◆ FreeTensor()

void gpu::FreeTensor	(	TensorPool &	pool,
		Tensor	tensor )

inline

Frees a tensor resource and updates the tensor pool.

Only needed if the use case requires manually managing resource lifetimes of GPU tensors. For simple use cases, the TensorPool destructor will automatically free all tensors.

Parameters

[in]	pool	TensorPool instance to manage the tensor
[in]	tensor	Tensor instance to free

FreeTensor(pool, tensor);

gpu::FreeTensor

void FreeTensor(TensorPool &pool, Tensor tensor)

Frees a tensor resource and updates the tensor pool.

Definition gpu.h:608

Definition at line 608 of file gpu.h.

                                                        {
  if (tensor.data.buffer) {
    wgpuBufferRelease(tensor.data.buffer);
  } else {
    LOG(kDefLog, kWarn, "Tried to free tensor with null buffer");
  }
  if (pool.data.find(tensor.data.buffer) != pool.data.end()) {
    pool.data.erase(tensor.data.buffer);
  } else {
    LOG(kDefLog, kWarn, "Tried to free tensor that was not in pool");
  }
}

◆ isclose()

bool gpu::isclose	(	float *	a,
		float *	b,
		size_t	n,
		float	tol = 1e-3 )

Determine if the values of two arrays are close to each other.

Parameters

a	The first array.
b	The second array.
n	The number of elements in the arrays.
tol	The tolerance for closeness.

Returns: bool True if the arrays are close, false otherwise.

Definition at line 288 of file array_utils.h.

                                                             {
  for (size_t i = 0; i < n; i++) {
    if (std::abs(a[i] - b[i]) > tol || std::isnan(a[i]) || std::isnan(b[i])) {
      LOG(kDefLog, kError, "Mismatch at index %d: %f != %f", i, a[i], b[i]);
      return false;
    }
  }
  return true;
}

◆ LOG()

void gpu::LOG	(	Logger &	logger,
		int	level,
		const char *	message,
			... )

inline

Log a message to the logger. If NDEBUG is defined in a source or as a compiler flag, this is a no-op.

Parameters

logger	The logger to log to.
level	The log level of the message.
message	The message to log.

Definition at line 34 of file logging.h.

                                                                     {
  static const char *orange = "\033[0;33m";
  static const char *red = "\033[0;31m";
  static const char *white = "\033[0;37m";
  static const char *gray = "\033[0;90m";
  static const char *reset = "\033[0m";
  static const char *logColors[] = {red, red, orange, gray};
  if (level <= logger.level) {
    va_list(args);
    va_start(args, message);
    snprintf(logger.buffer, sizeof(logger.buffer), message, args);
    // Brackets and messages are white.
    // Log levels are red for error and warning, orange for info, and grey for trace.
    // Then the color is reset.
    fprintf(logger.stream, "%s[%s%s%s] ", white, logColors[level], kLevelStr[level],
            white);
    vfprintf(logger.stream, message, args);
    fprintf(logger.stream, "%s\n", reset);
    va_end(args);
  }
}

◆ operator<()

bool gpu::operator<	(	const Kernel &	lhs,
		const Kernel &	rhs )

inline

Operator implementation to make the Kernel type hashable.

Parameters

[in]	lhs	First Kernel instance to compare
[in]	rhs	Second Kernel instance to compare

Returns: True if lhs < rhs, false otherwise

Definition at line 398 of file gpu.h.

                                                            {
  return lhs.commandBuffer < rhs.commandBuffer;
}

◆ processEvents()

void gpu::processEvents ( const WGPUInstance & instance )

inline

Definition at line 419 of file gpu.h.

                                                        {
#ifdef __EMSCRIPTEN__
  emscripten_sleep(0);
#else
  wgpuInstanceProcessEvents(instance);
#endif
}

◆ randint() [1/2]

void gpu::randint	(	float *	a,
		size_t	N,
		std::mt19937 &	gen,
		int	min = -1,
		int	max = 1 )

Populate the array with random integers.

Parameters

a	The array to populate.
N	The number of elements in the array.
gen	The random number generator.
min	The minimum value for the random integers.
max	The maximum value for the random integers.

Definition at line 171 of file array_utils.h.

                                                                             {
  std::uniform_int_distribution<> dist(min, max);
  for (int i = 0; i < N; i++) {
    a[i] = static_cast<float>(dist(gen));
  }
}

◆ randint() [2/2]

template<typename numtype , size_t size>

void gpu::randint	(	std::array< numtype, size > &	a,
		std::mt19937 &	gen,
		int	min = -1,
		int	max = 1 )

Overload of randint() for std::array.

Parameters

a	The array to populate.
gen	The random number generator.
min	The minimum value for the random integers.
max	The maximum value for the random integers.

Definition at line 186 of file array_utils.h.

                          {
  std::uniform_int_distribution<> dist(min, max);
  for (int i = 0; i < size; i++) {
    a[i] = static_cast<numtype>(dist(gen));
  }
}

◆ randn() [1/2]

void gpu::randn	(	float *	a,
		size_t	N,
		std::mt19937 &	gen,
		float	mean = 0.0,
		float	std = 1.0 )

Populate the array with random floats, generated from a Gaussian distribution.

Parameters

a	The array to populate.
N	The number of elements in the array.
gen	The random number generator.
mean	The mean of the Gaussian distribution.
std	The standard deviation of the Gaussian distribution.

Definition at line 202 of file array_utils.h.

                            {
  std::normal_distribution<float> dist(mean, std);
  for (int i = 0; i < N; i++) {
    a[i] = static_cast<float>(dist(gen));
  }
}

◆ randn() [2/2]

template<size_t size>

void gpu::randn	(	std::array< float, size > &	a,
		std::mt19937 &	gen,
		float	mean = 0.0,
		float	std = 1.0 )

Overload of randn() for std::array.

Parameters

a	The array to populate.
gen	The random number generator.
mean	The mean of the Gaussian distribution.
std	The standard deviation of the Gaussian distribution.

Definition at line 218 of file array_utils.h.

                            {
  std::normal_distribution<float> dist(mean, std);
  for (int i = 0; i < size; i++) {
    a[i] = static_cast<float>(dist(gen));
  }
}

◆ range() [1/2]

void gpu::range	(	float *	input,
		size_t	N,
		float	start = 0.0,
		float	step = 1.0 )

Populate the array with a range of values. This is mostly for testing purposes.

Parameters

input	The array to populate.
N	The number of elements in the array.
start	The starting value.
step	The step size.

Definition at line 139 of file array_utils.h.

                                                                        {
  // TODO(avh): currently unused - check
  float curr = start;
  for (size_t i = 0; i < N; i++) {
    input[i] = curr;
    curr += step;
  }
}

◆ range() [2/2]

template<size_t N>

void gpu::range	(	std::array< float, N > &	input,
		float	start = 0.0,
		float	step = 1.0 )

Overload of range() for std::array.

Parameters

input	The array to populate.
start	The starting value.
step	The step size.

Definition at line 155 of file array_utils.h.

                                                                           {
  float curr = start;
  for (size_t i = start; i < N; i++) {
    input[i] = curr;
    curr += step;
  }
}

◆ replaceAll() [1/2]

void gpu::replaceAll	(	std::string &	str,
		const std::string &	from,
		const std::string &	to )

inline

simple in-place string replacement helper function for substituting placeholders in a WGSL string template.

Note this is not meant to be used in performance-critical code paths and should be used ahead-of-time before any performance-critical codepath to preprocess WGSL code strings.

Parameters

[in]	str	String to mutate with substitution replacements.
[in]	from	Substring to replace
[in]	to	Substring to replace with

replaceAll(str, "{{workgroupSize}}", "256");

gpu::replaceAll

void replaceAll(std::string &str, const std::string &from, const std::string &to)

simple in-place string replacement helper function for substituting placeholders in a WGSL string tem...

Definition gpu.h:256

Definition at line 256 of file gpu.h.

                                            {
  size_t start_pos = 0;
  while ((start_pos = str.find(from, start_pos)) != std::string::npos) {
    str.replace(start_pos, from.length(), to);
    start_pos += to.length();
  }
}

◆ replaceAll() [2/2]

void gpu::replaceAll	(	std::string &	str,
		const std::vector< std::pair< std::string, std::string > > &	reps )

inline

Overload of the string replacement helper function to replace multiple substrings in a string with multiple replacements.

Parameters

[in]	str	String to mutate with substitution replacements.
[in]	reps	Vector of pairs of substrings to replace and their replacements.

replaceAll(str, {{"{{workgroupSize}}", "256"}, {"{{precision}}",

"f32"}});

Definition at line 346 of file gpu.h.

                                                                 {
  for (const auto &rep : reps) {
    replaceAll(str, rep.first, rep.second);
  }
}

◆ resetCommandBuffer()

void gpu::resetCommandBuffer	(	WGPUDevice &	device,
		Kernel &	op )

inline

Resets the command buffer in preparation for a kernel dispatch. Since command buffers are consumed upon submission, this function is used both in the initial kernel creation and every time the kernel is to be reused for a dispatch.

Parameters

[in]	device	WGPUDevice instance to manage the operation
[in]	op	Kernel instance representing the kernel to reset

resetCommandBuffer(device, op);

Definition at line 937 of file gpu.h.

                                                               {
  {
    WGPUCommandEncoder commandEncoder =
        wgpuDeviceCreateCommandEncoder(device, nullptr);
    WGPUComputePassEncoder computePassEncoder =
        wgpuCommandEncoderBeginComputePass(commandEncoder, nullptr);
    wgpuComputePassEncoderSetPipeline(computePassEncoder, op.computePipeline);
    wgpuComputePassEncoderSetBindGroup(computePassEncoder, 0, op.bindGroup, 0,
                                       nullptr);
    wgpuComputePassEncoderDispatchWorkgroups(
        computePassEncoder, op.nWorkgroups[0], op.nWorkgroups[1],
        op.nWorkgroups[2]);
    wgpuComputePassEncoderEnd(computePassEncoder);
    op.commandBuffer = wgpuCommandEncoderFinish(commandEncoder, nullptr);
  }
}

◆ setLogLevel()

void gpu::setLogLevel ( int level )

Set the log level of the default logger.

Parameters

level The log level to set.

Definition at line 70 of file logging.h.

                            {
  kDefLog.level = level;
}

◆ show() [1/3]

template<typename numtype >

std::string gpu::show	(	const numtype *	a,
		size_t	rows,
		size_t	cols,
		const std::string &	name = "" )

Show a 2D array as a string, base implementation.

Parameters

a	The array to show.
rows	The number of rows in the array.
cols	The number of columns in the array.
name	The name of the array to show.

Returns: std::string The string representation of the array.
std::array<float, 4> a = {1.0, 2.0, 3.0, 4.0};

printf("%s", show<float>(a.data(), 2, 2, "a").c_str());

gpu::show
std::string show(const numtype *a, size_t rows, size_t cols, const std::string &name="")
Show a 2D array as a string, base implementation.
Definition array_utils.h:43

Definition at line 43 of file array_utils.h.

                                             {
  std::string output = "\n";
  if (name != "") {
    output += "\n" + name + " (" + std::to_string(rows) + ", " +
              std::to_string(cols) + ")\n\n";
  } else {
    output +=
        "\n(" + std::to_string(rows) + ", " + std::to_string(cols) + ")\n\n";
  }
  // spacing as log10 of max value
  int spacing = 1;
  numtype max = *std::max_element(a, a + rows * cols);
  if constexpr (std::is_same<numtype, int>::value) {
    spacing = std::max(0, (int)log10(max + .01)) + 2;
  } else if constexpr (std::is_same<numtype, float>::value) {
    // spacing = std::max(0, (int)log10(max + .01)) + 1;
    spacing = 8; // scientific notation
  } else {
    throw std::runtime_error("Unsupported number type for show()");
  }
  // print to stdout line break for each row
  for (size_t i = 0; i < rows; i++) {
    if (i == kShowMaxRows / 2 && rows > kShowMaxRows) {
      output += "...\n";
      i = rows - kShowMaxRows / 2;
    }
    for (size_t j = 0; j < cols; j++) {
      if (j == kShowMaxCols / 2 && cols > kShowMaxCols) {
        output += " ..";
        j = cols - kShowMaxCols / 2;
      }
      char buffer[50];
      if constexpr (std::is_same<numtype, int>::value) {
        snprintf(buffer, spacing, "%*d", spacing, a[i * cols + j]);
      } else if constexpr (std::is_same<numtype, float>::value) {
        if (std::abs(a[i * cols + j]) < 1000 &&
                std::abs(a[i * cols + j]) > 0.01 ||
            a[i * cols + j] == 0.0) {
          snprintf(buffer, 16, "%9.2f", a[i * cols + j]);
        } else
          snprintf(buffer, 16, "%10.2e", a[i * cols + j]);
      } else {
        throw std::runtime_error("Unsupported number type for show()");
      }
      output += buffer;
    }
    output += "\n";
  }
  output += "\n";
  return output;
}

◆ show() [2/3]

template<size_t rows, size_t cols>

std::string gpu::show	(	const std::array< float, rows *cols > &	a,
		const std::string &	name = "" )

Overload of show() for float std::array.

Parameters

a	The array to show.
name	The name of the array to show.

Returns: std::string The string representation of the array.

std::array<float, 4> a = {1.0, 2.0, 3.0, 4.0};

printf("%s", show(a, "a").c_str());

@

Definition at line 126 of file array_utils.h.

                                             {
  return show<float, rows, cols>(a, name);
}

◆ show() [3/3]

template<typename numtype , size_t rows, size_t cols>

std::string gpu::show	(	const std::array< numtype, rows *cols > &	a,
		const std::string &	name = "" )

Overload of show() for std::array.

Parameters

a	The array to show.
name	The name of the array to show.

Returns: std::string The string representation of the array.
std::array<float, 4> a = {1.0, 2.0, 3.0, 4.0};

printf("%s", show<float>(a, "a").c_str());

Definition at line 108 of file array_utils.h.

                                             {
  return show<numtype>(a.data(), rows, cols, name);
}

◆ size()

size_t gpu::size ( const Shape & shape )

inline

Returns the number of elements in a tensor with the given shape, which is equal to the product of the dimensions.

Parameters

[in] shape Shape of the tensor

Returns: Number of elements in the tensor

size({256, 256}) -> 65536

Definition at line 80 of file gpu.h.

                                       {
  size_t numels = 1;
  for (size_t i = 0; i < shape.rank; i++) {
    numels *= shape.data[i];
  }
  return numels;
}

◆ sizeBytes()

size_t gpu::sizeBytes ( const NumType & type )

inline

Returns the number of bytes of a number type.

Definition at line 191 of file gpu.h.

                                             {
  switch (type) {
  case kf16:
    return sizeof(uint16_t);
  case kf32:
    return sizeof(float);
  default:
    LOG(kDefLog, kError, "Invalid NumType in size calculation.");
    return 0;
  }
}

◆ toCPU() [1/3]

template<size_t N>

void gpu::toCPU	(	Context &	ctx,
		Tensor &	tensor,
		std::array< float, N > &	data )

Overload of the toCPU function to copy data from a GPU buffer to CPU memory for an array of floats instead of a pointer to a float buffer.

Parameters

[in]	ctx	Context instance to manage the operation
[in]	tensor	Tensor instance representing the GPU buffer to copy from
[out]	data	Array of floats to copy the data to

toCPU(ctx, tensor, data);

gpu::toCPU

void toCPU(Context &ctx, Tensor &tensor, void *data, size_t bufferSize, CopyData &op)

Copies data from a GPU buffer to CPU memory.

Definition gpu.h:789

Definition at line 869 of file gpu.h.

                                                                   {
  toCPU(ctx, tensor, data.data(), sizeof(data));
}

◆ toCPU() [2/3]

void gpu::toCPU	(	Context &	ctx,
		Tensor &	tensor,
		void *	data,
		size_t	bufferSize )

inline

Overload of the toCPU function to copy data from a GPU buffer to CPU but initializes a staging buffer and promise/future for the operation for you.

For simple use cases, this overload is recommended as it abstracts away the staging buffer and promise/future management. For more custom use cases where the staging buffer is initialized ahead of time, use the other overload.

Parameters

[in]	ctx	Context instance to manage the operation
[in]	tensor	Tensor instance representing the GPU buffer to copy from
[in]	bufferSize	Size of the data buffer in bytes
[out]	data	Pointer to the CPU memory to copy the data to

Definition at line 834 of file gpu.h.

                                                                               {
  CopyData op;
  op.future = op.promise.get_future();
  {
    WGPUBufferDescriptor readbackBufferDescriptor = {
        .usage = WGPUBufferUsage_CopyDst | WGPUBufferUsage_MapRead,
        .size = bufferSize,
    };
    op.readbackBuffer =
        wgpuDeviceCreateBuffer(ctx.device, &readbackBufferDescriptor);
  }
  {
    WGPUCommandEncoder commandEncoder;
    WGPUComputePassEncoder computePassEncoder;
    commandEncoder = wgpuDeviceCreateCommandEncoder(ctx.device, nullptr);
    wgpuCommandEncoderCopyBufferToBuffer(commandEncoder, tensor.data.buffer, 0,
                                         op.readbackBuffer, 0, bufferSize);
    op.commandBuffer = wgpuCommandEncoderFinish(commandEncoder, nullptr);
    check(op.commandBuffer, "Create command buffer", __FILE__, __LINE__);
  }
  toCPU(ctx, tensor, data, bufferSize, op);
}

◆ toCPU() [3/3]

void gpu::toCPU	(	Context &	ctx,
		Tensor &	tensor,
		void *	data,
		size_t	bufferSize,
		CopyData &	op )

inline

Copies data from a GPU buffer to CPU memory.

Parameters

[in]	ctx	Context instance to manage the operation
[in]	tensor	Tensor instance representing the GPU buffer to copy from
[out]	data	Pointer to the CPU memory to copy the data to
[in]	bufferSize	Size of the data buffer in bytes
[in]	op	StagingBuffer instance to manage the operation

toCPU(ctx, tensor, data, bufferSize);

Definition at line 789 of file gpu.h.

                                {
  wgpuQueueSubmit(ctx.queue, 1, &op.commandBuffer);
  CallbackData callbackData = {op.readbackBuffer, bufferSize, data, &op.promise,
                               &op.future};
  wgpuQueueOnSubmittedWorkDone(
      ctx.queue,
      [](WGPUQueueWorkDoneStatus status, void *callbackData) {
        check(status == WGPUQueueWorkDoneStatus_Success, "Queue work done",
              __FILE__, __LINE__);
        const auto *data = static_cast<CallbackData *>(callbackData);
        wgpuBufferMapAsync(
            data->buffer, WGPUMapMode_Read, 0, data->bufferSize,
            [](WGPUBufferMapAsyncStatus status, void *captureData) {
              const auto *data = static_cast<CallbackData *>(captureData);
              check(status == WGPUBufferMapAsyncStatus_Success,
                    "Map readbackBuffer", __FILE__, __LINE__);
              const void *mappedData = wgpuBufferGetConstMappedRange(
                  data->buffer, /*offset=*/0, data->bufferSize);
              check(mappedData, "Get mapped range", __FILE__, __LINE__);
              memcpy(data->output, mappedData, data->bufferSize);
              wgpuBufferUnmap(data->buffer);
              data->promise->set_value();
            },
            callbackData);
      },
      &callbackData);
  wait(ctx, op.future);
}

◆ toGPU() [1/4]

void gpu::toGPU	(	Context &	ctx,
		const float *	data,
		Tensor &	tensor )

inline

Overload of the toGPU function to copy data from CPU memory to a GPU taking a Tensor instance instead of a WGPUBuffer instance.

Parameters

[in]	ctx	Context instance to manage the operation
[in]	data	Pointer to the CPU memory to copy from
[in]	tensor	Tensor instance representing the GPU buffer to copy to

toGPU(ctx, data, tensor);

gpu::toGPU

void toGPU(Context &ctx, const void *data, WGPUBuffer buffer, size_t size)

Copies data from CPU memory to a GPU buffer. The toGPU overloads are effectively a convenience wrappe...

Definition gpu.h:888

Definition at line 904 of file gpu.h.

                                                                   {
  wgpuQueueWriteBuffer(ctx.queue, tensor.data.buffer, 0, data,
                       tensor.data.size);
}

◆ toGPU() [2/4]

void gpu::toGPU	(	Context &	ctx,
		const half *	data,
		Tensor &	tensor )

inline

Definition at line 909 of file gpu.h.

                                                                  {
  wgpuQueueWriteBuffer(ctx.queue, tensor.data.buffer, 0, data,
                       tensor.data.size);
}

◆ toGPU() [3/4]

void gpu::toGPU	(	Context &	ctx,
		const void *	data,
		WGPUBuffer	buffer,
		size_t	size )

inline

Copies data from CPU memory to a GPU buffer. The toGPU overloads are effectively a convenience wrapper around the WebGPU API call wgpuQueueWriteBuffer.

Parameters

[in]	ctx	Context instance to manage the operation
[in]	data	Pointer to the CPU memory to copy from
[in]	buffer	WGPUBuffer instance representing the GPU buffer to copy to
[in]	size	Size of the data buffer in bytes

toGPU(ctx, data, buffer, size);

Definition at line 888 of file gpu.h.

                               {
  wgpuQueueWriteBuffer(ctx.queue, buffer, 0, data, size);
}

◆ toGPU() [4/4]

template<typename Params >

void gpu::toGPU	(	Context &	ctx,
		Params &	params,
		Kernel &	op )

inline

Definition at line 915 of file gpu.h.

                                                            {
  // TODO(avh): Maintain params metadata in Kernel and check for consistency.
  // If a kernel does not have parameters this will quietly overwrite
  // the last buffer in the bind group with the parameters buffer.
  if (op.numBindings > 0) {
    wgpuQueueWriteBuffer(ctx.queue, op.buffers[op.numBindings - 1], 0,
                         static_cast<void *>(&params), sizeof(params));
  }
}

◆ toString() [1/3]

std::string gpu::toString ( const Shape & shape )

inline

Converts Shape to string. The string formatting is meant to be slotted into WGSL code (hence no additional parentheses or brackets).

Definition at line 222 of file gpu.h.

                                              {
  std::string str;
  for (size_t i = 0; i < shape.rank; i++) {
    str += std::to_string(shape.data[i]);
    if (i < shape.rank - 1) {
      str += ", ";
    }
  }
  return str;
}

◆ toString() [2/3]

std::string gpu::toString ( NumType type )

inline

Converts NumType to string.

Definition at line 206 of file gpu.h.

                                        {
  switch (type) {
  case kf16:
    return "f16";
  case kf32:
    return "f32";
  default:
    LOG(kDefLog, kError, "Invalid NumType in string conversion.");
    return "unknown";
  }
}

◆ toString() [3/3]

std::string gpu::toString ( size_t value )

inline

Converts size_t to string. Wraps std::to_string for consistency, instead of having to remember to switch between std::to_string and toString depending on the type.

Definition at line 238 of file gpu.h.

238{ return std::to_string(value); }

◆ transpose()

void gpu::transpose	(	float *	input,
		float *	output,
		size_t	M,
		size_t	N )

inline

Transpose a matrix.

Parameters

input	The input matrix.
output	The output matrix.
M	The number of rows in the input matrix.
N	The number of columns in the input matrix.

Definition at line 249 of file array_utils.h.

                                                                       {
  for (size_t i = 0; i < M; i++) {
    for (size_t j = 0; j < N; j++) {
      output[j * M + i] = input[i * N + j];
    }
  }
}

◆ wait()

void gpu::wait	(	Context &	ctx,
		std::future< void > &	future )

inline

Definition at line 770 of file gpu.h.

                                                        {
  while (future.wait_for(std::chrono::seconds(0)) !=
         std::future_status::ready) {
    processEvents(ctx.instance);
  }
}

Variable Documentation

◆ IsNoParam

template<typename T >

bool gpu::IsNoParam = std::is_same_v<T, NoParam>

constexpr

Definition at line 959 of file gpu.h.

◆ kDefLog

Logger gpu::kDefLog = {stdout, "", kInfo}

static

Default logger for logging messages to stdout at the info level. Output stream and logging level for the default logger can be globally changed on a per-program basis.

Definition at line 64 of file logging.h.

64{stdout, "", kInfo};

◆ kLevelStr

const char* gpu::kLevelStr[] = {"error", "warn", "info", "trace"}

static

Definition at line 11 of file logging.h.

11{"error", "warn", "info", "trace"};

◆ kShowMaxCols

int gpu::kShowMaxCols = 8

staticconstexpr

Definition at line 27 of file array_utils.h.

◆ kShowMaxRows

int gpu::kShowMaxRows = 8

staticconstexpr

Definition at line 26 of file array_utils.h.

Classes

Enumerations

Functions

Variables

Enumeration Type Documentation

◆ LogLevel

◆ NumType

Function Documentation

◆ Bindings() [1/2]

◆ Bindings() [2/2]

◆ cdiv() [1/2]

◆ cdiv() [2/2]

◆ check()

◆ createContext()

◆ createKernel() [1/2]

◆ createKernel() [2/2]

◆ createTensor() [1/4]

◆ createTensor() [2/4]

◆ createTensor() [3/4]

◆ createTensor() [4/4]

◆ dispatchKernel()

◆ eye()

◆ flip()

◆ FreeTensor()

◆ isclose()

◆ LOG()

◆ operator<()

◆ processEvents()

◆ randint() [1/2]

◆ randint() [2/2]

◆ randn() [1/2]

◆ randn() [2/2]

◆ range() [1/2]

◆ range() [2/2]

◆ replaceAll() [1/2]

◆ replaceAll() [2/2]

◆ resetCommandBuffer()

◆ setLogLevel()

◆ show() [1/3]

◆ show() [2/3]

◆ show() [3/3]

◆ size()

◆ sizeBytes()

◆ toCPU() [1/3]

◆ toCPU() [2/3]

◆ toCPU() [3/3]

◆ toGPU() [1/4]

◆ toGPU() [2/4]

◆ toGPU() [3/4]

◆ toGPU() [4/4]

◆ toString() [1/3]

◆ toString() [2/3]

◆ toString() [3/3]

◆ transpose()

◆ wait()

Variable Documentation

◆ IsNoParam

◆ kDefLog

◆ kLevelStr

◆ kShowMaxCols

◆ kShowMaxRows