|On this page…|
The purpose of GPU computing in MATLAB is to speed up your applications. This topic discusses fundamental concepts and practices that can help you achieve better performance on the GPU, such as the configuration of the GPU hardware and best practices within your code. It discusses the trade-off between implementation difficulty and performance, and describes the criteria you might use to choose between using gpuArray functions, arrayfun, MEX-files, or CUDA kernels. Finally, it describes how to accurately measure performance on the GPU.
When converting MATLAB code to run on the GPU, it is best to start with MATLAB code that already performs well. While the GPU and CPU have different performance characteristics, the general guidelines for writing good MATLAB code also help you write good MATLAB code for the GPU. The first step is almost always to profile your CPU code. The lines of code that the profiler shows taking the most time on the CPU will likely be ones that you must concentrate on when you code for the GPU.
It is easiest to start converting your code using MATLAB built-in functions that support gpuArray data. These functions take gpuArray inputs, perform calculations on the GPU, and return gpuArray outputs. A list of the MATLAB functions that support gpuArray data is found in Run Built-In Functions on a GPU. In general these functions support the same arguments and data types as standard MATLAB functions that are calculated in the CPU. Any limitations in these overloaded functions for gpuArrays are described in their command-line help (e.g., help gpuArray/qr).
If all the functions that you want to use are supported on the GPU, running code on the GPU may be as simple as calling gpuArray to transfer input data to the GPU, and calling gather to retrieve the output data from the GPU when finished. In many cases, you might need to vectorize your code, replacing looped scalar operations with MATLAB matrix and vector operations. While vectorizing is generally a good practice on the CPU, it is usually critical for achieving high performance on the GPU. For more information, see Vectorize for Improved GPU Performance.
It is possible that even after converting inputs to gpuArrays and vectorizing your code, there are operations in your algorithm that are either not built-in functions, or that are not fast enough to meet your application's requirements. In such situations you have three main options: use arrayfun to precompile element-wise parts of your application, make use of GPU library functions, or write a custom CUDA kernel.
If you have a purely element-wise function, you can improve its performance by calling it with arrayfun. The arrayfun function on the GPU turns an element-wise MATLAB function into a custom CUDA kernel, thus reducing the overhead of performing the operation. Often, there is a subset of your application that can be used with arrayfun even if the entire application cannot be. The example Improve Performance of Element-wise MATLAB Functions on the GPU using ARRAYFUN shows the basic concepts of this approach; and the example Using ARRAYFUN for Monte-Carlo Simulations shows how this can be done in simulations for a finance application.
MATLAB provides an extensive library of GPU-enabled functions in Parallel Computing Toolbox™, Image Processing Toolbox™, Signal Processing Toolbox™, and other products. However, there are many libraries of additional functions that do not have direct built-in analogs in MATLAB's GPU support. Examples include the NVIDIA Performance Primitives library and the CURAND library, which are included in the CUDA toolkit that ships with MATLAB. If you need to call a function in one of these libraries, you can do so using the GPU MEX interface. This interface allows you to extract the pointers to the device data from MATLAB gpuArrays so that you can pass these pointers to GPU functions. You can convert the returned values into gpuArrays for return to MATLAB. For more information see Run MEX-Functions Containing CUDA Code.
Finally, you have the option of writing a custom CUDA kernel for the operation that you need. Such kernels can be directly integrated into MATLAB using the CUDAKernel object.
The example Illustrating Three Approaches to GPU Computing: The Mandelbrot Set shows how to implement a simple calculation using three of the approaches mentioned in this section. This example begins with MATLAB code that is easily converted to run on the GPU, rewrites the code to use arrayfun for element-wise operations, and finally shows how to integrate a custom CUDA kernel for the same operation.
Alternately, you can write a CUDA kernel as part of a MEX-file and call it using the CUDA Runtime API inside the MEX-file. Either of these approaches might let you work with low-level features of the GPU, such as shared memory and texture memory, that are not directly available in MATLAB code. For more details, see the example Accessing Advanced CUDA Features Using MEX.
In general you can achieve the best performance when your GPU is dedicated to computing. It is usually not practical to use the same GPU device for both computations and graphics, because of the amount of memory taken up for problems of reasonable size and the constant use of the device by the system for graphics. If possible, obtain a separate device for graphics. Details of configuring your device for compute or graphics depend on the operating system and driver version.
On Windows® systems, a GPU device can be in one of two modes: Windows Display Driver Model (WDDM) or Tesla Compute Cluster (TCC) mode. For best performance, any devices used for computing should be in TCC mode. Consult NVIDIA documentation for more details.
NVIDIA's highest-performance compute devices, the Tesla line, support error correcting codes (ECC) when reading and writing GPU memory. The purpose of ECC is to correct for occasional bit-errors that occur normally when reading or writing dynamic memory. One technique to improve performance is to turn off ECC to increase the achievable memory bandwidth. While the hardware can be configured this way, MathWorks does not recommend this practice. The potential loss of accuracy due to silent errors can be more harmful than the performance benefit.
This topic describes general techniques that help you achieve better performance on the GPU. Some of these tips apply when writing MATLAB code for the CPU as well.
Data in MATLAB arrays is stored in column-major order. Therefore, it is beneficial to operate along the first or column dimension of your array. If one dimension of your data is significantly longer than others, you might achieve better performance if you make that the first dimension. Similarly, if you frequently operate along a particular dimension, it is usually best to have it as the first dimension. In some cases, if consecutive operations target different dimensions of an array, it might be beneficial to transpose or permute the array between these operations.
GPUs achieve high performance by calculating many results in parallel. Thus, matrix and higher-dimensional array operations typically perform much better than operations on vectors or scalars. You can achieve better performance by rewriting your loops to make use of higher-dimensional operations. The process of revising loop-based, scalar-oriented code to use MATLAB matrix and vector operations is called vectorization. For more details, see Using Vectorization.
By default, all operations in MATLAB are performed in double-precision floating-point arithmetic. However, most operations support a variety of data types, including integer and single-precision floating-point. Today's GPUs and CPUs typically have much higher throughput when performing single-precision operations, and single-precision floating-point data occupies less memory. If your application's accuracy requirements allow the use of single-precision floating-point, it can greatly improve the performance of your MATLAB code.
The GPU sits at the end of a data transfer mechanism known as the PCI bus. While this bus is an efficient, high-bandwidth way to transfer data from the PC host memory to various extension cards, it is still much slower than the overall bandwidth to the global memory of the GPU device or of the CPU (for more details, see the example Measuring GPU Performance). In addition, transfers from the GPU device to MATLAB host memory cause MATLAB to wait for all pending operations on the device to complete before executing any other statements. This can significantly hurt the performance of your application. In general, you should limit the number of times you transfer data between the MATLAB workspace and the GPU. If you can transfer data to the GPU once at the start of your application, perform all the calculations you can on the GPU, and then transfer the results back into MATLAB at the end, that generally results in the best performance. Similarly, it helps to create data directly on the GPU, either using static methods of gpuArray (e.g., gpuArray.zeros) or the 'like' syntax of functions such as zeros (e.g., zeros(N,'like',g) for a gpuArray g).
The best way to measure performance on the GPU is to use gputimeit. This function takes as input a function handle with no input arguments, and returns the measured execution time of that function. It takes care of such benchmarking considerations as repeating the timed operation to get better resolution, executing the function before measurement to avoid initialization overhead, and subtracting out the overhead of the timing function. Also, gputimeit ensures that all operations on the GPU have completed before the final timing.
For example, consider measuring the time taken to compute the lu factorization of a random matrix A of size N. You can do this by defining a function that does the lu factorization and passing the function handle to gputimeit:
A = gpuArray.rand(N); fh = @() lu(A); gputimeit(fh,2); % 2nd arg indicates number of outputs
If you cannot use gputimeit, you can measure performance with tic and toc. However, to get accurate timing on the GPU, you must wait for operations to complete before calling toc(). There are two ways to do this. You can call gather on the final GPU output before calling toc(): this forces all computations to complete before the time measurement is taken. Alternately, you can use the wait() function, which takes a gpuDevice as its input. For example, if you wanted to measure the time taken to compute the lu factorization of a matrix A using tic, toc, and wait, you can do it as follows:
gd = gpuDevice(); tic(); [l,u] = lu(A); wait(gd); tLU = toc();
Treat with caution any results from the MATLAB profiler when GPU operations are involved. The profiler shows only the time spent by the CPU and does not indicate execution time on the GPU. The best way to tell what is happening when profiling GPU code is to place a wait call after each GPU operation or each section of interest in the code. Typically, the wait appears to take a significant amount of time. The time taken by the wait is actually the execution time of the GPU operations that occur prior to the wait in the program.
This example shows you how to improve performance by running a function on the GPU instead of the CPU, and by vectorizing the calculations.
Consider a function that performs fast convolution on the columns of a matrix. Fast convolution, which is a common operation in signal processing applications, transforms each column of data from the time domain to the frequency domain, multiplies it by the transform of a filter vector, transforms back to the time domain, and stores the result in an output matrix.
function y = fastConvolution(data,filter) [m,n] = size(data); % Zero-pad filter to the column length of data, and transform filter_f = fft(filter,m); % Create an array of zeros of the same size and class as data y = zeros(m,n,'like',data); % Transform each column of data for ix = 1:n af = fft(data(:,ix)); y(:,ix) = ifft(af .* filter_f); end end
Execute this function in the CPU on data of a particular size, and measure the execution time using the MATLAB timeit function. The timeit function takes care of common benchmarking considerations, such as accounting for startup and overhead.
a = complex(randn(4096,100),randn(4096,100)); % Data input b = randn(16,1); % Filter input c = fastConvolution(a,b); % Calculate output ctime = timeit(@()fastConvolution(a,b)); % Measure CPU time disp(['Execution time on CPU = ',num2str(ctime)]);
On a sample machine, this code displays the output:
Execution time on CPU = 0.019335
Now execute this function on the GPU. You can do this easily by changing the input data to be gpuArrays rather than normal MATLAB arrays. The 'like' syntax used when creating the output inside the function ensures that y will be a gpuArray if data is a gpuArray.
ga = gpuArray(a); % Move data to GPU gb = gpuArray(b); % Move filter to GPU gc = fastConvolution(ga,gb); % Calculate on GPU gtime = gputimeit(@()fastConvolution(ga,gb)); % Measure GPU time gerr = max(max(abs(gather(gc)-c))); % Calculate error disp(['Execution time on GPU = ',num2str(gtime)]); disp(['Maximum absolute error = ',num2str(gerr)]);
On the same machine, this code displays the output:
Execution time on CPU = 0.019335 Execution time on GPU = 0.027235 Maximum absolute error = 1.1374e-14
Unfortunately, the GPU is slower than the CPU for this problem. The reason is that the for-loop is executing the FFT, multiplication, and inverse FFT operations on individual columns of length 4096. The best way to increase the performance is to vectorize the code, so that a single MATLAB function call performs more calculation. The FFT and IFFT operations are easy to vectorize: fft(A) computes the FFT of each column of a matrix A. You can perform a multiply of the filter with every column in a matrix at once using the MATLAB binary scalar expansion function bsxfun. The vectorized function looks like this:
function y = fastConvolution_v2(data,filter) m = size(data,1); % Zero-pad filter to the length of data, and transform filter_f = fft(filter,m); % Transform each column of the input af = fft(data); % Multiply each column by filter and compute inverse transform y = ifft(bsxfun(@times,af,filter_f)); end
Perform the same experiment using the vectorized function:
a = complex(randn(4096,100),randn(4096,100)); % Data input b = randn(16,1); % Filter input c = fastConvolution_v2(a,b); % Calculate output ctime = timeit(@()fastConvolution_v2(a,b)); % Measure CPU time disp(['Execution time on CPU = ',num2str(ctime)]); ga = gpuArray(a); % Move data to GPU gb = gpuArray(b); % Move filter to GPU gc = fastConvolution_v2(ga, gb); % Calculate on GPU gtime = gputimeit(@()fastConvolution_v2(ga,gb));% Measure GPU time gerr = max(max(abs(gather(gc)-c))); % Calculate error disp(['Execution time on GPU = ',num2str(gtime)]); disp(['Maximum absolute error = ',num2str(gerr)]);
Execution time on CPU = 0.010393 Execution time on GPU = 0.0020537 Maximum absolute error = 1.1374e-14
In conclusion, vectorizing the code helps both the CPU and GPU versions to run faster. However, vectorization helps the GPU version much more than the CPU. The improved CPU version is nearly twice as fast as the original; the improved GPU version is 13 times faster than the original. The GPU code went from being 40% slower than the CPU in the original version, to about five times faster in the revised version.