GPU计算
通用图形处理器(GPGPU)
GPGPU
图形处理器(graphics processing unit,GPU))使显卡减少了对中央处理器的依赖,并分担了部分原本是由中央处理器所担当的工作,尤其是在进行三维绘图运算时,功效更加明显。
通用图形处理器(General-purpose computing on graphics processing units,GPGPU),利用处理图形任务的图形处理器来计算原本由中央处理器处理的通用计算任务,这些通用计算常常与图形处理没有任何关系。
https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units
GPU methods
- Map
The map operation simply applies the given function (the kernel) to every element in the stream. A simple example is multiplying each value in the stream by a constant (increasing the brightness of an image). The map operation is simple to implement on the GPU. The programmer generates a fragment for each pixel on screen and applies a fragment program to each one. The result stream of the same size is stored in the output buffer.
- Reduce
Some computations require calculating a smaller stream (possibly a stream of only 1 element) from a larger stream. This is called a reduction of the stream. Generally, a reduction can be performed in multiple steps. The results from the prior step are used as the input for the current step and the range over which the operation is applied is reduced until only one stream element remains.
- Stream filtering
Stream filtering is essentially a non-uniform reduction. Filtering involves removing items from the stream based on some criteria.
- Scan
The scan operation, also termed parallel prefix sum, takes in a vector (stream) of data elements and an (arbitrary) associative binary function '+' with an identity element 'i'. If the input is [a0, a1, a2, a3, ...], an exclusive scan produces the output [i, a0, a0 + a1, a0 + a1 + a2, ...], while an inclusive scan produces the output [a0, a0 + a1, a0 + a1 + a2, a0 + a1 + a2 + a3, ...] and does not require an identity to exist. While at first glance the operation may seem inherently serial, efficient parallel scan algorithms are possible and have been implemented on graphics processing units. The scan operation has uses in e.g., quicksort and sparse matrix-vector multiplication.
- Scatter
The scatter operation is most naturally defined on the vertex processor. The vertex processor is able to adjust the position of the vertex, which allows the programmer to control where information is deposited on the grid. Other extensions are also possible, such as controlling how large an area the vertex affects.
The fragment processor cannot perform a direct scatter operation because the location of each fragment on the grid is fixed at the time of the fragment's creation and cannot be altered by the programmer. However, a logical scatter operation may sometimes be recast or implemented with another gather step. A scatter implementation would first emit both an output value and an output address. An immediately following gather operation uses address comparisons to see whether the output value maps to the current output slot.
In dedicated compute kernels, scatter can be performed by indexed writes.
- Gather
Gather is the reverse of scatter, after scatter reorders elements according to a map, gather can restore the order of the elements according to the map scatter used. In dedicated compute kernels, gather may be performed by indexed reads. In other shaders, it is performed with texture-lookups.
- Sort
The sort operation transforms an unordered set of elements into an ordered set of elements. The most common implementation on GPUs is using radix sort for integer and floating point data and coarse-grained merge sort and fine-grained sorting networks for general comparable data.
- Search
The search operation allows the programmer to find a given element within the stream, or possibly find neighbors of a specified element. The GPU is not used to speed up the search for an individual element, but instead is used to run multiple searches in parallel.[citation needed] Mostly the search method used is binary search on sorted elements.