大主宰天蚕土豆,盗墓笔记小说全集,有声读物

C.1. Introduction

Cooperative Groups 是 CUDA 9 中引入的 CUDA 編程模型的擴展，用于組織通信線程組。協作組允許開發人員表達線程通信的粒度，幫助他們表達更豐富、更有效的并行分解。

從歷史上看，CUDA 編程模型為同步協作線程提供了一個單一、簡單的構造：線程塊的所有線程之間的屏障，如使用__syncthreads()內部函數實現的那樣。但是，程序員希望以其他粒度定義和同步線程組，以“集體”組范圍功能接口的形式實現更高的性能、設計靈活性和軟件重用。為了表達更廣泛的并行交互模式，許多面向性能的程序員已經求助于編寫自己的臨時和不安全的原語來同步單個 warp 中的線程，或者跨運行在單個 GPU 上的線程塊集。雖然實現的性能改進通常很有價值，但這導致了越來越多的脆弱代碼集合，隨著時間的推移和跨 GPU 架構的不同，這些代碼的編寫、調整和維護成本很高。合作組通過提供安全且面向未來的機制來啟用高性能代碼來解決這個問題。

C.2. What’s New in CUDA 11.0

使用網格范圍的組不再需要單獨編譯，并且同步該組的速度現在提高了30%。此外，我們在最新的 Windows 平臺上啟用了協作啟動，并在 MPS 下運行時增加了對它們的支持。
grid_group現在可以轉換為thread_group。
線程塊切片和合并組的新集合：reduce和memcpy_async。
線程塊切片和合并組的新分區操作：labeled_pa??rtition和binary_partition。
新的 API，meta_group_rank和meta_group_size，它們提供有關導致創建該組的分區的信息。
線程塊tile現在可以在類型中編碼其父級，這允許對發出的代碼進行更好的編譯時優化。
接口更改：grid_group必須在聲明時使用this_grid()構造。默認構造函數被刪除。

注意：在此版本中，我們正朝著要求 C++11 提供新功能的方向發展。在未來的版本中，所有現有 API 都需要這樣做。

C.3. Programming Model Concept

協作組編程模型描述了 CUDA 線程塊內和跨線程塊的同步模式。它為應用程序提供了定義它們自己的線程組的方法，以及同步它們的接口。它還提供了強制執行某些限制的新啟動 API，因此可以保證同步正常工作。這些原語在 CUDA 內啟用了新的協作并行模式，包括生產者-消費者并行、機會并行和整個網格的全局同步。

合作組編程模型由以下元素組成：

表示協作線程組的數據類型；
獲取由 CUDA 啟動 API 定義的隱式組的操作（例如，線程塊）；
將現有群體劃分為新群體的集體；
用于數據移動和操作的集體算法（例如memcpy_async、reduce、scan）；
同步組內所有線程的操作；
檢查組屬性的操作；
公開低級別、特定于組且通常是硬件加速的操作的集合。

協作組中的主要概念是對象命名作為其中一部分的線程集的對象。這種將組表示為一等程序對象的方式改進了軟件組合，因為集合函數可以接收表示參與線程組的顯式對象。該對象還明確了程序員的意圖，從而消除了不合理的架構假設，這些假設會導致代碼脆弱、對編譯器優化的不良限制以及與新一代 GPU 的更好兼容性。

為了編寫高效的代碼，最好使用專門的組（通用會失去很多編譯時優化），并通過引用打算以某種協作方式使用這些線程的函數來傳遞這些組對象。

合作組需要 CUDA 9.0 或更高版本。要使用合作組，請包含頭文件：

// Primary header is compatible with pre-C++11, collective algorithm headers require C++11
#include 
// Optionally include for memcpy_async() collective
#include 
// Optionally include for reduce() collective
#include 
// Optionally include for inclusive_scan() and exclusive_scan() collectives
#include

并使用合作組命名空間：

using namespace cooperative_groups;
// Alternatively use an alias to avoid polluting the namespace with collective algorithms
namespace cg = cooperative_groups;

可以使用 nvcc 以正常方式編譯代碼，但是如果您希望使用memcpy_async、reduce或scan功能并且您的主機編譯器的默認不是 C++11 或更高版本，那么您必須添加--std=c++11到命令行。

C.3.1. Composition Example

為了說明組的概念，此示例嘗試執行塊范圍的求和。以前，編寫此代碼時對實現存在隱藏的約束：

__device__ int sum(int *x, int n) {
    // ...
    __syncthreads();
    return total;
}

__global__ void parallel_kernel(float *x) {
    // ...
    // Entire thread block must call sum
    sum(x, n);
}

線程塊中的所有線程都必須到達__syncthreads()屏障，但是，對于可能想要使用sum(...)的開發人員來說，這個約束是隱藏的。對于合作組，更好的編寫方式是：

__device__ int sum(const thread_block& g, int *x, int n) {
    // ...
    g.sync()
    return total;
}

__global__ void parallel_kernel(...) {
    // ...
    // Entire thread block must call sum
    thread_block tb = this_thread_block();
    sum(tb, x, n);
    // ...
}

C.4. Group Types

C.4.1. Implicit Groups

隱式組代表內核的啟動配置。不管你的內核是如何編寫的，它總是有一定數量的線程、塊和塊尺寸、單個網格和網格尺寸。另外，如果使用多設備協同啟動API，它可以有多個網格（每個設備一個網格）。這些組為分解為更細粒度的組提供了起點，這些組通常是硬件加速的，并且更專門針對開發人員正在解決的問題。

盡管您可以在代碼中的任何位置創建隱式組，但這樣做很危險。為隱式組創建句柄是一項集體操作——組中的所有線程都必須參與。如果組是在并非所有線程都到達的條件分支中創建的，則可能導致死鎖或數據損壞。出于這個原因，建議您預先為隱式組創建一個句柄（盡可能早，在任何分支發生之前）并在整個內核中使用該句柄。出于同樣的原因，必須在聲明時初始化組句柄（沒有默認構造函數），并且不鼓勵復制構造它們。

C.4.1.1. Thread Block Group

任何 CUDA 程序員都已經熟悉某一組線程：線程塊。 Cooperative Groups 擴展引入了一個新的數據類型thread_block，以在內核中明確表示這個概念。

class thread_block;

thread_block g = this_thread_block();

公開成員函數：

示例:

/// Loading an integer from global into shared memory
__global__ void kernel(int *globalInput) {
    __shared__ int x;
    thread_block g = this_thread_block();
    // Choose a leader in the thread block
    if (g.thread_rank() == 0) {
        // load from global into shared for all threads to work with
        x = (*globalInput);
    }
    // After loading data into shared memory, you want to synchronize
    // if all threads in your thread block need to see it
    g.sync(); // equivalent to __syncthreads();
}

注意：組中的所有線程都必須參與集體操作，否則行為未定義。

相關：thread_block數據類型派生自更通用的thread_group數據類型，可用于表示更廣泛的組類。

C.4.1.2. Grid Group

該組對象表示在單個網格中啟動的所有線程。除了sync()之外的 API 始終可用，但要能夠跨網格同步，您需要使用協作啟動 API。

class grid_group;
grid_group g = this_grid();

公開成員函數：

C.4.1.3. Multi Grid Group

該組對象表示跨設備協作組啟動的所有設備啟動的所有線程。與grid.group不同，所有 API 都要求您使用適當的啟動 API。

class multi_grid_group;

通過一下方式構建:

// Kernel must be launched with the cooperative multi-device API
multi_grid_group g = this_multi_grid();

公開成員函數：

C.4.2. Explicit Groups

C.4.2.1. Thread Block Tile

tile組的模板版本，其中模板參數用于指定tile的大小 – 在編譯時已知這一點，有可能實現更優化的執行。

template 
class thread_block_tile;

通過以下構建:

template 
_CG_QUALIFIER thread_block_tile tiled_partition(const ParentT& g),>

Size必須是 2 的冪且小于或等于 32。

ParentT是從其中劃分該組的父類型。它是自動推斷的，但是 void 的值會將此信息存儲在組句柄中而不是類型中。

公開成員函數：

注意：

shfl、shfl_up、shfl_down 和 shfl_xor函數在使用 C++11 或更高版本編譯時接受任何類型的對象。這意味著只要滿足以下約束，就可以對非整數類型進行shuffle ：

符合普通可復制的條件，即is_trivially_copyable::value == true
sizeof(T) <= 32

示例:

/// The following code will create two sets of tiled groups, of size 32 and 4 respectively:
/// The latter has the provenance encoded in the type, while the first stores it in the handle
thread_block block = this_thread_block();
thread_block_tile<32> tile32 = tiled_partition<32>(block);
thread_block_tile<4, thread_block> tile4 = tiled_partition<4>(block);

注意：這里使用的是 thread_block_tile 模板化數據結構，并且組的大小作為模板參數而不是參數傳遞給 tiled_partition 調用。

C.4.2.1.1. Warp-Synchronous Code Pattern

開發人員可能擁有他們之前對 warp 大小做出隱含假設并圍繞該數字進行編碼的 warp 同步代碼。現在這需要明確指定。

__global__ void cooperative_kernel(...) {
    // obtain default "current thread block" group
    thread_block my_block = this_thread_block();

    // subdivide into 32-thread, tiled subgroups
    // Tiled subgroups evenly partition a parent group into
    // adjacent sets of threads - in this case each one warp in size
    auto my_tile = tiled_partition<32>(my_block);

    // This operation will be performed by only the
    // first 32-thread tile of each block
    if (my_tile.meta_group_rank() == 0) {
        // ...
        my_tile.sync();
    }
}

C.4.2.1.2. Single thread group

可以從 this_thread 函數中獲取代表當前線程的組：

thread_block_tile<1> this_thread();

以下memcpy_asyncAPI 使用thread_group將int元素從源復制到目標：

#include 
#include 

cooperative_groups::memcpy_async(cooperative_groups::this_thread(), dest, src, sizeof(int));

可以在使用?cuda::pipeline的單階段異步數據拷貝和使用?cuda::pipeline的多階段異步數據拷貝部分中找到使用this_thread執行異步復制的更詳細示例。

C.4.2.1.3. Thread Block Tile of size larger than 32

使用cooperative_groups::experimental命名空間中的新API 可以獲得大小為64、128、256 或512的thread_block_tile。要使用它，_CG_ABI_EXPERIMENTAL必須在源代碼中定義。在分區之前，必須為thread_block_tile保留少量內存。這可以使用必須駐留在共享或全局內存中的cooperative_groups::experimental::block_tile_memory結構模板來完成。

template 
struct block_tile_memory;

TileCommunicationSize確定為集體操作保留多少內存。如果對大于指定通信大小的大小類型執行此類操作，則集合可能涉及多次傳輸并需要更長的時間才能完成。

MaxBlockSize指定當前線程塊中的最大線程數。此參數可用于最小化僅以較小線程數啟動的內核中block_tile_memory的共享內存使用量。

然后這個block_tile_memory需要被傳遞到cooperative_groups::experimental::this_thread_block，允許將生成的thread_block劃分為大小大于32的tile。this_thread_block接受block_tile_memory參數的重載是一個集體操作，必須與所有線程一起調用線程塊。返回的線程塊可以使用experimental::tiled_partition函數模板進行分區，該模板接受與常規tiled_partition相同的參數。

#define _CG_ABI_EXPERIMENTAL // enable experimental API

__global__ void cooperative_kernel(...) {
    // reserve shared memory for thread_block_tile usage.
    __shared__ experimental::block_tile_memory<4, 256> shared;
    thread_block thb = experimental::this_thread_block(shared);

    auto tile = experimental::tiled_partition<128>(thb);

    // ...
}

公開成員函數:

C.4.2.2. Coalesced Groups

在 CUDA 的 SIMT 架構中，在硬件級別，多處理器以 32 個一組的線程執行線程，稱為 warp。如果應用程序代碼中存在依賴于數據的條件分支，使得 warp 中的線程發散，那么 warp 會串行執行每個分支，禁用不在該路徑上的線程。在路徑上保持活動的線程稱為合并。協作組具有發現和創建包含所有合并線程的組的功能。

通過coalesced_threads()構造組句柄是伺機的(opportunistic)。它在那個時間點返回一組活動線程，并且不保證返回哪些線程（只要它們是活動的）或者它們在整個執行過程中保持合并（它們將被重新組合在一起以執行一個集合，但之后可以再次發散）。

class coalesced_group;

通過以下重構:

coalesced_group active = coalesced_threads();

公開成員函數:

注意：`shfl、shfl_up 和 shfl_down`函數在使用 C++11 或更高版本編譯時接受任何類型的對象。這意味著只要滿足以下約束，就可以對非整數類型進行洗牌：

符合普通可復制的條件，即is_trivially_copyable::value == true
sizeof(T) <= 32

示例:

/// Consider a situation whereby there is a branch in the
/// code in which only the 2nd, 4th and 8th threads in each warp are
/// active. The coalesced_threads() call, placed in that branch, will create (for each
/// warp) a group, active, that has three threads (with
/// ranks 0-2 inclusive).
__global__ void kernel(int *globalInput) {
    // Lets say globalInput says that threads 2, 4, 8 should handle the data
    if (threadIdx.x == *globalInput) {
        coalesced_group active = coalesced_threads();
        // active contains 0-2 inclusive
        active.sync();
    }
}

C.4.2.2.1. Discovery Pattern

通常，開發人員需要使用當前活動的線程集。不對存在的線程做任何假設，而是開發人員使用碰巧存在的線程。這可以在以下“在warp中跨線程聚合原子增量”示例中看到（使用正確的 CUDA 9.0 內在函數集編寫）：

{
    unsigned int writemask = __activemask();
    unsigned int total = __popc(writemask);
    unsigned int prefix = __popc(writemask & __lanemask_lt());
    // Find the lowest-numbered active lane
    int elected_lane = __ffs(writemask) - 1;
    int base_offset = 0;
    if (prefix == 0) {
        base_offset = atomicAdd(p, total);
    }
    base_offset = __shfl_sync(writemask, base_offset, elected_lane);
    int thread_offset = prefix + base_offset;
    return thread_offset;
}

這可以用Cooperative Groups重寫如下：

{
    cg::coalesced_group g = cg::coalesced_threads();
    int prev;
    if (g.thread_rank() == 0) {
        prev = atomicAdd(p, g.num_threads());
    }
    prev = g.thread_rank() + g.shfl(prev, 0);
    return prev;
}

C.5. Group Partitioning

C.5.1. tiled_partition

template 
thread_block_tile tiled_partition(const ParentT& g);

thread_group tiled_partition(const thread_group& parent, unsigned int tilesz);,>

tiled_partition方法是一種集體操作，它將父組劃分為一維、行主序的子組平鋪。總共將創建((size(parent)/tilesz)子組，因此父組大小必須能被Size整除。允許的父組是thread_block或thread_block_tile。

該實現可能導致調用線程在恢復執行之前等待，直到父組的所有成員都調用了該操作。功能僅限于本地硬件大小，1/2/4/8/16/32和cg::size(parent)必須大于size參數。cooperative_groups::experimental命名空間的實驗版本支持64/128/256/512大小。

Codegen 要求：計算能力 3.5 最低，C++11 用于大于 32 的size

示例:

/// The following code will create a 32-thread tile
thread_block block = this_thread_block();
thread_block_tile<32> tile32 = tiled_partition<32>(block);

我們可以將這些組中的每一個分成更小的組，每個組的大小為 4 個線程：

auto tile4 = tiled_partition<4>(tile32);
// or using a general group
// thread_group tile4 = tiled_partition(tile32, 4);

例如，如果我們要包含以下代碼行：

if (tile4.thread_rank()==0) printf(“Hello from tile4 rank 0\n”);

那么該語句將由塊中的每四個線程打?。好總€ tile4 組中排名為 0 的線程，它們對應于塊組中排名為 0、4、8、12.. 的那些線程。

C.5.2. labeled_partition

coalesced_group labeled_partition(const coalesced_group& g, int label);
template 
coalesced_group labeled_partition(const thread_block_tile& g, int label);

labeled_partition方法是一種集體操作，它將父組劃分為一維子組，線程在這些子組中合并。該實現將評估條件標簽并將具有相同標簽值的線程分配到同一組中。

該實現可能會導致調用線程在恢復執行之前等待直到父組的所有成員都調用了該操作。

注意：此功能仍在評估中，將來可能會略有變化。

Codegen 要求：計算能力 7.0 最低，C++11

C.5.3. binary_partition

coalesced_group binary_partition(const coalesced_group& g, bool pred);
template 
coalesced_group binary_partition(const thread_block_tile& g, bool pred);

binary_partition()方法是一種集體操作，它將父組劃分為一維子組，線程在其中合并。該實現將評估predicate并將具有相同值的線程分配到同一組中。這是labeled_partition()的一種特殊形式，其中label只能是0 或1。

該實現可能會導致調用線程在恢復執行之前等待直到父組的所有成員都調用了該操作。

注意：此功能仍在評估中，將來可能會略有變化。

Codegen 要求：計算能力 7.0 最低，C++11

示例:

/// This example divides a 32-sized tile into a group with odd
/// numbers and a group with even numbers
_global__ void oddEven(int *inputArr) {
    cg::thread_block cta = cg::this_thread_block();
    cg::thread_block_tile<32> tile32 = cg::tiled_partition<32>(cta);

    // inputArr contains random integers
    int elem = inputArr[cta.thread_rank()];
    // after this, tile32 is split into 2 groups,
    // a subtile where elem&1 is true and one where its false
    auto subtile = cg::binary_partition(tile32, (elem & 1));
}

C.6. Group Collectives

C.6.1. Synchronization

C.6.1.1. sync

cooperative_groups::sync(T& group);

sync同步組中指定的線程。T可以是任何現有的組類型，因為它們都支持同步。如果組是grid_group或multi_grid_group，則內核必須已使用適當的協作啟動 API 啟動。

C.6.2. Data Transfer

C.6.2.1. memcpy_async

memcpy_async是一個組范圍的集體memcpy，它利用硬件加速支持從全局到共享內存的非阻塞內存事務。給定組中命名的一組線程，memcpy_async將通過單個管道階段傳輸指定數量的字節或輸入類型的元素。此外，為了在使用memcpy_asyncAPI 時獲得最佳性能，共享內存和全局內存都需要 16 字節對齊。需要注意的是，雖然在一般情況下這是一個memcpy，但只有當源(source)是全局內存而目標是共享內存并且兩者都可以通過 16、8 或 4 字節對齊來尋址時，它才是異步的。異步復制的數據只能在調用wait或wait_prior之后讀取，這表明相應階段已完成將數據移動到共享內存。

必須等待所有未完成的請求可能會失去一些靈活性（但會變得簡單）。為了有效地重疊數據傳輸和執行，重要的是能夠在等待和操作請求N時啟動N+1 memcpy_async請求。為此，請使用memcpy_async并使用基于集體階段的wait_priorAPI 等待它.有關詳細信息，請參閱wait 和 wait_prior。

用法1:

template 
void memcpy_async(
  const TyGroup &group,
  TyElem *__restrict__ _dst,
  const TyElem *__restrict__ _src,
  const TyShape &shape
);

執行shape字節的拷貝

用法2:

template 
void memcpy_async(
  const TyGroup &group,
  TyElem *__restrict__ dst,
  const TyDstLayout &dstLayout,
  const TyElem *__restrict__ src,
  const TySrcLayout &srcLayout
);

執行min(dstLayout, srcLayout)元素的拷貝。如果布局的類型為cuda::aligned_size_t，則兩者必須指定相同的對齊方式。

勘誤表

CUDA 11.1 中引入的具有 src 和 dst 輸入布局的memcpy_asyncAPI 期望布局以元素而不是字節形式提供。元素類型是從TyElem推斷出來的，大小為sizeof(TyElem)。如果使用cuda::aligned_size_t類型作為布局，指定的元素個數乘以sizeof(TyElem)必須是 N 的倍數，建議使用std::byte或char作為元素類型。

如果副本的指定形狀或布局是cuda::aligned_size_t類型，則將保證至少為min(16, N)。在這種情況下，dst 和 src 指針都需要與 N 個字節對齊，并且復制的字節數需要是 N 的倍數。

Codegen 要求：最低計算能力 3.5，異步計算能力 8.0，C++11

需要包含collaborative_groups/memcpy_async.h頭文件。

示例:

/// This example streams elementsPerThreadBlock worth of data from global memory
/// into a limited sized shared memory (elementsInShared) block to operate on.
#include 
#include 

namespace cg = cooperative_groups;

__global__ void kernel(int* global_data) {
    cg::thread_block tb = cg::this_thread_block();
    const size_t elementsPerThreadBlock = 16 * 1024;
    const size_t elementsInShared = 128;
    __shared__ int local_smem[elementsInShared];

    size_t copy_count;
    size_t index = 0;
    while (index < elementsPerThreadBlock) {
        cg::memcpy_async(tb, local_smem, elementsInShared, global_data + index, elementsPerThreadBlock - index);
        copy_count = min(elementsInShared, elementsPerThreadBlock - index);
        cg::wait(tb);
        // Work with local_smem
        index += copy_count;
    }
}

C.6.2.2. wait and wait_prior

template 
void wait(TyGroup & group);

template 
void wair_prior(TyGroup & group);

wait和wait_prior集合同步指定的線程和線程塊，直到所有未完成的memcpy_async請求（在等待的情況下）或第一個NumStages（在 wait_prior 的情況下）完成。

Codegen 要求：最低計算能力 3.5，異步計算能力 8.0，C++11

需要包含collaborative_groups/memcpy_async.h 頭文件。

示例:

/// This example streams elementsPerThreadBlock worth of data from global memory
/// into a limited sized shared memory (elementsInShared) block to operate on in
/// multiple (two) stages. As stage N is kicked off, we can wait on and operate on stage N-1.
#include 
#include 

namespace cg = cooperative_groups;

__global__ void kernel(int* global_data) {
    cg::thread_block tb = cg::this_thread_block();
    const size_t elementsPerThreadBlock = 16 * 1024 + 64;
    const size_t elementsInShared = 128;
    __align__(16) __shared__ int local_smem[2][elementsInShared];
    int stage = 0;
    // First kick off an extra request
    size_t copy_count = elementsInShared;
    size_t index = copy_count;
    cg::memcpy_async(tb, local_smem[stage], elementsInShared, global_data, elementsPerThreadBlock - index);
    while (index < elementsPerThreadBlock) {
        // Now we kick off the next request...
        cg::memcpy_async(tb, local_smem[stage ^ 1], elementsInShared, global_data + index, elementsPerThreadBlock - index);
        // ... but we wait on the one before it
        cg::wait_prior<1>(tb);

        // Its now available and we can work with local_smem[stage] here
        // (...)
        //

        // Calculate the amount fo data that was actually copied, for the next iteration.
        copy_count = min(elementsInShared, elementsPerThreadBlock - index);
        index += copy_count;

        // A cg::sync(tb) might be needed here depending on whether
        // the work done with local_smem[stage] can release threads to race ahead or not
        // Wrap to the next stage
        stage ^= 1;
    }
    cg::wait(tb);
    // The last local_smem[stage] can be handled here

C.6.3. Data manipulation

C.6.3.1. reduce

template 
auto reduce(const TyGroup& group, TyArg&& val, TyOp&& op) -> decltype(op(val, val));

reduce對傳入的組中指定的每個線程提供的數據執行歸約操作。這利用硬件加速（在計算 80 及更高的設備上）進行算術加法、最小或最大操作以及邏輯 AND、OR、或 XOR，以及在老一代硬件上提供軟件替代支持(fallback)。只有 4B 類型由硬件加速。

group：有效的組類型是coalesced_group和thread_block_tile。

val：滿足以下要求的任何類型：

符合普通可復制的條件，即is_trivially_copyable::value == true
sizeof(TyArg) <= 32
對給定的函數對象具有合適的算術或比較運算符。

op：將提供具有整數類型的硬件加速的有效函數對象是plus()、less()、greater()、bit_and()、bit_xor()、bit_or()。這些必須構造，因此需要TyVal模板參數，即plus()。Reduce還支持可以使用operator()調用的lambda和其他函數對象

Codegen 要求：計算能力 3.5 最低，計算能力 8.0 用于硬件加速，C++11。

需要包含collaborative_groups/reduce.h 頭文件。

示例:

#include 
#include 
namespace cg=cooperative_groups;

/// The following example accepts input in *A and outputs a result into *sum
/// It spreads the data within the block, one element per thread
#define blocksz 256
__global__ void block_reduce(const int *A, int *sum) {
    __shared__ int reduction_s[blocksz];

    cg::thread_block cta = cg::this_thread_block();
    cg::thread_block_tile<32> tile = cg::tiled_partition<32>(cta);

    const int tid = cta.thread_rank();
    int beta = A[tid];
    // reduce across the tile
    // cg::plus allows cg::reduce() to know it can use hardware acceleration for addition
    reduction_s[tid] = cg::reduce(tile, beta, cg::plus());
    // synchronize the block so all data is ready
    cg::sync(cta);
    // single leader accumulates the result
    if (cta.thread_rank() == 0) {
        beta = 0;
        for (int i = 0; i < blocksz; i += tile.num_threads()) {
            beta += reduction_s[i];
        }
        sum[blockIdx.x] = beta;
    }

C.6.3.2. Reduce Operators

下面是一些可以用reduce完成的基本操作的函數對象的原型

namespace cooperative_groups {
  template 
  struct cg::plus;

  template 
  struct cg::less;

  template 
  struct cg::greater;

  template 
  struct cg::bit_and;

  template 
  struct cg::bit_xor;

  template 
  struct cg::bit_or;
}

Reduce僅限于在編譯時可用于實現的信息。因此，為了利用 CC 8.0 中引入的內在函數，cg::命名空間公開了幾個鏡像硬件的功能對象。這些對象看起來與 C++ STL 中呈現的對象相似，除了less/greater。與 STL 有任何差異的原因在于，這些函數對象旨在實際反映硬件內聯函數的操作。

功能說明：

cg::plus：接受兩個值并使用operator +返回兩者之和。
cg::less: 接受兩個值并使用operator 返回較小的值。這不同之處在于返回較低的值而不是布爾值。


	
		cg::greater：接受兩個值并使用operator <返回較大的值。 這不同之處在于返回更大的值而不是布爾值。
	
		cg::bit_and：接受兩個值并返回operator &的結果。
	
		cg::bit_xor：接受兩個值并返回operator ^的結果。
	
		cg::bit_or：接受兩個值并返回operator |的結果。

示例:


{
    // cg::plus is specialized within cg::reduce and calls __reduce_add_sync(...) on CC 8.0+
    cg::reduce(tile, (int)val, cg::plus());

    // cg::plus fails to match with an accelerator and instead performs a standard shuffle based reduction
    cg::reduce(tile, (float)val, cg::plus());

    // While individual components of a vector are supported, reduce will not use hardware intrinsics for the following
    // It will also be necessary to define a corresponding operator for vector and any custom types that may be used
    int4 vec = {...};
    cg::reduce(tile, vec, cg::plus())

    // Finally lambdas and other function objects cannot be inspected for dispatch
    // and will instead perform shuffle based reductions using the provided function object.
    cg::reduce(tile, (int)val, [](int l, int r) -> int {return l + r;});
}

	C.6.3.3. inclusive_scan and exclusive_scan
template 
auto inclusive_scan(const TyGroup& group, TyVal&& val, TyFn&& op) -> decltype(op(val, val));

template 
TyVal inclusive_scan(const TyGroup& group, TyVal&& val);

template 
auto exclusive_scan(const TyGroup& group, TyVal&& val, TyFn&& op) -> decltype(op(val, val));

template 
TyVal exclusive_scan(const TyGroup& group, TyVal&& val);

	inclusive_scan和exclusive_scan對傳入組中指定的每個線程提供的數據執行掃描操作。在exclusive_scan的情況下，每個線程的結果是減少thread_rank低于該線程的線程的數據。inclusive_scan的結果還包括調用線程中的歸約數據。

	group：有效的組類型是coalesced_group和thread_block_tile。

	val：滿足以下要求的任何類型：

		符合普通可復制的條件，即is_trivially_copyable::value == true
	
		sizeof(TyArg) <= 32
	
		對給定的函數對象具有合適的算術或比較運算符。

	op：為了方便而定義的函數對象有reduce Operators中描述的plus()、less()、greater()、bit_and()、bit_xor()、bit_or()。這些必須構造，因此需要TyVal模板參數，即plus()。inclusive_scan和exclusive_scan還支持可以使用operator()調用的lambdas和其他函數對象

	Codegen 要求：計算能力 3.5 最低，C++11。

	需要包含collaborative_groups/scan.h 頭文件。

	示例:
#include 
#include 
#include 
namespace cg = cooperative_groups;

__global__ void kernel() {
    auto thread_block = cg::this_thread_block();
    auto tile = cg::tiled_partition<8>(thread_block);
    unsigned int val = cg::inclusive_scan(tile, tile.thread_rank());
    printf("%u: %u\n", tile.thread_rank(), val);
}

/*  prints for each group:
    0: 0
    1: 1
    2: 3
    3: 6
    4: 10
    5: 15
    6: 21
    7: 28
*/

	使用 Exclusive_scan 進行動態緩沖區空間分配的示例：
#include 
#include 
namespace cg = cooperative_groups;

// Buffer partitioning is static to make the example easier to follow,
// but any arbitrary dynamic allocation scheme can be implemented by replacing this function.
__device__ int calculate_buffer_space_needed(cg::thread_block_tile<32>& tile) {
    return tile.thread_rank() % 2 + 1;
}

__device__ int my_thread_data(int i) {
    return i;
}

__global__ void kernel() {
    __shared__ int buffer_used;
    extern __shared__ int buffer[];
    auto thread_block = cg::this_thread_block();
    auto tile = cg::tiled_partition<32>(thread_block);

    buffer_used = 0;
    thread_block.sync();

    // each thread calculates buffer size it needs and its offset within the allocation
    int buf_needed = calculate_buffer_space_needed(tile);
    int buf_offset = cg::exclusive_scan(tile, buf_needed);

    // last thread in the tile allocates buffer space with an atomic operation
    int alloc_offset = 0;
    if (tile.thread_rank() == tile.num_threads() - 1) {
        alloc_offset = atomicAdd(&buffer_used, buf_offset + buf_needed);
    }
    // that thread shares the allocation start with other threads in the tile
    alloc_offset = tile.shfl(alloc_offset, tile.num_threads() - 1);
    buf_offset += alloc_offset;

    // each thread fill its part of the buffer with thread specific data
    for (int i = 0 ; i < buf_needed ; ++i) {
        buffer[buf_offset + i] = my_thread_data(i);
    }

    // buffer is {0, 0, 1, 0, 0, 1 ...};
}

	C.7. Grid Synchronization

	在引入協作組(Cooperative Groups)之前，CUDA 編程模型只允許在內核完成邊界的線程塊之間進行同步。內核邊界帶有隱含的狀態失效，以及潛在的性能影響。

	例如，在某些用例中，應用程序具有大量小內核，每個內核代表處理pipeline中的一個階段。當前的 CUDA 編程模型需要這些內核的存在，以確保在一個pipeline階段上運行的線程塊在下一個pipeline階段上運行的線程塊準備好使用數據之前產生數據。在這種情況下，提供全局線程間塊同步的能力將允許將應用程序重組為具有持久線程塊，當給定階段完成時，這些線程塊能夠在設備上同步。

	要從內核中跨網格同步，您只需使用grid.sync()功能：
grid_group grid = this_grid();
grid.sync();

	并且在啟動內核時，有必要使用cudaLaunchCooperativeKernelCUDA 運行時啟動 API 或 CUDA 驅動程序等價物，而不是 <<<…>>> 執行配置語法。

	例子：

	為了保證線程塊在 GPU 上的共同駐留，需要仔細考慮啟動的塊數。 例如，可以按如下方式啟動與 SM 一樣多的塊：
int device = 0;
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, dev);
// initialize, then launch
cudaLaunchCooperativeKernel((void*)my_kernel, deviceProp.multiProcessorCount, numThreads, args);

	或者，您可以通過使用占用計算器(occupancy calculator)計算每個 SM 可以同時容納多少塊來最大化暴露的并行度，如下所示：
/// This will launch a grid that can maximally fill the GPU, on the default stream with kernel arguments
int numBlocksPerSm = 0;
 // Number of threads my_kernel will be launched with
int numThreads = 128;
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, dev);
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&numBlocksPerSm, my_kernel, numThreads, 0);
// launch
void *kernelArgs[] = { /* add kernel args */ };
dim3 dimBlock(numThreads, 1, 1);
dim3 dimGrid(deviceProp.multiProcessorCount*numBlocksPerSm, 1, 1);
cudaLaunchCooperativeKernel((void*)my_kernel, dimGrid, dimBlock, kernelArgs);

	最好先通過查詢設備屬性cudaDevAttrCooperativeLaunch來確保設備支持協作啟動：
int dev = 0;
int supportsCoopLaunch = 0;
cudaDeviceGetAttribute(&supportsCoopLaunch, cudaDevAttrCooperativeLaunch, dev);

	如果設備 0 支持該屬性，則將supportsCoopLaunch設置為 1。僅支持計算能力為 6.0 及更高版本的設備。 此外，您需要在以下任何一個上運行：

		沒有 MPS 的 Linux 平臺
	
		具有 MPS 和計算能力 7.0 或更高版本的設備上的 Linux 平臺
	
		最新的 Windows 平臺

	C.8. Multi-Device Synchronization

	為了通過協作組啟用跨多個設備的同步，需要使用cudaLaunchCooperativeKernelMultiDeviceCUDA API。這與現有的 CUDA API 有很大不同，它將允許單個主機線程跨多個設備啟動內核。除了cudaLaunchCooperativeKernel做出的約束和保證之外，這個 API 還具有額外的語義：

		此 API 將確保啟動是原子的，即如果 API 調用成功，則提供的線程塊數將在所有指定設備上啟動。
	
		通過此 API 啟動的功能必須相同。驅動程序在這方面沒有進行明確的檢查，因為這在很大程度上是不可行的。由應用程序來確保這一點。
	
		提供的cudaLaunchParams中沒有兩個條目可以映射到同一設備。
	
		本次發布所針對的所有設備都必須具有相同的計算能力——主要版本和次要版本。
	
		每個網格的塊大小、網格大小和共享內存量在所有設備上必須相同。請注意，這意味著每個設備可以啟動的最大塊數將受到 SM 數量最少的設備的限制。
	
		擁有正在啟動的 CUfunction 的模塊中存在的任何用戶定義的device、constant或managed設備全局變量都在每個設備上獨立實例化。用戶負責適當地初始化此類設備全局變量。

	棄用通知：cudaLaunchCooperativeKernelMultiDevice 已在 CUDA 11.3 中針對所有設備棄用。在多設備共軛梯度樣本中可以找到替代方法的示例。

	多設備同步的最佳性能是通過cuCtxEnablePeerAccess或cudaDeviceEnablePeerAccess為所有參與設備啟用對等訪問來實現的。

	啟動參數應使用結構數組（每個設備一個）定義，并使用cudaLaunchCooperativeKernelMultiDevice啟動

	Example:
cudaDeviceProp deviceProp;
cudaGetDeviceCount(&numGpus);

// Per device launch parameters
cudaLaunchParams *launchParams = (cudaLaunchParams*)malloc(sizeof(cudaLaunchParams) * numGpus);
cudaStream_t *streams = (cudaStream_t*)malloc(sizeof(cudaStream_t) * numGpus);

// The kernel arguments are copied over during launch
// Its also possible to have individual copies of kernel arguments per device, but
// the signature and name of the function/kernel must be the same.
void *kernelArgs[] = { /* Add kernel arguments */ };

for (int i = 0; i < numGpus; i++) {
    cudaSetDevice(i);
    // Per device stream, but its also possible to use the default NULL stream of each device
    cudaStreamCreate(&streams[i]);
    // Loop over other devices and cudaDeviceEnablePeerAccess to get a faster barrier implementation
}
// Since all devices must be of the same compute capability and have the same launch configuration
// it is sufficient to query device 0 here
cudaGetDeviceProperties(&deviceProp[i], 0);
dim3 dimBlock(numThreads, 1, 1);
dim3 dimGrid(deviceProp.multiProcessorCount, 1, 1);
for (int i = 0; i < numGpus; i++) {
    launchParamsList[i].func = (void*)my_kernel;
    launchParamsList[i].gridDim = dimGrid;
    launchParamsList[i].blockDim = dimBlock;
    launchParamsList[i].sharedMem = 0;
    launchParamsList[i].stream = streams[i];
    launchParamsList[i].args = kernelArgs;
}
cudaLaunchCooperativeKernelMultiDevice(launchParams, numGpus);

	此外，與網格范圍的同步一樣，生成的設備代碼看起來非常相似：
multi_grid_group multi_grid = this_multi_grid();
multi_grid.sync();

	但是，需要通過將-rdc=true傳遞給 nvcc 來單獨編譯代碼。

	最好先通過查詢設備屬性cudaDevAttrCooperativeMultiDeviceLaunch來確保設備支持多設備協作啟動：
int dev = 0;
int supportsMdCoopLaunch = 0;
cudaDeviceGetAttribute(&supportsMdCoopLaunch, cudaDevAttrCooperativeMultiDeviceLaunch, dev);

	如果設備 0 支持該屬性，則將 supportsMdCoopLaunch 設置為 1。僅支持計算能力為 6.0 及更高版本的設備。 此外，您需要在 Linux 平臺（無 MPS）或當前版本的 Windows 上運行，并且設備處于 TCC 模式。

	關于作者

	Ken He 是 NVIDIA 企業級開發者社區經理 & 高級講師，擁有多年的 GPU 和人工智能開發經驗。自 2017 年加入 NVIDIA 開發者社區以來，完成過上百場培訓，幫助上萬個開發者了解人工智能和 GPU 編程開發。在計算機視覺，高性能計算領域完成過多個獨立項目。并且，在機器人和無人機領域，有過豐富的研發經驗。對于圖像識別，目標的檢測與跟蹤完成過多種解決方案。曾經參與 GPU 版氣象模式GRAPES，是其主要研發者。

	審核編輯：郭婷


                    
                        聲明：本文內容及配圖由入駐作者撰寫或者入駐合作網站授權轉載。文章觀點僅代表作者本人，不代表電子發燒友網立場。文章及其配圖僅供工程師學習之用，如有內容侵權或者其他違規問題，請聯系本站處理。
                        舉報投訴
                    
                    
                            
                                gpu
                                
                                    
                                        
                                            
                                                gpu
                                            
                                            +關注
                                        
                                        
                                        
                                            
                                                關注
                                                28
                                            
                                            
                                                文章
                                                4761
                                            
                                            
                                                瀏覽量
                                                129141
                                            
                                        
                                    
                                
                            

                                API
                                
                                    
                                        
                                            
                                                API
                                            
                                            +關注
                                        
                                        
                                        
                                            
                                                關注
                                                2
                                            
                                            
                                                文章
                                                1507
                                            
                                            
                                                瀏覽量
                                                62219
                                            
                                        
                                    
                                
                            

                                CUDA
                                
                                    
                                        
                                            
                                                CUDA
                                            
                                            +關注
                                        
                                        
                                        
                                            
                                                關注
                                                0
                                            
                                            
                                                文章
                                                121
                                            
                                            
                                                瀏覽量
                                                13648


                
                    
                        收藏
                        人收藏

                        
                            
                                
                                
                                    
                                        掃一掃，分享給好友
                                        
                                        復制鏈接分享
                                    
                                
                            
                        
                    
                    
                    
                

                
                
                    評論
                


                
                                            
                        發布評論請先 登錄
                
                
                
                    相關推薦
                                                
    
        
            GPU編程的平臺模型、執行模型、內存模型及編程模型
    
    GPU編程--OpenCL四大模型
    
                發表于 04-29 07:40        
                    

    

                            
    
        
            并行編程模型有什么優勢
    
    并行編程模型是并行計算，尤其是并行軟件的基礎，也是并行硬件系統的導向，在面臨多核新挑戰的情況下，什么樣的并行編程模型在未來能成為主流，還很難說。至少到目前，還處于百家爭鳴的時代，很多
    
                發表于 07-11 08:03        
                    

    

                            
    
        
            “協作機器人”如何快速處理傳感器數據
    
    無論是傳統的工業機器人系統，還是當今最先進的協作機器人（Cobot），它們都要依靠可生成大量高度可變數據的傳感器。這些數據有助于構建更佳的機器學習（ML）和人工智能（AI）模型。而機器人依靠這些模型變得“自主”，可在動態的現實環
    
                發表于 08-05 07:08        
                    

    

                            
    
        
            深度融合模型的特點
    
    深度融合模型的特點，背景深度學習模型在訓練完成之后，部署并應用在生產環境的這一步至關重要，畢竟訓練出來的模型不能只接受一些公開數據集和榜單的檢驗，還需要在真正的業務場景下創造價值，不能
    
                發表于 07-16 06:08        
                    

    

                            
    
        
            ARM體系結構與編程模型
    
    ARM體系結構與編程模型
              
    
                發表于 12-20 14:47        
                ?35次下載    

    

                            
    
        
            協作對等組的接入控制和聯合授權機制
    
    本文提出P2P 環境中協作對等組的接入控制機制。在完全非結構化的對等組中，本文采用分布式委托授權機制，以解決集中式授權的單點失效問題。本文基于信任管理語言RT，提出一
    
                發表于 08-17 08:39        
                ?7次下載    

    

                            
    
        
            基于Agent技術的決策模型協作問題研究
    
    本文通過對模型進行Agent封裝，以及模型之間的協作，將決策算法選擇和算法中系數的確定問題通過評價模型的協作來解決，改變了傳統決策者憑經驗選
    
                發表于 09-01 10:54        
                ?12次下載    

    

                            
    
        
            計算機輔助審計的多Agent協作模型研究
    
    文章提出了一種基于計算機輔助審計的多Agent 系統模型，分析了該模型中各模塊的組成特點，并著重探討了系統中各Agent 的功能、Agent 之間的協作關系以及審計Agent 的內部結
    
                發表于 09-23 10:52        
                ?4次下載    

    

                            
    
        
            多移動agent協作規劃模型
    
    以agent負載能耗均衡度和網絡總能耗為指標構建多移動agent協作規劃模型，為了盡可能延長網絡生存周期，給出基于網絡覆蓋率的節點休眠機制，在滿足WSN網絡覆蓋率要求的同時，采用較少節點處于工作狀態
    
                發表于 01-17 10:03        
                ?0次下載    

    

                            
    
        
            如何使用云霧協作模型進行任務分配詳細方法說明
    
    針對在云 霧協作下實現移動用戶任務請求的合理分配與調度的問題，提出了一種基于云霧協作模型的任務分配算法一IGA。 首先，采用混合編碼的方式對個體進行編碼，并采用隨機的方式產生初始種群;其次設定服務商
    
                發表于 04-22 16:48        
                ?7次下載    

    

                        
                            
    
        
            如何使用云霧協作模型實現任務分配的方法說明
    
    針對在云霧協作下實現移動用戶任務請求的合理分配與調度的問題，提出了一種基于云霧協作模型的任務分配算法——IGA.首先，采用混合編碼的方式對個體進行編碼，并采用隨機的方式產生初始種群;其次設定服務商
    
                發表于 03-03 15:34        
                ?14次下載    

    

                        
                            
    
        
            CUDA簡介: CUDA編程模型概述
    
    在 CUDA 編程模型中，線程是進行計算或內存操作的最低抽象級別。 從基于 NVIDIA Ampere GPU 架構的設備開始，CUDA 編程模型通過異步
    
        
            	
                                    發表于 04-20 17:16        
        ?3025次閱讀            

    

                        
                            
    
        
            介紹CUDA編程模型及CUDA線程體系
    
    CUDA 編程模型主要有三個關鍵抽象：層級的線程組，共享內存和柵同步（barrier synchronization）。
    
        
            	
                                    發表于 05-19 11:32        
        ?1934次閱讀            

    

                        
                            
    
        
            SCP線程模型特點
    
    。 SCP線程模型的特點： ?軟實時調度。 ? 支持具有等優先級線程的單線程和多線程環境（無搶占）。 ? 支持協作調度，如符合CMSIS的RTX RTOS。 ? 不支持多處理器。 ? 通過框架定義的線程API獨立于直接RTOS調
    
        
            	
                                    發表于 11-02 17:07        
        ?560次閱讀            

    

                        
                            
    
        
            接口芯片的編程模型方法是什么
    
    接口芯片的編程模型方法是一個復雜的話題，涉及到硬件設計、軟件編程、通信協議等多個方面。 1. 接口芯片概述 接口芯片是用來連接不同硬件設備或系統的一種集成電路。它們可以是通用的，如USB、HDMI
    
        
            	
                                    發表于 09-30 11:30        
        ?261次閱讀