【CUDA】Warp shuffleの__shfl()サンプル

warp shuffleは，kepler世代のcc3.x以上から使える，
shared memoryを用いずに，warp内のthread間で値を交換することができる機能です．

GPGPUでは，shared memoryをいじるのが当然なのですが，それをせずにさらに高速化することができるということで，使えるようになっておきたい機能です．

関数は4つ用意されていて，

__shfl__
__shfl_up
__shfl_down
__shfl_xor

ですが，programming guideにもそれぞれの簡単なサンプルコードが載っていないので，少し書いてみました．

__global__ void shfl_test(int arg){
	int laneId = threadIdx.x & 0x1f;
	int value;

	// laneId:0x00
	if(laneId == 0){
		value = arg; 
	}else{
		value = 0;
	}

	// dst: laneId:0x1f(32)
	// src: laneId:0x00(0)
	if(laneId == 3){
		value = __shfl(value, 0);
		printf("thread:%d, value: %d\n", threadIdx.x, value);
	}

	printf("thread:%d, value: %d\n", threadIdx.x, value);
}

int main(){
	shfl_test<<<1, 32>>>(1234);
	cudaDeviceSynchronize();
	return 0;
}

残念ながらこのコードはうまく動きません．
CUDA programming guideによると，

The __shfl() intrinsics permit exchanging of a variable between threads within a warp without use of shared memory. The exchange occurs simultaneously for all active threads within the warp, ...
and
Threads may only read data from another thread which is actively participating in the __shfl() command. If the target thread is inactive, the retrieved value is undefined.

warp shuffleでは，shuffleコマンドに参加しているアクティブなスレッドからしか，値を読むことができないのです．
laneIdが3のスレッド以外のスレッドはアクティブでないので，読み込むことができません．

if文の分岐をなくして、以下のようにすると、全スレッドが参加することができて動きます。

value = __shfl(value, 0);
printf("thread:%d, value: %d\n", threadIdx.x, value);

これがwarp内のスレッド全てに値をコピーするbroadcastのサンプルです。
しかしこういう制約があると、使いづらいような。。

warp shuffleに関しては，
https://people.maths.ox.ac.uk/gilesm/cuda/lecs/lec4.pdf
の資料が参考になります．