Betrag des lokalen Speichers pro CUDA-Thread

Question

Mar 02, 2015, 01:56 PM

Betrag des lokalen Speichers pro CUDA-Thread

Ich habe in der NVIDIA-Dokumentation gelesen http: //docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specification, Tabelle 12), dass die Größe des lokalen Speichers pro Thread für meine GPU 512 Ko beträgt (GTX 580, Rechenkapazität 2.0).

Ich habe erfolglos versucht, dieses Limit unter Linux mit CUDA 6.5 zu überprüfen.

Hier ist der Code, den ich verwendet habe (der einzige Zweck ist das Testen des lokalen Speicherlimits, es macht keine sinnvolle Berechnung):

#include <iostream>
#include <stdio.h>

#define MEMSIZE 65000  // 65000 -> out of memory, 60000 -> ok

inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=false)
{
    if (code != cudaSuccess) 
    {
        fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if( abort )
            exit(code);
    }
}

inline void gpuCheckKernelExecutionError( const char *file, int line)
{
    gpuAssert( cudaPeekAtLastError(), file, line);
    gpuAssert( cudaDeviceSynchronize(), file, line);    
}


__global__ void kernel_test_private(char *output)
{
    int c = blockIdx.x*blockDim.x + threadIdx.x; // absolute col
    int r = blockIdx.y*blockDim.y + threadIdx.y; // absolute row

    char tmp[MEMSIZE];
    for( int i = 0; i < MEMSIZE; i++)
        tmp[i] = 4*r + c; // dummy computation in local mem
    for( int i = 0; i < MEMSIZE; i++)
        output[i] = tmp[i];
}

int main( void)
{
    printf( "MEMSIZE=%d bytes.\n", MEMSIZE);

    // allocate memory
    char output[MEMSIZE];
    char *gpuOutput;
    cudaMalloc( (void**) &gpuOutput, MEMSIZE);

    // run kernel
    dim3 dimBlock( 1, 1);
    dim3 dimGrid( 1, 1);
    kernel_test_private<<<dimGrid, dimBlock>>>(gpuOutput);
    gpuCheckKernelExecutionError( __FILE__, __LINE__);

    // transfer data from GPU memory to CPU memory
    cudaMemcpy( output, gpuOutput, MEMSIZE, cudaMemcpyDeviceToHost);

    // release resources
    cudaFree(gpuOutput);
    cudaDeviceReset();

    return 0;
}

Und die Kompilierungsbefehlszeile:

nvcc -o cuda_test_private_memory -Xptxas -v -O2 --compiler-options -Wall cuda_test_private_memory.cu

Die Zusammenstellung ist in Ordnung und meldet:

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z19kernel_test_privatePc' for 'sm_20'
ptxas info    : Function properties for _Z19kernel_test_privatePc
    65000 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 21 registers, 40 bytes cmem[0]

Ich habe zur Laufzeit auf der GTX 580 einen "out of memory" -Fehler erhalten, als ich 65000 Bytes pro Thread erreicht habe. Hier ist die genaue Ausgabe des Programms in der Konsole:

MEMSIZE=65000 bytes.
GPUassert: out of memory cuda_test_private_memory.cu 48

Ich habe auch einen Test mit einer GTX 770-GPU (unter Linux mit CUDA 6.5) durchgeführt. Es lief fehlerfrei für MEMSIZE = 200000, aber der "Out of Memory-Fehler" trat zur Laufzeit für MEMSIZE = 250000 auf.

Wie erkläre ich dieses Verhalten? Mache ich etwas falsch