In welcher Situation würde der AVX2 Anweisungen schneller sammeln, als die Daten einzeln zu laden?

Question

Jul 15, 2014, 01:02 PM

optimization assembly vectorization avx2

In welcher Situation würde der AVX2 Anweisungen schneller sammeln, als die Daten einzeln zu laden?

Ich habe die Verwendung der neuen Sammelanweisungen des AVX2-Anweisungssatzes untersucht. Insbesondere habe ich mich für ein Benchmarking eines einfachen Problems entschieden, bei dem ein Gleitkomma-Array permutiert und einem anderen hinzugefügt wird. In c kann dies implementiert werden als

void vectortest(double * a,double * b,unsigned int * ind,unsigned int N)
{
    int i;
    for(i=0;i<N;++i)
    {
        a[i]+=b[ind[i]];
    }
}

Ich kompiliere diese Funktion mit g ++ -O3 -march = native. Jetzt implementiere ich dies in der Montage auf drei Arten. Der Einfachheit halber gehe ich davon aus, dass die Länge der Arrays N durch vier teilbar ist. Die einfache, nicht vektorisierte Implementierung:

align 4
global vectortest_asm
vectortest_asm:
        ;;  double * a = rdi                                                                                                                                                                                                                                   
        ;;  double * b = rsi                                                                                                                                                                                                                                   
        ;;  unsigned int * ind = rdx                                                                                                                                                                                                                           
        ;;  unsigned int N = rcx                                                                                                                                                                                                                               

        push rax
        xor rax,rax

loop:   sub rcx, 1
        mov eax, [rdx+rcx*4]    ;eax = ind[rcx]                                                                                                                                                                                                                
        vmovq xmm0, [rdi+rcx*8]         ;xmm0 = a[rcx]                                                                                                                                                                                                         
        vaddsd xmm0, [rsi+rax*8]        ;xmm1 += b[rax] ( and b[rax] = b[eax] = b[ind[rcx]])                                                                                                                                                                   
        vmovq [rdi+rcx*8], xmm0
        cmp rcx, 0
        jne loop

        pop rax

        ret

Die Schleife wird ohne die Anweisung gather vektorisiert:

loop:   sub rcx, 4

        mov eax,[rdx+rcx*4]    ;first load the values from array b to xmm1-xmm4
        vmovq xmm1,[rsi+rax*8]
        mov eax,[rdx+rcx*4+4]
        vmovq xmm2,[rsi+rax*8]

        mov eax,[rdx+rcx*4+8]
        vmovq xmm3,[rsi+rax*8]
        mov eax,[rdx+rcx*4+12]
        vmovq xmm4,[rsi+rax*8]

        vmovlhps xmm1,xmm2     ;now collect them all to ymm1
        vmovlhps xmm3,xmm4
        vinsertf128 ymm1,ymm1,xmm3,1

        vaddpd ymm1, ymm1, [rdi+rcx*8]
        vmovupd [rdi+rcx*8], ymm1

        cmp rcx, 0
        jne loop

Und zum Schluss eine Implementierung mit vgatherdpd:

loop:   sub rcx, 4               
        vmovdqu xmm2,[rdx+4*rcx]           ;load the offsets from array ind to xmm2
        vpcmpeqw ymm3,ymm3                 ;set ymm3 to all ones, since it acts as the mask in vgatherdpd                                                                                                                                                                 
        vgatherdpd ymm1,[rsi+8*xmm2],ymm3  ;now gather the elements from array b to ymm1

        vaddpd ymm1, ymm1, [rdi+rcx*8]
        vmovupd [rdi+rcx*8], ymm1

        cmp rcx, 0
        jne loop

Ich teste diese Funktionen auf einer Maschine mit einer Haswell-CPU (Xeon E3-1245 v3). Einige typische Ergebnisse sind (Zeiten in Sekunden):

Array length 100, function called 100000000 times.
Gcc version: 6.67439
Nonvectorized assembly implementation: 6.64713
Vectorized without gather: 4.88616
Vectorized with gather: 9.32949

Array length 1000, function called 10000000 times.
Gcc version: 5.48479
Nonvectorized assembly implementation: 5.56681
Vectorized without gather: 4.70103
Vectorized with gather: 8.94149

Array length 10000, function called 1000000 times.
Gcc version: 7.35433
Nonvectorized assembly implementation: 7.66528
Vectorized without gather: 7.92428
Vectorized with gather: 8.873

Die gcc- und die nonvectorized-Assembly-Version liegen sehr nahe beieinander. (Ich habe auch die Assembly-Ausgabe von gcc überprüft, die meiner handcodierten Version ziemlich ähnlich ist.) Die Vektorisierung bietet einige Vorteile für kleine Arrays, ist jedoch für große Arrays langsamer. Die große Überraschung (zumindest für mich) ist, dass die Version, die vgatherpdp verwendet, so langsam ist. Meine Frage ist also, warum? Mache ich hier etwas Dummes?Kann jemand ein Beispiel nennen, bei dem die Sammelanweisung tatsächlich einen Leistungsvorteil gegenüber der Ausführung mehrerer Ladevorgänge bietet? Wenn nicht, wozu dient dann eine solche Anweisung?

Der Testcode mit einem Makefile für g ++ und nasm steht unter zur Verfügunghttps://github.com/vanhala/vectortest.git für den Fall, dass jemand dies ausprobieren möchte.