¿Por qué std :: fill (0) es más lento que std :: fill (1)?

Question

Mar 02, 2017, 04:04 PM

c++compiler-optimization memset performance x86

¿Por qué std :: fill (0) es más lento que std :: fill (1)?

He observado en un sistema questd::fill en una granstd::vector<int> fue significativamente y consistentemente más lento al establecer un valor constante0 comparado con un valor constante1 o un valor dinámico:

5.8 GiB / s vs 7.5 GiB / s

Sin embargo, los resultados son diferentes para tamaños de datos más pequeños, dondefill(0) es más rápido

Con más de un hilo, con un tamaño de datos de 4 GiB,fill(1) muestra una pendiente más alta, pero alcanza un pico mucho más bajo quefill(0) (51 GiB / s frente a 90 GiB / s):

Esto plantea la pregunta secundaria, ¿por qué el ancho de banda máximo defill(1) es mucho más bajo.

l sistema de prueba para esto fue una CPU Intel Xeon E5-2680 v3 de doble zócalo establecida a 2.5 GHz (a través de/sys/cpufreq) con 8x16 GiB DDR4-2133. Probé con GCC 6.1.0 -O3) y el compilador Intel 17.0.1 -fast), ambos obtienen resultados idénticos. @GOMP_CPU_AFFINITY=0,12,1,13,2,14,3,15,4,16,5,17,6,18,7,19,8,20,9,21,10,22,11,23 se estableció. Strem / add / 24 subprocesos obtiene 85 GiB / s en el sistema.

Pude reproducir este efecto en un sistema de servidor de doble socket Haswell diferente, pero no en ninguna otra arquitectura. Por ejemplo, en Sandy Bridge EP, el rendimiento de la memoria es idéntico, mientras que en la memoria cachéfill(0) es mucho más rápido.

Aquí está el código para reproducir:

#include <algorithm>
#include <cstdlib>
#include <iostream>
#include <omp.h>
#include <vector>

using value = int;
using vector = std::vector<value>;

constexpr size_t write_size = 8ll * 1024 * 1024 * 1024;
constexpr size_t max_data_size = 4ll * 1024 * 1024 * 1024;

void __attribute__((noinline)) fill0(vector& v) {
    std::fill(v.begin(), v.end(), 0);
}

void __attribute__((noinline)) fill1(vector& v) {
    std::fill(v.begin(), v.end(), 1);
}

void bench(size_t data_size, int nthreads) {
#pragma omp parallel num_threads(nthreads)
    {
        vector v(data_size / (sizeof(value) * nthreads));
        auto repeat = write_size / data_size;
#pragma omp barrier
        auto t0 = omp_get_wtime();
        for (auto r = 0; r < repeat; r++)
            fill0(v);
#pragma omp barrier
        auto t1 = omp_get_wtime();
        for (auto r = 0; r < repeat; r++)
            fill1(v);
#pragma omp barrier
        auto t2 = omp_get_wtime();
#pragma omp master
        std::cout << data_size << ", " << nthreads << ", " << write_size / (t1 - t0) << ", "
                  << write_size / (t2 - t1) << "\n";
    }
}

int main(int argc, const char* argv[]) {
    std::cout << "size,nthreads,fill0,fill1\n";
    for (size_t bytes = 1024; bytes <= max_data_size; bytes *= 2) {
        bench(bytes, 1);
    }
    for (size_t bytes = 1024; bytes <= max_data_size; bytes *= 2) {
        bench(bytes, omp_get_max_threads());
    }
    for (int nthreads = 1; nthreads <= omp_get_max_threads(); nthreads++) {
        bench(max_data_size, nthreads);
    }
}

esultados presentados compilados cong++ fillbench.cpp -O3 -o fillbench_gcc -fopenmp.