Método eficiente para contar casos abiertos al momento de la presentación de cada caso en un conjunto de datos grande

Question

Dec 12, 2015, 10:39 PM

performance dplyr r vectorization data.table

Método eficiente para contar casos abiertos al momento de la presentación de cada caso en un conjunto de datos grande

En un conjunto de datos grande (~ 1M casos), cada caso tiene un "creado" y un "censurado"dateTime. Quiero contar el número de otros casos que estaban abiertos en el momento en que se creó cada caso. Los casos están abiertos entre su "creado" y "censurado"dataTimes.

Varias soluciones funcionan bien en conjuntos de datos pequeños (<100,000 casos), pero el tiempo de cálculo crece exponencialmente. Mi estimación es que el tiempo de cálculo aumenta como una función 3n ^ 2. Por n = 100,000 casos, el tiempo de cálculo es> 20 minutos en mi servidor con núcleos de 6 * 4GHz y 64GB de RAM. Incluso con bibliotecas multinúcleo, en el mejor de los casos, podría reducir el tiempo en un factor de 8 o 10. No es suficiente para manejar ~ 1M de casos.

Estoy buscando un método más eficiente para hacer este cálculo. A continuación, he proporcionado una función que le permite crear fácilmente grandes cantidades de "creado" y "censurado"dateTime pares junto con dos soluciones probadas hasta ahora, usandodplyr ydata.table bibliotecas Los tiempos se informan al usuario por simplicidad. Simplemente puede cambiar la variable "CASE_COUNT" en la parte superior para volver a ejecutar y ver los tiempos nuevamente y comparar fácilmente el tiempo de otras soluciones que tenga que sugerir.

Actualizaré la publicación original con otras soluciones para dar el crédito adecuado a sus autores. Gracias de antemano por su ayuda con esto!

# Load libraries used in this example
library(dplyr);
library(data.table);
# Not on CRAN. See: http://bioconductor.org/packages/release/bioc/html/IRanges.html
library(IRanges);

# Set seed for reproducibility 
set.seed(123)

# Set number of cases & date range variables
CASE_COUNT  <<- 1000;
RANGE_START <- as.POSIXct("2000-01-01 00:00:00", 
                          format="%Y-%m-%d %H:%M:%S", 
                          tz="UTC", origin="1970-01-01");
RANGE_END   <- as.POSIXct("2012-01-01 00:00:00", 
                          format="%Y-%m-%d %H:%M:%S", 
                          tz="UTC", origin="1970-01-01");

# Select which solutions you want to run in this test           
RUN_SOLUTION_1 <- TRUE;     # dplyr::summarize() + comparisons
RUN_SOLUTION_2 <- TRUE;     # data.table:foverlaps()
RUN_SOLUTION_3 <- TRUE;     # data.table aggregation + comparisons
RUN_SOLUTION_4 <- TRUE;     # IRanges::IRanges + countOverlaps()
RUN_SOLUTION_5 <- TRUE;     # data.table::frank()

# Function to generate random creation & censor dateTime pairs
# The censor time always has to be after the creation time
# Credit to @DirkEddelbuettel for this smart function
# (https://stackoverflow.com/users/143305/dirk-eddelbuettel)

generate_cases_table <- function(n = CASE_COUNT, start_val=RANGE_START, end_val=RANGE_END) {
    # Measure duration between start_val & end_val
    duration <- as.numeric(difftime(end_val, start_val, unit="secs"));

    # Select random values in duration to create start_offset
    start_offset   <- runif(n, 0, duration);

    # Calculate the creation time list
    created_list  <- start_offset + start_val;

    # Calculate acceptable time range for censored values
    # since they must always be after their respective creation value
    censored_range <- as.numeric(difftime(RANGE_END, created_list, unit="secs"));

    # Select random values in duration to create end_offset
    creation_to_censored_times <- runif(n, 0, censored_range);

    censored_list <- created_list + creation_to_censored_times;

    # Create and return a data.table with creation & censor values
    # calculated from start or end with random offsets
    return_table  <- data.table(id       = 1:n,
                                created  = created_list,
                                censored = censored_list);

    return(return_table);
}

# Create the data table with the desired number of cases specified by CASE_COUNT above
cases_table <- generate_cases_table();

solution_1_function <- function (cases_table) { 
    # SOLUTION 1: Using dplyr::summarize:

    # Group by id to set parameters for summarize() function 
    cases_table_grouped <- group_by(cases_table, id);

    # Count the instances where other cases were created before
    # and censored after each case using vectorized sum() within summarize()

    cases_table_summary <- summarize(cases_table_grouped, 
                           open_cases_at_creation = sum((cases_table$created  < created & 
                                                         cases_table$censored > created)));
    solution_1_table <<- as.data.table(cases_table_summary, key="id");        
} # End solution_1_function

solution_2_function <- function (cases_table) {
    # SOLUTION 2: Using data.table::foverlaps:

    # Adapted from solution provided by @Davidarenburg
    # (https://stackoverflow.com/users/3001626/david-arenburg)

    # The foverlaps() solution tends to crash R with large case counts
    # I suspect it has to do with memory assignment of the very large objects
    # It maxes RAM on my system (64GB) before crashing, possibly attempting
    # to write beyond its assigned memory limits.
    # I'll submit a reproduceable bug to the data.table team since
    # foverlaps() is pretty new and known to be occasionally unstable

    if (CASE_COUNT > 50000) {
        stop("The foverlaps() solution tends to crash R with large case counts. Not running.");
    }

    setDT(cases_table)[, created_dupe := created];
    setkey(cases_table, created, censored);

    foverlaps_table  <- foverlaps(cases_table[,c("id","created","created_dupe"), with=FALSE],
                                  cases_table[,c("id","created","censored"),    with=FALSE], 
                                  by.x=c("created","created_dupe"))[order(i.id),.N-1,by=i.id];

    foverlaps_table  <- dplyr::rename(foverlaps_table, id=i.id, open_cases_at_creation=V1);

    solution_2_table <<- as.data.table(foverlaps_table, key="id");
} # End solution_2_function

solution_3_function <- function (cases_table) {    
    # SOLUTION 3: Using data.table aggregation instead of dplyr::summarize

    # Idea suggested by @jangorecki
    # (https://stackoverflow.com/users/2490497/jangorecki)

    # Count the instances where other cases were created before
    # and censored after each case using vectorized sum() with data.table aggregation

    cases_table_aggregated <- cases_table[order(id), sum((cases_table$created  < created & 
                                                     cases_table$censored > created)),by=id];   

    solution_3_table <<- as.data.table(dplyr::rename(cases_table_aggregated, open_cases_at_creation=V1), key="id");

} # End solution_3_function

solution_4_function <- function (cases_table) { 
    # SOLUTION 4: Using IRanges package

    # Adapted from solution suggested by @alexis_laz
    # (https://stackoverflow.com/users/2414948/alexis-laz)

    # The IRanges package generates ranges efficiently, intended for genome sequencing
    # but working perfectly well on this data, since POSIXct values are numeric-representable
    solution_4_table <<- data.table(id      = cases_table$id,
                     open_cases_at_creation = countOverlaps(IRanges(cases_table$created, 
                                                                    cases_table$created), 
                                                            IRanges(cases_table$created, 
                                                                    cases_table$censored))-1, key="id");

} # End solution_4_function

solution_5_function <- function (cases_table) {
    # SOLUTION 5: Using data.table::frank()

    # Adapted from solution suggested by @danas.zuokas
    # (https://stackoverflow.com/users/1249481/danas-zuokas)

    n <- CASE_COUNT;

    # For every case compute the number of other cases
    # with `created` less than `created` of other cases
    r1 <- data.table::frank(c(cases_table[, created], cases_table[, created]), ties.method = 'first')[1:n];

    # For every case compute the number of other cases
    # with `censored` less than `created`
    r2 <- data.table::frank(c(cases_table[, created], cases_table[, censored]), ties.method = 'first')[1:n];

    solution_5_table <<- data.table(id      = cases_table$id,
                     open_cases_at_creation = r1 - r2, key="id");

} # End solution_5_function;

# Execute user specified functions;
if (RUN_SOLUTION_1)
    solution_1_timing <- system.time(solution_1_function(cases_table)); 
if (RUN_SOLUTION_2) {
    solution_2_timing <- try(system.time(solution_2_function(cases_table)));
    cases_table <- select(cases_table, -created_dupe);
}
if (RUN_SOLUTION_3)
    solution_3_timing <- system.time(solution_3_function(cases_table)); 
if (RUN_SOLUTION_4)
    solution_4_timing <- system.time(solution_4_function(cases_table));
if (RUN_SOLUTION_5)
    solution_5_timing <- system.time(solution_5_function(cases_table));         

# Check generated tables for comparison
if (RUN_SOLUTION_1 && RUN_SOLUTION_2 && class(solution_2_timing)!="try-error") {
    same_check1_2 <- all(solution_1_table$open_cases_at_creation == solution_2_table$open_cases_at_creation);
} else {same_check1_2 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_3) {
    same_check1_3 <- all(solution_1_table$open_cases_at_creation == solution_3_table$open_cases_at_creation);
} else {same_check1_3 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_4) {
    same_check1_4 <- all(solution_1_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check1_4 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_5) {
    same_check1_5 <- all(solution_1_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check1_5 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_3 && class(solution_2_timing)!="try-error") {
    same_check2_3 <- all(solution_2_table$open_cases_at_creation == solution_3_table$open_cases_at_creation);
} else {same_check2_3 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_4 && class(solution_2_timing)!="try-error") {
    same_check2_4 <- all(solution_2_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check2_4 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_5 && class(solution_2_timing)!="try-error") {
    same_check2_5 <- all(solution_2_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check2_5 <- TRUE;}
if (RUN_SOLUTION_3 && RUN_SOLUTION_4) {
    same_check3_4 <- all(solution_3_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check3_4 <- TRUE;}
if (RUN_SOLUTION_3 && RUN_SOLUTION_5) {
    same_check3_5 <- all(solution_3_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check3_5 <- TRUE;}
if (RUN_SOLUTION_4 && RUN_SOLUTION_5) {
    same_check4_5 <- all(solution_4_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check4_5 <- TRUE;}


same_check    <- all(same_check1_2, same_check1_3, same_check1_4, same_check1_5,
                     same_check2_3, same_check2_4, same_check2_5, same_check3_4,
                     same_check3_5, same_check4_5);

# Report summary of results to user
cat("This execution was for", CASE_COUNT, "cases.\n",
    "It is", same_check, "that all solutions match.\n");
if (RUN_SOLUTION_1)
    cat("The dplyr::summarize() solution took", solution_1_timing[3], "seconds.\n");
if (RUN_SOLUTION_2 && class(solution_2_timing)!="try-error")
    cat("The data.table::foverlaps() solution took", solution_2_timing[3], "seconds.\n");
if (RUN_SOLUTION_3)
    cat("The data.table aggregation solution took", solution_3_timing[3], "seconds.\n");
if (RUN_SOLUTION_4)
    cat("The IRanges solution solution took", solution_4_timing[3], "seconds.\n");
if (RUN_SOLUTION_5)
    cat("The data.table:frank() solution solution took", solution_5_timing[3], "seconds.\n\n");

losdata.table::foverlaps() la solución es más rápida para menos casos (<5000 más o menos; depende de la aleatoriedad además de n ya que utiliza la búsqueda binaria para optimizar). losdplyr::summarize() la solución es más rápida para más casos (> 5,000 más o menos). Mucho más allá de 100,000, ninguna de las soluciones es viable ya que ambas son demasiado lentas.

EDITAR: se agregó una tercera solución basada en la idea sugerida por @jangorecki que utilizadata.table agregación en lugar dedplyr::summarize(), y de lo contrario es similar a ladplyr solución. Para hasta alrededor de 50,000 casos, es la solución más rápida. Más de 50,000 casos, eldplyr::summarize() La solución es un poco más rápida, pero no mucho. Lamentablemente, para los casos de 1M todavía no es práctico.

EDIT2: se agregó una cuarta solución adaptada de la solución sugerida por @alexis_laz que utiliza elIRanges paquete y sucountOverlaps función. Es significativamente más rápido que las otras 3 soluciones. Con 50,000 casos, fue casi un 400% más rápido que las soluciones 1 y 3.

EDIT3: Se modificó la función de generación de casos para ejercer adecuadamente la condición "censurada". Gracias a @jangorecki por captar la limitación de la versión anterior.

EDITAR4: reescrito para permitir al usuario seleccionar qué soluciones ejecutar y usarsystem.time() para la comparación de tiempos con la recolección de basura antes de cada ejecución para tiempos más precisos (según la observación astuta de @ jangorecki) - También se agregaron algunas verificaciones de condición para casos de accidentes

EDIT5: Se agregó una quinta solución adaptada de la solución sugerida por @ danas.zuokas usandorank(). Mi experimentación sugiere que siempre es al menos un orden de magnitud más lento que las otras soluciones. En 10,000 casos, toma 44 segundos vs. 3.5 segundos paradplyr::summarize y 0.36 segundos paraIRanges soluciones

EDICIÓN FINAL: He realizado ligeras modificaciones a la solución 5 sugerida por @ danas.zuokas, y haciendo coincidir la observación de @Khashaa sobre los tipos. He establecido el tipoas.numeric en eldataTime función de generación que acelera drásticamenterank como opera enintegers odoubles en lugar dedateTime objetos (aumenta la velocidad de otras funciones también, pero no tan drásticamente). Con algunas pruebas, ajusteties.method='first' produce resultados consistentes con la intención.data.table::frank es más rápido que ambosbase::rank yIRanges::rank. bit64::rank es más rápido, pero parece manejar los lazos de manera diferente adata.table::frank y no puedo lograr que los maneje como lo deseo. Una vezbit64 se carga, enmascara una gran cantidad de tipos y funciones, cambiando los resultados dedata.table::frank por el camino. Las razones específicas por las cuales están más allá del alcance de esta pregunta.

NOTA POSTERIOR: Resulta quedata.table::frank manejasPOSIXct dateTimes eficientemente, mientras que ningunobase::rank niIRanges::rank parece que. Como tal, incluso elas.numeric (oas.integer) la configuración de tipo no es necesaria condata.table::frank y no hay pérdida de precisión de la conversión, por lo que hay menosties.method discrepancias ¡Gracias a todos los que contribuyeron! ¡Aprendí mucho! ¡Muy apreciado! :) El crédito se incluirá en mi código fuente.

NOTA FINAL: Esta pregunta es una versión refinada y clarificada, con un código de ejemplo más fácil de usar y más legible, deMétodo más eficiente para contar casos abiertos a partir del tiempo de creación de cada caso - Lo he separado aquí para no abrumar la publicación original con demasiadas ediciones y para simplificar la creación de grandes cantidades dedataTime pares en el código de ejemplo. De esa manera, no tienes que trabajar tan duro para responder. ¡Gracias de nuevo!