Scala: Spark SQL to_date (unix_timestamp) que devuelve NULL

Question

Nov 05, 2016, 12:24 AM

scala spark-csv spark-dataframe apache-spark apache-spark-sql

Scala: Spark SQL to_date (unix_timestamp) que devuelve NULL

Spark Version: spark-2.0.1-bin-hadoop2.7 Scala: 2.11.8

Estoy cargando un csv en bruto en un DataFrame. En csv, aunque se admite que la columna esté en formato de fecha, se escriben como 20161025 en lugar de 2016-10-25. El parámetrodate_format incluye una cadena de nombres de columnas que deben convertirse al formato aaaa-mm-dd.

En el siguiente código, primero cargué el csv de la columna Fecha como StringType a través deschemay luego verifico sidate_format no está vacío, es decir, hay columnas que deben convertirse aDate deString, luego eche cada columna usandounix_timestamp yto_date. Sin embargo, en elcsv_df.show(), las filas devueltas son todasnull.

def read_csv(csv_source:String, delimiter:String, is_first_line_header:Boolean, 
    schema:StructType, date_format:List[String]): DataFrame = {
    println("|||| Reading CSV Input ||||")

    var csv_df = sqlContext.read
        .format("com.databricks.spark.csv")
        .schema(schema)
        .option("header", is_first_line_header)
        .option("delimiter", delimiter)
        .load(csv_source)
    println("|||| Successfully read CSV. Number of rows -> " + csv_df.count() + " ||||")
    if(date_format.length > 0) {
        for (i <- 0 until date_format.length) {
            csv_df = csv_df.select(to_date(unix_timestamp(
                csv_df(date_format(i)), "yyyy-MM-dd").cast("timestamp")))
            csv_df.show()
        }
    }
    csv_df
}

20 filas principales devueltas:

+-------------------------------------------------------------------------+
|to_date(CAST(unix_timestamp(prom_price_date, YYYY-MM-DD) AS TIMESTAMP))|
+-------------------------------------------------------------------------+
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
+-------------------------------------------------------------------------+

¿Por qué estoy obteniendo todonull?