Spark Scala - java.util.NoSuchElementException & Data Cleaning

Question

Jul 06, 2016, 07:37 PM

nosuchelementexception stanford-nlp apache-spark scala

Spark Scala - java.util.NoSuchElementException & Data Cleaning

He tenido unproblema similar antes, pero estoy buscando una respuesta generalizable. estoy usandospark-corenlp para obtener puntajes de Sentimiento en correos electrónicos. A veces, sentiment () se bloquea en alguna entrada (tal vez es demasiado larga, tal vez tuvo un carácter inesperado). No me dice que se bloquea en algunos casos, y solo devuelve elColumn sentiment('email). Por lo tanto, cuando trato deshow() más allá de cierto punto osave() mi marco de datos, obtengo unjava.util.NoSuchElementException porquesentiment() debe haber devuelto nada en esa fila.

Mi código inicial es cargar los datos y aplicarsentiment() como se muestra enspark-corenlp API

       val customSchema = StructType(Array(
                        StructField("contactId", StringType, true),
                        StructField("email", StringType, true))
                        )

// Load dataframe   
val df = sqlContext.read
                        .format("com.databricks.spark.csv")
                        .option("delimiter","\t")          // Delimiter is tab
                        .option("parserLib", "UNIVOCITY")  // Parser, which deals better with the email formatting
                        .schema(customSchema)              // Schema of the table
                        .load("emails")                        // Input file


    val sent = df.select('contactId, sentiment('email).as('sentiment)) // Add sentiment analysis output to dataframe

Traté de filtrar valores nulos y NaN:

val sentFiltered = sent.filter('sentiment.isNotNull)
                .filter(!'sentiment.isNaN)
                .filter(col("sentiment").between(0,4))

Incluso intenté hacerlo a través de una consulta SQL:

sent.registerTempTable("sent")
val test = sqlContext.sql("SELECT * FROM sent WHERE sentiment IS NOT NULL")

No sé qué entrada está causando el accidente de spark-corenlp. ¿Cómo puedo averiguarlo? De lo contrario, ¿cómo puedo filtrar estos valores no existentes de col ("sentimiento")? O bien, ¿debería intentar atrapar la excepción e ignorar la fila? ¿Es esto posible?