Lectura de JSON anidado en Google Dataflow / Apache Beam

Question

Feb 01, 2017, 05:00 PM

Lectura de JSON anidado en Google Dataflow / Apache Beam

Es posible leer archivos JSON sin anotar en Cloud Storage con Dataflow a través de:

p.apply("read logfiles", TextIO.Read.from("gs://bucket/*").withCoder(TableRowJsonCoder.of()));

Si solo quiero escribir esos registros con un filtrado mínimo en BigQuery, puedo hacerlo usando un DoFn como este:

private static class Formatter extends DoFn<TableRow,TableRow> {

        @Override
        public void processElement(ProcessContext c) throws Exception {

            // .clone() since input is immutable
            TableRow output = c.element().clone();

            // remove misleading timestamp field
            output.remove("@timestamp");

            // set timestamp field by using the element's timestamp
            output.set("timestamp", c.timestamp().toString());

            c.output(output);
        }
    }
}

Sin embargo, no sé cómo acceder a los campos anidados en el archivo JSON de esta manera.

Si TableRow contiene unRECORD nombradar, ¿es posible acceder a sus claves / valores sin más serialización / deserialización?Si necesito serializar / deserializarme con elJackson biblioteca, ¿tiene más sentido usar un estándarCoder deTextIO.Read en lugar deTableRowJsonCoder, para recuperar parte del rendimiento que pierdo de esta manera?

EDITAR

Los archivos están delimitados por una nueva línea y se ven así:

{"@timestamp":"2015-x", "message":"bla", "r":{"analyzed":"blub", "query": {"where":"9999"}}}
{"@timestamp":"2015-x", "message":"blub", "r":{"analyzed":"bla", "query": {"where":"1111"}}}