Pyparsing: análisis de datos de texto plano anidados semi-JSON a una lista

Question

Dec 19, 2013, 09:32 PM

Pyparsing: análisis de datos de texto plano anidados semi-JSON a una lista

Tengo un montón de datos anidados en un formato que se asemeja a JSON:

company="My Company"
phone="555-5555"
people=
{
    person=
    {
        name="Bob"
        location="Seattle"
        settings=
        {
            size=1
            color="red"
        }
    }
    person=
    {
        name="Joe"
        location="Seattle"
        settings=
        {
            size=2
            color="blue"
        }
    }
}
places=
{
    ...
}

Hay muchos parámetros diferentes con diferentes niveles de profundidad, esto es solo un subconjunto muy pequeño.

También puede valer la pena tener en cuenta que cuando se crea una nueva sub-matriz, siempre hay un signo igual seguido de un salto de línea seguido del corchete abierto (como se ve arriba).

¿Existe alguna técnica simple de bucle o recursión para convertir estos datos a un formato de datos fácil de usar para el sistema, como matrices o JSON? Quiero evitar codificar los nombres de las propiedades. Estoy buscando algo que funcione en Python, Java o PHP. El pseudocódigo también está bien.

Aprecio cualquier ayuda.

EDITAR: descubrí la biblioteca Pyparsing para Python y parece que podría ser de gran ayuda. No puedo encontrar ningún ejemplo sobre cómo usar Pyparsing para analizar estructuras anidadas de profundidad desconocida. ¿Alguien puede arrojar luz sobre Pyparsing en términos de los datos que describí anteriormente?

EDITAR 2: Bueno, aquí hay una solución de trabajo en Pyparsing:

def parse_file(fileName):

#get the input text file
file = open(fileName, "r")
inputText = file.read()

#define the elements of our data pattern
name = Word(alphas, alphanums+"_")
EQ,LBRACE,RBRACE = map(Suppress, "={}")
value = Forward() #this tells pyparsing that values can be recursive
entry = Group(name + EQ + value) #this is the basic name-value pair


#define data types that might be in the values
real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda x: float(x[0]))
integer = Regex(r"[+-]?\d+").setParseAction(lambda x: int(x[0]))
quotedString.setParseAction(removeQuotes)

#declare the overall structure of a nested data element
struct = Dict(LBRACE + ZeroOrMore(entry) + RBRACE) #we will turn the output into a Dictionary

#declare the types that might be contained in our data value - string, real, int, or the struct we declared
value << (quotedString | struct | real | integer)

#parse our input text and return it as a Dictionary
result = Dict(OneOrMore(entry)).parseString(inputText)
return result.dump()

Esto funciona, pero cuando intento escribir los resultados en un archivo con json.dump (resultado), el contenido del archivo se incluye entre comillas dobles. También hay\n Chraacters entre muchos de los pares de datos. Intenté suprimirlos en el código anterior conLineEnd().suppress(), pero no debo usarlo correctamente.

Bueno, se me ocurrió una solución final que realmente transforma estos datos en un Dict compatible con JSON como originalmente quería. Primero utiliza Pyparsing para convertir los datos en una serie de listas anidadas y luego recorre la lista y la transforma en JSON. Esto me permite superar el problema donde Pyparsing'stoDict() El método no pudo manejar donde el mismo objeto tiene dos propiedades del mismo nombre. Para determinar si una lista es una lista simple o un par propiedad / valor, laprependPropertyToken método agrega la cadena__property__ delante de los nombres de propiedades cuando Pyparsing los detecta.

def parse_file(self,fileName):

            #get the input text file
            file = open(fileName, "r")
            inputText = file.read()


            #define data types that might be in the values
            real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda x: float(x[0]))
            integer = Regex(r"[+-]?\d+").setParseAction(lambda x: int(x[0]))
            yes = CaselessKeyword("yes").setParseAction(replaceWith(True))
            no = CaselessKeyword("no").setParseAction(replaceWith(False))
            quotedString.setParseAction(removeQuotes)
            unquotedString =  Word(alphanums+"_-?\"")
            comment = Suppress("#") + Suppress(restOfLine)
            EQ,LBRACE,RBRACE = map(Suppress, "={}")

            data = (real | integer | yes | no | quotedString | unquotedString)

            #define structures
            value = Forward()
            object = Forward() 

            dataList = Group(OneOrMore(data))
            simpleArray = (LBRACE + dataList + RBRACE)

            propertyName = Word(alphanums+"_-.").setParseAction(self.prependPropertyToken)
            property = dictOf(propertyName + EQ, value)
            properties = Dict(property)

            object << (LBRACE + properties + RBRACE)
            value << (data | object | simpleArray)

            dataset = properties.ignore(comment)

            #parse it
            result = dataset.parseString(inputText)

            #turn it into a JSON-like object
            dict = self.convert_to_dict(result.asList())
            return json.dumps(dict)



    def convert_to_dict(self, inputList):
            dict = {}
            for item in inputList:
                    #determine the key and value to be inserted into the dict
                    dictval = None
                    key = None

                    if isinstance(item, list):
                            try:
                                    key = item[0].replace("__property__","")
                                    if isinstance(item[1], list):
                                            try:
                                                    if item[1][0].startswith("__property__"):
                                                            dictval = self.convert_to_dict(item)
                                                    else:
                                                            dictval = item[1]
                                            except AttributeError:
                                                    dictval = item[1]
                                    else:
                                            dictval = item[1]
                            except IndexError:
                                    dictval = None
                    #determine whether to insert the value into the key or to merge the value with existing values at this key
                    if key:
                            if key in dict:
                                    if isinstance(dict[key], list):
                                            dict[key].append(dictval)
                                    else:
                                            old = dict[key]
                                            new = [old]
                                            new.append(dictval)
                                            dict[key] = new
                            else:
                                    dict[key] = dictval
            return dict



    def prependPropertyToken(self,t):
            return "__property__" + t[0]