Pyparsing: Analisando dados de texto simples aninhados em semi-JSON em uma lista

Question

Dec 19, 2013, 09:32 PM

Pyparsing: Analisando dados de texto simples aninhados em semi-JSON em uma lista

Eu tenho um monte de dados aninhados em um formato que lembra vagamente JSON:

company="My Company"
phone="555-5555"
people=
{
    person=
    {
        name="Bob"
        location="Seattle"
        settings=
        {
            size=1
            color="red"
        }
    }
    person=
    {
        name="Joe"
        location="Seattle"
        settings=
        {
            size=2
            color="blue"
        }
    }
}
places=
{
    ...
}

Existem muitos parâmetros diferentes com níveis variados de profundidade - isto é apenas um subconjunto muito pequeno.

Também pode ser interessante notar que quando uma nova sub-matriz é criada, há sempre um sinal de igual seguido por uma quebra de linha seguida pelo colchete aberto (como visto acima).

Existe alguma técnica simples de loop ou de recursão para converter esses dados em um formato de dados amigável ao sistema, como arrays ou JSON? Eu quero evitar codificar os nomes das propriedades. Eu estou procurando por algo que funcione em Python, Java ou PHP. O pseudocódigo é bom também.

Eu aprecio qualquer ajuda.

EDIT: Eu descobri a biblioteca Pyparsing para Python e parece que poderia ser uma grande ajuda. Não consigo encontrar nenhum exemplo de como usar o Pyparsing para analisar estruturas aninhadas de profundidade desconhecida. Alguém pode lançar luz sobre Pyparsing em termos dos dados que descrevi acima?

EDIT 2: Ok, aqui está uma solução de trabalho no Pyparsing:

def parse_file(fileName):

#get the input text file
file = open(fileName, "r")
inputText = file.read()

#define the elements of our data pattern
name = Word(alphas, alphanums+"_")
EQ,LBRACE,RBRACE = map(Suppress, "={}")
value = Forward() #this tells pyparsing that values can be recursive
entry = Group(name + EQ + value) #this is the basic name-value pair


#define data types that might be in the values
real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda x: float(x[0]))
integer = Regex(r"[+-]?\d+").setParseAction(lambda x: int(x[0]))
quotedString.setParseAction(removeQuotes)

#declare the overall structure of a nested data element
struct = Dict(LBRACE + ZeroOrMore(entry) + RBRACE) #we will turn the output into a Dictionary

#declare the types that might be contained in our data value - string, real, int, or the struct we declared
value << (quotedString | struct | real | integer)

#parse our input text and return it as a Dictionary
result = Dict(OneOrMore(entry)).parseString(inputText)
return result.dump()

Isso funciona, mas quando tento gravar os resultados em um arquivo com json.dump (result), o conteúdo do arquivo é colocado entre aspas duplas. Além disso, existem\n caracteres entre muitos dos pares de dados. Eu tentei suprimi-los no código acima comLineEnd().suppress(), mas não devo usá-lo corretamente.

Ok, eu criei uma solução final que transforma esses dados em um Dict compatível com JSON, como eu queria originalmente. Primeiro, usando o Pyparsing para converter os dados em uma série de listas aninhadas e, em seguida, percorre a lista e a transforma em JSON. Isso me permite superar o problema em que o PyparsingtoDict() método não foi capaz de manipular onde o mesmo objeto tem duas propriedades de mesmo nome. Para determinar se uma lista é uma lista simples ou um par propriedade / valor,prependPropertyToken método adiciona a string__property__ na frente de nomes de propriedades quando o Pyparsing os detecta.

def parse_file(self,fileName):

            #get the input text file
            file = open(fileName, "r")
            inputText = file.read()


            #define data types that might be in the values
            real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda x: float(x[0]))
            integer = Regex(r"[+-]?\d+").setParseAction(lambda x: int(x[0]))
            yes = CaselessKeyword("yes").setParseAction(replaceWith(True))
            no = CaselessKeyword("no").setParseAction(replaceWith(False))
            quotedString.setParseAction(removeQuotes)
            unquotedString =  Word(alphanums+"_-?\"")
            comment = Suppress("#") + Suppress(restOfLine)
            EQ,LBRACE,RBRACE = map(Suppress, "={}")

            data = (real | integer | yes | no | quotedString | unquotedString)

            #define structures
            value = Forward()
            object = Forward() 

            dataList = Group(OneOrMore(data))
            simpleArray = (LBRACE + dataList + RBRACE)

            propertyName = Word(alphanums+"_-.").setParseAction(self.prependPropertyToken)
            property = dictOf(propertyName + EQ, value)
            properties = Dict(property)

            object << (LBRACE + properties + RBRACE)
            value << (data | object | simpleArray)

            dataset = properties.ignore(comment)

            #parse it
            result = dataset.parseString(inputText)

            #turn it into a JSON-like object
            dict = self.convert_to_dict(result.asList())
            return json.dumps(dict)



    def convert_to_dict(self, inputList):
            dict = {}
            for item in inputList:
                    #determine the key and value to be inserted into the dict
                    dictval = None
                    key = None

                    if isinstance(item, list):
                            try:
                                    key = item[0].replace("__property__","")
                                    if isinstance(item[1], list):
                                            try:
                                                    if item[1][0].startswith("__property__"):
                                                            dictval = self.convert_to_dict(item)
                                                    else:
                                                            dictval = item[1]
                                            except AttributeError:
                                                    dictval = item[1]
                                    else:
                                            dictval = item[1]
                            except IndexError:
                                    dictval = None
                    #determine whether to insert the value into the key or to merge the value with existing values at this key
                    if key:
                            if key in dict:
                                    if isinstance(dict[key], list):
                                            dict[key].append(dictval)
                                    else:
                                            old = dict[key]
                                            new = [old]
                                            new.append(dictval)
                                            dict[key] = new
                            else:
                                    dict[key] = dictval
            return dict



    def prependPropertyToken(self,t):
            return "__property__" + t[0]