Python - Parsen von Unicode-Zeichen

Question

Feb 26, 2010, 04:52 AM

Python - Parsen von Unicode-Zeichen

:) Ich habe versucht, w = Word (Ausdrucke) zu verwenden, aber es funktioniert nicht. Wie soll ich die Spezifikation dafür geben. 'w' soll Hindi-Zeichen verarbeiten (UTF-8)

Der Code gibt die Grammatik an und analysiert sie entsprechend.

671.assess  :: अहसास  ::2
x=number + "." + src + "::" + w + "::" + number + "." + number

Wenn es nur englische Zeichen gibt, funktioniert es, sodass der Code für das ASCII-Format korrekt ist, der Code jedoch nicht für das Unicode-Format.

Ich meine, dass der Code funktioniert, wenn wir etwas in der Form 671.assess :: ahsaas :: 2 haben

d.h. es analysiert Wörter im englischen Format, aber ich bin nicht sicher, wie ich Zeichen im Unicode-Format analysieren und dann drucken soll. Ich brauche dies für Englisch Hindi Wortausrichtung für den Zweck.

Der Python-Code sieht folgendermaßen aus:

# -*- coding: utf-8 -*-
from pyparsing import Literal, Word, Optional, nums, alphas, ZeroOrMore, printables , Group , alphas8bit , 
# grammar 
src = Word(printables)
trans =  Word(printables)
number = Word(nums)
x=number + "." + src + "::" + trans + "::" + number + "." + number
#parsing for eng-dict
efiledata = open('b1aop_or_not_word.txt').read()
eresults = x.parseString(efiledata)
edict1 = {}
edict2 = {}
counter=0
xx=list()
for result in eresults:
  trans=""#translation string
  ew=""#english word
  xx=result[0]
  ew=xx[2]
  trans=xx[4]   
  edict1 = { ew:trans }
  edict2.update(edict1)
print len(edict2) #no of entries in the english dictionary
print "edict2 has been created"
print "english dictionary" , edict2 

#parsing for hin-dict
hfiledata = open('b1aop_or_not_word.txt').read()
hresults = x.scanString(hfiledata)
hdict1 = {}
hdict2 = {}
counter=0
for result in hresults:
  trans=""#translation string
  hw=""#hin word
  xx=result[0]  
  hw=xx[2]
  trans=xx[4]
  #print trans
  hdict1 = { trans:hw }
  hdict2.update(hdict1)

print len(hdict2) #no of entries in the hindi dictionary
print"hdict2 has been created"
print "hindi dictionary" , hdict2
'''
#######################################################################################################################

def translate(d, ow, hinlist):
   if ow in d.keys():#ow=old word d=dict
    print ow , "exists in the dictionary keys"
        transes = d[ow]
    transes = transes.split()
        print "possible transes for" , ow , " = ", transes
        for word in transes:
            if word in hinlist:
        print "trans for" , ow , " = ", word
                return word
        return None
   else:
        print ow , "absent"
        return None

f = open('bidir','w')
#lines = ["'\
#5# 10 # and better performance in business in turn benefits consumers .  # 0 0 0 0 0 0 0 0 0 0 \
#5# 11 # vHyaapaar mEmn bEhtr kaam upbhOkHtaaomn kE lIe laabhpHrdd hOtaa hAI .  # 0 0 0 0 0 0 0 0 0 0 0 \
#'"]
data=open('bi_full_2','rb').read()
lines = data.split('!@#$%')
loc=0
for line in lines:
    eng, hin = [subline.split(' # ')
                for subline in line.strip('\n').split('\n')]

    for transdict, source, dest in [(edict2, eng, hin),
                                    (hdict2, hin, eng)]:
        sourcethings = source[2].split()
        for word in source[1].split():
            tl = dest[1].split()
            otherword = translate(transdict, word, tl)
            loc = source[1].split().index(word)
            if otherword is not None:
                otherword = otherword.strip()
                print word, ' <-> ', otherword, 'meaning=good'
                if otherword in dest[1].split():
                    print word, ' <-> ', otherword, 'trans=good'
                    sourcethings[loc] = str(
                        dest[1].split().index(otherword) + 1)

        source[2] = ' '.join(sourcethings)

    eng = ' # '.join(eng)
    hin = ' # '.join(hin)
    f.write(eng+'\n'+hin+'\n\n\n')
f.close()
'''

Wenn ein Beispiel-Eingabesatz für die Quelldatei ist:

1# 5 # modern markets : confident consumers  # 0 0 0 0 0 
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa .  # 0 0 0 0 0 0 
!@#$%

das ouptut würde so aussehen: -

1# 5 # modern markets : confident consumers  # 1 2 3 4 5 
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa .  # 1 2 3 4 5 0 
!@#$%

Erläuterung der Ausgabe: - Dadurch wird eine bidirektionale Ausrichtung erreicht. Es bedeutet, dass das erste Wort der englischen 'modernen' Sprache dem ersten Wort der Hindi-Sprache 'AddhUnIk' entspricht und umgekehrt. Hier werden sogar Zeichen als Wörter verstanden, da sie auch ein wesentlicher Bestandteil der bidirektionalen Zuordnung sind. Also, wenn Sie das Hindi-Wort '.' hat eine Null-Ausrichtung und ist in Bezug auf den englischen Satz nichts zugeordnet, da es keinen Punkt gibt. Die dritte Zeile in der Ausgabe stellt im Grunde genommen ein Trennzeichen dar, wenn wir an einer Reihe von Sätzen arbeiten, für die Sie eine bidirektionale Zuordnung anstreben.

Welche Änderung sollte ich vornehmen, damit es funktioniert, wenn ich die Hindi-Sätze im Unicode-Format (UTF-8) habe?