Jak dostosować tokenizer zdania NLTK

Question

Dec 31, 2012, 12:59 AM

Jak dostosować tokenizer zdania NLTK

Używam NLTK do analizy kilku klasycznych tekstów i staram się kłopotać z tokenizowaniem tekstu za pomocą zdania. Na przykład oto, co otrzymuję za fragmentMoby Dick:

import nltk
sent_tokenize = nltk.data.load('tokenizers/punkt/english.pickle')

'''
(Chapter 16)
A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but
that's a rather cold and clammy reception in the winter time, ain't it, Mrs. Hussey?"
'''
sample = 'A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs. Hussey?"'

print "\n-----\n".join(sent_tokenize.tokenize(sample))
'''
OUTPUT
"A clam for supper?
-----
a cold clam; is THAT what you mean, Mrs.
-----
Hussey?
-----
" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs.
-----
Hussey?
-----
"
'''

Nie oczekuję tutaj perfekcji, biorąc pod uwagę, że składnia Melville'a jest nieco przestarzała, ale NLTK powinien być w stanie obsługiwać podwójne cytaty z terminali i tytuły takie jak „Mrs.” Ponieważ tokenizer jest wynikiem nienadzorowanego treningu algo, nie wiem, jak to majstrować.

Czy ktoś ma zalecenia dotyczące lepszego tokenizera zdania? Wolałbym prostą heurystykę, którą mogę zhakować niż trenować własny parser.