Por qué la perplejidad del vocabulario acolchado es infinitivo para nltk.lm bigram?
Estoy probando elperplexity
medida para un modelo de idioma para un texto:
train_sentences = nltk.sent_tokenize(train_text)
test_sentences = nltk.sent_tokenize(test_text)
train_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent)))
for sent in train_sentences]
test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent)))
for sent in test_sentences]
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE,Laplace
from nltk.lm import Vocabulary
vocab = Vocabulary(nltk.tokenize.word_tokenize(train_text),1);
n = 2
print(train_tokenized_text)
print(len(train_tokenized_text))
train_data, padded_vocab = padded_everygram_pipeline(n, train_tokenized_text)
# print(list(vocab),"\n >>>>",list(padded_vocab))
model = MLE(n) # Lets train a 3-grams maximum likelihood estimation model.
# model.fit(train_data, padded_vocab)
model.fit(train_data, vocab)
sentences = test_sentences
print("len: ",len(sentences))
print("per all", model.perplexity(test_text))
Cuando usovocab
inmodel.fit(train_data, vocab)
la perplejidad enprint("per all", model.perplexity(test_text))
es un número 30.2
), pero si usopadded_vocab
que tiene @ adicion<s>
y</s>
imprimeinf
.