Transformar texto en vector

Question

Jul 25, 2018, 09:33 AM

python information-retrieval nlp python-3.x text-processing

Transformar texto en vector

Tengo un diccionario con palabras y la frecuencia de cada palabra.

{'cxampphtdocsemployeesphp': 1,
'emptiness': 1, 
'encodingundefinedconversionerror': 1, 
'msbuildexe': 2,
'e5': 1, 
'lnk4049': 1,
'specifierqualifierlist': 2, .... }

Ahora quiero crear un modelo de bolsa de palabras usando este diccionario (no quiero usar la biblioteca y la función estándar. Quiero aplicar esto usando el algoritmo).

Encuentre N palabras más populares en el diccionario y numerelas. Ahora tenemos un diccionario de las palabras más populares.Para cada título en el diccionario, cree un vector cero con una dimensión igual a N.Para cada texto en los corpus iterar sobre las palabras que están en el diccionario y aumentar en 1 la coordenada correspondiente.

Tengo mi texto que usaré para crear el vector usando una función.

La función se vería así,

def my_bag_of_words(text, words_to_index, dict_size):
"""
    text: a string
    dict_size: size of the dictionary

    return a vector which is a bag-of-words representation of 'text'
"""


 Let say we have N = 4 and the list of the most popular words is 

['hi', 'you', 'me', 'are']

Then we need to numerate them, for example, like this: 

{'hi': 0, 'you': 1, 'me': 2, 'are': 3}

And we have the text, which we want to transform to the vector:
'hi how are you'

For this text we create a corresponding zero vector 
[0, 0, 0, 0]

And iterate over all words, and if the word is in the dictionary, we increase the value of the corresponding position in the vector:
'hi':  [1, 0, 0, 0]
'how': [1, 0, 0, 0] # word 'how' is not in our dictionary
'are': [1, 0, 0, 1]
'you': [1, 1, 0, 1]

The resulting vector will be 
[1, 1, 0, 1]

Cualquier ayuda para aplicar esto sería realmente útil. Estoy usando python para la implementación.

Gracias

Neel