Schnelle Informationsgewinnberechnung

Question

Aug 23, 2014, 03:24 PM

performance scikit-learn python machine-learning feature-selection

Schnelle Informationsgewinnberechnung

Ich muss rechnenInformationsgewinn Ergebnisse für> 100.000 Features in> 10.000 Dokumenten fürTextklassifizierung. Code unten funktioniert gut, aberdenn der gesamte Datensatz ist sehr langsam - dauert mehr als eine Stunde auf einem Laptop. Dataset ist 20newsgroup und ich benutze scikit-learn,chi2 Die in scikit enthaltene Funktion arbeitet extrem schnell.

Haben Sie eine Idee, wie Sie den Informationsgewinn für einen solchen Datensatz schneller berechnen können?

def information_gain(x, y):

    def _entropy(values):
        counts = np.bincount(values)
        probs = counts[np.nonzero(counts)] / float(len(values))
        return - np.sum(probs * np.log(probs))

    def _information_gain(feature, y):
        feature_set_indices = np.nonzero(feature)[1]
        feature_not_set_indices = [i for i in feature_range if i not in feature_set_indices]
        entropy_x_set = _entropy(y[feature_set_indices])
        entropy_x_not_set = _entropy(y[feature_not_set_indices])

        return entropy_before - (((len(feature_set_indices) / float(feature_size)) * entropy_x_set)
                                 + ((len(feature_not_set_indices) / float(feature_size)) * entropy_x_not_set))

    feature_size = x.shape[0]
    feature_range = range(0, feature_size)
    entropy_before = _entropy(y)
    information_gain_scores = []

    for feature in x.T:
        information_gain_scores.append(_information_gain(feature, y))
    return information_gain_scores, []

BEARBEITEN:

Ich habe die internen Funktionen zusammengeführt und bin gelaufencProfiler wie folgt (in einem Datensatz, der auf ~ 15.000 Funktionen und ~ 1.000 Dokumente beschränkt ist):

cProfile.runctx(
    """for feature in x.T:
    feature_set_indices = np.nonzero(feature)[1]
    feature_not_set_indices = [i for i in feature_range if i not in feature_set_indices]

    values = y[feature_set_indices]
    counts = np.bincount(values)
    probs = counts[np.nonzero(counts)] / float(len(values))
    entropy_x_set = - np.sum(probs * np.log(probs))

    values = y[feature_not_set_indices]
    counts = np.bincount(values)
    probs = counts[np.nonzero(counts)] / float(len(values))
    entropy_x_not_set = - np.sum(probs * np.log(probs))

    result = entropy_before - (((len(feature_set_indices) / float(feature_size)) * entropy_x_set)
                             + ((len(feature_not_set_indices) / float(feature_size)) * entropy_x_not_set))
    information_gain_scores.append(result)""",
    globals(), locals())

Ergebnis Top 20 vontottime:

ncalls  tottime percall cumtime percall filename:lineno(function)
1       60.27   60.27   65.48   65.48   <string>:1(<module>)
16171   1.362   0   2.801   0   csr.py:313(_get_row_slice)
16171   0.523   0   0.892   0   coo.py:201(_check)
16173   0.394   0   0.89    0   compressed.py:101(check_format)
210235  0.297   0   0.297   0   {numpy.core.multiarray.array}
16173   0.287   0   0.331   0   compressed.py:631(prune)
16171   0.197   0   1.529   0   compressed.py:534(tocoo)
16173   0.165   0   1.263   0   compressed.py:20(__init__)
16171   0.139   0   1.669   0   base.py:415(nonzero)
16171   0.124   0   1.201   0   coo.py:111(__init__)
32342   0.123   0   0.123   0   {method 'max' of 'numpy.ndarray' objects}
48513   0.117   0   0.218   0   sputils.py:93(isintlike)
32342   0.114   0   0.114   0   {method 'sum' of 'numpy.ndarray' objects}
16171   0.106   0   3.081   0   csr.py:186(__getitem__)
32342   0.105   0   0.105   0   {numpy.lib._compiled_base.bincount}
32344   0.09    0   0.094   0   base.py:59(set_shape)
210227  0.088   0   0.088   0   {isinstance}
48513   0.081   0   1.777   0   fromnumeric.py:1129(nonzero)
32342   0.078   0   0.078   0   {method 'min' of 'numpy.ndarray' objects}
97032   0.066   0   0.153   0   numeric.py:167(asarray)

Sieht so aus, als würde die meiste Zeit in verbracht_get_row_slice. Ich bin mir nicht ganz sicher, ob die erste Reihe den gesamten Block abdeckt, den ich zur Verfügung gestellt habecProfile.runctx, obwohl ich nicht weiß, warum es eine so große Lücke zwischen der ersten Zeile gibttotime=60.27 und zweitetottime=1.362. Wo wurde der Unterschied ausgegeben? Ist es möglich, es einzuchecken?cProfile?

Grundsätzlich sieht es so aus, als ob das Problem bei Operationen mit spärlicher Matrix (Schneiden, Holen von Elementen) besteht - die Lösung wäre wahrscheinlich die BerechnungInformationsgewinn mit Matrixalgebra (wiechi2 ist in scikit implementiert). Aber ich habe keine Ahnung, wie man diese Berechnung in Matrizenoperationen ausdrückt ... Hat jemand eine Idee?