nkrementeller PCA für Big Da

Question

Jul 15, 2015, 01:00 PM

nkrementeller PCA für Big Da

Ich habe gerade versucht, IncrementalPCA von sklearn.decomposition zu verwenden, aber es hat einen MemoryError ausgelöst, genau wie zuvor PCA und RandomizedPCA. Mein Problem ist, dass die Matrix, die ich zu laden versuche, zu groß ist, um in RAM zu passen. Im Moment ist es in einer hdf5-Datenbank als Datensatz der Form ~ (1000000, 1000) gespeichert, also habe ich 1.000.000.000 float32-Werte. Ich dachte, IncrementalPCA lädt die Daten in Stapeln, aber anscheinend versucht es, den gesamten Datensatz zu laden, was nicht hilft. Wie soll diese Bibliothek verwendet werden? Ist das hdf5 Format das Problem?

from sklearn.decomposition import IncrementalPCA
import h5py

db = h5py.File("db.h5","r")
data = db["data"]
IncrementalPCA(n_components=10, batch_size=1).fit(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/decomposition/incremental_pca.py", line 165, in fit
    X = check_array(X, dtype=np.float)
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/utils/validation.py", line 337, in check_array
    array = np.atleast_2d(array)
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/shape_base.py", line 99, in atleast_2d
    ary = asanyarray(ary)
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/numeric.py", line 514, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2458)
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2415)
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 640, in __array__
    arr = numpy.empty(self.shape, dtype=self.dtype if dtype is None else dtype)
MemoryError

Danke für die Hilf