Datenverarbeitung mit dynamischem Hinzufügen von Spalten in Python Pandas Dataframe

Question

May 10, 2014, 12:40 PM

Datenverarbeitung mit dynamischem Hinzufügen von Spalten in Python Pandas Dataframe

Ich habe folgendes Problem. Nehmen wir an, das ist meine CSV

id f1 f2 f3
1  4  5  5
1  3  1  0
1  7  4  4
1  4  3  1
1  1  4  6
2  2  6  0
..........

Ich habe also Zeilen, die nach ID gruppiert werden können. Ich möchte eine CSV wie unten als Ausgabe erstellen.

f1 f2 f3 f1_n f2_n f3_n f1_n_n f2_n_n f3_n_n f1_t f2_t f3_t
4  5  5   3   1    0    7      4      4      1   4     6

Ich möchte also in der Lage sein, die Anzahl der Zeilen auszuwählen, die ich in Spalten konvertieren möchte (immer beginnend mit der ersten Zeile einer ID). In diesem Fall habe ich 3 Reihen gepackt. Ich überspringe dann auch eine oder mehrere Zeilen (in diesem Fall nur eine), um die letzten Spalten aus der letzten Zeile derselben ID-Gruppe zu übernehmen. Und aus Gründen möchte ich einen Datenrahmen verwenden.

Nach 3-4 Stunden kämpfen. Ich fand eine Lösung wie unten angegeben. Aber meine Lösung ist sehr langsam. Ich habe ungefähr 700.000 Zeilen und kann ungefähr 70.000 Gruppen von IDs sein. Der obige Code bei model = 3 dauert auf meinem 4 GB 4 Core Lenovo fast eine Stunde. Ich muss zu model = vielleicht 10 oder 15 gehen. Ich bin noch ein Anfänger in Python und ich bin sicher, dass es mehrere Änderungen geben kann, die dies schnell machen. Kann jemand erklären, wie ich den Code verbessern kann.

Danke vielmals.

Modell: Anzahl der zu erfassenden Zeilen

# train data frame from reading the csv
train = pd.read_csv(filename)

# Get groups of rows with same id
csv_by_id = train.groupby('id')

modelTarget = { 'f1_t','f2_t','f3_t'}

# modelFeatures is a list of features I am interested in the csv. 
    # The csv actually has hundreds
modelFeatures = { 'f1, 'f2' , 'f3' }

coreFeatures = list(modelFeatures) # cloning 


selectedFeatures = list(modelFeatures) # cloning

newFeatures = list(selectedFeatures) # cloning

finalFeatures = list(selectedFeatures) # cloning

# Now create the column list depending on the number of rows I will grab from
for x in range(2,model+1):
    newFeatures = [s + '_n' for s in newFeatures]
    finalFeatures = finalFeatures + newFeatures

# This is the final column list for my one row in the final data frame
selectedFeatures = finalFeatures + list(modelTarget) 

# Empty dataframe which I want to populate
model_data = pd.DataFrame(columns=selectedFeatures)

for id_group in csv_by_id:
    #id_group is a tuple with first element as the id itself and second one a dataframe with the rows of a group
    group_data = id_group[1] 

    #hmm - can this be better? I am picking up the rows which I need from first row on wards
    df = group_data[coreFeatures][0:model] 

    # initialize a list
    tmp = [] 

    # now keep adding the column values into the list
    for index, row in df.iterrows(): 
        tmp = tmp + list(row)


    # Wow, this one below surely should have something better. 
    # So i am picking up the feature column values from the last row of the group of rows for a particular id 
    targetValues = group_data[list({'f1','f2','f3'})][len(group_data.index)-1:len(group_data.index)].values 

    # Think this can be done easier too ? . Basically adding the values to the tmp list again
    tmp = tmp + list(targetValues.flatten()) 

    # coverting the list to a dict.
    tmpDict = dict(zip(selectedFeatures,tmp))  

    # then the dict to a dataframe.
    tmpDf = pd.DataFrame(tmpDict,index={1}) 

    # I just could not find a better way of adding a dict or list directly into a dataframe. 
    # And I went through lots and lots of blogs on this topic, including some in StackOverflow.

    # finally I add the frame to my main frame
    model_data = model_data.append(tmpDf) 

# and write it
model_data.to_csv(wd+'model_data' + str(model) + '.csv',index=False)