pandas Datenrahmen zum nächstgelegenen Zeitstempel zusammenführen

Question

Aug 06, 2016, 09:15 PM

pandas Datenrahmen zum nächstgelegenen Zeitstempel zusammenführen

Ich möchte zwei Datenrahmen in drei Spalten zusammenführen: E-Mail, Betreff und Zeitstempel. Die Zeitstempel zwischen den Datenrahmen unterscheiden sich und ich muss daher den am besten passenden Zeitstempel für eine Gruppe von E-Mails und Betreffs identifizieren.

Below ist ein reproduzierbares Beispiel für die Verwendung einer Funktion für die engste Übereinstimmung, die für @ vorgeschlagen wurdDie frage.

import numpy as np
import pandas as pd
from pandas.io.parsers import StringIO

def find_closest_date(timepoint, time_series, add_time_delta_column=True):
   # takes a pd.Timestamp() instance and a pd.Series with dates in it
   # calcs the delta between `timepoint` and each date in `time_series`
   # returns the closest date and optionally the number of days in its time delta
   deltas = np.abs(time_series - timepoint)
   idx_closest_date = np.argmin(deltas)
   res = {"closest_date": time_series.ix[idx_closest_date]}
   idx = ['closest_date']
   if add_time_delta_column:
      res["closest_delta"] = deltas[idx_closest_date]
      idx.append('closest_delta')
   return pd.Series(res, index=idx)


a = """timestamp,email,subject
2016-07-01 10:17:00,[email protected],subject3
2016-07-01 02:01:02,[email protected],welcome
2016-07-01 14:45:04,[email protected],subject3
2016-07-01 08:14:02,[email protected],subject2
2016-07-01 16:26:35,[email protected],subject4
2016-07-01 10:17:00,[email protected],subject3
2016-07-01 02:01:02,[email protected],welcome
2016-07-01 14:45:04,[email protected],subject3
2016-07-01 08:14:02,[email protected],subject2
2016-07-01 16:26:35,[email protected],subject4
"""

b = """timestamp,email,subject,clicks,var1
2016-07-01 02:01:14,[email protected],welcome,1,1
2016-07-01 08:15:48,[email protected],subject2,2,2
2016-07-01 10:17:39,[email protected],subject3,1,7
2016-07-01 14:46:01,[email protected],subject3,1,2
2016-07-01 16:27:28,[email protected],subject4,1,2
2016-07-01 10:17:05,[email protected],subject3,0,0
2016-07-01 02:01:03,[email protected],welcome,0,0
2016-07-01 14:45:05,[email protected],subject3,0,0
2016-07-01 08:16:00,[email protected],subject2,0,0
2016-07-01 17:00:00,[email protected],subject4,0,0
"""

Bitte beachten Sie, dass für [email protected] der am besten passende Zeitstempel 10:17:39 ist, während für [email protected] der am besten passende 10: 17: 05 ist.

a = """timestamp,email,subject
2016-07-01 10:17:00,[email protected],subject3
2016-07-01 10:17:00,[email protected],subject3
"""

b = """timestamp,email,subject,clicks,var1
2016-07-01 10:17:39,[email protected],subject3,1,7
2016-07-01 10:17:05,[email protected],subject3,0,0
"""
df1 = pd.read_csv(StringIO(a), parse_dates=['timestamp'])
df2 = pd.read_csv(StringIO(b), parse_dates=['timestamp'])

df1[['closest', 'time_bt_x_and_y']] = df1.timestamp.apply(find_closest_date, args=[df2.timestamp])
df1

df3 = pd.merge(df1, df2, left_on=['email','subject','closest'], right_on=['email','subject','timestamp'],how='left')

df3
timestamp_x        email   subject             closest  time_bt_x_and_y         timestamp_y  clicks  var1
  2016-07-01 10:17:00  [email protected]  subject3 2016-07-01 10:17:05         00:00:05                 NaT     NaN   NaN
  2016-07-01 02:01:02  [email protected]   welcome 2016-07-01 02:01:03         00:00:01                 NaT     NaN   NaN
  2016-07-01 14:45:04  [email protected]  subject3 2016-07-01 14:45:05         00:00:01                 NaT     NaN   NaN
  2016-07-01 08:14:02  [email protected]  subject2 2016-07-01 08:15:48         00:01:46 2016-07-01 08:15:48     2.0   2.0
  2016-07-01 16:26:35  [email protected]  subject4 2016-07-01 16:27:28         00:00:53 2016-07-01 16:27:28     1.0   2.0
  2016-07-01 10:17:00  [email protected]  subject3 2016-07-01 10:17:05         00:00:05 2016-07-01 10:17:05     0.0   0.0
  2016-07-01 02:01:02  [email protected]   welcome 2016-07-01 02:01:03         00:00:01 2016-07-01 02:01:03     0.0   0.0
  2016-07-01 14:45:04  [email protected]  subject3 2016-07-01 14:45:05         00:00:01 2016-07-01 14:45:05     0.0   0.0
  2016-07-01 08:14:02  [email protected]  subject2 2016-07-01 08:15:48         00:01:46                 NaT     NaN   NaN
  2016-07-01 16:26:35  [email protected]  subject4 2016-07-01 16:27:28         00:00:53                 NaT     NaN   NaN

Das Ergebnis ist falsch, hauptsächlich, weil das nächstgelegene Datum falsch ist, da E-Mail und Betreff nicht berücksichtigt werden.

Das erwartete Ergebnis ist

Es wäre hilfreich, die Funktion so zu ändern, dass die nächsten Zeitstempel für eine bestimmte E-Mail und einen bestimmten Betreff angegeben werden.

df1.groupby(['email','subject'])['timestamp'].apply(find_closest_date, args=[df1.timestamp])

Aber das ergibt einen Fehler, da die Funktion für ein Gruppenobjekt nicht definiert ist. Wie geht das am besten?

Antworten auf die Frage(4)

Top Fragen

0 die antwort

Aussprechbarkeitsalgorithmus

0 die antwort

Twitter Bootstrap 3 zwei Spalten voller Höhe

0 die antwort

Start-Prozess: Der Zugriff wird verweigert (obwohl ich Anmeldeinformationen angegeben habe

0 die antwort

Wie verwende ich HUnit und Cabal, um automatisierte Tests durchzuführen?

0 die antwort

Virtuelle Dateien mit IStream ziehen und ablegen

Du bist sehr aktiv! Es ist großartig!

pandas Datenrahmen zum nächstgelegenen Zeitstempel zusammenführen

Antworten auf die Frage(4)

Ihre Antwort auf die Frage

Top Fragen