Temporal Binning in Pandas - python

I would like to perform something similar to an SQL groupby operation or R's aggregate in Pandas. I have a bunch of rows with irregular timestamps, I would like to create temporal bins and count the number of rows falling into each bin. I can't quite see how to use resample to do this
Example Rows
Time, Val
05.33, XYZ
05.45, ABC
07.13, DEF
Example Output
05.00-06.00, 2
06.00-07.00, 0
07.00-08.00, 1

If you are indexing on another value, you can use a groupby statement on the timestamp.
In [1]: dft = pd.DataFrame({'A' : ['spam', 'eggs', 'spam', 'eggs'] * 6,
'B' : np.random.randn(24),
'C' : [np.random.choice(pd.date_range(datetime.datetime(2013,1,1,0,0,0),datetime.datetime(2013,1,2,0,0,0),freq='T')) for i in range(24)]})
In [2]: dft['B'].groupby([dft['C'].apply(lambda x:x.hour)]).agg(pd.Series.nunique)
Out[2]:
C
2 1
4 1
6 1
7 1
9 1
10 2
11 1
12 4
14 1
15 2
16 1
18 3
19 1
20 1
21 1
22 1
23 1
dtype: float64
If you're indexing on timestamps, then you can use resample.
In [3]: dft2 = pd.DataFrame({'A' : ['spam', 'eggs', 'spam', 'eggs'] * 6,
'B' : np.random.randn(24)},
index = [np.random.choice(pd.date_range(datetime.datetime(2013,1,1,0,0,0),datetime.datetime(2013,1,2,0,0,0),freq='T')) for i in range(24)])
In [4]: dft2.resample('H',how=pd.Series.nunique)
Out[4]:
A B
2013-01-01 01:00:00 1 1
2013-01-01 02:00:00 0 0
2013-01-01 03:00:00 0 0
2013-01-01 04:00:00 0 0
2013-01-01 05:00:00 2 2
2013-01-01 06:00:00 2 3
2013-01-01 07:00:00 1 2
2013-01-01 08:00:00 2 2
2013-01-01 09:00:00 1 1
2013-01-01 10:00:00 2 3
2013-01-01 11:00:00 1 1
2013-01-01 12:00:00 1 2
2013-01-01 13:00:00 0 0
2013-01-01 14:00:00 1 1
2013-01-01 15:00:00 0 0
2013-01-01 16:00:00 1 1
2013-01-01 17:00:00 1 2
2013-01-01 18:00:00 0 0
2013-01-01 19:00:00 0 0
2013-01-01 20:00:00 2 2
2013-01-01 21:00:00 1 1

Related

How to repeat same datetime data for different interval

I have this dataset
Value1 Value2
2000-01-01 12:00:00 1 2
2000-01-02 12:00:00 3 4
2000-01-03 12:00:00 5 6
I want to repeat the same data but at 4 different time intervals for example
Value1 Value2
2000-01-01 00:00:00 1 2
2000-01-01 06:00:00 1 2
2000-01-01 12:00:00 1 2
2000-01-01 18:00:00 1 2
2000-01-02 00:00:00 3 4
2000-01-02 06:00:00 3 4
2000-01-02 12:00:00 3 4
2000-01-02 18:00:00 3 4
and so on.
your dates are contiguous, this solution will also work for none contiguous dates
generate a series that are the expanded times, then outer join
import pandas as pd
import io
df = pd.read_csv(io.StringIO(""" Value1 Value2
2000-01-01 12:00:00 1 2
2000-01-02 12:00:00 3 4
2000-01-03 12:00:00 5 6"""), sep="\s\s+", engine="python")
df = df.set_index(pd.to_datetime(df.index))
df = df.join(
pd.Series(df.index, index=df.index)
.rename("expanded")
.dt.date.apply(lambda d: pd.date_range(d, freq="6H", periods=4))
.explode()
).set_index("expanded")
df
expanded
Value1
Value2
2000-01-01 00:00:00
1
2
2000-01-01 06:00:00
1
2
2000-01-01 12:00:00
1
2
2000-01-01 18:00:00
1
2
2000-01-02 00:00:00
3
4
2000-01-02 06:00:00
3
4
2000-01-02 12:00:00
3
4
2000-01-02 18:00:00
3
4
2000-01-03 00:00:00
5
6
2000-01-03 06:00:00
5
6
2000-01-03 12:00:00
5
6
2000-01-03 18:00:00
5
6

How to use groupby() with between_time()?

I have a DataFrame and want to multiply all values in a column a for a certain day with the value of a at 6h00m00 of that day. If there is no 6h00m00 entry, that day should stay unchanged.
The code below unfortunately gives an error.
How do I have to correct this code / replace it with any working solution?
import pandas as pd
import numpy as np
start = pd.Timestamp('2000-01-01')
end = pd.Timestamp('2000-01-03')
t = np.linspace(start.value, end.value, 9)
datetime1 = pd.to_datetime(t)
df = pd.DataFrame( {'a':[1,3,4,5,6,7,8,9,14]})
df['date']= datetime1
print(df)
def myF(x):
y = x.set_index('date').between_time('05:59', '06:01').a
return y
toMultiplyWith = df.groupby(df.date.dt.floor('D')).transform(myF)
.
a date
0 1 2000-01-01 00:00:00
1 3 2000-01-01 06:00:00
2 4 2000-01-01 12:00:00
3 5 2000-01-01 18:00:00
4 6 2000-01-02 00:00:00
5 7 2000-01-02 06:00:00
6 8 2000-01-02 12:00:00
7 9 2000-01-02 18:00:00
8 14 2000-01-03 00:00:00
....
AttributeError: ("'Series' object has no attribute 'set_index'", 'occurred at index a')
you should change this line:
toMultiplyWith = df.groupby(df.date.dt.floor('D')).transform(myF)
to this:
toMultiplyWith = df.groupby(df.date.dt.floor('D')).apply(myF)
using .apply instead of .transform will give you the desired result.
apply is the right choice here since it implicitly passes all the columns for each group as a DataFrame to the custom function.
to read more about the difference between the two methods, consider this answer
If you stick to use between_times(...) function, that would be the way to do it:
df = df.set_index('date')
mask = df.between_time('05:59', '06:01').index
df.loc[mask, 'a'] = df.loc[mask, 'a'] ** 2 # the operation you want to perform
df.reset_index(inplace=True)
Outputs:
date a
0 2000-01-01 00:00:00 1
1 2000-01-01 06:00:00 9
2 2000-01-01 12:00:00 4
3 2000-01-01 18:00:00 5
4 2000-01-02 00:00:00 6
5 2000-01-02 06:00:00 49
6 2000-01-02 12:00:00 8
7 2000-01-02 18:00:00 9
8 2000-01-03 00:00:00 14
If I got your goal right, you can use apply to return a dataframe with the same amount of rows as the original dataframe (simulating a transform):
def myF(grp):
time = grp.date.dt.strftime('%T')
target_idx = time == '06:00:00'
if target_idx.any():
grp.loc[~target_idx, 'a_sum'] = grp.loc[~target_idx, 'a'].values * grp.loc[target_idx, 'a'].values
else:
grp.loc[~target_idx, 'a_sum'] = np.nan
return grp
df.groupby(df.date.dt.floor('D')).apply(myF)
Output:
a date a_sum
0 1 2000-01-01 00:00:00 3.0
1 3 2000-01-01 06:00:00 NaN
2 4 2000-01-01 12:00:00 12.0
3 5 2000-01-01 18:00:00 15.0
4 6 2000-01-02 00:00:00 42.0
5 7 2000-01-02 06:00:00 NaN
6 8 2000-01-02 12:00:00 56.0
7 9 2000-01-02 18:00:00 63.0
8 14 2000-01-03 00:00:00 NaN
See that, for each day, each value with time other than 06:00:00 is multiplied by the value with time equals 06:00:00. It retuns NaN for the 06:00:00-values themselves, as well as for the groups without this time.

Count rows with multiple criteria in pandas

I have a pandas dataframe with "user_ID", "datetime" and "action_type" columns like it is shown below and I want to get the last column (the last column = desired output) by performing some calculations:
data = {'user_id': list('ddabdacddaaa'),
'datetime':pd.date_range("20201001", periods=12, freq='H'),
'action_type':list('XXXWZWKOOXWX'),
'as_if_X_calculated':list('121021022223')
}
df = pd.DataFrame(data)
df
user_id datetime action_type as_if_X_calculated
0 d 2020-10-01 00:00:00 X 1
1 d 2020-10-01 01:00:00 X 2
2 a 2020-10-01 02:00:00 X 1
3 b 2020-10-01 03:00:00 W 0
4 d 2020-10-01 04:00:00 Z 2
5 a 2020-10-01 05:00:00 W 1
6 c 2020-10-01 06:00:00 K 0
7 d 2020-10-01 07:00:00 O 2
8 d 2020-10-01 08:00:00 O 2
9 a 2020-10-01 09:00:00 X 2
10 a 2020-10-01 10:00:00 W 2
11 a 2020-10-01 11:00:00 X 3
So the last column shows how many times the user has performed an action X at the time of the current record. If we see a user "a", his results will be like 1-1-2-2-3 in chronological order. So how can I calculate the number of action X for the given user that happened at the time of the record or earlier?
P.S. In Excel it would look like =countifs(A:A; A2; B:B; "<="&B2; C:C; "X") (Column A = "user_id")
If your dataframe is sorted by datetime you can create a temporary column for the condition on action_type and use pd.expanding
df.sort_values('datetime', inplace=True)
df['dummy'] = df.action_type == 'X'
df['X_calculated'] = (df.groupby('user_id')['dummy']
.expanding().sum()
.reset_index(level=0, drop=True)
.astype('int'))
df.sort_index(inplace=True)
print(df.drop('dummy', 1))
assert df.as_if_X_calculated.astype('int').equals(df.X_calculated), 'X_calculated is not equal'
Out:
user_id datetime action_type as_if_X_calculated X_calculated
0 d 2020-10-01 00:00:00 X 1 1
1 d 2020-10-01 01:00:00 X 2 2
2 a 2020-10-01 02:00:00 X 1 1
3 b 2020-10-01 03:00:00 W 0 0
4 d 2020-10-01 04:00:00 Z 2 2
5 a 2020-10-01 05:00:00 W 1 1
6 c 2020-10-01 06:00:00 K 0 0
7 d 2020-10-01 07:00:00 O 2 2
8 d 2020-10-01 08:00:00 O 2 2
9 a 2020-10-01 09:00:00 X 2 2
10 a 2020-10-01 10:00:00 W 2 2
11 a 2020-10-01 11:00:00 X 3 3

Sort csv-data while reading, using pandas

I have a csv-file with entries like this:
1,2014 1 1 0 1,5
2,2014 1 1 0 1,5
3,2014 1 1 0 1,5
4,2014 1 1 0 1,6
5,2014 1 1 0 1,6
6,2014 1 1 0 1,12
7,2014 1 1 0 1,17
8,2014 5 7 1 5,4
The first column is the ID, the second the arrival-date (example of last entry: may 07, 1:05 a.m.) and the last column is the duration of work (in minutes).
Actually, I read in the data using pandas and the following function:
import pandas as pd
def convert_data(csv_path):
store = pd.HDFStore(data_file)
print('Loading CSV File')
df = pd.read_csv(csv_path, parse_dates=True)
print('CSV File Loaded, Converting Dates/Times')
df['Arrival_time'] = map(convert_time, df['Arrival_time'])
df['Rel_time'] = (df['Arrival_time'] - REF.timestamp)/60.0
print('Conversion Complete')
store['orders'] = df
My question is: How can I sort the entries according to their duration, but considering the arrival-date? So, I'd like to sort the csv-entries according to "arrival-date + duration". How is this possible?
Thanks for any hint! Best regards, Stan.
OK, the following shows you can convert the date times and then shows how to add the minutes:
In [79]:
df['Arrival_Date'] = pd.to_datetime(df['Arrival_Date'], format='%Y %m %d %H %M')
df
Out[79]:
ID Arrival_Date Duration
0 1 2014-01-01 00:01:00 5
1 2 2014-01-01 00:01:00 5
2 3 2014-01-01 00:01:00 5
3 4 2014-01-01 00:01:00 6
4 5 2014-01-01 00:01:00 6
5 6 2014-01-01 00:01:00 12
6 7 2014-01-01 00:01:00 17
7 8 2014-05-07 01:05:00 4
In [80]:
import datetime as dt
df['Arrival_and_Duration'] = df['Arrival_Date'] + df['Duration'].apply(lambda x: dt.timedelta(minutes=int(x)))
df
Out[80]:
ID Arrival_Date Duration Arrival_and_Duration
0 1 2014-01-01 00:01:00 5 2014-01-01 00:06:00
1 2 2014-01-01 00:01:00 5 2014-01-01 00:06:00
2 3 2014-01-01 00:01:00 5 2014-01-01 00:06:00
3 4 2014-01-01 00:01:00 6 2014-01-01 00:07:00
4 5 2014-01-01 00:01:00 6 2014-01-01 00:07:00
5 6 2014-01-01 00:01:00 12 2014-01-01 00:13:00
6 7 2014-01-01 00:01:00 17 2014-01-01 00:18:00
7 8 2014-05-07 01:05:00 4 2014-05-07 01:09:00
In [81]:
df.sort(columns=['Arrival_and_Duration'])
Out[81]:
ID Arrival_Date Duration Arrival_and_Duration
0 1 2014-01-01 00:01:00 5 2014-01-01 00:06:00
1 2 2014-01-01 00:01:00 5 2014-01-01 00:06:00
2 3 2014-01-01 00:01:00 5 2014-01-01 00:06:00
3 4 2014-01-01 00:01:00 6 2014-01-01 00:07:00
4 5 2014-01-01 00:01:00 6 2014-01-01 00:07:00
5 6 2014-01-01 00:01:00 12 2014-01-01 00:13:00
6 7 2014-01-01 00:01:00 17 2014-01-01 00:18:00
7 8 2014-05-07 01:05:00 4 2014-05-07 01:09:00

Unable to convert to datetime using pd.to_datetime

I am trying to read a csv file and convert it to a dataframe to be used as a time series.
The csv file is of this type:
#Date Time CO_T1_AHU.01_CC_CTRV_CHW__SIG_STAT
0 NaN NaN %
1 NaN NaN Cooling Coil Hydronic Valve Position
2 2014-01-01 00:00:00 0
3 2014-01-01 01:00:00 0
4 2014-01-01 02:00:00 0
5 2014-01-01 03:00:00 0
6 2014-01-01 04:00:00 0
I read the file using:
df = pd.read_csv ('filepath/file.csv', sep=';', parse_dates = [[0,1]])
producing this result:
#Date_Time FCO_T1_AHU.01_CC_CTRV_CHW__SIG_STAT
0 nan nan %
1 nan nan Cooling Coil Hydronic Valve Position
2 2014-01-01 00:00:00 0
3 2014-01-01 01:00:00 0
4 2014-01-01 02:00:00 0
5 2014-01-01 03:00:00 0
6 2014-01-01 04:00:00 0
to continue converting string to datetime and using it as index:
pd.to_datetime(df.values[:,0])
df.set_index([df.columns[0]], inplace=True)
so i get this:
FCO_T1_AHU.01_CC_CTRV_CHW__SIG_STAT
#Date_Time
nan nan %
nan nan Cooling Coil Hydronic Valve Position
2014-01-01 00:00:00 0
2014-01-01 01:00:00 0
2014-01-01 02:00:00 0
2014-01-01 03:00:00 0
2014-01-01 04:00:00 0
However, the pd.to_datetime is unable to convert to datetime. Is there a way of finding out what is the error?
Many thanks.
Luis
The string entry 'nan nan' cannot be converted using to_datetime, so replace these with an empty string so that they can now be converted to NaT:
In [122]:
df['Date_Time'].replace('nan nan', '',inplace=True)
df
Out[122]:
Date_Time index CO_T1_AHU.01_CC_CTRV_CHW__SIG_STAT
0 0 %
1 1 Cooling Coil Hydronic Valve Position
2 2014-01-01 00:00:00 2 0
3 2014-01-01 01:00:00 3 0
4 2014-01-01 02:00:00 4 0
5 2014-01-01 03:00:00 5 0
6 2014-01-01 04:00:00 6 0
In [124]:
df['Date_Time'] = pd.to_datetime(df['Date_Time'])
df
Out[124]:
Date_Time index CO_T1_AHU.01_CC_CTRV_CHW__SIG_STAT
0 NaT 0 %
1 NaT 1 Cooling Coil Hydronic Valve Position
2 2014-01-01 00:00:00 2 0
3 2014-01-01 01:00:00 3 0
4 2014-01-01 02:00:00 4 0
5 2014-01-01 03:00:00 5 0
6 2014-01-01 04:00:00 6 0
UPDATE
Actually if you just set coerce=True then it converts fine:
df['Date_Time'] = pd.to_datetime(df['Date_Time'], coerce=True)

Categories