Issue with Groupby function in Pandas - python

I am trying to take a dataframe, which has timestamps and various other fields, and group by rounded timestamps (to the nearest minute), and take the average of another field. I'm getting the error no numeric value to aggregate
I am rounding the timestamp column like such:
df['time'] = df['time'].dt.round('1min')
The aggregation column is of the form: 0 days 00:00:00.054000
df3 = (
df
.groupby([df['time']])['total_diff'].mean()
.unstack(fill_value=0)
.reset_index()
)
I realize the total_diff column is a time delta field, but I would have thought this would still be considered numerical?
My ideal output would have the following columns: Rounded timestamp, number of records that are grouped in that timestamp, average total_diff by rounded timestamp. How can I achieve this?
EDIT
Example row:
[index, id, time, total_diff]
[400, 5fdfe9242c2fb0da04928d55, 2020-12-21 00:16:00, 0 days 00:00:00.055000]
[401, 5fdfe9242c2fb0da04928d56, 2020-12-21 00:16:00, 0 days 00:00:00.01000]
[402, 5fdfe9242c2fb0da04928d57, 2020-12-21 00:15:00, 0 days 00:00:00.05000]
The time column is not unique. I want to group by the time column, count the number of rows that are grouped into each time bucket, and produce an average of the total_diff for each time bucket.
Desired outcome:
[time, count, avg_total_diff]
[2020-12-21 00:16:00, 2, .0325]

By default DataFrame.groupby.mean has numeric_only=True, and numeric only considers int, bool and float. To also work with timedelta64[ns] you must set this to False
.
Sample Data
import pandas as pd
df = pd.DataFrame(pd.date_range('2010-01-01', freq='2T', periods=6))
df[1] = df[0].diff().bfill()
# 0 1
#0 2010-01-01 00:00:00 0 days 00:02:00
#1 2010-01-01 00:02:00 0 days 00:02:00
#2 2010-01-01 00:04:00 0 days 00:02:00
#3 2010-01-01 00:06:00 0 days 00:02:00
#4 2010-01-01 00:08:00 0 days 00:02:00
#5 2010-01-01 00:10:00 0 days 00:02:00
df.dtypes
#0 datetime64[ns]
#1 timedelta64[ns]
#dtype: object
Code
df.groupby(df[0].round('5T'))[1].mean()
#DataError: No numeric types to aggregate
df.groupby(df[0].round('5T'))[1].mean(numeric_only=False)
#0
#2010-01-01 00:00:00 0 days 00:02:00
#2010-01-01 00:05:00 0 days 00:02:00
#2010-01-01 00:10:00 0 days 00:02:00
#Name: 1, dtype: timedelta64[ns]

Related

replace dataframe values based on another dataframe

I have a pandas dataframe which is structured as follows:
timestamp y
0 2020-01-01 00:00:00 336.0
1 2020-01-01 00:15:00 544.0
2 2020-01-01 00:30:00 736.0
3 2020-01-01 00:45:00 924.0
4 2020-01-01 01:00:00 1260.0
...
The timestamp column is a datetime data type
and I have another dataframe with the following structure:
y
timestamp
00:00:00 625.076923
00:15:00 628.461538
00:30:00 557.692308
00:45:00 501.692308
01:00:00 494.615385
...
I this case, the time is the pandas datetime index.
Now what I want to do is replace the values in the first dataframe where the time field is matching i.e. the time of the day is matching with the second dataset.
IIUC your first dataframe df1's timestamp is datetime type and your second dataframe (df2) has an index of type datetime as well but only time and not date.
then you can do:
df1['y'] = df1['timestamp'].dt.time.map(df2['y'])
I wouldn't be surprised if there is a better way, but you can accomplish this by working to get the tables so that they can merge on the time. Assuming your dataframes will be df and df2.
df['time'] = df['timestamp'].dt.time
df2 = df2.reset_index()
df2['timestamp'] = pd.to_datetime(df2['timestamp'].dt.time
df_combined = pd.merge(df,df2,left_on='time',right_on='timestamp')
df_combined
timestamp_x y_x time timestamp_y y_y
0 2020-01-01 00:00:00 336.0 00:00:00 00:00:00 625.076923
1 2020-01-01 00:15:00 544.0 00:15:00 00:15:00 628.461538
2 2020-01-01 00:30:00 736.0 00:30:00 00:30:00 557.692308
3 2020-01-01 00:45:00 924.0 00:45:00 00:45:00 501.692308
4 2020-01-01 01:00:00 1260.0 01:00:00 01:00:00 494.615385
# This clearly has more than you need, so just keep what you want and rename things back.
df_combined = df_combined[['timestamp_x','y_y']]
df_combined = df_combined.rename(columns={'timestamp_x':'timestamp','y_y':'y'})
New answer I like way better: actually using .map()
Still need to get df2 to have the time column to match on.
df2 = df2.reset_index()
df2['timestamp'] = pd.to_datetime(df2['timestamp'].dt.time
df['y'] = df['timestamp'].dt.time.map(dict(zip(df2['timestamp',df2['y'])))

Why are there different results for pandas groupby+resample on an appended dataframe

I want to groupby and resample a dataframe i have. I group by int_var and bool_var, and then I resample per 1Min to fill in any missing minutes in the dataset. This works perfectly fine for the base dataframe A:
date bool_var int_var
2021-01-01 00:03:00 True 1
2021-01-01 00:06:00 False 6
2021-01-01 00:06:00 True 6
The result then becomes something like this:
int_var bool_var date
1 True 2021-01-01 00:03:00 1
2021-01-01 00:04:00 0
2021-01-01 00:05:00 0
2021-01-01 00:06:00 0
6 True 2021-01-01 00:03:00 0
2021-01-01 00:04:00 0
2021-01-01 00:05:00 0
2021-01-01 00:06:00 1
6 False 2021-01-01 00:03:00 0
2021-01-01 00:04:00 0
2021-01-01 00:05:00 0
2021-01-01 00:06:00 1
This is exactly what I want. However, as you can see the data starts a bit after midnight, and I want those minutes from midnight to be in there as well. So I append a row for each bool_var / int_var combination at 2021-01-01 00:00:00, to make sure the resampling starts from there.
rows = []
some for loop:
rows.append()
extra_rows_df = pd.DataFrame(rows, columns=['date', 'bool_var', 'int_var'])
B = pd.concat([A, extra_rows_df], ignore_index=True)
The resulting dataframe B appear to be correct, and in the same format as dataframe A:
date bool_var int_var
2021-01-01 00:00:00 True 1
2021-01-01 00:03:00 True 1
2021-01-01 00:00:00 False 6
2021-01-01 00:06:00 False 6
2021-01-01 00:00:00 True 6
2021-01-01 00:06:00 True 6
However, if I run the exact same groupby and resample command on dataframe B. My results are all weird:
date 2021-01-01 00:00:00 ... 2021-12-31 23:59:00
int_var bool_var 1 ... 1
1 True
6 True
False
It is like each date suddenly became a column instead of being listed for each grouping.
TL;DR: use stack().
I figured it out. In dataframe A, every bool_var / int_var group has different datetime values; here (1, True) started with 00:03, but some other group, e.g. (2, True) could start with an entry at 01:14. Once I filled out dataframe A so that each group had an entry at 00:00 in dataframe B, and I resampled to fill in each minute, every group had each datetime. In this way, all those datetimes could become columns since they apply to each group.
The solution is to use stack() on this final result

Python - Pandas, group by time intervals

Having the following DF:
group_id timestamp
A 2020-09-29 06:00:00 UTC
A 2020-09-29 08:00:00 UTC
A 2020-09-30 09:00:00 UTC
B 2020-09-01 04:00:00 UTC
B 2020-09-01 06:00:00 UTC
I would like to count the deltas between records using all groups, not counting deltas between groups. Result for the above example:
delta count
2 2
25 1
Explanation: In group A the deltas are
06:00:00 -> 08:00:00 (2 hours)
08:00:00 -> 09:00:00 on the next day (25 hours)
And in group B:
04:00:00 -> 06:00:00 (2 hours)
How can I achieve this using Python Pandas?
Use DataFrameGroupBy.diff for differencies per groups, convert to seconds by Series.dt.total_seconds, divide by 3600 for hours and last count values by Series.value_counts with convert Series to 2 columns DataFrame:
df1 = (df.groupby("group_id")['timestamp']
.diff()
.dt.total_seconds()
.div(3600)
.value_counts()
.rename_axis('delta')
.reset_index(name='count'))
print (df1)
delta count
0 2.0 2
1 25.0 1
Code
df_out = df.groupby("group_id").diff().groupby("timestamp").size()
# convert to dataframe
df_out = df_out.to_frame().reset_index().rename(columns={"timestamp": "delta", 0: "count"})
Result
print(df_out)
delta count
0 0 days 02:00:00 2
1 1 days 01:00:00 1
The NaT's (missing values) produced by groupby-diff were ignored automatically.
To represent timedelta in hours, just call total_seconds() method.
df_out["delta"] = df_out["delta"].dt.total_seconds() / 3600
print(df_out)
delta count
0 2.0 2
1 25.0 1

Delete all (hourly) day entries per row based on a daily table in python

I have a dataframe with a datetime64[ns] object which has the format, so there I have data per hourly base:
Datum Values
2020-01-01 00:00:00 1
2020-01-01 01:00:00 10
....
2020-02-28 00:00:00 5
2020-03-01 00:00:00 4
and another table with closing days, also in a datetime64[ns] column with the format, so there I only have a dayformat:
Dates
2020-02-28
2020-02-29
....
How can I delete all days in the first dataframe df, which occure in the second dataframe Dates? So that df is:
2020-01-01 00:00:00 1
2020-01-01 01:00:00 10
....
2020-03-01 00:00:00 4
Use Series.dt.floor for set times to 0, so possible filter by Series.isin with inverted mask in boolean indexing:
df['Datum'] = pd.to_datetime(df['Datum'])
df1['Dates'] = pd.to_datetime(df1['Dates'])
df = df[~df['Datum'].dt.floor('d').isin(df1['Dates'])]
print (df)
Datum Values
0 2020-01-01 00:00:00 1
1 2020-01-01 01:00:00 10
3 2020-03-01 00:00:00 4
EDIT: For flag column convert mask to integers by Series.view or Series.astype:
df['flag'] = df['Datum'].dt.floor('d').isin(df1['Dates']).view('i1')
#alternative
#df['flag'] = df['Datum'].dt.floor('d').isin(df1['Dates']).astype('int')
print (df)
Datum Values flag
0 2020-01-01 00:00:00 1 0
1 2020-01-01 01:00:00 10 0
2 2020-02-28 00:00:00 5 1
3 2020-03-01 00:00:00 4 0
Putting you aded comment into consideration
string of the Dates in df1
c="|".join(df1.Dates.values)
c
Coerce Datum to datetime
df['Datum']=pd.to_datetime(df['Datum'])
df.dtypes
Extract Datum as Dates ,dtype string
df.set_index(df['Datum'],inplace=True)
df['Dates']=df.index.date.astype(str)
Boolean select date ins in both
m=df.Dates.str.contains(c)
m
Mark inclusive dates as 0 and exclusive as 1
df['drop']=np.where(m,0,1)
df
Drop unwanted rows
df.reset_index(drop=True).drop(columns=['Dates'])
Outcome

Dropping time from datetime <[M8] in Pandas

So I have a 'Date' column in my data frame where the dates have the format like this
0 1998-08-26 04:00:00
If I only want the Year month and day how do I drop the trivial hour?
The quickest way is to use DatetimeIndex's normalize (you first need to make the column a DatetimeIndex):
In [11]: df = pd.DataFrame({"t": pd.date_range('2014-01-01', periods=5, freq='H')})
In [12]: df
Out[12]:
t
0 2014-01-01 00:00:00
1 2014-01-01 01:00:00
2 2014-01-01 02:00:00
3 2014-01-01 03:00:00
4 2014-01-01 04:00:00
In [13]: pd.DatetimeIndex(df.t).normalize()
Out[13]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01, ..., 2014-01-01]
Length: 5, Freq: None, Timezone: None
In [14]: df['date'] = pd.DatetimeIndex(df.t).normalize()
In [15]: df
Out[15]:
t date
0 2014-01-01 00:00:00 2014-01-01
1 2014-01-01 01:00:00 2014-01-01
2 2014-01-01 02:00:00 2014-01-01
3 2014-01-01 03:00:00 2014-01-01
4 2014-01-01 04:00:00 2014-01-01
DatetimeIndex also has some other useful attributes, e.g. .year, .month, .day.
From 0.15 they'll be a dt attribute, so you can access this (and other methods) with:
df.t.dt.normalize()
# equivalent to
pd.DatetimeIndex(df.t).normalize()
Another option
df['my_date_column'].dt.date
Would give
0 2019-06-15
1 2019-06-15
2 2019-06-15
3 2019-06-15
4 2019-06-15
Another Possibility is using str.split
df['Date'] = df['Date'].str.split(' ',expand=True)[0]
This should split the 'Date' column into two columns marked 0 and 1. Using the whitespace in between the date and time as the split indicator.
Column 0 of the returned dataframe then includes the date, and column 1 includes the time.
Then it sets the 'Date' column of your original dataframe to column [0] which should be just the date.
At read_csv with date_parser
to_date = lambda times : [t[0:10] for t in times]
df = pd.read_csv('input.csv',
parse_dates={date: ['time']},
date_parser=to_date,
index_col='date')

Categories