How to calculate the mean value across groups of data - python

I have this pandas DataFrame df:
Station DateTime Record
A 2017-01-01 00:00:00 20
A 2017-01-01 01:00:00 22
A 2017-01-01 02:00:00 20
A 2017-01-01 03:00:00 18
B 2017-01-01 00:00:00 22
B 2017-01-01 01:00:00 24
I want to estimate the average Record per DateTime (basically per hour) across stations A and B. If either A or B have no record for some DateTime, then the Record value should be considered as 0 for this station.
It can be assumed that DateTime is available for all hours for at least one Station.
This is the expected result:
DateTime Avg_Record
2017-01-01 00:00:00 21
2017-01-01 01:00:00 23
2017-01-01 02:00:00 10
2017-01-01 03:00:00 9

Here is a solution:
g = df.groupby('DateTime')['Record']
df_out = g.mean()
m = g.count() == 1
df_out.loc[m] = df_out.loc[m] / 2
df_out = df_out.reset_index()
Or an uglier one-liner:
df = df.groupby('DateTime')['Record'].apply(
lambda x: x.mean() if x.size == 2 else x.values[0]/2
).reset_index()
Proof:
import pandas as pd
data = '''\
Station DateTime Record
A 2017-01-01T00:00:00 20
A 2017-01-01T01:00:00 22
A 2017-01-01T02:00:00 20
A 2017-01-01T03:00:00 18
B 2017-01-01T01:00:00 22
B 2017-01-01T02:00:00 24'''
fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, sep='\s+', parse_dates=['DateTime'])
# Create a grouper and get the mean
g = df.groupby('DateTime')['Record']
df_out = g.mean()
# Divide by 2 where only 1 input exist
m = g.count() == 1
df_out.loc[m] = df_out.loc[m] / 2
# Reset index to get a dataframe format again
df_out = df_out.reset_index()
print(df_out)
Returns:
DateTime Record
0 2017-01-01 00:00:00 10.0
1 2017-01-01 01:00:00 22.0
2 2017-01-01 02:00:00 22.0
3 2017-01-01 03:00:00 9.0

Related

Joining and comparing values of one df with first value of second df and then cumulative comparison

I have two dataframes according to dates such as:
df1
id date time sum
abc 15/03/2020 01:00:00 15
abc 15/03/2020 02:00:00 25
abc 15/03/2020 04:00:00 10
xyz 15/03/2020 12:00:00 30
xyz 15/03/2020 03:00:00 20
df2
id date sum_last
abc 14/03/2020 10
xyz 14/03/2020 20
i want to create a column of Flag in df1 by comparing the values of sum column if value of a sum row is greater than previous sum row then flag is 1 or else its 0 but for the first row of sum column value 15 it will not be Nan,,it will be compared with the value of df2 sum value as it is of same id for one date lesser i.e 14th march 2020.So the output will be:
id date time sum Flag
abc 15/03/2020 01:00:00 15 1
abc 15/03/2020 02:00:00 25 1
abc 15/03/2020 04:00:00 10 0
xyz 15/03/2020 12:00:00 30 1
xyz 15/03/2020 03:00:00 20 0
Can anyone help me to join these two dfs and get exact result by comparing the value of df2 with df1's sum columns's 1st value according to id column.Thanks in advance
Use:
print (df1)
id date time sum sum1
0 abc 15/03/2020 01:00:00 15 10
1 abc 15/03/2020 02:00:00 25 10
2 abc 15/03/2020 04:00:00 10 10
3 xyz 15/03/2020 12:00:00 30 10
4 xyz 15/03/2020 03:00:00 20 10
print (df2)
id date sum sum1
0 abc 15/03/2020 10 0
1 xyz 14/03/2020 20 100
#columns for processing
cols = ['sum','sum1']
#columnsnames in df2
new = [x + '_last' for x in cols]
#dictionary for rename for match with df1.columns
d = dict(zip(new, cols))
print (d)
#set id to index
df1 = df1.set_index('id')
df2 = df2.set_index('id')
#shifting per id and first NaN repalced by df2
df = df1.groupby('id')[cols].shift().fillna(df2.rename(columns=d)[cols])
print (df)
sum sum1
id
abc 10.0 0.0
abc 15.0 10.0
abc 25.0 10.0
xyz 20.0 100.0
xyz 30.0 10.0
#comapred and added to df1
df1 = pd.concat([df1, df1[cols].gt(df[cols]).astype(int).add_prefix('flag_')], axis=1)
print (df1)
date time sum sum1 flag_sum flag_sum1
id
abc 15/03/2020 01:00:00 15 10 1 1
abc 15/03/2020 02:00:00 25 10 1 0
abc 15/03/2020 04:00:00 10 10 0 0
xyz 15/03/2020 12:00:00 30 10 1 0
xyz 15/03/2020 03:00:00 20 10 0 0

Delete all (hourly) day entries per row based on a daily table in python

I have a dataframe with a datetime64[ns] object which has the format, so there I have data per hourly base:
Datum Values
2020-01-01 00:00:00 1
2020-01-01 01:00:00 10
....
2020-02-28 00:00:00 5
2020-03-01 00:00:00 4
and another table with closing days, also in a datetime64[ns] column with the format, so there I only have a dayformat:
Dates
2020-02-28
2020-02-29
....
How can I delete all days in the first dataframe df, which occure in the second dataframe Dates? So that df is:
2020-01-01 00:00:00 1
2020-01-01 01:00:00 10
....
2020-03-01 00:00:00 4
Use Series.dt.floor for set times to 0, so possible filter by Series.isin with inverted mask in boolean indexing:
df['Datum'] = pd.to_datetime(df['Datum'])
df1['Dates'] = pd.to_datetime(df1['Dates'])
df = df[~df['Datum'].dt.floor('d').isin(df1['Dates'])]
print (df)
Datum Values
0 2020-01-01 00:00:00 1
1 2020-01-01 01:00:00 10
3 2020-03-01 00:00:00 4
EDIT: For flag column convert mask to integers by Series.view or Series.astype:
df['flag'] = df['Datum'].dt.floor('d').isin(df1['Dates']).view('i1')
#alternative
#df['flag'] = df['Datum'].dt.floor('d').isin(df1['Dates']).astype('int')
print (df)
Datum Values flag
0 2020-01-01 00:00:00 1 0
1 2020-01-01 01:00:00 10 0
2 2020-02-28 00:00:00 5 1
3 2020-03-01 00:00:00 4 0
Putting you aded comment into consideration
string of the Dates in df1
c="|".join(df1.Dates.values)
c
Coerce Datum to datetime
df['Datum']=pd.to_datetime(df['Datum'])
df.dtypes
Extract Datum as Dates ,dtype string
df.set_index(df['Datum'],inplace=True)
df['Dates']=df.index.date.astype(str)
Boolean select date ins in both
m=df.Dates.str.contains(c)
m
Mark inclusive dates as 0 and exclusive as 1
df['drop']=np.where(m,0,1)
df
Drop unwanted rows
df.reset_index(drop=True).drop(columns=['Dates'])
Outcome

Group datetime column with hourly granularity by days

How can I group the following data frame (with an hourly granularity in the date column)
import pandas as pd
import numpy as np
np.random.seed(42)
date_rng = pd.date_range(start='1/1/2018', end='1/03/2018', freq='H')
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(0,100,size=(len(date_rng)))
print(df.head())
date data
0 2018-01-01 00:00:00 51
1 2018-01-01 01:00:00 92
2 2018-01-01 02:00:00 14
3 2018-01-01 03:00:00 71
4 2018-01-01 04:00:00 60
by day, to calculate min and max values per day?
Use DataFrame.resample:
print(df.resample('d', on='date')['data'].agg(['min','max']))
min max
date
2018-01-01 1 99
2018-01-02 2 91
2018-01-03 72 72
You can also specify columns names:
df1 = df.resample('d', on='date')['data'].agg([('min_data', 'min'),('max_data','max')])
print (df1)
min_data max_data
date
2018-01-01 1 99
2018-01-02 2 91
2018-01-03 72 72
Another solution with Grouper:
df1 = (df.groupby(pd.Grouper(freq='d', key='date'))['data']
.agg([('min_data', 'min'),('max_data','max')]))

Resample pandas dataframe and interpolate missing values for timeseries data

I need to resample timeseries data and interpolate missing values in 15 min intervals over the course of an hour. Each ID should have four rows of data per hour.
In:
ID Time Value
1 1/1/2019 12:17 3
1 1/1/2019 12:44 2
2 1/1/2019 12:02 5
2 1/1/2019 12:28 7
Out:
ID Time Value
1 2019-01-01 12:00:00 3.0
1 2019-01-01 12:15:00 3.0
1 2019-01-01 12:30:00 2.0
1 2019-01-01 12:45:00 2.0
2 2019-01-01 12:00:00 5.0
2 2019-01-01 12:15:00 7.0
2 2019-01-01 12:30:00 7.0
2 2019-01-01 12:45:00 7.0
I wrote a function to do this, however efficiency goes down drastically when trying to process a larger dataset.
Is there a more efficient way to do this?
import datetime
import pandas as pd
data = pd.DataFrame({'ID': [1,1,2,2],
'Time': ['1/1/2019 12:17','1/1/2019 12:44','1/1/2019 12:02','1/1/2019 12:28'],
'Value': [3,2,5,7]})
def clean_dataset(data):
ids = data.drop_duplicates(subset='ID')
data['Time'] = pd.to_datetime(data['Time'])
data['Time'] = data['Time'].apply(
lambda dt: datetime.datetime(dt.year, dt.month, dt.day, dt.hour,15*(dt.minute // 15)))
data = data.drop_duplicates(subset=['Time','ID']).reset_index(drop=True)
df = pd.DataFrame(columns=['Time','ID','Value'])
for i in range(ids.shape[0]):
times = pd.DataFrame(pd.date_range('1/1/2019 12:00','1/1/2019 13:00',freq='15min'),columns=['Time'])
id_data = data[data['ID']==ids.iloc[i]['ID']]
clean_data = times.join(id_data.set_index('Time'), on='Time')
clean_data = clean_data.interpolate(method='linear', limit_direction='both')
clean_data.drop(clean_data.tail(1).index,inplace=True)
df = df.append(clean_data)
return df
clean_dataset(data)
Linear interpolation does become slow with a large data set. Having a loop in your code is also responsible for a large part of the slowdown. Anything that can be removed from the loop and pre-computed will help increase efficiency. For example, if you pre-define the data frame that you use to initialize times, the code becomes 14% more efficient:
times_template = pd.DataFrame(pd.date_range('1/1/2019 12:00','1/1/2019 13:00',freq='15min'),columns=['Time'])
for i in range(ids.shape[0]):
times = times_template.copy()
Profiling your code confirms that the interpolation takes the longest amount of time (22.7%), followed by the join (13.1%), the append (7.71%), and then the drop (7.67%) commands.
You can use:
#round datetimes by 15 minutes
data['Time'] = pd.to_datetime(data['Time'])
minutes = pd.to_timedelta(15*(data['Time'].dt.minute // 15), unit='min')
data['Time'] = data['Time'].dt.floor('H') + minutes
#change date range for 4 values (to `12:45`)
rng = pd.date_range('1/1/2019 12:00','1/1/2019 12:45',freq='15min')
#create MultiIndex and reindex
mux = pd.MultiIndex.from_product([data['ID'].unique(), rng], names=['ID','Time'])
data = data.set_index(['ID','Time']).reindex(mux).reset_index()
#interpolate per groups
data['Value'] = (data.groupby('ID')['Value']
.apply(lambda x: x.interpolate(method='linear', limit_direction='both')))
print (data)
ID Time Value
0 1 2019-01-01 12:00:00 3.0
1 1 2019-01-01 12:15:00 3.0
2 1 2019-01-01 12:30:00 2.0
3 1 2019-01-01 12:45:00 2.0
4 2 2019-01-01 12:00:00 5.0
5 2 2019-01-01 12:15:00 7.0
6 2 2019-01-01 12:30:00 7.0
7 2 2019-01-01 12:45:00 7.0
If range cannot be change:
data['Time'] = pd.to_datetime(data['Time'])
minutes = pd.to_timedelta(15*(data['Time'].dt.minute // 15), unit='min')
data['Time'] = data['Time'].dt.floor('H') + minutes
#end in 13:00
rng = pd.date_range('1/1/2019 12:00','1/1/2019 13:00',freq='15min')
mux = pd.MultiIndex.from_product([data['ID'].unique(), rng], names=['ID','Time'])
data = data.set_index(['ID','Time']).reindex(mux).reset_index()
data['Value'] = (data.groupby('ID')['Value']
.apply(lambda x: x.interpolate(method='linear', limit_direction='both')))
#remove last row per groups
data = data[data['ID'].duplicated(keep='last')]
print (data)
ID Time Value
0 1 2019-01-01 12:00:00 3.0
1 1 2019-01-01 12:15:00 3.0
2 1 2019-01-01 12:30:00 2.0
3 1 2019-01-01 12:45:00 2.0
5 2 2019-01-01 12:00:00 5.0
6 2 2019-01-01 12:15:00 7.0
7 2 2019-01-01 12:30:00 7.0
8 2 2019-01-01 12:45:00 7.0
EDIT:
Another solution with merge and left join instead reindex:
from itertools import product
#round datetimes by 15 minutes
data['Time'] = pd.to_datetime(data['Time'])
minutes = pd.to_timedelta(15*(data['Time'].dt.minute // 15), unit='min')
data['Time'] = data['Time'].dt.floor('H') + minutes
#change date range for 4 values (to `12:45`)
rng = pd.date_range('1/1/2019 12:00','1/1/2019 12:45',freq='15min')
#create helper DataFrame and merge with left join
df = pd.DataFrame(list(product(data['ID'].unique(), rng)), columns=['ID','Time'])
print (df)
ID Time
0 1 2019-01-01 12:00:00
1 1 2019-01-01 12:15:00
2 1 2019-01-01 12:30:00
3 1 2019-01-01 12:45:00
4 2 2019-01-01 12:00:00
5 2 2019-01-01 12:15:00
6 2 2019-01-01 12:30:00
7 2 2019-01-01 12:45:00
data = df.merge(data, how='left')
##interpolate per groups
data['Value'] = (data.groupby('ID')['Value']
.apply(lambda x: x.interpolate(method='linear', limit_direction='both')))
print (data)
ID Time Value
0 1 2019-01-01 12:00:00 3.0
1 1 2019-01-01 12:15:00 3.0
2 1 2019-01-01 12:30:00 2.0
3 1 2019-01-01 12:45:00 2.0
4 2 2019-01-01 12:00:00 5.0
5 2 2019-01-01 12:15:00 7.0
6 2 2019-01-01 12:30:00 7.0
7 2 2019-01-01 12:45:00 7.0

Group pandas rows into pairs then find timedelta

I have a dataframe where I need to group the TX/RX column into pairs, and then put these into a new dataframe with a new index and the timedelta between them as values.
df = pd.DataFrame()
df['time1'] = pd.date_range('2018-01-01', periods=6, freq='H')
df['time2'] = pd.date_range('2018-01-01', periods=6, freq='1H1min')
df['id'] = ids
df['val'] = vals
time1 time2 id val
0 2018-01-01 00:00:00 2018-01-01 00:00:00 1 A
1 2018-01-01 01:00:00 2018-01-01 01:01:00 2 B
2 2018-01-01 02:00:00 2018-01-01 02:02:00 3 A
3 2018-01-01 03:00:00 2018-01-01 03:03:00 4 B
4 2018-01-01 04:00:00 2018-01-01 04:04:00 5 A
5 2018-01-01 05:00:00 2018-01-01 05:05:00 6 B
needs to be...
index timedelta A B
0 1 1 2
1 1 3 4
2 1 5 6
I think that pivot_tables or stack/unstack is probably the best way to go about this, but I'm not entirely sure how...
I believe you need:
df = pd.DataFrame()
df['time1'] = pd.date_range('2018-01-01', periods=6, freq='H')
df['time2'] = df['time1'] + pd.to_timedelta([60,60,120,120,180,180], 's')
df['id'] = range(1,7)
df['val'] = ['A','B'] * 3
df['t'] = df['time2'] - df['time1']
print (df)
time1 time2 id val t
0 2018-01-01 00:00:00 2018-01-01 00:01:00 1 A 00:01:00
1 2018-01-01 01:00:00 2018-01-01 01:01:00 2 B 00:01:00
2 2018-01-01 02:00:00 2018-01-01 02:02:00 3 A 00:02:00
3 2018-01-01 03:00:00 2018-01-01 03:02:00 4 B 00:02:00
4 2018-01-01 04:00:00 2018-01-01 04:03:00 5 A 00:03:00
5 2018-01-01 05:00:00 2018-01-01 05:03:00 6 B 00:03:00
#if necessary convert to seconds
#df['t'] = (df['time2'] - df['time1']).dt.total_seconds()
df = df.pivot('t','val','id').reset_index().rename_axis(None, axis=1)
#if necessary aggregate values
#df = (df.pivot_table(index='t',columns='val',values='id', aggfunc='mean')
# .reset_index().rename_axis(None, axis=1))
print (df)
t A B
0 00:01:00 1 2
1 00:02:00 3 4
2 00:03:00 5 6

Categories