How can I group the following data frame (with an hourly granularity in the date column)
import pandas as pd
import numpy as np
np.random.seed(42)
date_rng = pd.date_range(start='1/1/2018', end='1/03/2018', freq='H')
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(0,100,size=(len(date_rng)))
print(df.head())
date data
0 2018-01-01 00:00:00 51
1 2018-01-01 01:00:00 92
2 2018-01-01 02:00:00 14
3 2018-01-01 03:00:00 71
4 2018-01-01 04:00:00 60
by day, to calculate min and max values per day?
Use DataFrame.resample:
print(df.resample('d', on='date')['data'].agg(['min','max']))
min max
date
2018-01-01 1 99
2018-01-02 2 91
2018-01-03 72 72
You can also specify columns names:
df1 = df.resample('d', on='date')['data'].agg([('min_data', 'min'),('max_data','max')])
print (df1)
min_data max_data
date
2018-01-01 1 99
2018-01-02 2 91
2018-01-03 72 72
Another solution with Grouper:
df1 = (df.groupby(pd.Grouper(freq='d', key='date'))['data']
.agg([('min_data', 'min'),('max_data','max')]))
Related
I have 3 resampled pandas dataframes using the same data indexed by datetime.
Each dataframe is resampled using a different timeframe (e.g 30min / 60 min / 240 min).
2 of the dataframes have resampled correctly with the datetimes aligned because they have an equal number of rows (20) but the 3rd dataframe only has 12 rows because there isn't enough data to create 20 rows resampled to 240mins.
How can I adjust the 240min dataframe so the datetimes are aligned with the other 2 dataframes?
For example, every 2nd row in the 30min dataframe equals the corresponding row in the 60min dataframe and every 4 rows in the 60min dataframe should equal the corresponding row in 240min dataframe but this is not the case because the 240min dataframe has resampled the datetimes differently due there not being enough data to create 20 rows.
If you're just trying to align the different datasets to one index you can use pd.concat.
import pandas as pd
periods = 12.5 * 240
index = pd.date_range(start='1/1/2018', periods=periods, freq="min")
data = pd.DataFrame(list(range(int(periods))), index=index)
df1 = data.resample('30min').asfreq()
df2 = data.resample('60min').asfreq()
df3 = data.resample('240min').asfreq()
df4 = pd.concat([df1, df2, df3], axis=1)
print(df4)
Output:
2018-01-01 00:00:00 0 0.0 0.0
2018-01-01 00:30:00 30 NaN NaN
2018-01-01 01:00:00 60 60.0 NaN
2018-01-01 01:30:00 90 NaN NaN
2018-01-01 02:00:00 120 120.0 NaN
... ... ... ...
2018-01-02 23:30:00 2850 NaN NaN
2018-01-03 00:00:00 2880 2880.0 2880.0
2018-01-03 00:30:00 2910 NaN NaN
2018-01-03 01:00:00 2940 2940.0 NaN
2018-01-03 01:30:00 2970 NaN NaN
I have the following time series data of temperature readings:
DT Temperature
01/01/2019 0:00 41
01/01/2019 1:00 42
01/01/2019 2:00 44
......
01/01/2019 23:00 41
01/02/2019 0:00 44
I am trying to write a function that compares the hourly change in temperature for a given day. Any change greater than 3 will increment quickChange counter. Something like this:
def countChange(day):
for dt in day:
if dt+1 - dt > 3: quickChange = quickChange+1
I can call the function for a day ex: countChange(df.loc['2018-01-01'])
Use Series.diff with compare by 3 and count Trues values by sum:
np.random.seed(2019)
rng = (pd.date_range('2018-01-01', periods=10, freq='H').tolist() +
pd.date_range('2018-01-02', periods=10, freq='H').tolist())
df = pd.DataFrame({'Temperature': np.random.randint(100, size=20)}, index=rng)
print (df)
Temperature
2018-01-01 00:00:00 72
2018-01-01 01:00:00 31
2018-01-01 02:00:00 37
2018-01-01 03:00:00 88
2018-01-01 04:00:00 62
2018-01-01 05:00:00 24
2018-01-01 06:00:00 29
2018-01-01 07:00:00 15
2018-01-01 08:00:00 12
2018-01-01 09:00:00 16
2018-01-02 00:00:00 48
2018-01-02 01:00:00 71
2018-01-02 02:00:00 83
2018-01-02 03:00:00 12
2018-01-02 04:00:00 80
2018-01-02 05:00:00 50
2018-01-02 06:00:00 95
2018-01-02 07:00:00 5
2018-01-02 08:00:00 24
2018-01-02 09:00:00 28
#if necessary create DatetimeIndex if DT is column
df = df.set_index("DT")
def countChange(day):
return (day['Temperature'].diff() > 3).sum()
print (countChange(df.loc['2018-01-01']))
4
print (countChange(df.loc['2018-01-02']))
9
try pandas.DataFrame.diff:
df = pd.DataFrame({'dt': ["01/01/2019 0:00","01/01/2019 1:00","01/01/2019 2:00","01/01/2019 23:00","01/02/2019 0:00"],
'Temperature': [41, 42, 44, 41, 44]})
df = df.sort_values("dt")
df = df.set_index("dt")
def countChange(df):
df["diff"] = df["Temperature"].diff()
return df.loc[df["diff"] > 3, "diff"].count()
quickchange = countChange(df.loc["2018-01-01"])
I have this pandas DataFrame df:
Station DateTime Record
A 2017-01-01 00:00:00 20
A 2017-01-01 01:00:00 22
A 2017-01-01 02:00:00 20
A 2017-01-01 03:00:00 18
B 2017-01-01 00:00:00 22
B 2017-01-01 01:00:00 24
I want to estimate the average Record per DateTime (basically per hour) across stations A and B. If either A or B have no record for some DateTime, then the Record value should be considered as 0 for this station.
It can be assumed that DateTime is available for all hours for at least one Station.
This is the expected result:
DateTime Avg_Record
2017-01-01 00:00:00 21
2017-01-01 01:00:00 23
2017-01-01 02:00:00 10
2017-01-01 03:00:00 9
Here is a solution:
g = df.groupby('DateTime')['Record']
df_out = g.mean()
m = g.count() == 1
df_out.loc[m] = df_out.loc[m] / 2
df_out = df_out.reset_index()
Or an uglier one-liner:
df = df.groupby('DateTime')['Record'].apply(
lambda x: x.mean() if x.size == 2 else x.values[0]/2
).reset_index()
Proof:
import pandas as pd
data = '''\
Station DateTime Record
A 2017-01-01T00:00:00 20
A 2017-01-01T01:00:00 22
A 2017-01-01T02:00:00 20
A 2017-01-01T03:00:00 18
B 2017-01-01T01:00:00 22
B 2017-01-01T02:00:00 24'''
fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, sep='\s+', parse_dates=['DateTime'])
# Create a grouper and get the mean
g = df.groupby('DateTime')['Record']
df_out = g.mean()
# Divide by 2 where only 1 input exist
m = g.count() == 1
df_out.loc[m] = df_out.loc[m] / 2
# Reset index to get a dataframe format again
df_out = df_out.reset_index()
print(df_out)
Returns:
DateTime Record
0 2017-01-01 00:00:00 10.0
1 2017-01-01 01:00:00 22.0
2 2017-01-01 02:00:00 22.0
3 2017-01-01 03:00:00 9.0
My data set contains the data of days and hrs
time slot hr_slot location_point
2019-01-21 00:00:00 0 34
2019-01-21 01:00:00 1 564
2019-01-21 02:00:00 2 448
2019-01-21 03:00:00 3 46
.
.
.
.
2019-01-22 23:00:00 23 78
2019-01-22 00:00:00 0 34
2019-01-22 01:00:00 1 165
2019-01-22 02:00:00 2 65
2019-01-22 03:00:00 3 156
.
.
.
.
2019-01-22 23:00:00 23 78
The data set conatins 7 days. that is 7*24 row. How to plot the graph for the dataset above.
hr_slot on the X axis : (0-23 hours)
loaction_point on Y axis : (location_point)
and each day should have different color on the graph: (Day1: color1, Day2:color2....)
Consider pivoting your data first:
# Create normalized date column
df['date'] = df['time slot'].dt.date.astype(str)
# Pivot
piv = df.pivot(index='hr_slot', columns='date', values='location_point')
piv.plot()
Update
To filter which dates are plotted, using loc or iloc:
# Exclude first and last day
piv.iloc[:, 1:-1].plot()
# Include specific dates only
piv.loc[:, ['2019-01-21', '2019-01-22']].plot()
Alternate approach using pandas.crosstab instead:
(pd.crosstab(df['hr_slot'],
df['time slot'].dt.date,
values=df['location_point'],
aggfunc='sum')
.plot())
I have two DataFrames that look like this
start_date end_date
1 2018-01-01 2018-01-31
2 2018-01-15 2018-02-28
3 2018-01-31 2018-03-15
4 2018-01-07 2018-04-30
value
2018-01-01 1
2018-01-02 4
2018-01-03 2
2018-01-04 10
2018-01-05 0
... ...
2018-12-28 1
2018-12-29 7
2018-12-30 9
2018-12-31 5
I'm trying to add a new column to the first DataFrame that contains the summed values of the second DataFrame, filtered by start_date and end_date. Something like
start_date end_date total_value
1 2018-01-01 2018-01-31 47 # Where 47 is the sum of values between 2018-01-01 and 2018-01-31, inclusive
2 2018-01-15 2018-02-28 82
3 2018-01-31 2018-03-15 116
4 2018-01-07 2018-04-30 253
I think I can do this with apply (basically just filter and sum the second DataFrame by start_date and end_date and return the sum), but I'm wondering if there's a neat pandas-esque solution instead.
NEW ANSWER
I'm using OP data and it needs to be massaged slightly
df2 = df2.asfreq('D').fillna(0, downcast='infer')
Then we do the cumsum thing with an added shift.
s = df2.value.cumsum()
starts = df1.start_date.map(s.shift().fillna(0, downcast='infer'))
ends = df1.end_date.map(s)
df1.assign(total_value=ends - starts)
start_date end_date total_value
1 2018-01-01 2018-01-31 17
2 2018-01-15 2018-02-28 0
3 2018-01-31 2018-03-15 0
4 2018-01-07 2018-04-30 0
OLD ANSWER
COOL, but inaccurate. This is the sum of numbers after the start date. In order to include start date, I have to use shift. See above.
You can use cumsum and take differences.
df1.assign(
total_value=df1.applymap(df2.cumsum().value.get).eval('end_date - start_date'))
start_date end_date total_value
1 2018-01-01 2018-01-31 145
2 2018-01-15 2018-02-28 229
3 2018-01-31 2018-03-15 212
4 2018-01-07 2018-04-30 535
Setup
np.random.seed([3, 1415])
min_date = df1.values.min()
max_date = df1.values.max()
tidx = pd.date_range(min_date, max_date)
df2 = pd.DataFrame(dict(value=np.random.randint(10, size=len(tidx))), tidx)
Setup
df2.reset_index(inplace=True)
Create your conditions using a loop and zip (It's important that output matches the index of your df1)
conditions = [df2['index'].between(i, j) for i, j in zip(df1.start_date, df1.end_date)]
output = df1.index
Use np.select, then groupby:
tmp = df2.assign(flag=np.select(conditions, output, np.nan))
tmp = tmp.dropna().groupby('flag').value.sum()
Finally merge:
df1.merge(tmp.to_frame(), left_index=True, right_index=True)
Output:
start_date end_date value
1.0 2018-01-01 2018-01-31 17
Notice this will be O(m*n) method , create a new key for merge
df1['Newkey']=1
df2['Newkey']=1
df2.reset_index(inplace=True)
mergefilterdf=df1.merge(df2).\
loc[lambda x : (x['start_date']<=x['index'])&(x['end_date']>=x['index'])]
mergefilterdf.groupby(['start_date','end_date']).value.sum()
Out[331]:
start_date end_date
2018-01-01 2018-01-31 17
Name: value, dtype: int64