I am trying find the cleanest, most pandastic way to create a new column that has the minimum values from one column in the same row as the maximum values in another column. The rest of the values can be nan as I will be interpolating.
rng = pd.date_range(start=datetime.date(2020,8,1), end=datetime.date(2020,8,3), freq='H')
df = pd.DataFrame(rng, columns=['date'])
df.index=pd.to_datetime(df['date'])
df.drop(['date'],axis=1,inplace=True)
df['val0']=np.random.randint(0,50,49)
df['val1']=np.random.randint(0,50,49)
One realization of df (cut and paste for reproducability):
val0 val1
date
2020-08-01 00:00:00 17 4
2020-08-01 01:00:00 89 0
2020-08-01 02:00:00 85 48
2020-08-01 03:00:00 83 13
2020-08-01 04:00:00 56 65
2020-08-01 05:00:00 48 31
2020-08-01 06:00:00 55 11
2020-08-01 07:00:00 15 87
2020-08-01 08:00:00 92 70
2020-08-01 09:00:00 95 57
2020-08-01 10:00:00 68 79
2020-08-01 11:00:00 87 7
2020-08-01 12:00:00 43 15
2020-08-01 13:00:00 23 4
2020-08-01 14:00:00 68 13
2020-08-01 15:00:00 68 63
2020-08-01 16:00:00 28 86
2020-08-01 17:00:00 12 40
2020-08-01 18:00:00 51 20
2020-08-01 19:00:00 20 48
2020-08-01 20:00:00 79 78
2020-08-01 21:00:00 67 89
2020-08-01 22:00:00 46 52
2020-08-01 23:00:00 7 47
2020-08-02 00:00:00 14 73
2020-08-02 01:00:00 70 30
2020-08-02 02:00:00 2 39
2020-08-02 03:00:00 65 81
2020-08-02 04:00:00 65 8
2020-08-02 05:00:00 83 60
2020-08-02 06:00:00 1 64
2020-08-02 07:00:00 13 63
2020-08-02 08:00:00 45 78
2020-08-02 09:00:00 83 7
2020-08-02 10:00:00 75 0
2020-08-02 11:00:00 52 3
2020-08-02 12:00:00 59 34
2020-08-02 13:00:00 54 57
2020-08-02 14:00:00 90 66
2020-08-02 15:00:00 82 56
2020-08-02 16:00:00 9 2
2020-08-02 17:00:00 5 51
2020-08-02 18:00:00 67 96
2020-08-02 19:00:00 18 77
2020-08-02 20:00:00 28 89
2020-08-02 21:00:00 96 53
2020-08-02 22:00:00 28 46
2020-08-02 23:00:00 41 87
2020-08-03 00:00:00 26 47
Now I find idxmax for and idxmin:
minidx=df.groupby(pd.Grouper(freq='D')).idxmin()
maxidx=df.groupby(pd.Grouper(freq='D')).idxmax()
minidx:
val0 val1
date
2020-08-01 2020-08-01 23:00:00 2020-08-01 01:00:00
2020-08-02 2020-08-02 06:00:00 2020-08-02 10:00:00
2020-08-03 2020-08-03 00:00:00 2020-08-03 00:00:00
maxidx:
val0 val1
date
2020-08-01 2020-08-01 09:00:00 2020-08-01 21:00:00
2020-08-02 2020-08-02 21:00:00 2020-08-02 18:00:00
2020-08-03 2020-08-03 00:00:00 2020-08-03 00:00:00
In this case, I would like to put the minimum daily value (7) located at 2020-08-01 23:00:00 into a new column at 2020-08-01 21:00:00 (i.e. adjacent to 89, the daily max of val1), and do the same for all other dates so the 'new' value on 2020-08-02 18:00:00 will be 1 (i.e. the minimum daily value occurring on 2020-08-02 06:00:00).
I tried the following, but I just get a bunch of nans:
df.loc[maxidx['val1'].values,'new']=df.loc[minidx['val0'].values,'val0']
If I just set it to an int (df.loc[maxidx['val1'].values,'new']=6), I get the int in the places I want the new values. The values I want are give by df.loc[minidx['val0'].values,'val0'], but I can't seem to get them into the dataframe.
minidx['val0'].values and maxidx['val1'].values are arrays of the same size with elements of type numpy.datetime64, and they are all generated from the same dataframe so maxidx and minidx should exist in df.index (df.index.values).
Is there an obvious reason this isn't working? Thanks
The simplest solution I have found is to loop through the idxmin and idxmax:
for v0,v1 in zip(minidx['val0'].values,maxidx['val1'].values):
df.loc[v1,'new']=df.loc[v0,'val0']
This gives me what I want, but doesn't seem very pandastic, so any other suggestions to accomplish the same thing would be great.
IIUC, you can do this using NamedAgg:
df.groupby(pd.Grouper(freq='D')).agg(val0_min_time=('val0','idxmin'),
val0_min_value=('val0','min'),
val0_max_time=('val0','idxmax'),
val0_max_value=('val0','max'),
val1_min_time=('val1','idxmin'),
val1_min_value=('val1','min'),
val1_max_time=('val1','idxmax'),
val1_max_value=('val1','max'),)
Output:
val0_min_time val0_min_value val0_max_time val0_max_value val1_min_time val1_min_value val1_max_time val1_max_value
date
2020-08-01 2020-08-01 23:00:00 7 2020-08-01 09:00:00 95 2020-08-01 01:00:00 0 2020-08-01 21:00:00 89
2020-08-02 2020-08-02 06:00:00 1 2020-08-02 21:00:00 96 2020-08-02 10:00:00 0 2020-08-02 18:00:00 96
2020-08-03 2020-08-03 00:00:00 26 2020-08-03 00:00:00 26 2020-08-03 00:00:00 47 2020-08-03 00:00:00 47
Related
Say I have the following:
>>> numpy.random.seed(42)
>>> df = pandas.DataFrame(numpy.random.randint(0, 100, 19), columns=['val'], index=pandas.date_range('2021-03-01', '2021-03-04', freq='4H'))
>>> df
val
2021-03-01 00:00:00 51
2021-03-01 04:00:00 92
2021-03-01 08:00:00 14
2021-03-01 12:00:00 71
2021-03-01 16:00:00 60
2021-03-01 20:00:00 20
2021-03-02 00:00:00 82
2021-03-02 04:00:00 86
2021-03-02 08:00:00 74
2021-03-02 12:00:00 74
2021-03-02 16:00:00 87
2021-03-02 20:00:00 99
2021-03-03 00:00:00 23
2021-03-03 04:00:00 2
2021-03-03 08:00:00 21
2021-03-03 12:00:00 52
2021-03-03 16:00:00 1
2021-03-03 20:00:00 87
2021-03-04 00:00:00 29
>>> df.groupby(pandas.Grouper(freq='1D')).quantile(0.95, interpolation='higher')
val
2021-03-01 92
2021-03-02 99
2021-03-03 87
2021-03-04 29
How can I also get the indices where quantiles are located within each group? I.e. my desired output is:
val idx
2021-03-01 92 2021-03-01 04:00:00
2021-03-02 99 2021-03-02 20:00:00
2021-03-03 87 2021-03-03 20:00:00
2021-03-04 29 2021-03-04 00:00:00
Instead of quantile calculate the rank within each group and figure out which values are >= your quantile (since you use interpolate='higher'). Then sort the DataFrame, keep only rows above your quantile and take the first within group. Assigning a column as the index brings this along.
m = df.resample('D')['val'].rank(method='dense', pct=True).ge(0.95)
df1 = df.assign(index=df.index)[m].sort_values('val')
df1.groupby(df1.index.normalize()).first()
val index
2021-03-01 92 2021-03-01 04:00:00
2021-03-02 99 2021-03-02 20:00:00
2021-03-03 87 2021-03-03 20:00:00
2021-03-04 29 2021-03-04 00:00:00
One option is to use groupby().transform:
q95 = (df.groupby(pd.Grouper(freq='1D'))['val']
.transform('quantile', q=0.95, interpolation='higher')
)
df[df['val']== q95]
Output:
val
2021-03-01 04:00:00 92
2021-03-02 20:00:00 99
2021-03-03 20:00:00 87
2021-03-04 00:00:00 29
I have a datframe looks like this:
zone Datetime Demand
48 2020-08-02 00:00:00 14292.550740
48 2020-08-02 01:00:00 14243.490740
48 2020-08-02 02:00:00 9130.840744
48 2020-08-02 03:00:00 10483.510740
48 2020-08-02 04:00:00 10014.970740
I want to resample (sum) the demand values according to another df index looks like:
2020-08-02 03:00:00
2020-08-02 06:00:00
2020-08-02 07:00:00
2020-08-02 10:00:00
What is the best way to handle that?
I believe you need merge_asof:
print (df2)
a
2020-08-02 03:00:00 1
2020-08-02 06:00:00 2
2020-08-02 07:00:00 3
2020-08-02 10:00:00 4
df1['Datetime'] = pd.to_datetime(df1['Datetime'])
df2.index = pd.to_datetime(df2.index)
df = pd.merge_asof(df1,
df2.rename_axis('date2').reset_index(),
left_on='Datetime',
right_on='date2',
direction='forward'
)
print (df)
zone Datetime Demand date2 a
0 48 2020-08-02 00:00:00 14292.550740 2020-08-02 03:00:00 1
1 48 2020-08-02 01:00:00 14243.490740 2020-08-02 03:00:00 1
2 48 2020-08-02 02:00:00 9130.840744 2020-08-02 03:00:00 1
3 48 2020-08-02 03:00:00 10483.510740 2020-08-02 03:00:00 1
4 48 2020-08-02 04:00:00 10014.970740 2020-08-02 06:00:00 2
And then aggregate sum, e.g. if need by both columns:
df = df.groupby(['zone','date2'], as_index=False)['Demand'].sum()
print (df)
zone date2 Demand
0 48 2020-08-02 03:00:00 48150.392964
1 48 2020-08-02 06:00:00 10014.970740
I want to create missing records from a time serie of % humidity.
datetime humidite
0 2019-07-09 08:30:00 87
1 2019-07-09 11:00:00 87
2 2019-07-09 17:30:00 82
3 2019-07-09 23:30:00 80
4 2019-07-11 06:15:00 79
5 2019-07-19 14:30:00 39
6 2019-07-21 00:00:00 80
I tried to index with existing datetime (result at this step is ok) :
humdt["datetime"] = pd.to_datetime(humdt["datetime"])
humdt = humdt.set_index("datetime")
humidite
datetime
2019-07-09 08:30:00 87
2019-07-09 11:00:00 87
2019-07-09 17:30:00 82
2019-07-09 23:30:00 80
2019-07-11 06:15:00 79
2019-07-19 14:30:00 39
Then reindex at 15 min frequency (my target frequency) :
humdt.resample("15min").asfreq()
humidite
datetime
2019-06-26 10:00:00 34.0
2019-06-26 10:15:00 33.0
2019-06-26 10:30:00 32.0
2019-06-26 10:45:00 31.0
2019-06-26 11:00:00 30.0
2019-06-26 11:15:00 29.0
As a result, I get wrong starting time and values, just frequency is respected.
Can you help me please ? I also tried to merge a range of datetime defined as my expected records with my data and it doesn't work. Thank you !!!
sorry for the badly phrased question, currently only the first hour is updated with holiday.
e.g.
2013-01-01 00:00:00 - New Years Day
2013-01-01 00:00:00 - None
2013-01-01 00:00:00 - None
I would like to apply similar holidays to the same date using Pandas (Python).
What would be the most efficient method to apply the holiday to the same dates, there are a number of other holidays to apply as well?
Thank you in advance!
Screenshot of CSV in question
Using a library called holidays together with pandas apply could be a great solution to your problem. Here is a short contained example example
import pandas as pd
import holidays
us_holidays = holidays.UnitedStates()
# Create a sample DataFrame. You can just use your own
data = pd.DataFrame(pd.date_range('2020-01-01', '2020-01-30'), columns=['date'])
data['holiday'] = data['date'].apply(lambda x: us_holidays.get(x))
print(data)
Output
date holiday
0 2020-01-01 New Year's Day
1 2020-01-02 None
2 2020-01-03 None
3 2020-01-04 None
4 2020-01-05 None
5 2020-01-06 None
6 2020-01-07 None
7 2020-01-08 None
8 2020-01-09 None
9 2020-01-10 None
10 2020-01-11 None
11 2020-01-12 None
12 2020-01-13 None
13 2020-01-14 None
14 2020-01-15 None
15 2020-01-16 None
16 2020-01-17 None
17 2020-01-18 None
18 2020-01-19 None
19 2020-01-20 Martin Luther King, Jr. Day
20 2020-01-21 None
21 2020-01-22 None
22 2020-01-23 None
23 2020-01-24 None
24 2020-01-25 None
25 2020-01-26 None
26 2020-01-27 None
27 2020-01-28 None
28 2020-01-29 None
29 2020-01-30 None
IIUC, you have only the first hour of a day listed with a holiday. Here is a small sample of a dataframe with two months of data and three holidays on three separate days.
import pandas as pd
import numpy as np
df = pd.DataFrame({'temp':np.random.randint(50,110, 60*24)}, index=pd.date_range('2013-01-01', periods=(60*24), freq='H'))
df['Holiday'] = np.nan
df.loc['2013-01-01 00:00:00', 'Holiday'] = 'New Years Day'
df.loc['2013-02-02 00:00:00', 'Holiday'] = 'Groundhog Day'
df.loc['2013-02-14 00:00:00', 'Holiday'] = "Valentine's Day"
Now, let's use groupby with day from DatetimeIndex and ffill:
df['Holiday'] = df.groupby(df.index.day)['Holiday'].ffill()
Let's look at a few records:
print(df.head(40))
print(df['2013-02-02'])
print(df['2013-02-13':'2013-02-15'])
Output:
temp Holiday
2013-01-01 00:00:00 51 New Years Day
2013-01-01 01:00:00 71 New Years Day
2013-01-01 02:00:00 61 New Years Day
2013-01-01 03:00:00 90 New Years Day
2013-01-01 04:00:00 77 New Years Day
2013-01-01 05:00:00 69 New Years Day
2013-01-01 06:00:00 50 New Years Day
2013-01-01 07:00:00 99 New Years Day
2013-01-01 08:00:00 86 New Years Day
2013-01-01 09:00:00 72 New Years Day
2013-01-01 10:00:00 89 New Years Day
2013-01-01 11:00:00 62 New Years Day
2013-01-01 12:00:00 53 New Years Day
2013-01-01 13:00:00 91 New Years Day
2013-01-01 14:00:00 51 New Years Day
2013-01-01 15:00:00 93 New Years Day
2013-01-01 16:00:00 97 New Years Day
2013-01-01 17:00:00 83 New Years Day
2013-01-01 18:00:00 87 New Years Day
2013-01-01 19:00:00 58 New Years Day
2013-01-01 20:00:00 84 New Years Day
2013-01-01 21:00:00 92 New Years Day
2013-01-01 22:00:00 106 New Years Day
2013-01-01 23:00:00 104 New Years Day
2013-01-02 00:00:00 78 NaN
2013-01-02 01:00:00 104 NaN
2013-01-02 02:00:00 96 NaN
2013-01-02 03:00:00 103 NaN
2013-01-02 04:00:00 60 NaN
2013-01-02 05:00:00 87 NaN
2013-01-02 06:00:00 108 NaN
2013-01-02 07:00:00 85 NaN
2013-01-02 08:00:00 67 NaN
2013-01-02 09:00:00 61 NaN
2013-01-02 10:00:00 91 NaN
2013-01-02 11:00:00 79 NaN
2013-01-02 12:00:00 99 NaN
2013-01-02 13:00:00 82 NaN
2013-01-02 14:00:00 75 NaN
2013-01-02 15:00:00 90 NaN
temp Holiday
2013-02-02 00:00:00 82 Groundhog Day
2013-02-02 01:00:00 58 Groundhog Day
2013-02-02 02:00:00 102 Groundhog Day
2013-02-02 03:00:00 90 Groundhog Day
2013-02-02 04:00:00 79 Groundhog Day
2013-02-02 05:00:00 50 Groundhog Day
2013-02-02 06:00:00 50 Groundhog Day
2013-02-02 07:00:00 83 Groundhog Day
2013-02-02 08:00:00 80 Groundhog Day
2013-02-02 09:00:00 50 Groundhog Day
2013-02-02 10:00:00 52 Groundhog Day
2013-02-02 11:00:00 69 Groundhog Day
2013-02-02 12:00:00 100 Groundhog Day
2013-02-02 13:00:00 61 Groundhog Day
2013-02-02 14:00:00 62 Groundhog Day
2013-02-02 15:00:00 76 Groundhog Day
2013-02-02 16:00:00 83 Groundhog Day
2013-02-02 17:00:00 109 Groundhog Day
2013-02-02 18:00:00 109 Groundhog Day
2013-02-02 19:00:00 81 Groundhog Day
2013-02-02 20:00:00 52 Groundhog Day
2013-02-02 21:00:00 108 Groundhog Day
2013-02-02 22:00:00 68 Groundhog Day
2013-02-02 23:00:00 75 Groundhog Day
temp Holiday
2013-02-13 00:00:00 93 NaN
2013-02-13 01:00:00 93 NaN
2013-02-13 02:00:00 74 NaN
2013-02-13 03:00:00 97 NaN
2013-02-13 04:00:00 58 NaN
2013-02-13 05:00:00 103 NaN
2013-02-13 06:00:00 79 NaN
2013-02-13 07:00:00 65 NaN
2013-02-13 08:00:00 72 NaN
2013-02-13 09:00:00 100 NaN
2013-02-13 10:00:00 66 NaN
2013-02-13 11:00:00 60 NaN
2013-02-13 12:00:00 95 NaN
2013-02-13 13:00:00 51 NaN
2013-02-13 14:00:00 71 NaN
2013-02-13 15:00:00 58 NaN
2013-02-13 16:00:00 58 NaN
2013-02-13 17:00:00 98 NaN
2013-02-13 18:00:00 61 NaN
2013-02-13 19:00:00 63 NaN
2013-02-13 20:00:00 57 NaN
2013-02-13 21:00:00 102 NaN
2013-02-13 22:00:00 69 NaN
2013-02-13 23:00:00 86 NaN
2013-02-14 00:00:00 94 Valentine's Day
2013-02-14 01:00:00 64 Valentine's Day
2013-02-14 02:00:00 62 Valentine's Day
2013-02-14 03:00:00 59 Valentine's Day
2013-02-14 04:00:00 93 Valentine's Day
2013-02-14 05:00:00 99 Valentine's Day
2013-02-14 06:00:00 64 Valentine's Day
2013-02-14 07:00:00 80 Valentine's Day
2013-02-14 08:00:00 89 Valentine's Day
2013-02-14 09:00:00 96 Valentine's Day
2013-02-14 10:00:00 60 Valentine's Day
2013-02-14 11:00:00 76 Valentine's Day
2013-02-14 12:00:00 82 Valentine's Day
2013-02-14 13:00:00 65 Valentine's Day
2013-02-14 14:00:00 90 Valentine's Day
2013-02-14 15:00:00 62 Valentine's Day
2013-02-14 16:00:00 64 Valentine's Day
2013-02-14 17:00:00 98 Valentine's Day
2013-02-14 18:00:00 52 Valentine's Day
2013-02-14 19:00:00 72 Valentine's Day
2013-02-14 20:00:00 108 Valentine's Day
2013-02-14 21:00:00 85 Valentine's Day
2013-02-14 22:00:00 87 Valentine's Day
2013-02-14 23:00:00 62 Valentine's Day
2013-02-15 00:00:00 106 NaN
2013-02-15 01:00:00 82 NaN
2013-02-15 02:00:00 77 NaN
2013-02-15 03:00:00 52 NaN
2013-02-15 04:00:00 94 NaN
2013-02-15 05:00:00 71 NaN
2013-02-15 06:00:00 95 NaN
2013-02-15 07:00:00 96 NaN
2013-02-15 08:00:00 71 NaN
2013-02-15 09:00:00 69 NaN
2013-02-15 10:00:00 85 NaN
2013-02-15 11:00:00 92 NaN
2013-02-15 12:00:00 106 NaN
2013-02-15 13:00:00 77 NaN
2013-02-15 14:00:00 65 NaN
2013-02-15 15:00:00 104 NaN
2013-02-15 16:00:00 98 NaN
2013-02-15 17:00:00 107 NaN
2013-02-15 18:00:00 106 NaN
2013-02-15 19:00:00 67 NaN
2013-02-15 20:00:00 59 NaN
2013-02-15 21:00:00 81 NaN
2013-02-15 22:00:00 56 NaN
2013-02-15 23:00:00 75 NaN
Note: In this dataframe your datetime column is in the index.
You can try using the apply method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
The input to this is the function you want to be applied to each row. And in this case "axis" should be zero so that it is applied to each row.
I have a dataframe with datetime index:df.head(6)
NUMBERES PRICE
DEAL_TIME
2015-03-02 12:40:03 5 25
2015-03-04 14:52:57 7 23
2015-03-03 08:10:09 10 43
2015-03-02 20:18:24 5 37
2015-03-05 07:50:55 4 61
2015-03-02 09:08:17 1 17
The dataframe includes the data of one week. Now I need to count the time period of the day. If time period is 1 hour, I know the following method would work:
df_grouped = df.groupby(df.index.hour).count()
But I don't know how to do when the time period is half hour. How can I realize it?
UPDATE:
I was told that this question is similar to How to group DataFrame by a period of time?
But I had tried the methods mentioned. Maybe it's my fault that I didn't say it clearly. 'DEAL_TIME' ranges from '2015-03-02 00:00:00' to '2015-03-08 23:59:59'. If I use pd.TimeGrouper(freq='30Min') or resample(), the time periods would range from '2015-03-02 00:30' to '2015-03-08 23:30'. But what I want is a series like below:
COUNT
DEAL_TIME
00:00:00 53
00:30:00 49
01:00:00 31
01:30:00 22
02:00:00 1
02:30:00 24
03:00:00 27
03:30:00 41
04:00:00 41
04:30:00 76
05:00:00 33
05:30:00 16
06:00:00 15
06:30:00 4
07:00:00 60
07:30:00 85
08:00:00 3
08:30:00 37
09:00:00 18
09:30:00 29
10:00:00 31
10:30:00 67
11:00:00 35
11:30:00 60
12:00:00 95
12:30:00 37
13:00:00 30
13:30:00 62
14:00:00 58
14:30:00 44
15:00:00 45
15:30:00 35
16:00:00 94
16:30:00 56
17:00:00 64
17:30:00 43
18:00:00 60
18:30:00 52
19:00:00 14
19:30:00 9
20:00:00 31
20:30:00 71
21:00:00 21
21:30:00 32
22:00:00 61
22:30:00 35
23:00:00 14
23:30:00 21
In other words, the time period should be irrelevant to the date.
You need a 30-minute time grouper for this:
grouper = pd.TimeGrouper(freq="30T")
You also need to remove the 'date' part from the index:
df.index = df.reset_index()['index'].apply(lambda x: x - pd.Timestamp(x.date()))
Now, you can group by time alone:
df.groupby(grouper).count()
You can find somewhat obscure TimeGrouper documentation here: pandas resample documentation (it's actually resample documentation, but both features use the same rules).
In pandas, the most common way to group by time is to use the
.resample() function.
In v0.18.0 this function is two-stage.
This means that df.resample('M') creates an object to which we can
apply other functions (mean, count, sum, etc.)
The code snippet will be like,
df.resample('M').count()
You can refer here for example.