Pandas (Python): How to apply values to similar row? - python

sorry for the badly phrased question, currently only the first hour is updated with holiday.
e.g.
2013-01-01 00:00:00 - New Years Day
2013-01-01 00:00:00 - None
2013-01-01 00:00:00 - None
I would like to apply similar holidays to the same date using Pandas (Python).
What would be the most efficient method to apply the holiday to the same dates, there are a number of other holidays to apply as well?
Thank you in advance!
Screenshot of CSV in question

Using a library called holidays together with pandas apply could be a great solution to your problem. Here is a short contained example example
import pandas as pd
import holidays
us_holidays = holidays.UnitedStates()
# Create a sample DataFrame. You can just use your own
data = pd.DataFrame(pd.date_range('2020-01-01', '2020-01-30'), columns=['date'])
data['holiday'] = data['date'].apply(lambda x: us_holidays.get(x))
print(data)
Output
date holiday
0 2020-01-01 New Year's Day
1 2020-01-02 None
2 2020-01-03 None
3 2020-01-04 None
4 2020-01-05 None
5 2020-01-06 None
6 2020-01-07 None
7 2020-01-08 None
8 2020-01-09 None
9 2020-01-10 None
10 2020-01-11 None
11 2020-01-12 None
12 2020-01-13 None
13 2020-01-14 None
14 2020-01-15 None
15 2020-01-16 None
16 2020-01-17 None
17 2020-01-18 None
18 2020-01-19 None
19 2020-01-20 Martin Luther King, Jr. Day
20 2020-01-21 None
21 2020-01-22 None
22 2020-01-23 None
23 2020-01-24 None
24 2020-01-25 None
25 2020-01-26 None
26 2020-01-27 None
27 2020-01-28 None
28 2020-01-29 None
29 2020-01-30 None

IIUC, you have only the first hour of a day listed with a holiday. Here is a small sample of a dataframe with two months of data and three holidays on three separate days.
import pandas as pd
import numpy as np
df = pd.DataFrame({'temp':np.random.randint(50,110, 60*24)}, index=pd.date_range('2013-01-01', periods=(60*24), freq='H'))
df['Holiday'] = np.nan
df.loc['2013-01-01 00:00:00', 'Holiday'] = 'New Years Day'
df.loc['2013-02-02 00:00:00', 'Holiday'] = 'Groundhog Day'
df.loc['2013-02-14 00:00:00', 'Holiday'] = "Valentine's Day"
Now, let's use groupby with day from DatetimeIndex and ffill:
df['Holiday'] = df.groupby(df.index.day)['Holiday'].ffill()
Let's look at a few records:
print(df.head(40))
print(df['2013-02-02'])
print(df['2013-02-13':'2013-02-15'])
Output:
temp Holiday
2013-01-01 00:00:00 51 New Years Day
2013-01-01 01:00:00 71 New Years Day
2013-01-01 02:00:00 61 New Years Day
2013-01-01 03:00:00 90 New Years Day
2013-01-01 04:00:00 77 New Years Day
2013-01-01 05:00:00 69 New Years Day
2013-01-01 06:00:00 50 New Years Day
2013-01-01 07:00:00 99 New Years Day
2013-01-01 08:00:00 86 New Years Day
2013-01-01 09:00:00 72 New Years Day
2013-01-01 10:00:00 89 New Years Day
2013-01-01 11:00:00 62 New Years Day
2013-01-01 12:00:00 53 New Years Day
2013-01-01 13:00:00 91 New Years Day
2013-01-01 14:00:00 51 New Years Day
2013-01-01 15:00:00 93 New Years Day
2013-01-01 16:00:00 97 New Years Day
2013-01-01 17:00:00 83 New Years Day
2013-01-01 18:00:00 87 New Years Day
2013-01-01 19:00:00 58 New Years Day
2013-01-01 20:00:00 84 New Years Day
2013-01-01 21:00:00 92 New Years Day
2013-01-01 22:00:00 106 New Years Day
2013-01-01 23:00:00 104 New Years Day
2013-01-02 00:00:00 78 NaN
2013-01-02 01:00:00 104 NaN
2013-01-02 02:00:00 96 NaN
2013-01-02 03:00:00 103 NaN
2013-01-02 04:00:00 60 NaN
2013-01-02 05:00:00 87 NaN
2013-01-02 06:00:00 108 NaN
2013-01-02 07:00:00 85 NaN
2013-01-02 08:00:00 67 NaN
2013-01-02 09:00:00 61 NaN
2013-01-02 10:00:00 91 NaN
2013-01-02 11:00:00 79 NaN
2013-01-02 12:00:00 99 NaN
2013-01-02 13:00:00 82 NaN
2013-01-02 14:00:00 75 NaN
2013-01-02 15:00:00 90 NaN
temp Holiday
2013-02-02 00:00:00 82 Groundhog Day
2013-02-02 01:00:00 58 Groundhog Day
2013-02-02 02:00:00 102 Groundhog Day
2013-02-02 03:00:00 90 Groundhog Day
2013-02-02 04:00:00 79 Groundhog Day
2013-02-02 05:00:00 50 Groundhog Day
2013-02-02 06:00:00 50 Groundhog Day
2013-02-02 07:00:00 83 Groundhog Day
2013-02-02 08:00:00 80 Groundhog Day
2013-02-02 09:00:00 50 Groundhog Day
2013-02-02 10:00:00 52 Groundhog Day
2013-02-02 11:00:00 69 Groundhog Day
2013-02-02 12:00:00 100 Groundhog Day
2013-02-02 13:00:00 61 Groundhog Day
2013-02-02 14:00:00 62 Groundhog Day
2013-02-02 15:00:00 76 Groundhog Day
2013-02-02 16:00:00 83 Groundhog Day
2013-02-02 17:00:00 109 Groundhog Day
2013-02-02 18:00:00 109 Groundhog Day
2013-02-02 19:00:00 81 Groundhog Day
2013-02-02 20:00:00 52 Groundhog Day
2013-02-02 21:00:00 108 Groundhog Day
2013-02-02 22:00:00 68 Groundhog Day
2013-02-02 23:00:00 75 Groundhog Day
temp Holiday
2013-02-13 00:00:00 93 NaN
2013-02-13 01:00:00 93 NaN
2013-02-13 02:00:00 74 NaN
2013-02-13 03:00:00 97 NaN
2013-02-13 04:00:00 58 NaN
2013-02-13 05:00:00 103 NaN
2013-02-13 06:00:00 79 NaN
2013-02-13 07:00:00 65 NaN
2013-02-13 08:00:00 72 NaN
2013-02-13 09:00:00 100 NaN
2013-02-13 10:00:00 66 NaN
2013-02-13 11:00:00 60 NaN
2013-02-13 12:00:00 95 NaN
2013-02-13 13:00:00 51 NaN
2013-02-13 14:00:00 71 NaN
2013-02-13 15:00:00 58 NaN
2013-02-13 16:00:00 58 NaN
2013-02-13 17:00:00 98 NaN
2013-02-13 18:00:00 61 NaN
2013-02-13 19:00:00 63 NaN
2013-02-13 20:00:00 57 NaN
2013-02-13 21:00:00 102 NaN
2013-02-13 22:00:00 69 NaN
2013-02-13 23:00:00 86 NaN
2013-02-14 00:00:00 94 Valentine's Day
2013-02-14 01:00:00 64 Valentine's Day
2013-02-14 02:00:00 62 Valentine's Day
2013-02-14 03:00:00 59 Valentine's Day
2013-02-14 04:00:00 93 Valentine's Day
2013-02-14 05:00:00 99 Valentine's Day
2013-02-14 06:00:00 64 Valentine's Day
2013-02-14 07:00:00 80 Valentine's Day
2013-02-14 08:00:00 89 Valentine's Day
2013-02-14 09:00:00 96 Valentine's Day
2013-02-14 10:00:00 60 Valentine's Day
2013-02-14 11:00:00 76 Valentine's Day
2013-02-14 12:00:00 82 Valentine's Day
2013-02-14 13:00:00 65 Valentine's Day
2013-02-14 14:00:00 90 Valentine's Day
2013-02-14 15:00:00 62 Valentine's Day
2013-02-14 16:00:00 64 Valentine's Day
2013-02-14 17:00:00 98 Valentine's Day
2013-02-14 18:00:00 52 Valentine's Day
2013-02-14 19:00:00 72 Valentine's Day
2013-02-14 20:00:00 108 Valentine's Day
2013-02-14 21:00:00 85 Valentine's Day
2013-02-14 22:00:00 87 Valentine's Day
2013-02-14 23:00:00 62 Valentine's Day
2013-02-15 00:00:00 106 NaN
2013-02-15 01:00:00 82 NaN
2013-02-15 02:00:00 77 NaN
2013-02-15 03:00:00 52 NaN
2013-02-15 04:00:00 94 NaN
2013-02-15 05:00:00 71 NaN
2013-02-15 06:00:00 95 NaN
2013-02-15 07:00:00 96 NaN
2013-02-15 08:00:00 71 NaN
2013-02-15 09:00:00 69 NaN
2013-02-15 10:00:00 85 NaN
2013-02-15 11:00:00 92 NaN
2013-02-15 12:00:00 106 NaN
2013-02-15 13:00:00 77 NaN
2013-02-15 14:00:00 65 NaN
2013-02-15 15:00:00 104 NaN
2013-02-15 16:00:00 98 NaN
2013-02-15 17:00:00 107 NaN
2013-02-15 18:00:00 106 NaN
2013-02-15 19:00:00 67 NaN
2013-02-15 20:00:00 59 NaN
2013-02-15 21:00:00 81 NaN
2013-02-15 22:00:00 56 NaN
2013-02-15 23:00:00 75 NaN
Note: In this dataframe your datetime column is in the index.

You can try using the apply method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
The input to this is the function you want to be applied to each row. And in this case "axis" should be zero so that it is applied to each row.

Related

Using idxmax and idxmin to change values in different rows

I am trying find the cleanest, most pandastic way to create a new column that has the minimum values from one column in the same row as the maximum values in another column. The rest of the values can be nan as I will be interpolating.
rng = pd.date_range(start=datetime.date(2020,8,1), end=datetime.date(2020,8,3), freq='H')
df = pd.DataFrame(rng, columns=['date'])
df.index=pd.to_datetime(df['date'])
df.drop(['date'],axis=1,inplace=True)
df['val0']=np.random.randint(0,50,49)
df['val1']=np.random.randint(0,50,49)
One realization of df (cut and paste for reproducability):
val0 val1
date
2020-08-01 00:00:00 17 4
2020-08-01 01:00:00 89 0
2020-08-01 02:00:00 85 48
2020-08-01 03:00:00 83 13
2020-08-01 04:00:00 56 65
2020-08-01 05:00:00 48 31
2020-08-01 06:00:00 55 11
2020-08-01 07:00:00 15 87
2020-08-01 08:00:00 92 70
2020-08-01 09:00:00 95 57
2020-08-01 10:00:00 68 79
2020-08-01 11:00:00 87 7
2020-08-01 12:00:00 43 15
2020-08-01 13:00:00 23 4
2020-08-01 14:00:00 68 13
2020-08-01 15:00:00 68 63
2020-08-01 16:00:00 28 86
2020-08-01 17:00:00 12 40
2020-08-01 18:00:00 51 20
2020-08-01 19:00:00 20 48
2020-08-01 20:00:00 79 78
2020-08-01 21:00:00 67 89
2020-08-01 22:00:00 46 52
2020-08-01 23:00:00 7 47
2020-08-02 00:00:00 14 73
2020-08-02 01:00:00 70 30
2020-08-02 02:00:00 2 39
2020-08-02 03:00:00 65 81
2020-08-02 04:00:00 65 8
2020-08-02 05:00:00 83 60
2020-08-02 06:00:00 1 64
2020-08-02 07:00:00 13 63
2020-08-02 08:00:00 45 78
2020-08-02 09:00:00 83 7
2020-08-02 10:00:00 75 0
2020-08-02 11:00:00 52 3
2020-08-02 12:00:00 59 34
2020-08-02 13:00:00 54 57
2020-08-02 14:00:00 90 66
2020-08-02 15:00:00 82 56
2020-08-02 16:00:00 9 2
2020-08-02 17:00:00 5 51
2020-08-02 18:00:00 67 96
2020-08-02 19:00:00 18 77
2020-08-02 20:00:00 28 89
2020-08-02 21:00:00 96 53
2020-08-02 22:00:00 28 46
2020-08-02 23:00:00 41 87
2020-08-03 00:00:00 26 47
Now I find idxmax for and idxmin:
minidx=df.groupby(pd.Grouper(freq='D')).idxmin()
maxidx=df.groupby(pd.Grouper(freq='D')).idxmax()
minidx:
val0 val1
date
2020-08-01 2020-08-01 23:00:00 2020-08-01 01:00:00
2020-08-02 2020-08-02 06:00:00 2020-08-02 10:00:00
2020-08-03 2020-08-03 00:00:00 2020-08-03 00:00:00
maxidx:
val0 val1
date
2020-08-01 2020-08-01 09:00:00 2020-08-01 21:00:00
2020-08-02 2020-08-02 21:00:00 2020-08-02 18:00:00
2020-08-03 2020-08-03 00:00:00 2020-08-03 00:00:00
In this case, I would like to put the minimum daily value (7) located at 2020-08-01 23:00:00 into a new column at 2020-08-01 21:00:00 (i.e. adjacent to 89, the daily max of val1), and do the same for all other dates so the 'new' value on 2020-08-02 18:00:00 will be 1 (i.e. the minimum daily value occurring on 2020-08-02 06:00:00).
I tried the following, but I just get a bunch of nans:
df.loc[maxidx['val1'].values,'new']=df.loc[minidx['val0'].values,'val0']
If I just set it to an int (df.loc[maxidx['val1'].values,'new']=6), I get the int in the places I want the new values. The values I want are give by df.loc[minidx['val0'].values,'val0'], but I can't seem to get them into the dataframe.
minidx['val0'].values and maxidx['val1'].values are arrays of the same size with elements of type numpy.datetime64, and they are all generated from the same dataframe so maxidx and minidx should exist in df.index (df.index.values).
Is there an obvious reason this isn't working? Thanks
The simplest solution I have found is to loop through the idxmin and idxmax:
for v0,v1 in zip(minidx['val0'].values,maxidx['val1'].values):
df.loc[v1,'new']=df.loc[v0,'val0']
This gives me what I want, but doesn't seem very pandastic, so any other suggestions to accomplish the same thing would be great.
IIUC, you can do this using NamedAgg:
df.groupby(pd.Grouper(freq='D')).agg(val0_min_time=('val0','idxmin'),
val0_min_value=('val0','min'),
val0_max_time=('val0','idxmax'),
val0_max_value=('val0','max'),
val1_min_time=('val1','idxmin'),
val1_min_value=('val1','min'),
val1_max_time=('val1','idxmax'),
val1_max_value=('val1','max'),)
Output:
val0_min_time val0_min_value val0_max_time val0_max_value val1_min_time val1_min_value val1_max_time val1_max_value
date
2020-08-01 2020-08-01 23:00:00 7 2020-08-01 09:00:00 95 2020-08-01 01:00:00 0 2020-08-01 21:00:00 89
2020-08-02 2020-08-02 06:00:00 1 2020-08-02 21:00:00 96 2020-08-02 10:00:00 0 2020-08-02 18:00:00 96
2020-08-03 2020-08-03 00:00:00 26 2020-08-03 00:00:00 26 2020-08-03 00:00:00 47 2020-08-03 00:00:00 47

Getting wrong datetime after resampling

I want to create missing records from a time serie of % humidity.
datetime humidite
0 2019-07-09 08:30:00 87
1 2019-07-09 11:00:00 87
2 2019-07-09 17:30:00 82
3 2019-07-09 23:30:00 80
4 2019-07-11 06:15:00 79
5 2019-07-19 14:30:00 39
6 2019-07-21 00:00:00 80
I tried to index with existing datetime (result at this step is ok) :
humdt["datetime"] = pd.to_datetime(humdt["datetime"])
humdt = humdt.set_index("datetime")
humidite
datetime
2019-07-09 08:30:00 87
2019-07-09 11:00:00 87
2019-07-09 17:30:00 82
2019-07-09 23:30:00 80
2019-07-11 06:15:00 79
2019-07-19 14:30:00 39
Then reindex at 15 min frequency (my target frequency) :
humdt.resample("15min").asfreq()
humidite
datetime
2019-06-26 10:00:00 34.0
2019-06-26 10:15:00 33.0
2019-06-26 10:30:00 32.0
2019-06-26 10:45:00 31.0
2019-06-26 11:00:00 30.0
2019-06-26 11:15:00 29.0
As a result, I get wrong starting time and values, just frequency is respected.
Can you help me please ? I also tried to merge a range of datetime defined as my expected records with my data and it doesn't work. Thank you !!!

How do I display a subset of a pandas dataframe?

I have a dataframe df that contains datetimes for every hour of a day between 2003-02-12 to 2017-06-30 and I want to delete all datetimes between 24th Dec and 1st Jan of EVERY year.
An extract of my data frame is:
...
7505,2003-12-23 17:00:00
7506,2003-12-23 18:00:00
7507,2003-12-23 19:00:00
7508,2003-12-23 20:00:00
7509,2003-12-23 21:00:00
7510,2003-12-23 22:00:00
7511,2003-12-23 23:00:00
7512,2003-12-24 00:00:00
7513,2003-12-24 01:00:00
7514,2003-12-24 02:00:00
7515,2003-12-24 03:00:00
7516,2003-12-24 04:00:00
7517,2003-12-24 05:00:00
7518,2003-12-24 06:00:00
...
7723,2004-01-01 19:00:00
7724,2004-01-01 20:00:00
7725,2004-01-01 21:00:00
7726,2004-01-01 22:00:00
7727,2004-01-01 23:00:00
7728,2004-01-02 00:00:00
7729,2004-01-02 01:00:00
7730,2004-01-02 02:00:00
7731,2004-01-02 03:00:00
7732,2004-01-02 04:00:00
7733,2004-01-02 05:00:00
7734,2004-01-02 06:00:00
7735,2004-01-02 07:00:00
...
and my expected output is:
...
7505,2003-12-23 17:00:00
7506,2003-12-23 18:00:00
7507,2003-12-23 19:00:00
7508,2003-12-23 20:00:00
7509,2003-12-23 21:00:00
7510,2003-12-23 22:00:00
7511,2003-12-23 23:00:00
...
7728,2004-01-02 00:00:00
7729,2004-01-02 01:00:00
7730,2004-01-02 02:00:00
7731,2004-01-02 03:00:00
7732,2004-01-02 04:00:00
7733,2004-01-02 05:00:00
7734,2004-01-02 06:00:00
7735,2004-01-02 07:00:00
...
Sample dataframe:
dates
0 2003-12-23 23:00:00
1 2003-12-24 05:00:00
2 2004-12-27 05:00:00
3 2003-12-13 23:00:00
4 2002-12-23 23:00:00
5 2004-01-01 05:00:00
6 2014-12-24 05:00:00
Solution:
If you want it for every year between the following dates excluded, then extract the month and dates first:
df['month'] = df['dates'].dt.month
df['day'] = df['dates'].dt.day
And now put the condition check:
dec_days = [24, 25, 26, 27, 28, 29, 30, 31]
## if the month is dec, then check for these dates
## if the month is jan, then just check for the day to be 1 like below
df = df[~(((df.month == 12) & (df.day.isin(dec_days))) | ((df.month == 1) & (df.day == 1)))]
Sample output:
dates month day
0 2003-12-23 23:00:00 12 23
3 2003-12-13 23:00:00 12 13
4 2002-12-23 23:00:00 12 23
This takes advantage of the fact that datetime-strings in the form mm-dd are sortable. Read everything in from the CSV file then filter for the dates you want:
df = pd.read_csv('...', parse_dates=['DateTime'])
s = df['DateTime'].dt.strftime('%m-%d')
excluded = (s == '01-01') | (s >= '12-24') # Jan 1 or >= Dec 24
df[~excluded]
You can try dropping on conditionals. Maybe with a pattern match to the date string or parsing the date as a number (like in Java) and conditionally removing.
datesIdontLike = df[df['colname'] == <stringPattern>].index
newDF = df.drop(datesIdontLike, inplace=True)
Check this out: https://thispointer.com/python-pandas-how-to-drop-rows-in-dataframe-by-conditions-on-column-values/
(If you have issues, let me know.)
You can use pandas and boolean filtering with strftime
# version 0.23.4
import pandas as pd
# make df
df = pd.DataFrame(pd.date_range('20181223', '20190103', freq='H'), columns=['date'])
# string format the date to only include the month and day
# then set it strictly less than '12-24' AND greater than or equal to `01-02`
df = df.loc[
(df.date.dt.strftime('%m-%d') < '12-24') &
(df.date.dt.strftime('%m-%d') >= '01-02')
].copy()
print(df)
date
0 2018-12-23 00:00:00
1 2018-12-23 01:00:00
2 2018-12-23 02:00:00
3 2018-12-23 03:00:00
4 2018-12-23 04:00:00
5 2018-12-23 05:00:00
6 2018-12-23 06:00:00
7 2018-12-23 07:00:00
8 2018-12-23 08:00:00
9 2018-12-23 09:00:00
10 2018-12-23 10:00:00
11 2018-12-23 11:00:00
12 2018-12-23 12:00:00
13 2018-12-23 13:00:00
14 2018-12-23 14:00:00
15 2018-12-23 15:00:00
16 2018-12-23 16:00:00
17 2018-12-23 17:00:00
18 2018-12-23 18:00:00
19 2018-12-23 19:00:00
20 2018-12-23 20:00:00
21 2018-12-23 21:00:00
22 2018-12-23 22:00:00
23 2018-12-23 23:00:00
240 2019-01-02 00:00:00
241 2019-01-02 01:00:00
242 2019-01-02 02:00:00
243 2019-01-02 03:00:00
244 2019-01-02 04:00:00
245 2019-01-02 05:00:00
246 2019-01-02 06:00:00
247 2019-01-02 07:00:00
248 2019-01-02 08:00:00
249 2019-01-02 09:00:00
250 2019-01-02 10:00:00
251 2019-01-02 11:00:00
252 2019-01-02 12:00:00
253 2019-01-02 13:00:00
254 2019-01-02 14:00:00
255 2019-01-02 15:00:00
256 2019-01-02 16:00:00
257 2019-01-02 17:00:00
258 2019-01-02 18:00:00
259 2019-01-02 19:00:00
260 2019-01-02 20:00:00
261 2019-01-02 21:00:00
262 2019-01-02 22:00:00
263 2019-01-02 23:00:00
264 2019-01-03 00:00:00
This will work with multiple years because we are only filtering on the month and day.
# change range to include 2017
df = pd.DataFrame(pd.date_range('20171223', '20190103', freq='H'), columns=['date'])
df = df.loc[
(df.date.dt.strftime('%m-%d') < '12-24') &
(df.date.dt.strftime('%m-%d') >= '01-02')
].copy()
print(df)
date
0 2017-12-23 00:00:00
1 2017-12-23 01:00:00
2 2017-12-23 02:00:00
3 2017-12-23 03:00:00
4 2017-12-23 04:00:00
5 2017-12-23 05:00:00
6 2017-12-23 06:00:00
7 2017-12-23 07:00:00
8 2017-12-23 08:00:00
9 2017-12-23 09:00:00
10 2017-12-23 10:00:00
11 2017-12-23 11:00:00
12 2017-12-23 12:00:00
13 2017-12-23 13:00:00
14 2017-12-23 14:00:00
15 2017-12-23 15:00:00
16 2017-12-23 16:00:00
17 2017-12-23 17:00:00
18 2017-12-23 18:00:00
19 2017-12-23 19:00:00
20 2017-12-23 20:00:00
21 2017-12-23 21:00:00
22 2017-12-23 22:00:00
23 2017-12-23 23:00:00
240 2018-01-02 00:00:00
241 2018-01-02 01:00:00
242 2018-01-02 02:00:00
243 2018-01-02 03:00:00
244 2018-01-02 04:00:00
245 2018-01-02 05:00:00
... ...
8779 2018-12-23 19:00:00
8780 2018-12-23 20:00:00
8781 2018-12-23 21:00:00
8782 2018-12-23 22:00:00
8783 2018-12-23 23:00:00
9000 2019-01-02 00:00:00
9001 2019-01-02 01:00:00
9002 2019-01-02 02:00:00
9003 2019-01-02 03:00:00
9004 2019-01-02 04:00:00
9005 2019-01-02 05:00:00
9006 2019-01-02 06:00:00
9007 2019-01-02 07:00:00
9008 2019-01-02 08:00:00
9009 2019-01-02 09:00:00
9010 2019-01-02 10:00:00
9011 2019-01-02 11:00:00
9012 2019-01-02 12:00:00
9013 2019-01-02 13:00:00
9014 2019-01-02 14:00:00
9015 2019-01-02 15:00:00
9016 2019-01-02 16:00:00
9017 2019-01-02 17:00:00
9018 2019-01-02 18:00:00
9019 2019-01-02 19:00:00
9020 2019-01-02 20:00:00
9021 2019-01-02 21:00:00
9022 2019-01-02 22:00:00
9023 2019-01-02 23:00:00
9024 2019-01-03 00:00:00
Since you want this to happen for every year, we can first define a series that where we replace the year by a static value (2000 for example). Let date be the column that stores the date, we can generate such column as:
dt = pd.to_datetime({'year': 2000, 'month': df['date'].dt.month, 'day': df['date'].dt.day})
For the given sample data, we get:
>>> dt
0 2000-12-23
1 2000-12-23
2 2000-12-23
3 2000-12-23
4 2000-12-23
5 2000-12-23
6 2000-12-23
7 2000-12-24
8 2000-12-24
9 2000-12-24
10 2000-12-24
11 2000-12-24
12 2000-12-24
13 2000-12-24
14 2000-01-01
15 2000-01-01
16 2000-01-01
17 2000-01-01
18 2000-01-01
19 2000-01-02
20 2000-01-02
21 2000-01-02
22 2000-01-02
23 2000-01-02
24 2000-01-02
25 2000-01-02
26 2000-01-02
dtype: datetime64[ns]
Next we can filter the rows, like:
from datetime import date
df[(dt >= date(2000,1,2)) & (dt < date(2000,12,24))]
This gives us the following data for your sample data:
>>> df[(dt >= date(2000,1,2)) & (dt < date(2000,12,24))]
id dt
0 7505 2003-12-23 17:00:00
1 7506 2003-12-23 18:00:00
2 7507 2003-12-23 19:00:00
3 7508 2003-12-23 20:00:00
4 7509 2003-12-23 21:00:00
5 7510 2003-12-23 22:00:00
6 7511 2003-12-23 23:00:00
19 7728 2004-01-02 00:00:00
20 7729 2004-01-02 01:00:00
21 7730 2004-01-02 02:00:00
22 7731 2004-01-02 03:00:00
23 7732 2004-01-02 04:00:00
24 7733 2004-01-02 05:00:00
25 7734 2004-01-02 06:00:00
26 7735 2004-01-02 07:00:00
So regardless what the year is, we will only consider dates between the 2nd of January and the 23rd of December (both inclusive).

Group data by time of the day

I have a dataframe with datetime index:df.head(6)
NUMBERES PRICE
DEAL_TIME
2015-03-02 12:40:03 5 25
2015-03-04 14:52:57 7 23
2015-03-03 08:10:09 10 43
2015-03-02 20:18:24 5 37
2015-03-05 07:50:55 4 61
2015-03-02 09:08:17 1 17
The dataframe includes the data of one week. Now I need to count the time period of the day. If time period is 1 hour, I know the following method would work:
df_grouped = df.groupby(df.index.hour).count()
But I don't know how to do when the time period is half hour. How can I realize it?
UPDATE:
I was told that this question is similar to How to group DataFrame by a period of time?
But I had tried the methods mentioned. Maybe it's my fault that I didn't say it clearly. 'DEAL_TIME' ranges from '2015-03-02 00:00:00' to '2015-03-08 23:59:59'. If I use pd.TimeGrouper(freq='30Min') or resample(), the time periods would range from '2015-03-02 00:30' to '2015-03-08 23:30'. But what I want is a series like below:
COUNT
DEAL_TIME
00:00:00 53
00:30:00 49
01:00:00 31
01:30:00 22
02:00:00 1
02:30:00 24
03:00:00 27
03:30:00 41
04:00:00 41
04:30:00 76
05:00:00 33
05:30:00 16
06:00:00 15
06:30:00 4
07:00:00 60
07:30:00 85
08:00:00 3
08:30:00 37
09:00:00 18
09:30:00 29
10:00:00 31
10:30:00 67
11:00:00 35
11:30:00 60
12:00:00 95
12:30:00 37
13:00:00 30
13:30:00 62
14:00:00 58
14:30:00 44
15:00:00 45
15:30:00 35
16:00:00 94
16:30:00 56
17:00:00 64
17:30:00 43
18:00:00 60
18:30:00 52
19:00:00 14
19:30:00 9
20:00:00 31
20:30:00 71
21:00:00 21
21:30:00 32
22:00:00 61
22:30:00 35
23:00:00 14
23:30:00 21
In other words, the time period should be irrelevant to the date.
You need a 30-minute time grouper for this:
grouper = pd.TimeGrouper(freq="30T")
You also need to remove the 'date' part from the index:
df.index = df.reset_index()['index'].apply(lambda x: x - pd.Timestamp(x.date()))
Now, you can group by time alone:
df.groupby(grouper).count()
You can find somewhat obscure TimeGrouper documentation here: pandas resample documentation (it's actually resample documentation, but both features use the same rules).
In pandas, the most common way to group by time is to use the
.resample() function.
In v0.18.0 this function is two-stage.
This means that df.resample('M') creates an object to which we can
apply other functions (mean, count, sum, etc.)
The code snippet will be like,
df.resample('M').count()
You can refer here for example.

How can I slice a dataframe by timestamp, when timestamp isn't classified as index?

How can I split my pandas dataframe by using the timestamp on it?
I got the following prices when I call df30m:
Timestamp Open High Low Close Volume
0 2016-05-01 19:30:00 449.80 450.13 449.80 449.90 74.1760
1 2016-05-01 20:00:00 449.90 450.27 449.90 450.07 63.5840
2 2016-05-01 20:30:00 450.12 451.00 450.02 450.51 64.1080
3 2016-05-01 21:00:00 450.51 452.05 450.50 451.22 75.7390
4 2016-05-01 21:30:00 451.21 451.64 450.81 450.87 71.1190
5 2016-05-01 22:00:00 450.87 452.05 450.87 451.07 73.8430
6 2016-05-01 22:30:00 451.09 451.70 450.91 450.91 68.1490
7 2016-05-01 23:00:00 450.91 450.98 449.97 450.61 84.5430
8 2016-05-01 23:30:00 450.61 451.50 450.55 451.45 111.2370
9 2016-05-02 00:00:00 451.47 452.31 450.69 451.19 190.0750
10 2016-05-02 00:30:00 451.20 451.68 450.45 450.82 186.0930
11 2016-05-02 01:00:00 450.83 451.64 450.65 450.73 112.4630
12 2016-05-02 01:30:00 450.73 451.10 450.31 450.56 137.7530
13 2016-05-02 02:00:00 450.56 452.01 449.98 450.27 151.6140
14 2016-05-02 02:30:00 450.27 451.30 450.23 451.11 99.5490
15 2016-05-02 03:00:00 451.29 451.29 450.17 450.33 178.9860
16 2016-05-02 03:30:00 450.44 451.20 450.44 450.75 65.1480
17 2016-05-02 04:00:00 450.79 451.20 450.75 451.00 78.0430
18 2016-05-02 04:30:00 451.00 451.11 450.85 451.11 64.7250
19 2016-05-02 05:00:00 451.11 451.64 451.00 451.12 73.4840
20 2016-05-02 05:30:00 451.12 451.83 450.67 451.33 94.1950
21 2016-05-02 06:00:00 451.35 451.37 450.17 450.18 227.7480
22 2016-05-02 06:30:00 450.18 450.43 450.17 450.17 83.0270
23 2016-05-02 07:00:00 450.17 450.43 448.90 449.41 170.4950
24 2016-05-02 07:30:00 449.38 450.00 448.56 448.56 243.0420
25 2016-05-02 08:00:00 448.67 448.67 446.21 448.00 525.7090
26 2016-05-02 08:30:00 448.12 448.49 445.00 445.00 673.5810
27 2016-05-02 09:00:00 445.00 445.51 440.11 444.20 1392.9049
28 2016-05-02 09:30:00 444.24 444.36 440.11 442.00 438.6860
29 2016-05-02 10:00:00 441.91 443.20 440.05 442.24 400.5850
... ... ... ... ... ... ...
1651 2016-06-05 05:00:00 578.74 579.00 577.92 578.39 93.6980
1652 2016-06-05 05:30:00 578.40 578.48 574.52 575.26 98.1580
1653 2016-06-05 06:00:00 575.24 576.02 572.47 574.06 126.8620
1654 2016-06-05 06:30:00 574.06 576.35 574.06 576.34 125.4120
1655 2016-06-05 07:00:00 576.34 576.34 574.73 575.83 34.8070
1656 2016-06-05 07:30:00 575.84 576.27 574.91 575.58 74.8180
1657 2016-06-05 08:00:00 575.58 578.57 575.58 578.36 123.2560
1658 2016-06-05 08:30:00 578.23 578.47 576.18 577.25 43.6590
1659 2016-06-05 09:00:00 577.20 578.85 576.70 577.27 95.3900
1660 2016-06-05 09:30:00 577.36 578.18 576.70 576.70 51.0250
1661 2016-06-05 10:00:00 576.70 576.70 574.55 575.39 101.0590
1662 2016-06-05 10:30:00 575.41 576.44 575.18 576.44 86.4340
1663 2016-06-05 11:00:00 576.50 577.89 576.50 577.80 113.0600
1664 2016-06-05 11:30:00 577.80 578.10 576.03 576.98 57.5050
1665 2016-06-05 12:00:00 576.98 577.55 576.59 577.54 56.1070
1666 2016-06-05 12:30:00 577.54 583.00 570.93 572.82 872.8200
1667 2016-06-05 13:00:00 572.94 573.19 569.64 572.50 310.0020
1668 2016-06-05 13:30:00 572.50 574.37 572.50 574.09 59.3410
1669 2016-06-05 14:00:00 574.09 574.19 571.51 572.98 155.4310
1670 2016-06-05 14:30:00 572.98 573.57 572.02 573.47 76.9270
1671 2016-06-05 15:00:00 573.62 575.10 572.97 573.37 59.1430
1672 2016-06-05 15:30:00 573.37 574.39 573.37 574.38 77.3270
1673 2016-06-05 16:00:00 574.39 575.59 574.38 575.59 52.0150
1674 2016-06-05 16:30:00 575.00 575.59 574.50 575.00 66.9300
1675 2016-06-05 17:00:00 575.00 576.83 574.38 576.60 50.2990
1676 2016-06-05 17:30:00 576.60 577.50 575.50 576.86 104.5200
1677 2016-06-05 18:00:00 576.86 577.21 575.44 575.80 55.3270
1678 2016-06-05 18:30:00 575.77 575.80 574.52 574.77 78.7760
1679 2016-06-05 19:00:00 574.73 575.18 572.52 574.47 126.4300
1680 2016-06-05 19:30:00 574.49 574.87 573.80 574.32 10.4930
As you can see, it contains the last 35 days grouped by intervals of 30 min.
I wanna manipulate this price history in different time windows.
So, as a beginner example, I would like to fetch only the info from the last 1 day.
How can I filter this dataframe to show the info from the last 1 day?
This is what I've tried:
import datetime
d0 = datetime.datetime.today()
d1 = datetime.datetime.today() - datetime.timedelta(days=1)
print d0
>>> 2016-06-05 17:10:37.633824
print d1
>>> 2016-06-04 17:10:37.633967
df_1d = df30m['Timestamp'] > d1
print df_1d
This returns me a pandas series filled with True or False
0 False
1 False
2 False
3 False
4 False
...
1676 True
1677 True
1678 True
1679 True
1680 True
Also I've tried to use the between_time() module.
df_1d = df30m.between_time(d0, d1)
But I got the following error message:
TypeError: Index must be DatetimeIndex
Please, can anyone show me a pythonic way to slice my dataframe?
You can use loc to index your data. Do you know if your timestamps at datetime.datetime formats or Pandas Timestamps?
df30m.loc[(df30m.Timestamp <= d0) & (df30m.Timestamp >= d1)]
You can set the index to the Timestamp column and then index as follows:
df.set_index('Timestamp', inplace=True)
df[d1:d0]

Categories