Pandas Dataframe.resample('MS') - python

I wonder if somebody can help me understand where I may be going wrong. I have a dataframe which contains data (part numbers,sales per month, year) with missing months and I am trying to use dataframe.resample('MS').asfreq() to identify the missing months and insert them into my list. I have performed a groupby previously to ensure that data is collected together prior to the resample.
my code is:
df2=df1[['Part_ID','Order_Qty','Extended_Price','Month','Month1','Year']]
df2['Month1'] = pd.to_datetime(df2.Month.astype(str) +'-01-'+ df2.Year.astype(str))
df2=df2.set_index(pd.DatetimeIndex(df2['Month1']))
df2=df2.groupby([df2['Part_ID'],df2['Year']])
df2=df2['Month1'].resample('MS').asfreq()
df3=df2.to_frame()
print(df3)
df4=df3.reset_index()
After execution some months have been added, but some are missing. Can anybody please explain why?
the output after resample is:
Part_ID Year Month1
08095601/2 2014 2014-07-01 2014-07-01
2014-08-01 2014-08-01
2014-09-01 2014-09-01
2014-10-01 2014-10-01
2014-11-01 2014-11-01
2015 2015-01-01 2015-01-01
2015-02-01 2015-02-01
2015-03-01 2015-03-01
2015-04-01 2015-04-01
2015-05-01 2015-05-01
2015-06-01 2015-06-01
08095601/5 2014 2014-07-01 2014-07-01
2014-08-01 2014-08-01
2014-09-01 2014-09-01
...
ZZSSL 2007 2007-10-01 2007-10-01
2007-11-01 NaT
2007-12-01 2007-12-01
2008 2008-01-01 2008-01-01
2008-02-01 2008-02-01
2008-03-01 2008-03-01
2008-04-01 2008-04-01
2008-05-01 2008-05-01
2008-06-01 2008-06-01
2008-07-01 NaT
2008-08-01 NaT
2008-09-01 2008-09-01
2008-10-01 2008-10-01
2009 2009-01-01 2009-01-01
2009-02-01 2009-02-01
2009-03-01 2009-03-01
2009-04-01 2009-04-01
2009-05-01 2009-05-01
2009-06-01 2009-06-01
2009-07-01 2009-07-01
bracket 2014 2014-07-01 2014-07-01
2014-08-01 2014-08-01
2014-09-01 2014-09-01
2014-10-01 2014-10-01
2014-11-01 2014-11-01
2014-12-01 2014-12-01
2015 2015-01-01 2015-01-01
2015-02-01 NaT
2015-03-01 NaT
2015-04-01 2015-04-01
as you can see for part 08095601/2 in 2014 there is no month 12 and for part ZZSSL there is no month 11 OR 12. bracket has correctly inserted months 2 & 3 in 2015
Any pointers please.

Related

How to create 2-3 hrs from multiple columns

I’ve a df with multiple time series looks like this
df_observ
id
Date
start time
Action 1 time
end of action2
end Time
Observ1 time
observ2 time
observ1 value
observ2 value
indv1
3-2017
00:00:00
02:40:00
04:25:00
04:38:00
00:05:00
01:45:00
57
111
indv2
11-6-2019
00:00:00
00:46:00
02:16:00
02:40:00
01:01:00
02:37:00
68
113
indv2
13-4-2017
00:00:00
02:22:00
04:25:00
04:38:00
00:05:00
03:10:06
82
125
indv3
23-5-2022
00:00:00
01:34:00
02:22:00
03:34:00
02:24:00
03:25:00
67
101
indv4
8-11-2021
00:00:00
00:05:00
03:16:00
03:52:14
01:01:00
02:11:00
63
108
all-time series are subtracted from the start time. is there a way to plot obsev value changes in different time points ?
Thanks!

Pandas (Python): How to apply values to similar row?

sorry for the badly phrased question, currently only the first hour is updated with holiday.
e.g.
2013-01-01 00:00:00 - New Years Day
2013-01-01 00:00:00 - None
2013-01-01 00:00:00 - None
I would like to apply similar holidays to the same date using Pandas (Python).
What would be the most efficient method to apply the holiday to the same dates, there are a number of other holidays to apply as well?
Thank you in advance!
Screenshot of CSV in question
Using a library called holidays together with pandas apply could be a great solution to your problem. Here is a short contained example example
import pandas as pd
import holidays
us_holidays = holidays.UnitedStates()
# Create a sample DataFrame. You can just use your own
data = pd.DataFrame(pd.date_range('2020-01-01', '2020-01-30'), columns=['date'])
data['holiday'] = data['date'].apply(lambda x: us_holidays.get(x))
print(data)
Output
date holiday
0 2020-01-01 New Year's Day
1 2020-01-02 None
2 2020-01-03 None
3 2020-01-04 None
4 2020-01-05 None
5 2020-01-06 None
6 2020-01-07 None
7 2020-01-08 None
8 2020-01-09 None
9 2020-01-10 None
10 2020-01-11 None
11 2020-01-12 None
12 2020-01-13 None
13 2020-01-14 None
14 2020-01-15 None
15 2020-01-16 None
16 2020-01-17 None
17 2020-01-18 None
18 2020-01-19 None
19 2020-01-20 Martin Luther King, Jr. Day
20 2020-01-21 None
21 2020-01-22 None
22 2020-01-23 None
23 2020-01-24 None
24 2020-01-25 None
25 2020-01-26 None
26 2020-01-27 None
27 2020-01-28 None
28 2020-01-29 None
29 2020-01-30 None
IIUC, you have only the first hour of a day listed with a holiday. Here is a small sample of a dataframe with two months of data and three holidays on three separate days.
import pandas as pd
import numpy as np
df = pd.DataFrame({'temp':np.random.randint(50,110, 60*24)}, index=pd.date_range('2013-01-01', periods=(60*24), freq='H'))
df['Holiday'] = np.nan
df.loc['2013-01-01 00:00:00', 'Holiday'] = 'New Years Day'
df.loc['2013-02-02 00:00:00', 'Holiday'] = 'Groundhog Day'
df.loc['2013-02-14 00:00:00', 'Holiday'] = "Valentine's Day"
Now, let's use groupby with day from DatetimeIndex and ffill:
df['Holiday'] = df.groupby(df.index.day)['Holiday'].ffill()
Let's look at a few records:
print(df.head(40))
print(df['2013-02-02'])
print(df['2013-02-13':'2013-02-15'])
Output:
temp Holiday
2013-01-01 00:00:00 51 New Years Day
2013-01-01 01:00:00 71 New Years Day
2013-01-01 02:00:00 61 New Years Day
2013-01-01 03:00:00 90 New Years Day
2013-01-01 04:00:00 77 New Years Day
2013-01-01 05:00:00 69 New Years Day
2013-01-01 06:00:00 50 New Years Day
2013-01-01 07:00:00 99 New Years Day
2013-01-01 08:00:00 86 New Years Day
2013-01-01 09:00:00 72 New Years Day
2013-01-01 10:00:00 89 New Years Day
2013-01-01 11:00:00 62 New Years Day
2013-01-01 12:00:00 53 New Years Day
2013-01-01 13:00:00 91 New Years Day
2013-01-01 14:00:00 51 New Years Day
2013-01-01 15:00:00 93 New Years Day
2013-01-01 16:00:00 97 New Years Day
2013-01-01 17:00:00 83 New Years Day
2013-01-01 18:00:00 87 New Years Day
2013-01-01 19:00:00 58 New Years Day
2013-01-01 20:00:00 84 New Years Day
2013-01-01 21:00:00 92 New Years Day
2013-01-01 22:00:00 106 New Years Day
2013-01-01 23:00:00 104 New Years Day
2013-01-02 00:00:00 78 NaN
2013-01-02 01:00:00 104 NaN
2013-01-02 02:00:00 96 NaN
2013-01-02 03:00:00 103 NaN
2013-01-02 04:00:00 60 NaN
2013-01-02 05:00:00 87 NaN
2013-01-02 06:00:00 108 NaN
2013-01-02 07:00:00 85 NaN
2013-01-02 08:00:00 67 NaN
2013-01-02 09:00:00 61 NaN
2013-01-02 10:00:00 91 NaN
2013-01-02 11:00:00 79 NaN
2013-01-02 12:00:00 99 NaN
2013-01-02 13:00:00 82 NaN
2013-01-02 14:00:00 75 NaN
2013-01-02 15:00:00 90 NaN
temp Holiday
2013-02-02 00:00:00 82 Groundhog Day
2013-02-02 01:00:00 58 Groundhog Day
2013-02-02 02:00:00 102 Groundhog Day
2013-02-02 03:00:00 90 Groundhog Day
2013-02-02 04:00:00 79 Groundhog Day
2013-02-02 05:00:00 50 Groundhog Day
2013-02-02 06:00:00 50 Groundhog Day
2013-02-02 07:00:00 83 Groundhog Day
2013-02-02 08:00:00 80 Groundhog Day
2013-02-02 09:00:00 50 Groundhog Day
2013-02-02 10:00:00 52 Groundhog Day
2013-02-02 11:00:00 69 Groundhog Day
2013-02-02 12:00:00 100 Groundhog Day
2013-02-02 13:00:00 61 Groundhog Day
2013-02-02 14:00:00 62 Groundhog Day
2013-02-02 15:00:00 76 Groundhog Day
2013-02-02 16:00:00 83 Groundhog Day
2013-02-02 17:00:00 109 Groundhog Day
2013-02-02 18:00:00 109 Groundhog Day
2013-02-02 19:00:00 81 Groundhog Day
2013-02-02 20:00:00 52 Groundhog Day
2013-02-02 21:00:00 108 Groundhog Day
2013-02-02 22:00:00 68 Groundhog Day
2013-02-02 23:00:00 75 Groundhog Day
temp Holiday
2013-02-13 00:00:00 93 NaN
2013-02-13 01:00:00 93 NaN
2013-02-13 02:00:00 74 NaN
2013-02-13 03:00:00 97 NaN
2013-02-13 04:00:00 58 NaN
2013-02-13 05:00:00 103 NaN
2013-02-13 06:00:00 79 NaN
2013-02-13 07:00:00 65 NaN
2013-02-13 08:00:00 72 NaN
2013-02-13 09:00:00 100 NaN
2013-02-13 10:00:00 66 NaN
2013-02-13 11:00:00 60 NaN
2013-02-13 12:00:00 95 NaN
2013-02-13 13:00:00 51 NaN
2013-02-13 14:00:00 71 NaN
2013-02-13 15:00:00 58 NaN
2013-02-13 16:00:00 58 NaN
2013-02-13 17:00:00 98 NaN
2013-02-13 18:00:00 61 NaN
2013-02-13 19:00:00 63 NaN
2013-02-13 20:00:00 57 NaN
2013-02-13 21:00:00 102 NaN
2013-02-13 22:00:00 69 NaN
2013-02-13 23:00:00 86 NaN
2013-02-14 00:00:00 94 Valentine's Day
2013-02-14 01:00:00 64 Valentine's Day
2013-02-14 02:00:00 62 Valentine's Day
2013-02-14 03:00:00 59 Valentine's Day
2013-02-14 04:00:00 93 Valentine's Day
2013-02-14 05:00:00 99 Valentine's Day
2013-02-14 06:00:00 64 Valentine's Day
2013-02-14 07:00:00 80 Valentine's Day
2013-02-14 08:00:00 89 Valentine's Day
2013-02-14 09:00:00 96 Valentine's Day
2013-02-14 10:00:00 60 Valentine's Day
2013-02-14 11:00:00 76 Valentine's Day
2013-02-14 12:00:00 82 Valentine's Day
2013-02-14 13:00:00 65 Valentine's Day
2013-02-14 14:00:00 90 Valentine's Day
2013-02-14 15:00:00 62 Valentine's Day
2013-02-14 16:00:00 64 Valentine's Day
2013-02-14 17:00:00 98 Valentine's Day
2013-02-14 18:00:00 52 Valentine's Day
2013-02-14 19:00:00 72 Valentine's Day
2013-02-14 20:00:00 108 Valentine's Day
2013-02-14 21:00:00 85 Valentine's Day
2013-02-14 22:00:00 87 Valentine's Day
2013-02-14 23:00:00 62 Valentine's Day
2013-02-15 00:00:00 106 NaN
2013-02-15 01:00:00 82 NaN
2013-02-15 02:00:00 77 NaN
2013-02-15 03:00:00 52 NaN
2013-02-15 04:00:00 94 NaN
2013-02-15 05:00:00 71 NaN
2013-02-15 06:00:00 95 NaN
2013-02-15 07:00:00 96 NaN
2013-02-15 08:00:00 71 NaN
2013-02-15 09:00:00 69 NaN
2013-02-15 10:00:00 85 NaN
2013-02-15 11:00:00 92 NaN
2013-02-15 12:00:00 106 NaN
2013-02-15 13:00:00 77 NaN
2013-02-15 14:00:00 65 NaN
2013-02-15 15:00:00 104 NaN
2013-02-15 16:00:00 98 NaN
2013-02-15 17:00:00 107 NaN
2013-02-15 18:00:00 106 NaN
2013-02-15 19:00:00 67 NaN
2013-02-15 20:00:00 59 NaN
2013-02-15 21:00:00 81 NaN
2013-02-15 22:00:00 56 NaN
2013-02-15 23:00:00 75 NaN
Note: In this dataframe your datetime column is in the index.
You can try using the apply method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
The input to this is the function you want to be applied to each row. And in this case "axis" should be zero so that it is applied to each row.

De-Cumulating Time Series Data in a Pandas Series

I have several monthly, datetime-indexed cumulative Pandas series which I would like to de-cumulate so I can just get the values for the specific months themselves.
So, for each year, Jan is Jan, Feb is Jan + Feb, Mar is Jan + Feb + Mar and so on, until the next year that starts at Jan again.
To be awkward some of these series start with Feb instead.
Here's an example series:
2016-02-29 112.3
2016-03-31 243.0
2016-04-30 360.1
2016-05-31 479.5
2016-06-30 643.0
2016-07-31 757.6
2016-08-31 874.5
2016-09-30 1051.8
2016-10-31 1203.4
2016-11-30 1358.3
2016-12-31 1573.5
2017-01-31 75.0
2017-02-28 140.5
2017-03-31 290.4
2017-04-30 416.6
2017-05-31 548.2
2017-06-30 746.6
2017-07-31 863.5
2017-08-31 985.4
2017-09-30 1160.1
2017-10-31 1302.5
2017-11-30 1465.7
2017-12-31 1694.1
2018-01-31 74.0
2018-02-28 146.3
2018-03-31 300.9
2018-04-30 421.9
2018-05-31 564.1
2018-06-30 771.4
I thought one way to do this would be to use df.diff() to get most of the differences for everything but Jan, replace the incorrect Jan values with NaN then do a df.update(original df) to fill in the NaNs with the correct values.
I'm having trouble trying to replace the Jan data with NaNs. Would anyone be able to help with this or suggest another solution at all please?
I would solve this with groupby + diff + fillna:
df.asfreq('M').groupby(pd.Grouper(freq='Y')).diff().fillna(df)
Value
2016-02-29 112.3
2016-03-31 130.7
2016-04-30 117.1
2016-05-31 119.4
2016-06-30 163.5
2016-07-31 114.6
2016-08-31 116.9
2016-09-30 177.3
2016-10-31 151.6
2016-11-30 154.9
2016-12-31 215.2
2017-01-31 75.0
2017-02-28 65.5
2017-03-31 149.9
2017-04-30 126.2
2017-05-31 131.6
2017-06-30 198.4
2017-07-31 116.9
2017-08-31 121.9
2017-09-30 174.7
2017-10-31 142.4
2017-11-30 163.2
2017-12-31 228.4
2018-01-31 74.0
2018-02-28 72.3
2018-03-31 154.6
2018-04-30 121.0
2018-05-31 142.2
2018-06-30 207.3
Assuming the index is the date column, and the "Value" is a float.

Pandas rolling mean for Series returns NaN

Why do I receive Nan for rolling mean? Here's a code and output for this code. Initially I thought my data's wrong but simple .mean() works OK.
print(df_train.head())
y_hat_avg['mean'] = df_train['pickups'].mean()
print(y_hat_avg.head())
y_hat_avg['moving_avg_forecast'] = df_train['pickups'].rolling(1).mean()
print(y_hat_avg.head())
Added some data:
...................................................................
pickups
date
2014-04-01 00:00:00 12
2014-04-01 01:00:00 5
2014-04-01 02:00:00 2
2014-04-01 03:00:00 4
2014-04-01 04:00:00 3
pickups mean
date
2014-08-01 00:00:00 19 47.25888
2014-08-01 01:00:00 26 47.25888
2014-08-01 02:00:00 9 47.25888
2014-08-01 03:00:00 4 47.25888
2014-08-01 04:00:00 11 47.25888
pickups mean moving_avg_forecast
date
2014-08-01 00:00:00 19 47.25888 NaN
2014-08-01 01:00:00 26 47.25888 NaN
2014-08-01 02:00:00 9 47.25888 NaN
2014-08-01 03:00:00 4 47.25888 NaN
2014-08-01 04:00:00 11 47.25888 NaN
df_train.index = pd.RangeIndex(len(df_train.index)) fixed the problem for me.

Group data by time of the day

I have a dataframe with datetime index:df.head(6)
NUMBERES PRICE
DEAL_TIME
2015-03-02 12:40:03 5 25
2015-03-04 14:52:57 7 23
2015-03-03 08:10:09 10 43
2015-03-02 20:18:24 5 37
2015-03-05 07:50:55 4 61
2015-03-02 09:08:17 1 17
The dataframe includes the data of one week. Now I need to count the time period of the day. If time period is 1 hour, I know the following method would work:
df_grouped = df.groupby(df.index.hour).count()
But I don't know how to do when the time period is half hour. How can I realize it?
UPDATE:
I was told that this question is similar to How to group DataFrame by a period of time?
But I had tried the methods mentioned. Maybe it's my fault that I didn't say it clearly. 'DEAL_TIME' ranges from '2015-03-02 00:00:00' to '2015-03-08 23:59:59'. If I use pd.TimeGrouper(freq='30Min') or resample(), the time periods would range from '2015-03-02 00:30' to '2015-03-08 23:30'. But what I want is a series like below:
COUNT
DEAL_TIME
00:00:00 53
00:30:00 49
01:00:00 31
01:30:00 22
02:00:00 1
02:30:00 24
03:00:00 27
03:30:00 41
04:00:00 41
04:30:00 76
05:00:00 33
05:30:00 16
06:00:00 15
06:30:00 4
07:00:00 60
07:30:00 85
08:00:00 3
08:30:00 37
09:00:00 18
09:30:00 29
10:00:00 31
10:30:00 67
11:00:00 35
11:30:00 60
12:00:00 95
12:30:00 37
13:00:00 30
13:30:00 62
14:00:00 58
14:30:00 44
15:00:00 45
15:30:00 35
16:00:00 94
16:30:00 56
17:00:00 64
17:30:00 43
18:00:00 60
18:30:00 52
19:00:00 14
19:30:00 9
20:00:00 31
20:30:00 71
21:00:00 21
21:30:00 32
22:00:00 61
22:30:00 35
23:00:00 14
23:30:00 21
In other words, the time period should be irrelevant to the date.
You need a 30-minute time grouper for this:
grouper = pd.TimeGrouper(freq="30T")
You also need to remove the 'date' part from the index:
df.index = df.reset_index()['index'].apply(lambda x: x - pd.Timestamp(x.date()))
Now, you can group by time alone:
df.groupby(grouper).count()
You can find somewhat obscure TimeGrouper documentation here: pandas resample documentation (it's actually resample documentation, but both features use the same rules).
In pandas, the most common way to group by time is to use the
.resample() function.
In v0.18.0 this function is two-stage.
This means that df.resample('M') creates an object to which we can
apply other functions (mean, count, sum, etc.)
The code snippet will be like,
df.resample('M').count()
You can refer here for example.

Categories