Resample data to add missing hour values - python

Im working with a df that looks like this :
trans_id amount month day hour
2018-08-18 12:59:59+00:00 1 46 8 18 12
2018-08-26 01:56:55+00:00 2 20 8 26 1
I intend to get the average 'amount' at each hour.I use the following code to do that:
df2 = df.groupby(['month', 'day', 'day_name', 'hour'], as_index = False)['amount'].sum()
That gives me the total amount each month day day_name hour combination which is ok. But when I count the total hours for each day they all are not 24 as expected. I imagine due to the fact that some transactions don't exist at that specific (month day day_name hour).
My question is how do i get all 24h irrelevant if they have records or not.
Thanks

Use Series.unstack with DataFrame.stack:
df2 = (df.groupby(['month', 'day', 'day_name', 'hour'])['amount']
.sum()
.unstack(fill_value=0)
.stack()
.reset_index())

I hope not to be wrong, but you can try this:
df2 = df.resample('1H').sum().copy()
This will resample your dataset for every hour from 0 to 23 and will sum the values. It will also create the nan for missing timestamps.
Late but hope it helps.

Related

Creating year week based on date with different start date

I have a df
date
2021-03-12
2021-03-17
...
2022-05-21
2022-08-17
I am trying to add a column year_week, but my year week starts at 2021-06-28, which is the first day of July.
I tried:
df['date'] = pd.to_datetime(df['date'])
df['year_week'] = (df['date'] - timedelta(days=datetime(2021, 6, 24).timetuple()
.tm_yday)).dt.isocalendar().week
I played around with the timedelta days values so that the 2021-06-28 has a value of 1.
But then I got problems with previous & dates exceeding my start date + 1 year:
2021-03-12 has a value of 38
2022-08-17 has a value of 8
So it looks like the valid period is from 2021-06-28 + 1 year.
date year_week
2021-03-12 38 # LY38
2021-03-17 39 # LY39
2021-06-28 1 # correct
...
2022-05-21 47 # correct
2022-08-17 8 # NY8
Is there a way to get around this? As I am aggregating the data by year week I get incorrect results due to the past & upcoming dates. I would want to have negative dates for the days before 2021-06-28 or LY38 denoting that its the year week of the last year, accordingly year weeks of 52+ or NY8 denoting that this is the 8th week of the next year?
Here is a way, I added two dates more than a year away. You need the isocalendar from the difference between the date column and the dayofyear of your specific date. Then you can select the different scenario depending on the year of your specific date. use np.select for the different result format.
#dummy dataframe
df = pd.DataFrame(
{'date': ['2020-03-12', '2021-03-12', '2021-03-17', '2021-06-28',
'2022-05-21', '2022-08-17', '2023-08-17']
}
)
# define start date
d = pd.to_datetime('2021-6-24')
# remove the nomber of day of year from each date
s = (pd.to_datetime(df['date']) - pd.Timedelta(days=d.day_of_year)
).dt.isocalendar()
# get the difference in year
m = (s['year'].astype('int32') - d.year)
# all condition of result depending on year difference
conds = [m.eq(0), m.eq(-1), m.eq(1), m.lt(-1), m.gt(1)]
choices = ['', 'LY','NY',(m+1).astype(str)+'LY', '+'+(m-1).astype(str)+'NY']
# create the column
df['res'] = np.select(conds, choices) + s['week'].astype(str)
print(df)
date res
0 2020-03-12 -1LY38
1 2021-03-12 LY38
2 2021-03-17 LY39
3 2021-06-28 1
4 2022-05-21 47
5 2022-08-17 NY8
6 2023-08-17 +1NY8
I think
pandas period_range can be of some help
pd.Series(pd.period_range("6/28/2017", freq="W", periods=Number of weeks you want))

How to extract hours from a pandas.datetime?

I´ve a pandas dataframe which I applied the pandas.to_datetime. Now I want to extract the hours/minutes/seconds from each timestamp. I used df.index.day to get the days, and now, I want to know if there are different hours in my index.
For example, if I have two dates d1 = 2020-01-01 00:00:00 and d2 = 2020-01-02 00:00:00 I can't assume I should apply a smooth operator by hour because makes no sense.
So what I want to know is: how do I know if a day has different hours/minutes or seconds?
Thank you in advance
I think you should use df[index].dt provided by pandas.
You can extract day, week, hour, minute, second by using it.
Please see this.
dir(df[index].dt)
Here is an example.
import pandas as pd
df = pd.DataFrame([["2020-01-01 06:31:00"], ["2020-03-12 10:21:09"]])
print(df)
df['time'] = pd.to_datetime(df["timestamp"])
df['dates'] = df['time'].dt.date
df['hour'] = df['time'].dt.hour
df['minute'] = df['time'].dt.minute
df['second'] = df['time'].dt.second
Now your df should look like this.
0 time dates hour minute second
0 2020-01-01 06:31:00 2020-01-01 06:31:00 2020-01-01 6 31 0
1 2020-03-12 10:21:09 2020-03-12 10:21:09 2020-03-12 10 21 9
If d1 and d2 are datetime or Timestamp objects, you can get the hour, minute and second using attributes hour , minute and second
print(d1.hour,d1.minute,d1.second)
print(d2.hour,d2.minute,d2.second)
Similarly, year, month and day can also be extracted.

Week on Week, Day on Day and YoY Calculations on Pandas DF

I have the following dataframe which is at a day level:
BillDate S2Rate
4 2019-06-04 4686.5
3 2019-06-03 1557.5
2 2019-05-21 10073.5
1 2019-05-19 6501.5
0 2019-05-18 1378.0
I want to calculate WoW percentage, WoW increase or decrease using this data. How do I do this?
Also how do I replicate this for a YoY and Day on Day?
You should use resample. Then you can use functions like pct_change and diff to get the differences:
# df["BillDate"] = pd.to_datetime(df["BillDate"])
week_over_week = df.set_index("BillDate").resample("W").sum()
week_over_week_pct = week_over_week.pct_change()
week_over_week_increase = week_over_week.diff()
You can replace the parameter for resample with "D" for day over day, "Y" for year over year and many other options for more complex time ranges.
Set BillDate as index after coercing it to a datetime
df.set_index(pd.to_datetime(df['BillDate']), inplace=True)
df
Get rid of BillDate from columns now that you moved it to index
df.drop(columns=['BillDate'], inplace=True)
Resample to required period, calculate sum and percentage change
df.resample('W')['S2Rate'].sum().pct_change().to_frame()
Please note resample works by taking the last value in the period.
'W'-Sets date to Sunday
'M'-Sets date to last date in a month

Pandas DatetimeIndex and to_datetime discrepancies when calculate (format) the same date

I've got a simple task of creating consectuive days and do some calculations on it.
I did it using:
date = pd.DatetimeIndex(start='2019-01-01', end='2019-01-10',freq='D')
df = pd.DataFrame([date, date.week, date.dayofweek], index=['Date','Week', 'DOW']).T
df
and now I want to calculate back the date from week and day of week using:
df['Date2'] = pd.to_datetime('2019' + df['Week'].map(str) + df['DOW'].map(str), format='%Y%W%w')
The result I get is:
As I understand it DatetimeIndex has a different method of calculating Week Number as 1stJan2019 should be Week=0 and dow=2 and it is when I try run code: pd.to_datetime('201902', format='%Y%W%w') : Timestamp('2019-01-01 00:00:00')
Simmilar questions where asked here and here but both for both of them the discrepency came from different time zones and here I don't use them.
Thanks for help!
According to the documentation https://github.com/d3/d3-time-format#api-reference,
it appears %W is Monday-based week whereas %w is Sunday-based weekday.
I ran the code bellow to get back the expected result :
date = pd.DatetimeIndex(start='2019-01-01', end='2019-01-10',freq='D')
df = pd.DataFrame([date, date.week, date.weekday_name, date.dayofweek], index=['Date','Week', 'Weekday', 'DOW']).T
df['Week'] = df['Week'] - 1
df['Date2'] = pd.to_datetime('2019' + df['Week'].map(str) + df['Weekday'].map(str), format='%Y%W%A', box=True)
Notice that 2018-12-31 is in the first week of year 2019
Date Week Weekday DOW Date2
0 2018-12-31 00:00:00 0 Monday 0 2018-12-31

Extracting date components in pandas series

I have problems with transforming a Pandas dataframe column with dates to a number.
import matplotlib.dates
import datetime
for x in arsenalchelsea['Datum']:
year = int(x[:4])
month = int(x[5:7])
day = int(x[8:10])
hour = int(x[11:13])
minute = int(x[14:16])
sec = int(x[17:19])
arsenalchelsea['floatdate']=date2num(datetime.datetime(year, month, day, hour, minute, sec))
arsenalchelsea
I want to make a new column in my dataframe with the dates in numbers, because i want to make a line graph later with the date on the x-as.
This is the format of the date:
2017-11-29 14:06:45
Does anyone have a solution for this problem?
Slicing strings to get date components is bad practice. You should convert to datetime and extract directly.
In this case, it seems you can just use pd.to_datetime, but below I also demonstrate how you can extract the various components once you have performed the conversion.
df = pd.DataFrame({'Date': ['2017-01-15 14:55:42', '2017-11-10 12:15:21', '2017-12-05 22:05:45']})
df['Date'] = pd.to_datetime(df['Date'])
df[['year', 'month', 'day', 'hour', 'minute', 'sec']] = \
df['Date'].apply(lambda x: (x.year, x.month, x.day, x.hour, x.minute, x.second)).apply(pd.Series)
Result:
Date year month day hour minute sec
0 2017-01-15 14:55:42 2017 1 15 14 55 42
1 2017-11-10 12:15:21 2017 11 10 12 15 21
2 2017-12-05 22:05:45 2017 12 5 22 5 45

Categories