I have the following dataframe which is at a day level:
BillDate S2Rate
4 2019-06-04 4686.5
3 2019-06-03 1557.5
2 2019-05-21 10073.5
1 2019-05-19 6501.5
0 2019-05-18 1378.0
I want to calculate WoW percentage, WoW increase or decrease using this data. How do I do this?
Also how do I replicate this for a YoY and Day on Day?
You should use resample. Then you can use functions like pct_change and diff to get the differences:
# df["BillDate"] = pd.to_datetime(df["BillDate"])
week_over_week = df.set_index("BillDate").resample("W").sum()
week_over_week_pct = week_over_week.pct_change()
week_over_week_increase = week_over_week.diff()
You can replace the parameter for resample with "D" for day over day, "Y" for year over year and many other options for more complex time ranges.
Set BillDate as index after coercing it to a datetime
df.set_index(pd.to_datetime(df['BillDate']), inplace=True)
df
Get rid of BillDate from columns now that you moved it to index
df.drop(columns=['BillDate'], inplace=True)
Resample to required period, calculate sum and percentage change
df.resample('W')['S2Rate'].sum().pct_change().to_frame()
Please note resample works by taking the last value in the period.
'W'-Sets date to Sunday
'M'-Sets date to last date in a month
Related
Im working with a df that looks like this :
trans_id amount month day hour
2018-08-18 12:59:59+00:00 1 46 8 18 12
2018-08-26 01:56:55+00:00 2 20 8 26 1
I intend to get the average 'amount' at each hour.I use the following code to do that:
df2 = df.groupby(['month', 'day', 'day_name', 'hour'], as_index = False)['amount'].sum()
That gives me the total amount each month day day_name hour combination which is ok. But when I count the total hours for each day they all are not 24 as expected. I imagine due to the fact that some transactions don't exist at that specific (month day day_name hour).
My question is how do i get all 24h irrelevant if they have records or not.
Thanks
Use Series.unstack with DataFrame.stack:
df2 = (df.groupby(['month', 'day', 'day_name', 'hour'])['amount']
.sum()
.unstack(fill_value=0)
.stack()
.reset_index())
I hope not to be wrong, but you can try this:
df2 = df.resample('1H').sum().copy()
This will resample your dataset for every hour from 0 to 23 and will sum the values. It will also create the nan for missing timestamps.
Late but hope it helps.
I am working on time-series data, where I have two columns date and quantity. The date is day wise. I want to add all the quantity for a month and convert it into a single date.
date is my index column
Example
quantity
date
2018-01-03 30
2018-01-05 45
2018-01-19 30
2018-02-09 10
2018-02-19 20
Output :
quantity
date
2018-01-01 105
2018-02-01 30
Thanks in advance!!
You can downsample to combine the data for each month and sum it by chaining the sum method.
df.resample("M").sum()
Check out the pandas user guide on resampling here.
You'll need to make sure your index is in datetime format for this to work. So first do: df.index = pd.to_datetime(df.index). Hat tip to sammywemmy for the same advice in the comments.
You an also use groupby to get results.
df.index = pd.to_datetime(df.index)
df.groupby(df.index.strftime('%Y-%m-01')).sum()
I have two time series, df1
day cnt
2020-03-01 135006282
2020-03-02 145184482
2020-03-03 146361872
2020-03-04 147702306
2020-03-05 148242336
and df2:
day cnt
2017-03-01 149104078
2017-03-02 149781629
2017-03-03 151963252
2017-03-04 147384922
2017-03-05 143466746
The problem is that the sensors I'm measuring are sensitive to the day of the week, so on Sunday, for instance, they will produce less cnt. Now I need to compare the time series over 2 different years, 2017 and 2020, but to do that I have to align (March, in this case) to the matching day of the week, and plot them accordingly. How do I "shift" the data to make the series comparable?
The ISO calendar is a representation of date in a tuple (year, weeknumber, weekday). In pandas they are the dt members year, weekofyear and weekday. So assuming that the day column actually contains Timestamps (convert if first with to_datetime if it does not), you could do:
df1['Y'] = df1.day.dt.year
df1['W'] = df1.day.dt.weekofyear
df1['D'] = df1.day.dt.weekday
Then you could align the dataframes on the W and D columns
March 2017 started on wednesday
March 2020 started on Sunday
So, delete the last 3 days of march 2017
So, delete the first sunday, monday and tuesday from 2020
this way you have comparable days
df1['ctn2020'] = df1['cnt']
df2['cnt2017'] = df2['cnt']
df1 = df1.iloc[2:, 2]
df2 = df2.iloc[:-3, 2]
Since you don't want to plot the date, but want the months to align, make a new dataframe with both columns and a index column. This way you will have 3 columns: index(0-27), 2017 and 2020. The index will represent.
new_df = pd.concat([df1,df2], axis=1)
If you also want to plot the days of the week on the x axis, check out this link, to know how to get the day of the week from a date, and them change the x ticks label.
Sorry for the "written step-to-stop", if it all sounds confusing, i can type the whole code later for you.
I have a multivariate time series array. The timeseries is currently aggregated in 10 second intervals:
**Time**
2016-01-11 17:00:00
2016-01-11 17:00:10
2016-01-11 17:00:20
I want to resample so that I get a 5 hour timeframe per day (it doesn't matter how the time is shown in the data frame, just matters that its being aggregated properly). I am resampling by the mean values.
**Time**
2016-01-11 10:00:00-15:00:00
2016-01-12 10:00:00-15:00:00
2016-01-13 10:00:00-15:00:00
How would one do this?
First I would filter the time period I want and groupby day:
# mask the hours we want
hours = df.index.hour
mask = (hours >= 10) & (hours <=14)
# groupby
df[mask].groupby(df[mask].index.floor('D')).mean()
Toy data:
Times = pd.date_range('2016-01-11', '2016-01-14', freq='10s')
np.random.seed(1)
df = pd.DataFrame({'Time': Times,
'Value': np.random.randint(1,10, len(Times))})
gives:
Value
Time
2016-01-11 4.993333
2016-01-12 5.030556
2016-01-13 5.012778
df.groupby([df['Time'].dt.month, df['Time'].dt.day]).apply(lambda x: x.set_index('Time').resample('5H').mean())
You will have to group by the month and day of your time column first, then apply a resampling to your Time column in 5H (5 hours), followed by .mean() which will take the mean of your other columns.
The reason for the groupby is that you dont want 5 hr intervals for all day every day, only for the times each day. As long as your times are within 5hrs you will only get one interval a day.
I've got a simple task of creating consectuive days and do some calculations on it.
I did it using:
date = pd.DatetimeIndex(start='2019-01-01', end='2019-01-10',freq='D')
df = pd.DataFrame([date, date.week, date.dayofweek], index=['Date','Week', 'DOW']).T
df
and now I want to calculate back the date from week and day of week using:
df['Date2'] = pd.to_datetime('2019' + df['Week'].map(str) + df['DOW'].map(str), format='%Y%W%w')
The result I get is:
As I understand it DatetimeIndex has a different method of calculating Week Number as 1stJan2019 should be Week=0 and dow=2 and it is when I try run code: pd.to_datetime('201902', format='%Y%W%w') : Timestamp('2019-01-01 00:00:00')
Simmilar questions where asked here and here but both for both of them the discrepency came from different time zones and here I don't use them.
Thanks for help!
According to the documentation https://github.com/d3/d3-time-format#api-reference,
it appears %W is Monday-based week whereas %w is Sunday-based weekday.
I ran the code bellow to get back the expected result :
date = pd.DatetimeIndex(start='2019-01-01', end='2019-01-10',freq='D')
df = pd.DataFrame([date, date.week, date.weekday_name, date.dayofweek], index=['Date','Week', 'Weekday', 'DOW']).T
df['Week'] = df['Week'] - 1
df['Date2'] = pd.to_datetime('2019' + df['Week'].map(str) + df['Weekday'].map(str), format='%Y%W%A', box=True)
Notice that 2018-12-31 is in the first week of year 2019
Date Week Weekday DOW Date2
0 2018-12-31 00:00:00 0 Monday 0 2018-12-31