A csv file with data on orders (for meals to be delivered) was provided, the documents comprises the folowing columns with information:
dateTime,restaurant,address,zippcodeFrom,zippcodeTo,dist,tm
With the format for dateTime like this: YYYY-MM-DD HH:MM:ss
I'd personally prefer to use MS Excel to apply the FFT (fast fourrier transform) to forecast based on time-series data. However, this is a python course, and the file is to large for MS Excel.
Getting the average number of orders per day of the week would be a start. But if I try the aggregation function, it sums all orders of all mondays alltogether.
How can i retrieve either the average number per weekday or the total number of mondays (and then divide the total number of orders on all mondays by the number of mondays? (Subsequently we have to do the same for the average total travel time (tm in the csv file) for delivery.
The challanage: multiple orders per day, result in multiple lines of data for each day.
(The next thing is to get some kind of forecast hourly...)
What would be the best way to solve this?
Related
I have a data frame with a date column and a sales volume column.
There are days that I need to set sales volumes to zero and then distribute those volumes equally to the volumes of the next five non-weekend days.
So if I have a volume of 100 on a Monday, the next five non-business days are their volumes + (100/5).
I've tried some work arounds using date_range, but haven't had any success. I'm very lost on how to do this since I'm not so good with datetime methods.
I am new to python and using it to analyse climate data in NetCDF. I am wanting to calculate the total precipitation for each season in each year and then average these seasonal totals across the time period (i.e. an average for DJF over all years in the file and an average for MAM etc.).
Here is what I thought to do:
fn1 = 'cru_fixed.nc'
ds1 = xr.open_dataset(fn1)
ds1_season = ds1['pre'].groupby('time.season').mean('time')
#Then plot each season
ds1_season.plot(col='season')
plt.show()
The original file contains monthly totals of precipitation. This is calculating an average for each season and I need the sum of Dec, Jan and Feb and the sum of Mar, Apr, May etc. for each season in each year. How do I sum and then average over the years?
If I'm not mistaking, you need to first resample you data to have the sum of each seasons on a DataArray, then to average theses sum on multiple years.
To resample:
sum_of_seasons = ds1['pre'].resample(time='Q').sum(dim="time")
resample is an operator to upsample or downsample time series, it uses time offsets of pandas.
However be careful to choose the right offset, it will define the month included in each season. Depending on your needs, you may want to use "Q", "QS" or an anchored offset like "QS-DEC".
To have the same splitting as "time.season", the offset is "QS-DEC" I believe.
Then to group over multiple years, same as you did above:
result = sum_of_seasons.groupby('time.season').mean('time')
So for argument sake here is an example of autoarima for daily data:
auto_arima(df['orders'],seasonal=True,m=7)
Now in that example after running a Seasonal Decomposition that has shown weekly seasonality I "think" you select 7 for m? Is this correct as the seasonality is shown to be weekly?
My first question is as follows - If seasonality is Monthly do you use 12? If it is Annually do you use 1? And is there ever a reason to select 365 for daily?
Secondly if the data you are given is already weekly e.g
date weekly tot
2021/01/01 - 10,000
2021/01/07 - 15,000
2021/01/14 - 9,000
and so on......
And you do the seasonal decomposition would m=1 be used for weekly, m=4 for monthly and m=52 for annually.
Finally if its monthly like so:
date monthly tot
2020/01/01- 10,000
2020/02/01- 15,000
2020/03/01- 9,000
and so on......
And you do the seasonal decomposition would m=1 for monthly and m=12 for annually.
Any help would be greatly appreciated, I just want to be able to confidently select the right criteria.
A season is a recurring pattern in your data and m is the length of that season. m in that case is not a code or anything but simply the length:
Imagine the weather, if you had the weekly average temperature it will rise in the summer and fall in the winter. Since the length of one "season" is a year or 52 weeks, you set m to 52.
If you had a repeating pattern every quarter, then m would be 12, since a quarter is equal to 12 weeks. It always depends on your data and your use case.
To your questions:
If seasonality is Monthly do you use 12?
If the pattern you are looking for repeats every 12 months yes, if it repeats every 3 months it would be 3 and so on.
If it is Annually do you use 1?
A seasonality of 1 does not really make sense, since it would mean that you have a repeating pattern in every single data point.
And is there ever a reason to select 365 for daily?
If your data is daily and the pattern repeats every 365 days (meaning every year) then yes (you need to remember that every fourth year has 366 days though).
I hope you get the concept behind seasonality and m so you can answer the rest.
I have two time series data that gives the electricity demand in one-hour resolution and five-minute resolution. I am trying to find the maximum difference between these two time series. So the one-hour resolution data has 8760 rows (hourly for an year) and the 5-minute resolution data has 104,722 rows (5-minutly for an year).
I can only think of a method that will expand the hourly data into 5 minute resolution that will have 12 times repeating of the hourly data and find the maximum of the difference of the two data sets.
If this technique is the way to go, is there an easy way to convert my hourly data into 5-minute resolution by repeating the hourly data 12 times?
for your reference I posted a plot of this data for one day.
P.S> I am using Python to do this task
Numpy's .repeat() function
You can change your hourly data into 5-minute data by using numpy's repeat function
import numpy as np
np.repeat(hourly_data, 12)
I would strongly recommend against converting the hourly data into five-minute data. If the data in both cases refers to the mean load of those time ranges, you'll be looking at more accurate data if you group the five-minute intervals into hourly datasets. You'd get more granularity the way you're talking about, but the granularity is not based on accurate data, so you're not actually getting more value from it. If you aggregate the five-minute chunks into hourly chunks and compare the series that way, you can be more confident in the trustworthiness of your results.
In order to group them together to get that result, you can define a function like the following and use the apply method like so:
def to_hour(date):
date = date.strftime("%Y-%m-%d %H:00:00")
date = dt.strptime(date, "%Y-%m-%d %H:%M:%S")
return date
df['Aggregated_Datetime'] = df['Original_Datetime'].apply(lambda x: to_hour(x))
df.groupby('Aggregated_Datetime').agg('Real-Time Lo
I am trying to calculate night-time averages of a dataframe except that what I need is a mix between daily average and hour range average.
More specifically, I have a dataframe storing day and night hours and I want to use it as a boolean key to calculate night-time averages of another dataframe.
I cannot use daily averages because each night spreads over two calendar days, and I cannot use by hour range either because hours change by season.
Thanks for your help!
Dariush.
Based on comments received here is what I am looking for - see spreadsheet below. I need to calculate the average of 'Value' during nighttime using the Nighttime flag, and then repeat the average value for all time stamps until the following night, at which time the average is updated and repeated until the next nighttime flag.