I am trying to forecast daily profit using time series analysis, but daily profit is not only recorded unevenly, but some of the data is missing.
Raw Data:
Date
Revenue
2020/1/19
10$
2020/1/20
7$
2020/1/25
14$
2020/1/29
18$
2020/2/1
12$
2020/2/2
17$
2020/2/9
28$
The above table is an example of what kind of data I have. Profit is not recorded daily, so date between 2020/1/20 and 2020/1/24 does not exist. Not only that, say the profit recorded during the period between 2020/2/3 and 2020/3/8 went missing in the database. I would like to recover this missing data and use time series analysis to predict the profit after 2020/2/9 ~.
My approach was to first aggregate the profit every 6 days since I have to recover the profit between 2020/2/3 and 2020/3/8. So my cleaned data will look something like this
Date
Revenue
2020/1/16 ~ 2020/1/21
17$
2020/1/22 ~ 2020/1/27
14$
2020/1/28 ~ 2020/2/2
47$
2020/2/3 ~ 2020/2/8
? (to predict)
After applying this to a time series model, I would like to further predict the profit after 2020/2/9 ~.
This is my general idea, but as a beginner at Python, using pandas library, I have trouble executing my ideas. Could you please help me how to aggregate the profit every 6 days and have the data look like the above table?
Easiest way is using pandas resample function.
Provided you have an index of type Datetime resampling to aggregate profits at every 6 days would be as simple as your_dataframe.resample('6D').sum()
You can do all sorts of resampling (end of month, end of quarter, begining of week, every hour, minute, second, ...). Check the full documentation if you're interested: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html?highlight=resample#pandas.DataFrame.resample
I suggest using a combination of .rolling, pd.date_range, and .reindex
Say your DataFrame is df, with proper datetime indexing:
df = pd.DataFrame([['2020/1/19',10],
['2020/1/20',7],
['2020/1/25',14],
['2020/1/29',18],
['2020/2/1',12],
['2020/2/2',17],
['2020/2/9',28]],columns=['Date','Revenue'])
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date',inplace=True)
The first step is to 'fill in' the missing days with dummy, zero revenue. We can use pd.date_range to get an index with evenly spaced dates from 2020/1/16 to 2020/2/8, and then .reindex to bring this into the main df DataFrame:
evenly_spaced_idx = pd.date_range(start='2020/1/16',end='2020/2/8',freq='1d')
df = df.reindex(evenly_spaced_idx, fill_value=0)
Now we can take a rolling sum for each 6 day period. We're not interested in every day's six day revenue total, only every 6th days, though:
summary_df = df.rolling('6d').sum().iloc[5::6, :]
The last thing with summary_df is just to format it the way you'd like so that it clearly states the date range which each row refers to.
summary_df['Start Date'] = summary_df.index-pd.Timedelta('6d')
summary_df['End Date'] = summary_df.index
summary_df.reset_index(drop=True,inplace=True)
You can use resample for this.
Make sure to have the "Date" column as datetime type.
>>> df = pd.DataFrame([["2020/1/19" ,10],
... ["2020/1/20" ,7],
... ["2020/1/25" ,14],
... ["2020/1/29" ,18],
... ["2020/2/1" ,12],
... ["2020/2/2" ,17],
... ["2020/2/9" ,28]], columns=['Date', 'Revenue'])
>>> df['Date'] = pd.to_datetime(df.Date)
For pandas < 1.1.0
>>> df.set_index('Date').resample('6D', base=3).sum()
Revenue
Date
2020-01-16 17
2020-01-22 14
2020-01-28 47
2020-02-03 0
2020-02-09 28
For pandas >= 1.1.0
>>> df.set_index('Date').resample('6D', origin='2020-01-16').sum()
Revenue
Date
2020-01-16 17
2020-01-22 14
2020-01-28 47
2020-02-03 0
2020-02-09 28
Related
Could someone give me a tip on how to use pandas groupby to find similar "days" in a time series dataset?
For example my data is (averaged daily values) a buildings electrical power and weather data, I am attempting to see if Pandas groupby can be used to find similar "days" both in electrical power usage and weather to a unique date in the time stamp of July 25th 2019.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/bbartling/Data/master/stackoverflow_groupby_question.csv', parse_dates=True)
df['Date']=pd.to_datetime(df['Date'], utc=True)
df.set_index('Date', inplace=True)
df_daily_avg = df.resample('D').mean()
What I am trying to find is like the top 10 or 15 most similar days in this dataset to the averaged temperature on that day of July 25th which is:
july_25_temp_avg = df_daily_avg.loc['2019-07-25'].Temperature_C
22.047916666666676
And averaged building power which is:
july_25_power_avg = df_daily_avg.loc['2019-07-25'].kW
52.658333333333324
If I use groupby, something like this below it strips away the time stamp index.
july25_most_similar = df_daily_avg.groupby(['kW','Temperature_C'],as_index=False).Temperature_C.mean()
returns where it seems like most similar days are on the bottom:
kW Temperature_C
0 9.316667 17.256250
1 9.433333 14.979167
2 9.616667 13.933333
3 9.683333 19.822917
4 10.116667 24.606250
... ... ...
360 58.741667 21.816667
361 61.250000 23.839583
362 61.633333 25.204167
363 62.483333 25.970833
364 63.808333 25.300000
Any tips greatly appreciated to return the timestamp/days that are most similar to July 25th Temperature & Power.
Also if it is possible to use more criteria than just Temperature_C is it possible to post an additional answer to use more weather data? For example the averaged power on July 25th and more weather data (beyond just Temperature_C) like Wind_Speed_m_s Relative_Humidity Temperature_C Pressure_mbar DHI_DNI?
I think I would take this approach:
indx = df_daily_avg.sub(df_daily_avg.loc['2019-07-25']).abs()\
.sort_values(['Temperature_C', 'kW']).head(10).index.normalize()
df[df.index.normalize().isin(indx)]
Use diff and take the abs get the top then days sorted on 'Temperature_C' and 'kW' or some sort of metric that ranks the two.
Then get those index normalize them to a date and determine which rows in the original dataframe match retreived index.
Say I have a dataset at daily scale, but not all days have valid data. In other words, some days are missing in the data. I want to compute the summer season mean from the dataset, and want to remove the month which has less than 20 days of valid data.
How do I achieve this (in pythonic fashion)?
Say my dataframe (df) is like this:
DATE VAR
1900-01-01 123
1900-01-02 456
1900-01-10 789
...
I know how to compute the count:
df_count = df.resample('MS').count()
I also know how to compute the summer season mean:
df_summer = df.resample('Q-NOV').mean()
You can based on df_count to filter out the month which have less than 20 days of valid data. After that compute the summer season mean using your formula.
df_count = df.resample('MS').count()
relevant_month = df_count[df_count > 10].index
df_summer = df[df.index.isin(relevant_month)].resample('Q-NOV').mean()
I suppose you store the month in index. If the month or time is stored in a different column, change df.index.isin(relevant_month) to df.columnName.isin(relevant_month).
I also don't know the format of your time column (date or datetime) so you might need to modify the code to change this part df.index.isin(relevant_month) accordingly. It is just the general idea.
I have a dataframe which look like this as below
Year Birthday OnsetDate
5 2018/1/1
5 2018/2/2
now I use the OnsetDate column subtract with the Day column
df['Birthday'] = df['OnsetDate'] - pd.to_timedelta(df['Day'], unit='Y')
but the outcome of the Birthday column is mixing with time just like below
Birthday
2013/12/31 18:54:00
2013/1/30 18:54:00
the outcome is just a dummy data, what I focused on this is that the time will cause inaccurate of date after the operation. What is the solution to avoid the time being generated so that I can get accurate data.
Second question, I merge the above dataframe to another data frame.
new.update(df)
and the 'new' dataframe Birthday column became like this
Birthday
1164394440000000000
1165949640000000000
so actually caused this and what is the solution?
First question, you should know that is not a whole year by using pd.to_timedelta. If you print, you can see 1 year = 365 days 05:49:12.
print(pd.to_timedelta(1, unit='Y'))
365 days 05:49:12
If you want to avoid the time being generated, you can use DateOffset.
from pandas.tseries.offsets import DateOffset
df['Year'] = df['Year'].apply(lambda x: DateOffset(years=x))
df['Birthday'] = df['OnsetDate'] - df['Year']
Year OnsetDate Birthday
0 <DateOffset: years=5> 2018-01-01 2013-01-01
1 <DateOffset: years=5> 2018-02-02 2013-02-02
As for the second question is caused by the type of column, you can use pd.to_datetime to solve it.
new['Birthday'] = pd.to_datetime(new['Birthday'])
I have a large pandas dataframe that has hourly data associated with it. I then want to parse that into "monthly" data that sums the hourly data. However, the months aren't necessarily calendar months, they typically start in the middle of one month and end in the middle of the next month.
I could build a list of the "months" that each of these date ranges fall into and loop through it, but I would think there is a much better way to do this via pandas.
Here's my current code, the last line throws an error and is the crux of the question:
dates = pd.Series(pd.date_range('1/1/2015 00:00','3/31/2015 23:45',freq='1H'))
nums = np.random.randint(0,100,dates.count())
df = pd.DataFrame({'date':dates, 'num':nums})
month = pd.DataFrame({'start':['1/4/2015 00:00','1/24/2015 00:00'], 'end':['1/23/2015 23:00','2/23/2015 23:00']})
month['start'] = pd.to_datetime(month['start'])
month['end'] = pd.to_datetime(month['end'])
month['num'] = df['num'][(df['date'] >= month['start']) & (df['date'] <= month['end'])].sum()
I would want an output similar to:
start end num
0 2015-01-04 2015-01-23 23:00:00 33,251
1 2015-01-24 2015-02-23 23:00:00 39,652
but of course, I'm not getting that.
pd.merge_asof only available with pandas 0.19
combination of pd.merge_asof + query + groupby
pd.merge_asof(df, month, left_on='date', right_on='start') \
.query('date <= end').groupby(['start', 'end']).num.sum().reset_index()
explanation
pd.merge_asof
From docs
For each row in the left DataFrame, we select the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key. Both DataFrames must be sorted by the key.
But this only takes into account the start date.
query
I take care of end date with query since I now conveniently have end in my dataframe after pd.merge_asof
groupby
I trust this part is obvious`
Maybe you can convert to a period and add a number of days
# create data
dates = pd.Series(pd.date_range('1/1/2015 00:00','3/31/2015 23:45',freq='1H'))
nums = np.random.randint(0,100,dates.count())
df = pd.DataFrame({'date':dates, 'num':nums})
# offset days and then create period
df['periods'] = (df.date + pd.tseries.offsets.Day(23)).dt.to_period('M')]
# group and sum
df.groupby('periods')['num'].sum()
Output
periods
2015-01 10051
2015-02 34229
2015-03 37311
2015-04 26655
You can then shift the dates back and make new columns
I am trying to use pandas to compute daily climatology. My code is:
import pandas as pd
dates = pd.date_range('1950-01-01', '1953-12-31', freq='D')
rand_data = [int(1000*random.random()) for i in xrange(len(dates))]
cum_data = pd.Series(rand_data, index=dates)
cum_data.to_csv('test.csv', sep="\t")
cum_data is the data frame containing daily dates from 1st Jan 1950 to 31st Dec 1953. I want to create a new vector of length 365 with the first element containing the average of rand_data for January 1st for 1950, 1951, 1952 and 1953. And so on for the second element...
Any suggestions how I can do this using pandas?
You can groupby the day of the year, and the calculate the mean for these groups:
cum_data.groupby(cum_data.index.dayofyear).mean()
However, you have the be aware of leap years. This will cause problems with this approach. As alternative, you can also group by the month and the day:
In [13]: cum_data.groupby([cum_data.index.month, cum_data.index.day]).mean()
Out[13]:
1 1 462.25
2 631.00
3 615.50
4 496.00
...
12 28 378.25
29 427.75
30 528.50
31 678.50
Length: 366, dtype: float64
Hoping it can be of any help, I want to post my solution to get a climatology series with the same index and length of the original time series.
I use joris' solution to get a "model climatology" of 365/366 elements, then I build my desired series taking values from this model climatology and time index from my original time series.
This way, things like leap years are automatically taken care of.
#I start with my time series named 'serData'.
#I apply joris' solution to it, getting a 'model climatology' of length 365 or 366.
serClimModel = serData.groupby([serData.index.month, serData.index.day]).mean()
#Now I build the climatology series, taking values from serClimModel depending on the index of serData.
serClimatology = serClimModel[zip(serData.index.month, serData.index.day)]
#Now serClimatology has a time index like this: [1,1] ... [12,31].
#So, as a final step, I take as time index the one of serData.
serClimatology.index = serData.index
#joris. Thanks. Your answer was just what I needed to use pandas to calculate daily climatologies, but you stopped short of the final step. Re-mapping the month,day index back to an index of day of the year for all years, including leap years, i.e. 1 thru 366. So I thought I'd share my solution for other users. 1950 thru 1953 is 4 years with one leap year, 1952. Note since random values are used each run will give different results.
...
from datetime import date
doy = []
doy_mean = []
doy_size = []
for name, group in cum_data.groupby([cum_data.index.month, cum_data.index.day]):
(mo, dy) = name
# Note: can use any leap year here.
yrday = (date(1952, mo, dy)).timetuple().tm_yday
doy.append(yrday)
doy_mean.append(group.mean())
doy_size.append(group.count())
# Note: useful climatology stats are also available via group.describe() returned as dict
#desc = group.describe()
# desc["mean"], desc["min"], desc["max"], std,quartiles, etc.
# we lose the counts here.
new_cum_data = pd.Series(doy_mean, index=doy)
print new_cum_data.ix[366]
>> 634.5
pd_dict = {}
pd_dict["mean"] = doy_mean
pd_dict["size"] = doy_size
cum_data_df = pd.DataFrame(data=pd_dict, index=doy)
print cum_data_df.ix[366]
>> mean 634.5
>> size 4.0
>> Name: 366, dtype: float64
# and just to check Feb 29
print cum_data_df.ix[60]
>> mean 343
>> size 1
>> Name: 60, dtype: float64
Groupby month and day is a good solution. However, the perfect thinking of groupby(dayofyear) is still possible if you use xrray.CFtimeIndex instead of pandas.DatetimeIndex. i.e,
Delete feb29 by using
rand_data=rand_data[~((rand_data.index.month==2) & (rand_data.index.day==29))]
Replace the index of the above data by xrray.CFtimeIndex, i.e.,
index = xarray.cftime_range('1950-01-01', '1953-12-31', freq='D', calendar = 'noleap')
index = index[~((index.month==2)&(index.day==29))]
rand_data['time']=index
Now, for both non-leap and leap year, the 60th dayofyear would be March 1st, and the total number of dayofyear would be 365. groupbyyear would be correct to calculate climatological daily mean.