Generate daily time series date from monthly usage with python - python

I have about two years of monthly gas usage for a city and want to generate daily use concerning daily usage sum equal to monthly and keep time-series shape, but I don't know how to do that.
Here is my data Link [1]

The following code sample demonstrates date and data interpolation using pandas.
The following steps are taken:
Using the provided dataset, read this into a DataFrame.
Calculate a cumulative sum of usage data.
Set the DataFrame's index as the date, to facilitate date resampling.
Resample for dates to a daily frequency.
Calculate the daily usage.
Example code:
# Read the CSV and convert dates to a datetime object.
path = '~/Downloads/usage.csv'
df = pd.read_csv(path,
header=0,
names=['date', 'gas_usage'],
converters={'date': pd.to_datetime})
# Calculate a cumulative sum to be interpolated.
df['gas_usage_c'] = df['gas_usage'].cumsum()
# Move the date to the index, for resampling.
df.set_index('date', inplace=True)
# Resample the data to a daily ('D') frequency.
df2 = df.resample('D').interpolate('time')
# Calculate the daily usage.
df2['daily_usage'] = df2['gas_usage_c'].diff()
Sample output of df2:
gas_usage gas_usage_c daily_usage
date
2016-03-20 3.989903e+07 3.989903e+07 NaN
2016-03-21 3.932781e+07 4.061487e+07 7.158445e+05
2016-03-22 3.875659e+07 4.133072e+07 7.158445e+05
... ... ...
2018-02-18 4.899380e+07 7.967041e+08 1.598856e+06
2018-02-19 4.847973e+07 7.983029e+08 1.598856e+06
2018-02-20 4.796567e+07 7.999018e+08 1.598856e+06
[703 rows x 3 columns]
Visual confirmation
I've included two simple graphs to illustrate the dataset alignment and interpolation.
Plotting code:
For completeness, the rough plotting code is included below.
from plotly.offline import plot
plot({'data': [{'x': df.index,
'y': df['gas_usage'],
'type': 'bar'}],
'layout': {'title': 'Original',
'template': 'plotly_dark'}})
plot({'data': [{'x': df2.index,
'y': df2['daily_usage'],
'type': 'bar'}],
'layout': {'title': 'Interpolated',
'template': 'plotly_dark'}})

Related

pandas groupby to return dates in a dataset

Could someone give me a tip on how to use pandas groupby to find similar "days" in a time series dataset?
For example my data is (averaged daily values) a buildings electrical power and weather data, I am attempting to see if Pandas groupby can be used to find similar "days" both in electrical power usage and weather to a unique date in the time stamp of July 25th 2019.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/bbartling/Data/master/stackoverflow_groupby_question.csv', parse_dates=True)
df['Date']=pd.to_datetime(df['Date'], utc=True)
df.set_index('Date', inplace=True)
df_daily_avg = df.resample('D').mean()
What I am trying to find is like the top 10 or 15 most similar days in this dataset to the averaged temperature on that day of July 25th which is:
july_25_temp_avg = df_daily_avg.loc['2019-07-25'].Temperature_C
22.047916666666676
And averaged building power which is:
july_25_power_avg = df_daily_avg.loc['2019-07-25'].kW
52.658333333333324
If I use groupby, something like this below it strips away the time stamp index.
july25_most_similar = df_daily_avg.groupby(['kW','Temperature_C'],as_index=False).Temperature_C.mean()
returns where it seems like most similar days are on the bottom:
kW Temperature_C
0 9.316667 17.256250
1 9.433333 14.979167
2 9.616667 13.933333
3 9.683333 19.822917
4 10.116667 24.606250
... ... ...
360 58.741667 21.816667
361 61.250000 23.839583
362 61.633333 25.204167
363 62.483333 25.970833
364 63.808333 25.300000
Any tips greatly appreciated to return the timestamp/days that are most similar to July 25th Temperature & Power.
Also if it is possible to use more criteria than just Temperature_C is it possible to post an additional answer to use more weather data? For example the averaged power on July 25th and more weather data (beyond just Temperature_C) like Wind_Speed_m_s Relative_Humidity Temperature_C Pressure_mbar DHI_DNI?
I think I would take this approach:
indx = df_daily_avg.sub(df_daily_avg.loc['2019-07-25']).abs()\
.sort_values(['Temperature_C', 'kW']).head(10).index.normalize()
df[df.index.normalize().isin(indx)]
Use diff and take the abs get the top then days sorted on 'Temperature_C' and 'kW' or some sort of metric that ranks the two.
Then get those index normalize them to a date and determine which rows in the original dataframe match retreived index.

How to aggregate irregularly sampled data for Time Series Analysis

I am trying to forecast daily profit using time series analysis, but daily profit is not only recorded unevenly, but some of the data is missing.
Raw Data:
Date
Revenue
2020/1/19
10$
2020/1/20
7$
2020/1/25
14$
2020/1/29
18$
2020/2/1
12$
2020/2/2
17$
2020/2/9
28$
The above table is an example of what kind of data I have. Profit is not recorded daily, so date between 2020/1/20 and 2020/1/24 does not exist. Not only that, say the profit recorded during the period between 2020/2/3 and 2020/3/8 went missing in the database. I would like to recover this missing data and use time series analysis to predict the profit after 2020/2/9 ~.
My approach was to first aggregate the profit every 6 days since I have to recover the profit between 2020/2/3 and 2020/3/8. So my cleaned data will look something like this
Date
Revenue
2020/1/16 ~ 2020/1/21
17$
2020/1/22 ~ 2020/1/27
14$
2020/1/28 ~ 2020/2/2
47$
2020/2/3 ~ 2020/2/8
? (to predict)
After applying this to a time series model, I would like to further predict the profit after 2020/2/9 ~.
This is my general idea, but as a beginner at Python, using pandas library, I have trouble executing my ideas. Could you please help me how to aggregate the profit every 6 days and have the data look like the above table?
Easiest way is using pandas resample function.
Provided you have an index of type Datetime resampling to aggregate profits at every 6 days would be as simple as your_dataframe.resample('6D').sum()
You can do all sorts of resampling (end of month, end of quarter, begining of week, every hour, minute, second, ...). Check the full documentation if you're interested: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html?highlight=resample#pandas.DataFrame.resample
I suggest using a combination of .rolling, pd.date_range, and .reindex
Say your DataFrame is df, with proper datetime indexing:
df = pd.DataFrame([['2020/1/19',10],
['2020/1/20',7],
['2020/1/25',14],
['2020/1/29',18],
['2020/2/1',12],
['2020/2/2',17],
['2020/2/9',28]],columns=['Date','Revenue'])
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date',inplace=True)
The first step is to 'fill in' the missing days with dummy, zero revenue. We can use pd.date_range to get an index with evenly spaced dates from 2020/1/16 to 2020/2/8, and then .reindex to bring this into the main df DataFrame:
evenly_spaced_idx = pd.date_range(start='2020/1/16',end='2020/2/8',freq='1d')
df = df.reindex(evenly_spaced_idx, fill_value=0)
Now we can take a rolling sum for each 6 day period. We're not interested in every day's six day revenue total, only every 6th days, though:
summary_df = df.rolling('6d').sum().iloc[5::6, :]
The last thing with summary_df is just to format it the way you'd like so that it clearly states the date range which each row refers to.
summary_df['Start Date'] = summary_df.index-pd.Timedelta('6d')
summary_df['End Date'] = summary_df.index
summary_df.reset_index(drop=True,inplace=True)
You can use resample for this.
Make sure to have the "Date" column as datetime type.
>>> df = pd.DataFrame([["2020/1/19" ,10],
... ["2020/1/20" ,7],
... ["2020/1/25" ,14],
... ["2020/1/29" ,18],
... ["2020/2/1" ,12],
... ["2020/2/2" ,17],
... ["2020/2/9" ,28]], columns=['Date', 'Revenue'])
>>> df['Date'] = pd.to_datetime(df.Date)
For pandas < 1.1.0
>>> df.set_index('Date').resample('6D', base=3).sum()
Revenue
Date
2020-01-16 17
2020-01-22 14
2020-01-28 47
2020-02-03 0
2020-02-09 28
For pandas >= 1.1.0
>>> df.set_index('Date').resample('6D', origin='2020-01-16').sum()
Revenue
Date
2020-01-16 17
2020-01-22 14
2020-01-28 47
2020-02-03 0
2020-02-09 28

Produce daily forecasts from monthly averages using Python Pandas

I have daily data going back years. If I firstly wanted to see what the monthly average of these was, then to project out this monthly average forecast for the next few years I have written the following code.
For example, my forecast for the next few January's will be the average of the last few January's, and the same for Feb, Mar etc. Over the past few years my January number is 51.8111, so for the January's in my forecast period I want every day in every January to be this 51.8111 number (i.e. moving the monthly to daily granularity).
My question is, my code seems a bit long winded and with the loop, could potentially be a little slow? For my own learning I was wondering, what is a better way of taking daily data, averaging it by a time period, then projecting out this time period? I was looking at map and apply functions within Pandas, but couldn't quite work it out.
import datetime as dt
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
np.random.seed(0)
# create random dataframe of daily values
df = pd.DataFrame(np.random.randint(low=0, high=100,size=2317),
columns=['value'],
index=pd.date_range(start='2014-01-01', end=dt.date.today()-dt.timedelta(days=1), freq='D'))
# gain average by month over entire date range
df_by_month = df.groupby(df.index.month).mean()
# create new dataframe with date range for forecast
df_forecast = pd.DataFrame(index=pd.date_range(start=dt.date.today(), periods=1095, freq='D'))
df_forecast['value'] = 0
# project forward the monthly average to each day
for val in df_forecast.index:
df_forecast.loc[val]['value'] = df_by_month.loc[val.month]
# create new dataframe joining together the historical value and forecast
df_complete = df.append(df_forecast)
I think you need Index.map by months by column value from df_by_month:
# create new dataframe with date range for forecast
df_forecast = pd.DataFrame(index=pd.date_range(start=dt.date.today(), periods=1095, freq='D'))
df_forecast['value'] = df_forecast.index.month.map(df_by_month['value'])

Problem using Groupby in Python for date time. How to make a bar plot with Month/Year?

I have the following data set:
df
OrderDate Total_Charged
7/9/2017 5
7/9/2017 5
7/20/2017 10
8/20/2017 6
9/20/2019 1
...
I want to make a bar plot with month_year (X-axis) and Total charged per month/year i.e. sum it over month and year. Firstly, I want to groupby month and year and next make the plot.However, I get error on the first step:
df["OrderDate"]=pd.to_datetime(df['OrderDate'])
monthly_orders=df.groupby([(df.index.year),(df.index.month)]).sum()["Total_Charged"]
Got following error:
AttributeError: 'RangeIndex' object has no attribute 'year'
What am I doing wrong (what does the error mean)? How can i fix it?
Not sure why you're grouping by the index there. If you want the group by year and month respectively you could do the following:
df["OrderDate"]=pd.to_datetime(df['OrderDate'])
df.groupby([df.OrderDate.dt.year, df.OrderDate.dt.month]).sum().plot.bar()
pandas.DataFrame.resample
This is a versatile option, that easily implements aggregation over various time ranges (e.g. weekly, daily, quarterly, etc)
Code:
A more expansive dataset:
This code block sets up the sample dataset.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# list of dates
first_date = datetime(2017, 1, 1)
last_date = datetime(2019, 9, 20)
x = 4
list_of_dates = [date for date in np.arange(first_date, last_date, timedelta(days=x)).astype(datetime)]
df = pd.DataFrame({'OrderDate': list_of_dates,
'Total_Charged': [np.random.randint(10) for _ in range(len(list_of_dates))]})
Using resample for Monthly Sum:
requires a datetime index
df.OrderDate = pd.to_datetime(df.OrderDate)
df.set_index('OrderDate', inplace=True)
monthly_sums = df.resample('M').sum()
monthly_sums.plot.bar(figsize=(8, 6))
plt.show()
An example with Quarterly Avg:
this shows the versatility of resample compared to groupby
Quarterly would not be easily implemented with groupby
quarterly_avg = df.resample('Q').mean()
quarterly_avg.plot.bar(figsize=(8, 6))
plt.show()

How do you pull WEEKLY historical data from yahoo finance?

import datetime
import pandas.io.data
sp = pd.io.data.get_data_yahoo('^IXIC',start = datetime.datetime(1972, 1, 3),
end = datetime.datetime(2010, 1, 3))
I have used the above example, but that just pulls DAILY data into a dataframe when I would like to pull weekly. It doesn't seem like get_data_yahoo has a parameter where you can select perhaps from daily, weekly or monthly like the options made available on yahoo itself. Any other packages or ideas that you know of that might be able to facilitate this?
You can downsample using the asfreq method:
sp = sp.asfreq('W-FRI', method='pad')
The pad method will propagate the last valid observation forward.
Using resample (as #tshauck has shown) is another possibility.
Use asfreq if you want to guarantee that the values in your downsample are values found in the original data set. Use resample if you wish to aggregate groups of rows from the original data set (for example, by taking a mean). reindex might introduce NaN values if the original data set does not have a value on the date specified by the reindex -- though (as #behzad.nouri points out) you could use method=pad to propagate last observations here as well.
If you check the latest pandas source code on github, you will see that interval param is included in the latest master branch. You can manually modify your local copy by overwriting the same data.py under your Site-Packages/pandas/io folder
you can always reindex to your desired frequency:
sp.reindex( pd.date_range( start=sp.index.min( ),
end=sp.index.max( ),
freq='W-WED' ) ) # weekly, Wednesdays
edit: you may add , method='ffill' to forward fill NaN values.
As a suggestion, take Wednesdays because that tend to have least missing values. ( i.e. fewer NYSE holidays falls on Wednesday ). I think Yahoo weekly data gives the stock price each Monday, which is worst weekly frequency based on S&P data from 2000 onwards:
import pandas.io.data as web
sp = web.DataReader("^GSPC", "yahoo", start=dt.date( 2000, 1, 1 ) )
weekday = { 0:'MON', 1:'TUE', 2:'WED', 3:'THU', 4:'FRI' }
sp[ 'weekday' ] = list( map( weekday.get, sp.index.dayofweek ) )
sp.weekday.value_counts( )
output:
WED 722
TUE 717
THU 707
FRI 705
MON 659
One option would be to mask on the day of week you want.
sp[sp.index.dayofweek == 0]
Another option would be to resample.
sp.resample('W', how='mean')
That's how I convert daily to weekly price data:
import datetime
import pandas as pd
import pandas_datareader.data as web
start = datetime.datetime(1972, 1, 3)
end = datetime.datetime(2010, 1, 3)
stock_d = web.DataReader('^IXIC', 'yahoo', start, end)
def week_open(array_like):
return array_like[0]
def week_close(array_like):
return array_like[-1]
stock_w = stock_d.resample('W',
how={'Open': week_open,
'High': 'max',
'Low': 'min',
'Close': week_close,
'Volume': 'sum'},
loffset=pd.offsets.timedelta(days=-6))
stock_w = stock_w[['Open', 'High', 'Low', 'Close', 'Volume']]
more info:
https://pandas-datareader.readthedocs.io/en/latest/remote_data.html#yahoo-finance
https://gist.github.com/prithwi/339f87bf9c3c37bb3188

Categories