Python - saving ARIMA forecast data to file - python

Hi and thanks in advance,
So I'm trying to write my forecast data to a file from my plot which is using an ARIMA forecast. How would I be able to do so, so that I could access the saved forecast data later?
Here is my code:
import pandas
from matplotlib import pyplot
series = pandas.read_csv('Quantity.csv',header=0,parse_dates=[0])
series.columns = ['Date','Quantity']
series.set_index(['Date'],inplace=True)
model = ARIMA(series['Quantity'].astype(float), order=(2,0,1))
fittedModel = model.fit(disp=0,method='css')
stepsAhead = 10
forecastArray = fittedModel.forecast(steps=stepsAhead)
for i in range(stepsAhead):
series.loc[len(series)] = forecastArray[0][i]
series.plot()
pyplot.show()
Here is the data I used to plot with if needed:
Date Quantity
2010/01/01 1358
2010/07/02 0
2010/08/03 0
2011/02/04 0
2011/11/05 0
2011/12/06 274
2012/06/07 1074
2012/08/30 2223
2013/04/16 0
2013/03/18 1753
2014/02/22 345
2014/01/27 24
2015/12/15 652
2015/09/28 275
2016/05/04 124
2017/11/07 75
2017/09/22 32
2017/04/04 12
Thank you.

series.to_csv(yourPath +'\Forecasts.csv')
That will be convenient for you I guess.

Related

How to plot multiple daily time series, aligned at specified trigger times?

The Problem:
I have a dataframe df that looks like this:
value msg_type
date
2022-03-15 08:15:10+00:00 122 None
2022-03-15 08:25:10+00:00 125 None
2022-03-15 08:30:10+00:00 126 None
2022-03-15 08:30:26.542134+00:00 127 ANNOUNCEMENT
2022-03-15 08:35:10+00:00 128 None
2022-03-15 08:40:10+00:00 122 None
2022-03-15 08:45:09+00:00 127 None
2022-03-15 08:50:09+00:00 133 None
2022-03-15 08:55:09+00:00 134 None
....
2022-03-16 09:30:09+00:00 132 None
2022-03-16 09:30:13.234425+00:00 135 ANNOUNCEMENT
2022-03-16 09:35:09+00:00 130 None
2022-03-16 09:40:09+00:00 134 None
2022-03-16 09:45:09+00:00 135 None
2022-03-16 09:50:09+00:00 134 None
The value data occurs in roughly 5 minute intervals, but messages can occur at any time. I am trying to plot one line of values per day, where the x-axis ranges from t=-2 hours to t=+8 hours, and the ANNOUNCEMENT occurs at t=0 (see image below).
So, for example, if an ANNOUNCEMENT occurs at 8:30AM on 3/15 and again at 9:30AM on 3/16, there should be two lines:
one line for 3/15 that plots data from 6:30AM to 4:30PM, and
one line for 3/16 that plots data from 7:30AM to 5:30PM,
both sharing the same x-axis ranging from -2 to +8, with ANNOUNCEMENT at t=0.
What I've Tried:
I am able to do this currently by finding the index position of an announcement (e.g. say it occurs at row 298 -> announcement_index = 298), generating an array of 120 numbers from -24 to 96 (representing 10 hours at 5 minutes per number -> x = np.arange(-24, 96, 1)), then plotting
sns.lineplot(x, y=df['value'].iloc[announcement_index-24:announcement_index+96])
While this does mostly work (see image below), I suspect it's not the correct way to go about it. Specifically, trying to add more info to the plot (like a different set of 'value' markers) at specific times is difficult because I would need to convert the timestamp into this arbitrary 24-96 value range.
How can I make this same plot but by utilizing the datetime index instead? Thank you so much!
Assuming the index has already been converted to_datetime, create an IntervalArray from -2H to +8H of the index:
dl, dr = -2, 8
left = df.index + pd.Timedelta(f'{dl}H')
right = df.index + pd.Timedelta(f'{dr}H')
df['interval'] = pd.arrays.IntervalArray.from_arrays(left, right)
Then for each ANNOUNCEMENT, plot the window from interval.left to interval.right:
Set the x-axis as seconds since ANNOUNCEMENT
Set the labels as hours since ANNOUNCEMENT
fig, ax = plt.subplots()
for ann in df.loc[df['msg_type'] == 'ANNOUNCEMENT'].itertuples():
window = df.loc[ann.interval.left:ann.interval.right] # extract interval.left to interval.right
window.index -= ann.Index # compute time since announcement
window.index = window.index.total_seconds() # convert to seconds since announcement
window.plot(ax=ax, y='value', label=ann.Index.date())
deltas = np.arange(dl, dr + 1)
ax.set(xticks=deltas * 3600, xticklabels=deltas) # set tick labels to hours since announcement
ax.legend()
Here is the output with a smaller window -1H to +2H just so we can see the small sample data more clearly (full code below):
Full code:
import io
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
s = '''
date,value,msg_type
2022-03-15 08:15:10+00:00,122,None
2022-03-15 08:25:10+00:00,125,None
2022-03-15 08:30:10+00:00,126,None
2022-03-15 08:30:26.542134+00:00,127,ANNOUNCEMENT
2022-03-15 08:35:10+00:00,128,None
2022-03-15 08:40:10+00:00,122,None
2022-03-15 08:45:09+00:00,127,None
2022-03-15 08:50:09+00:00,133,None
2022-03-15 08:55:09+00:00,134,None
2022-03-16 09:30:09+00:00,132,None
2022-03-16 09:30:13.234425+00:00,135,ANNOUNCEMENT
2022-03-16 09:35:09+00:00,130,None
2022-03-16 09:40:09+00:00,134,None
2022-03-16 09:45:09+00:00,135,None
2022-03-16 09:50:09+00:00,134,None
'''
df = pd.read_csv(io.StringIO(s), index_col=0, parse_dates=['date'])
# create intervals from -1H to +2H of the index
dl, dr = -1, 2
left = df.index + pd.Timedelta(f'{dl}H')
right = df.index + pd.Timedelta(f'{dr}H')
df['interval'] = pd.arrays.IntervalArray.from_arrays(left, right)
# plot each announcement's interval.left to interval.right
fig, ax = plt.subplots()
for ann in df.loc[df['msg_type'] == 'ANNOUNCEMENT')].itertuples():
window = df.loc[ann.interval.left:ann.interval.right] # extract interval.left to interval.right
window.index -= ann.Index # compute time since announcement
window.index = window.index.total_seconds() # convert to seconds since announcement
window.plot(ax=ax, y='value', label=ann.Index.date())
deltas = np.arange(dl, dr + 1)
ax.set(xticks=deltas * 3600, xticklabels=deltas) # set tick labels to hours since announcement
ax.grid()
ax.legend()

How to show timeline in matplotlib.axes.Axes.stem plot?

I am doing a matplotlib.axes.Axes.stem graph where the x-axis is a dateline that shows days. Some of my data appear on certain days. While on other days, it has no data (because such info do not exist in my data).
Question 1: How do I make a timeline stem graph that will show my data, including days with no data? Is this possible? Is there some way to auto-scale the appearance of the data x-axis to handle such a situation?
Below is a sample data file called test.txt and my python script to read in its data to show a timeline stem plot for your consideration. output from this script is also given below.
Question2. Presentation question. How do I show a "-" symbol at each annotation? Also, how do I rotate the annotation by 30 degrees?
test.txt
No. Date
1 23/01/2020
2 24/01/2020
3 24/01/2020
4 26/01/2020
5 27/01/2020
6 28/01/2020
7 29/01/2020
8 29/01/2020
9 30/01/2020
10 30/01/2020
11 31/01/2020
12 31/01/2020
13 01/02/2020
14 01/02/2020
15 04/02/2020
16 04/02/2020
17 04/02/2020
18 05/02/2020
19 05/02/2020
20 05/02/2020
21 06/02/2020
22 07/02/2020
23 07/02/2020
24 07/02/2020
25 08/02/2020
26 08/02/2020
27 08/02/2020
28 08/02/2020
29 08/02/2020
30 09/02/2020
31 10/02/2020
32 10/02/2020
33 11/02/2020
34 11/02/2020
38 13/02/2020
39 13/02/2020
40 13/02/2020
41 13/02/2020
42 13/02/2020
43 13/02/2020
44 14/02/2020
45 14/02/2020
46 14/02/2020
47 14/02/2020
48 14/02/2020
49 14/02/2020
50 15/02/2020
51 15/02/2020
52 15/02/2020
53 15/02/2020
54 15/02/2020
57 18/02/2020
58 18/02/2020
59 18/02/2020
60 19/02/2020
61 21/02/2020
stem_plot.py
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.dates as mdates
from datetime import datetime
from pathlib import Path
#########################
#### DATA EXTRACTION ####
#########################
source = Path('./test.txt')
with source.open() as f:
lines = f.readlines()
#print( lines )
# Store source data in dictionary with date shown as mm-dd.
data={}
for line in lines[1:]:
case, cdate = line.strip().split()
cdate = datetime.strptime(cdate, "%d/%m/%Y").strftime('%m-%d')
data[case] = cdate
print( f'\ndata = {data}' )
# Collate data's y-axis for each date, i.e. history
history2={}
cdates = list(data.values())
sorted_dates = sorted( set( cdates ) )
for i in sorted_dates:
cases=[]
for case, date in data.items():
if i == date:
cases.append(case)
#print( i, cases)
history2[i] = cases
print( f'\nhistory2 = {history2}')
###########################
#### DATA PRESENTATION ####
###########################
# Create figure and plot a stem plot with the date
fig, ax = plt.subplots(figsize=(8.8, 5), constrained_layout=True)
ax.set(title="Test")
labels=list( history2.values() ) # For annotation
yy = [ len(i) for i in labels ] # y-axis
xx = list(history2.keys()) # x-axis
markerline, stemline, baseline = ax.stem(
xx, yy, linefmt="C1:", basefmt="k-", use_line_collection=True)
plt.setp(markerline, marker="None" )
# annotate stem lines
for ann_x, label in list(history2.items()):
print(ann_x, label)
each_count=1
for each in label:
ax.annotate( each, xy=(ann_x, each_count), xycoords='data')
each_count += 1
#print(f'each_count = {each_count}' )
# format xaxis
plt.setp( ax.get_xticklabels(), rotation=30 )
# remove top and right spines
for spine in ["top", "right"]:
ax.spines[spine].set_visible(False)
# show axis name
ax.get_yaxis().set_label_text(label='Y-axis')
ax.get_xaxis().set_label_text(label='X-axis')
plt.show()
Current output:
About your first question. Basically, you make a list of all days between the days you are using and use that. So add this to the beginning of your code:
import pandas as pd
alldays = pd.date_range(start="20200123",
end="20200221",
normalize=True)
dates = []
for i in alldays:
dates.append(f"{i.month:02}-{i.day:02}")
What this does is it gets a pandas data range between two dates and converts this range into a list of month-day strings.
Then modify this part of your code like this:
# Collate data's y-axis for each date, i.e. history
history2={}
cdates = list(data.values())
sorted_dates = sorted( set( cdates ) )
for i in dates: # This is the only change!
cases=[]
for case, date in data.items():
if i == date:
cases.append(case)
#print( i, cases)
history2[i] = cases
And this change would give you this:
About your second question, change your code to this:
# annotate stem lines
for ann_x, label in list(history2.items()):
print(ann_x, label)
each_count=1
for each in label:
ax.annotate(f"--{each}", xy=(ann_x, each_count), xycoords='data', rotation=30)
each_count += 1
I just changed the ax.annotate line. The two changes are:
added a "--" to each of your annotation labels,
added a rotation parameter. The rotation parameter does not appear directly in the documentation, but the documentation says you can use any of the methods for Text as kwargs, and they are here.
This would hopefully give you what you have asked for:
Adding to #SinanKurmus answer to my 1st Question:
Solution1:
A time-axis with a daily interval for the entire history of the given data can be obtained using matplotlib's methods, namely drange and num2date, and python. The use of pandas can be avoided here.
First, express the start and end date of the time axis as a python datetime object. Note, you need to add 1 more day to the end date else data from the last date would not be included. Next, use 1 day as your time interval using python's datetime.timedelta object. Next supply them to matplotlib.date.drange method that will return a NumPy array. Matplotlib's num2date method in turns converts that back to a python datetime object.
def get_time_axis( data ):
start = datetime.strptime(min(data.values()), "%Y-%m-%d")
end = datetime.strptime(max(data.values()), "%Y-%m-%d") + timedelta(days=1)
delta = timedelta(days=1)
time_axis_md = mdates.drange( start, end, delta )
time_axis_py = mdates.num2date( time_axis_md, tz=None ) # Add tz when required
return time_axis_py
Solution 2:
Apparently, Matplotlib also has a FAQ on how to skip dates where there is no data. I have included their sample code example below.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import matplotlib.ticker as ticker
r = mlab.csv2rec('../data/aapl.csv')
r.sort()
r = r[-30:] # get the last 30 days
N = len(r)
ind = np.arange(N) # the evenly spaced plot indices
def format_date(x, pos=None):
thisind = np.clip(int(x+0.5), 0, N-1)
return r.date[thisind].strftime('%Y-%m-%d')
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(ind, r.adj_close, 'o-')
ax.xaxis.set_major_formatter(ticker.FuncFormatter(format_date))
fig.autofmt_xdate()
plt.show()

Iterating rows and collecting values for output. Numpy and Python 3.6

Update 5/22/18: Answer by #aorr below original question.
I am trying to collect each ID and the data for that ID for thousands of inputs.
I am trying to collect each row of individual ID's, sort the dates, then plot each ID + plus data and export the chart for each ID.
Edited
Sample data:
Col names: Id Date O G Company Date2
aab72ffd-4d0b-4c62-b6fe-4c55b98be9a0 3/1/1999 180.66 673 A 1/1/1996
aab72ffd-4d0b-4c62-b6fe-4c55b98be9a0 3/1/1995 173.9 651 A 1/1/1996
a15961bc-0263-4c66-a825-1deb69bda8be 12/1/2010 55.14 542 C 1/1/2011
a15961bc-0263-4c66-a825-1deb69bda8be 5/1/2012 49.24 577 C 1/1/2011
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 12/1/2000 48.14 290 D 3/1/2002
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 3/1/2003 69.03 282.5 D 3/1/2002
Desired output arrays/charts, but sorted by date.
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 7/1/2005 28.24 327
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 10/1/1998 45.11 335
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 7/1/2001 28.22 348
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 7/1/1997 44.53 350.5
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 8/1/2001 28.4 333.5
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 10/1/2005 41.72 314
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 12/1/2001 29.53 313.5
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 8/1/2002 43.24 319
The code I have typed so far successfully creates an indexed array of the the different data types. Now, I am just trying to iterate over all rows and organize the data so that it prints out/writes individual arrays/charts based on ID's.
Here is what I have so far:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#import data
mydataset = pd.read_csv('input_test.csv', dtype=None)
x = mydataset.iloc[:,:].values
y = mydataset.iloc[:,:].values
#Id
b = np.array((x[:,0]), dtype=str)
#Date
c = np.array((x[:,1]), dtype=str)
# O Var
d = np.array((x[:,2]), dtype=int)
# G var
e = np.array((x[:,3]), dtype=int)
#Stack
f = np.vstack((b,c,d,e))
#Transpose array
g = f.T
#Plot data
plt.figure()
plt.plot(x[:,2], y[:,3], label ='Rate over time')
plt.xlabel('m')
plt.ylabel('r/m')
#plt.legend()
Update based on #aorr answer:
Thank's for helping us noobs.
This plots both O and G on the Y axis with Date on the X axis for each Id. And everything is sorted based on date. Great starting point to expand with this data. More to follow based on updates.
for Id in data['Id'].unique():
fig, ax = plt.subplots(figsize=(5,3))
plot_data = data.query("Id==#Id").sort_values('Date')
_ = plot_data.plot(x='Date',y='O', ax=ax)
_ = plot_data.plot(x='Date', y='G', ax=ax)
#Plot Company name in each chart
for Company in plot_data[Company]:
_ = plt.title(Company)
#Plot Date2 Event onto X-axis
for Date2 in plot_data[Date2]:
_ = plt.axvline(Date2)
Have you tried solving this with pandas? I don't think you need to create numpy arrays for every element, pandas already stores them as ndarrays internally.
import matplotlib.pyplot as plt
data = pd.read_csv('input_test.csv', parse_dates=['date'])
for id in data['id'].unique():
fig, ax = plt.subplots(figsize=(5,3))
plot_data = data.query("id==#id").sort_values('date')
_ = plot_data.plot(x='O',y='G', ax=ax)
that should get you nearly all the way there. The pandas visualization docs here have a bunch of other really helpful options for exploring data quickly, but if you're picky about the look of the figure then you'll want to use straight matplotlib for the figure and axes layouts.

Python: Predict the y value using Statsmodels - Linear Regression

I am using the statsmodels library of Python to predict the future balance using Linear Regression. The csv file is displayed below:
Year | Balance
3 | 30
8 | 57
9 | 64
13 | 72
3 | 36
6 | 43
11 | 59
21 | 90
1 | 20
16 | 83
It contains the 'Year' as the independent 'x' variable, while the 'Balance' is the dependent 'y' variable
Here's the code for Linear Regression for this data:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import numpy as np
from matplotlib import pyplot as plt
import os
os.chdir('C:\Users\Admin\Desktop\csv')
cw = pd.read_csv('data-table.csv')
y=cw.Balance
X=cw.Year
X = sm.add_constant(X) # Adds a constant term to the predictor
est = sm.OLS(y, X)
est = est.fit()
print est.summary()
est.params
X_prime = np.linspace(X.Year.min(), X.Year.max(), 100)[:, np.newaxis]
X_prime = sm.add_constant(X_prime) # add constant as we did before
y_hat = est.predict(X_prime)
plt.scatter(X.Year, y, alpha=0.3) # Plot the raw data
plt.xlabel("Year")
plt.ylabel("Total Balance")
plt.plot(X_prime[:, 1], y_hat, 'r', alpha=0.9) # Add the regression line, colored in red
plt.show()
The question is how to predict the 'Balance' value, using Statsmodels when the value of 'Year'=10 ?
You can use the predict method from the result object est but in order to succesfully use it you have to use as formula
est = sm.ols("y ~ x", data =data).fit()
est.predict(exog=new_values)
where new_values is a dictionary.
Check out this link.

Plotting datetime output using matplotlib

So I have this code based on a simple data array that looks like this:
5020 : 2015 7 11 11 42 54 782705
5020 : 2015 7 11 11 44 55 575776
5020 : 2015 7 11 11 46 56 560755
5020 : 2015 7 11 11 48 57 104872
and the plot looks like the following:
import scipy as sp
import matplotlib.pyplot as plt
data = sp.genfromtxt("E:/Python/data.txt", delimiter=" : ")
x = data[:,0]
y = data[:,1]
plt.scatter(x,y)
plt.title("Instagram")
plt.xlabel("Time")
plt.ylabel("Followers")
plt.xticks([w*2*60 for w in range(10)],
['2-minute interval %i'%w for w in range(10)])
plt.autoscale(tight=True)
plt.grid()
plt.show()
I'm looking for a simple way to use the datetime output as x intervals on the graph, I can't figure out a way to make it understand it and there's this:
In [15]:sp.sum(sp.isnan(y))
Out[15]: 77
Which I guess is because of the spaces? I'm new to machine learning in Python, forgive my ignorance.
Thank you very much.
I would solve this by directly passing datetime.datetime objects to pyplot. Here is a short example:
import datetime as dt
import matplotlib.pyplot as plt
import matplotlib
# Note: please figure out yourself the data input
x = [dt.datetime(2015,7,11,11,42,54),
dt.datetime(2015,7,11,11,44,56),
dt.datetime(2015,7,11,11,46,56),
dt.datetime(2015,7,11,11,48,57)]
#define the x limit:
xstart= dt.datetime(2015,7,11,11,40,54)
xstop = dt.datetime(2015,7,11,11,50,54)
y = [782705, 575776, 560755, 104872]
fig,ax= plt.subplots()
ax.scatter(x,y)
xfmt = matplotlib.dates.DateFormatter('%D %H:%M:%S')
ax.xaxis.set_major_formatter(xfmt)
ax.set_title("Instagram")
ax.set_xlabel("Time")
ax.set_ylabel("Followers")
ax.set_xlim(xstart,xstop)
plt.xticks(rotation='vertical')
plt.show()
Result:
Yes it's because of the spaces. When you're importing the data it's assigning NaN to your x values.
Try this, it's a little longer but should work:
data = []
x=[]
y=[]
with open('data.txt', 'r') as f:
for line in f:
data.append(line.split(':'))
for i in data:
y.append(i[0])
x_old.append(i[1])
for t in x_old:
x.append(float(t[17:19]+'.'+t[20:])/60+int(t[14:16]))
Because of the spaces I had to convert the data into float manually. I divided the seconds+milliseconds by 60 then added to minutes since I'm assuming you're only interested in that (2 min interval).
If the format is done better you can use datetime and extract the information better. For example:
my_time = datetime.strptime('2015 7 11 11 42 54.782705', '&Y &m %d %H:%M:%S.%f')

Categories