Using axvline with datetimeindex - python

I am trying to plot a vertical line using axvline, but I keep getting errors even though I saw here that it's possible to just feed in the date into axvline: Python/Matplotlib plot vertical line for specific dates in line chart
Could anyone point out what am I doing wrong (I am also a beginner).
Ideally, I am looking for a way to just be able to feed in the date into axvline without adding extra pieces of code.
Here is my df:
CPISXS CPIX WTI UMCSI
Dates
2022-08-31 387.748 263.732 93.67 44.0
2022-09-30 390.555 264.370 84.26 42.0
2022-10-31 390.582 264.442 87.55 42.0
2022-11-30 390.523 263.771 84.37 46.0
2022-12-31 NaN NaN NaN NaN
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 756 entries, 1960-01-31 to 2022-12-31
Freq: M
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CPISXS 480 non-null float64
1 CPIX 671 non-null float64
2 WTI 755 non-null float64
3 UMCSI 670 non-null float64
dtypes: float64(4)
memory usage: 45.7 KB`
And here is the code:
fig2, m1 = plt.subplots(figsize=(12,6))
m2=m1.twinx()
m1.axvline(x='1990-01-30')
m1.plot(df0['UMCSI'],'r--',linewidth=1)
m2.plot(df0['WTI'],'b')
When I run it I always get the vertical line on 1970-01-01

Move axvline after the plot commands.
Why the order matters:
If you use axvline first, the line gets added to a blank plot without a datetime x-axis. Since there is no datetime x-axis, axvline doesn't know what to do with a date string input.
Conversely if you plot the time series first, the datetime x-axis gets established, at which point axvline will understand the date input.
fig2, m1 = plt.subplots(figsize=(12, 6))
m2 = m1.twinx()
# plot the time series first to establish the datetime x-axis
m1.plot(df0['UMCSI'], 'r--', linewidth=1)
m2.plot(df0['WTI'], 'b')
# now axvline will understand the date input
m1.axvline(x='1990-01-30')

Related

Why are my forecast predictions coming out as NaN?

My problem is pretty simple, and I know I'm missing something very obvious, I just can't figure out what it is....
My test predictions for Holt-Winters are coming out as NaN and I can't figure out why. Can anyone help on this?
I'm using a Jupyter Notebook, and trying to forecast sales of one SKU using Holt-Winters method. I even went as far as using
Here is the code I used:
# Import the libraries needed to execute Holt-Winters
import pandas as pd
import numpy as np
%matplotlib inline
df = pd.read_csv('../Data/M1045_White.csv',index_col='Month',parse_dates=True)
# Set the month column as the index column
df.index.freq = 'MS'
df.index
df.head()
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 48 entries, 2015-05-01 to 2019-04-01
Freq: MS
Data columns (total 7 columns):
Sales 48 non-null int64
EWMA12 48 non-null float64
SES12 47 non-null float64
DESadd12 47 non-null float64
DESmul12 47 non-null float64
TESadd12 48 non-null float64
TESmul12 12 non-null float64
dtypes: float64(6), int64(1)
memory usage: 3.0 KB
from statsmodels.tsa.holtwinters import SimpleExpSmoothing
# Train Test Split
train_data = df.iloc[:36] # Goes up to but not including 36
test_data = df.iloc[12:]
# Fit the Model
fitted_model = exponentialSmoothing(train_data['Sales'],trend='mul',seasonal='mul',seasonal_periods=12).fit()
test_predictions = fitted_model.forecast(12).rename('HW M1045 White Forecast')
test_predictions
Here is the output of my predictions:
2018-05-01 NaN
2018-06-01 NaN
2018-07-01 NaN
2018-08-01 NaN
2018-09-01 NaN
2018-10-01 NaN
2018-11-01 NaN
2018-12-01 NaN
2019-01-01 NaN
2019-02-01 NaN
2019-03-01 NaN
2019-04-01 NaN
Freq: MS, Name: HW M1045 White Forecast, dtype: float64
Can someone please point out what I may have missed? This seems to be a simple problem with a simple solution, but it's kicking my butt.
Thanks!
The answer has something to do with the seasonal_periods variable being set to 12. If this is updated to 6 then the predictions yield actual values. I'm not a stats expert in Exponential Smoothing to understand why this is the case.
Reason:
Your training data contained some NaNs, so it was unable to model nor forecast.
See the non-null values count for each column, it is not the same.
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 48 entries, 2015-05-01 to 2019-04-01
Freq: MS
Data columns (total 7 columns):
Sales 48 non-null int64
EWMA12 48 non-null float64
SES12 47 non-null float64
DESadd12 47 non-null float64
DESmul12 47 non-null float64
TESadd12 48 non-null float64
TESmul12 12 non-null float64
dtypes: float64(6), int64(1)
memory usage: 3.0 KB
Check if there are any missing values in dataframe
df.isnull().sum()
Solution:
In your case, missing value treatment is needed before training the model.
Thanks all. My but there was a few blank cells, and N/A within my dataset that caused my code to throw me this error. My mistake not doing a better job with data cleaning. As well, I ensured my dates where formatted correctly and sales data should be integer.

Plotting 2 variables in Python from an Excel file

I am a Python beginner. I want to start learning Python with plotting.
I would really appreciate if someome can help me write a script to plot an Excel file with 2 variables (velocity, and direction) below:
Date Velocity Direction
3/12/2011 0:00 1.0964352 10
3/12/2011 0:30 1.1184975 15
3/12/2011 1:00 0.48979592 20
3/12/2011 1:30 0.13188942 45
Prepare the data
import pandas as pd
from io import StringIO
data = '''\
Date Velocity Direction
3/12/2011 0:00 1.0964352 10
3/12/2011 0:30 1.1184975 15
3/12/2011 1:00 0.48979592 20
3/12/2011 1:30 0.13188942 45
'''
df = pd.read_csv(StringIO(data), sep=r'\s{2,}', parse_dates=[0], dayfirst=True)
I use a trick here. Because the Dates in the Date column contain time elements, that are separated by a single whitespace, I separate columns by two or more whitespaces. This is why I give the separator as a regex sep=r'\s{2,}'. But of course in a CSV columns are normally separated by a comma which makes things easier (sep=',' which is the default setting).
Note that the Date column has been parsed as dates. Its column type is datetime64.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
Date 4 non-null datetime64[ns]
Velocity 4 non-null float64
Direction 4 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 176.0 bytes
By setting the Date column as the index plotting the data is simple:
df.set_index('Date').plot()
This will result in a line plot where both velocity and direction are plotted for each timestamp.

Pandas: Group by year and plot density

I have a data frame that contains some time based data:
>>> temp.groupby(pd.TimeGrouper('AS'))['INC_RANK'].mean()
date
2001-01-01 0.567128
2002-01-01 0.581349
2003-01-01 0.556646
2004-01-01 0.549128
2005-01-01 NaN
2006-01-01 0.536796
2007-01-01 0.513109
2008-01-01 0.525859
2009-01-01 0.530433
2010-01-01 0.499250
2011-01-01 0.488159
2012-01-01 0.493405
2013-01-01 0.530207
Freq: AS-JAN, Name: INC_RANK, dtype: float64
And now I would like to plot the density for each year. The following command used to work for other data frames, but it is not here:
>>> temp.groupby(pd.TimeGrouper('AS'))['INC_RANK'].plot(kind='density')
ValueError: ordinal must be >= 1
Here's how that column looks like:
>>> temp['INC_RANK'].head()
date
2001-01-01 0.516016
2001-01-01 0.636038
2001-01-01 0.959501
2001-01-01 NaN
2001-01-01 0.433824
Name: INC_RANK, dtype: float64
I think it is due to the nan in your data, as density can not be estimated for nans. However, since you want to visualize density, it should not be a big issue to simply just drop the missing values, assuming the missing/unobserved cells should follow the same distribution as the observed/non-missing cells. Therefore, df.dropna().groupby(pd.TimeGrouper('AS'))['INC_RANK'].plot(kind='density') should suffice.
On the other hand, if the missing values are not 'unobserved', but rather are the values out of the measuring range (say data from a temperature sensor, which reads 0~50F, but sometimes, 100F temperate is encountered. Sensor sends out a error code and recorded as missing value), then dropna() probably is not a good idea.

Python: Matplotlib avoid plotting gaps

I am currently generating the plot below:
with this code:
ax = plt.subplots()
ax.plot(intra.to_pydatetime(), data)
plt.title('Intraday Net Spillover')
fig.autofmt_xdate()
where intra.to_pydatetime() is a:
<bound method DatetimeIndex.to_pydatetime of <class 'pandas.tseries.index.DatetimeIndex'>
[2011-01-03 09:35:00, ..., 2011-01-07 16:00:00]
Length: 390, Freq: None, Timezone: None>
So the dates go from 2011-01-03 09:35:00, increments by 5 minutes until 16:00:00, and then jumps to the next day, 2011-01-04 09:35:00 until 2011-01-04 16:00:00, and so on.
How can I avoid plotting the gaps between 16:00:00 and 9:30:00 on the following day? I don't want to see these straight lines.
UPDATE:
I will try this to see if it works.
Simply set the two values defining the line you don't want to see as NaN (Not a Number). Matplotlib will hide the line between the two values automatically.
Check out this example :
http://matplotlib.org/examples/pylab_examples/nan_test.html
Try to resample your dataframe.
For example :
df.plot()
gives me that result :
plot
and now with resample:
df = df.resample('H').first().fillna(value=np.nan)
plot after resample

Transforming financial data from postgres to pandas dataframe for use with Zipline

I'm new to Pandas and Zipline, and I'm trying to learn how to use them (and use them with this data that I have). Any sorts of tips, even if no full solution, would be much appreciated. I have tried a number of things, and have gotten quite close, but run into indexing issues, Exception: Reindexing only valid with uniquely valued Index objects, in particular. [Pandas 0.10.0, Python 2.7]
I'm trying to transform monthly returns data I have for thousands of stocks in postgres from the form:
ticker_symbol :: String, monthly_return :: Float, date :: Timestamp
e.g.
AAPL, 0.112, 28/2/1992
GS, 0.13, 30/11/1981
GS, -0.23, 22/12/1981
NB: The frequency of the reporting is monthly, but there is going to be considerable NaN data here, as not all of the over 6000 companies I have here are going to be around at the same time.
…to the form described below, which is what Zipline needs to run its backtester. (I think. Can Zipline's backtester work with monthly data like this, easily? I know it can, but any tips for doing this?)
The below is a DataFrame (of timeseries? How do you say this?), in the format I need:
> data:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2268 entries, 1993-01-04 00:00:00+00:00 to 2001-12-31 00:00:00+00:00
Data columns:
AA 2268 non-null values
AAPL 2268 non-null values
GE 2268 non-null values
IBM 2268 non-null values
JNJ 2268 non-null values
KO 2268 non-null values
MSFT 2268 non-null values
PEP 2268 non-null values
SPX 2268 non-null values
XOM 2268 non-null values
dtypes: float64(10)
The below is a TimeSeries, and is in the format I need.
> data.AAPL:
Date
1993-01-04 00:00:00+00:00 73.00
1993-01-05 00:00:00+00:00 73.12
...
2001-12-28 00:00:00+00:00 36.15
2001-12-31 00:00:00+00:00 35.55
Name: AAPL, Length: 2268
Note, there isn't return data here, but prices instead. They're adjusted (by Zipline's load_from_yahoo—though, from reading the source, really by functions in pandas) for dividends, splits, etc, so there's an isomorphism (less the initial price) between that and my return data (so, no problem here).
(EDIT: Let me know if you'd like me to write what I have, or attach my iPython notebook or a gist; I just doubt it'd be helpful, but I can absolutely do it if requested.)
I suspect you are trying to set the date as the index too early. My suggestion would be to first set_index as date and company name, then you can unstack the company name and resample.
Something like this:
In [11]: df1
Out[11]:
ticker_symbol monthly_return date
0 AAPL 0.112 1992-02-28 00:00:00
1 GS 0.130 1981-11-30 00:00:00
2 GS -0.230 1981-12-22 00:00:00
df2 = df2.set_index(['date','ticker_symbol'])
df3 = df2.unstack(level=1)
df4 = df.resample('M')
In [14]: df2
Out[14]:
monthly_return
date ticker_symbol
1992-02-28 AAPL 0.112
1981-11-30 GS 0.130
1981-12-22 GS -0.230
In [15]: df3
Out[15]:
monthly_return
ticker_symbol AAPL GS
date
1981-11-30 NaN 0.13
1981-12-22 NaN -0.23
1992-02-28 0.112 NaN
In [16]: df4
Out[16]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 124 entries, 1981-11-30 00:00:00 to 1992-02-29 00:00:00
Freq: M
Data columns:
(monthly_return, AAPL) 1 non-null values
(monthly_return, GS) 2 non-null values
dtypes: float64(2)

Categories