Using a formula with Statsmodels of scipy - Python - python

I am building a straightforward regression model with statsmodels using a formula but I am getting an error which I do not understand.
For a reproducible example my dataframe is Prices:
Prices.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 25 entries, 2012-06-30 to 2018-06-30
Freq: Q-DEC
Data columns (total 4 columns):
FB 25 non-null float64
GOOG 25 non-null float64
GDP_growth 25 non-null float64
GDP_growth_shifted 24 non-null float64
dtypes: float64(4)
memory usage: 1000.0 bytes
Prices.to_json()
'{"FB":{"1341014400000":35.26,"1348963200000":31.9133333333,"1356912000000":24.76,"1364688000000":25.43,"1372550400000":27.6966666667,"1380499200000":29.8733333333,"1388448000000":42.2733333333,"1396224000000":54.78,"1404086400000":61.9233333333,"1412035200000":64.54,"1419984000000":72.7233333333,"1427760000000":77.42,"1435622400000":80.87,"1443571200000":83.9333333333,"1451520000000":90.2466666667,"1459382400000":100.8266666667,"1467244800000":106.8833333333,"1475193600000":116.6166666667,"1483142400000":122.8233333333,"1490918400000":125.43,"1498780800000":135.3733333333,"1506729600000":152.3233333333,"1514678400000":166.5233333333,"1522454400000":174.9033333333,"1530316800000":172.17},"GOOG":{"1341014400000":306.6033333333,"1348963200000":312.19,"1356912000000":338.7,"1364688000000":363.36,"1372550400000":396.7433333333,"1380499200000":428.11,"1388448000000":459.26,"1396224000000":520.9766666667,"1404086400000":568.72,"1412035200000":556.0,"1419984000000":565.0566666667,"1427760000000":535.56,"1435622400000":536.8866666667,"1443571200000":560.1466666667,"1451520000000":627.9,"1459382400000":709.6133333333,"1467244800000":738.7766666667,"1475193600000":743.08,"1483142400000":771.7733333333,"1490918400000":801.2033333333,"1498780800000":843.9766666667,"1506729600000":915.2533333333,"1514678400000":964.72,"1522454400000":1036.6966666667,"1530316800000":1083.77},"GDP_growth":{"1341014400000":1.7,"1348963200000":0.5,"1356912000000":0.5,"1364688000000":3.6,"1372550400000":0.5,"1380499200000":3.2,"1388448000000":3.2,"1396224000000":-1.0,"1404086400000":5.1,"1412035200000":4.9,"1419984000000":1.9,"1427760000000":3.3,"1435622400000":3.3,"1443571200000":1.0,"1451520000000":0.4,"1459382400000":1.5,"1467244800000":2.3,"1475193600000":1.9,"1483142400000":1.8,"1490918400000":1.8,"1498780800000":3.0,"1506729600000":2.8,"1514678400000":2.3,"1522454400000":2.2,"1530316800000":4.1},"GDP_growth_shifted":{"1341014400000":0.5,"1348963200000":0.5,"1356912000000":3.6,"1364688000000":0.5,"1372550400000":3.2,"1380499200000":3.2,"1388448000000":-1.0,"1396224000000":5.1,"1404086400000":4.9,"1412035200000":1.9,"1419984000000":3.3,"1427760000000":3.3,"1435622400000":1.0,"1443571200000":0.4,"1451520000000":1.5,"1459382400000":2.3,"1467244800000":1.9,"1475193600000":1.8,"1483142400000":1.8,"1490918400000":3.0,"1498780800000":2.8,"1506729600000":2.3,"1514678400000":2.2,"1522454400000":4.1,"1530316800000":null}}'
My code is :
import statsmodels.formula.api as sm
from statsmodels.formula.api import ols
result = sm.ols(formula="FB ~ GOOG + GDP_growth", data=Price.tail(-1)).fit()
PatsyError: Error evaluating factor: NameError: name 'GDP_growth' is not defined
FB ~ GOOG + GDP_growth
^^^^^^^^^^

Related

Why are my forecast predictions coming out as NaN?

My problem is pretty simple, and I know I'm missing something very obvious, I just can't figure out what it is....
My test predictions for Holt-Winters are coming out as NaN and I can't figure out why. Can anyone help on this?
I'm using a Jupyter Notebook, and trying to forecast sales of one SKU using Holt-Winters method. I even went as far as using
Here is the code I used:
# Import the libraries needed to execute Holt-Winters
import pandas as pd
import numpy as np
%matplotlib inline
df = pd.read_csv('../Data/M1045_White.csv',index_col='Month',parse_dates=True)
# Set the month column as the index column
df.index.freq = 'MS'
df.index
df.head()
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 48 entries, 2015-05-01 to 2019-04-01
Freq: MS
Data columns (total 7 columns):
Sales 48 non-null int64
EWMA12 48 non-null float64
SES12 47 non-null float64
DESadd12 47 non-null float64
DESmul12 47 non-null float64
TESadd12 48 non-null float64
TESmul12 12 non-null float64
dtypes: float64(6), int64(1)
memory usage: 3.0 KB
from statsmodels.tsa.holtwinters import SimpleExpSmoothing
# Train Test Split
train_data = df.iloc[:36] # Goes up to but not including 36
test_data = df.iloc[12:]
# Fit the Model
fitted_model = exponentialSmoothing(train_data['Sales'],trend='mul',seasonal='mul',seasonal_periods=12).fit()
test_predictions = fitted_model.forecast(12).rename('HW M1045 White Forecast')
test_predictions
Here is the output of my predictions:
2018-05-01 NaN
2018-06-01 NaN
2018-07-01 NaN
2018-08-01 NaN
2018-09-01 NaN
2018-10-01 NaN
2018-11-01 NaN
2018-12-01 NaN
2019-01-01 NaN
2019-02-01 NaN
2019-03-01 NaN
2019-04-01 NaN
Freq: MS, Name: HW M1045 White Forecast, dtype: float64
Can someone please point out what I may have missed? This seems to be a simple problem with a simple solution, but it's kicking my butt.
Thanks!
The answer has something to do with the seasonal_periods variable being set to 12. If this is updated to 6 then the predictions yield actual values. I'm not a stats expert in Exponential Smoothing to understand why this is the case.
Reason:
Your training data contained some NaNs, so it was unable to model nor forecast.
See the non-null values count for each column, it is not the same.
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 48 entries, 2015-05-01 to 2019-04-01
Freq: MS
Data columns (total 7 columns):
Sales 48 non-null int64
EWMA12 48 non-null float64
SES12 47 non-null float64
DESadd12 47 non-null float64
DESmul12 47 non-null float64
TESadd12 48 non-null float64
TESmul12 12 non-null float64
dtypes: float64(6), int64(1)
memory usage: 3.0 KB
Check if there are any missing values in dataframe
df.isnull().sum()
Solution:
In your case, missing value treatment is needed before training the model.
Thanks all. My but there was a few blank cells, and N/A within my dataset that caused my code to throw me this error. My mistake not doing a better job with data cleaning. As well, I ensured my dates where formatted correctly and sales data should be integer.

Pandas dataframe adding zero-padding before the datetime

I'm using Pandas dataframe. And I have a dataFrame df as the following:
time id
-------------
5:13:40 1
16:20:59 2
...
For the first row, the time 5:13:40 has no zero padding before, and I want to convert it to 05:13:40. So my expected df would be like:
time id
-------------
05:13:40 1
16:20:59 2
...
The type of time is <class 'datetime.timedelta'>.Could anyone give me some hints to handle this problem? Thanks so much!
Use pd.to_timedelta:
df['time'] = pd.to_timedelta(df['time'])
Before:
print(df)
time id
1 5:13:40 1.0
2 16:20:59 2.0
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 1 to 2
Data columns (total 2 columns):
time 2 non-null object
id 2 non-null float64
dtypes: float64(1), object(1)
memory usage: 48.0+ bytes
After:
print(df)
time id
1 05:13:40 1.0
2 16:20:59 2.0
df.info()
d<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 1 to 2
Data columns (total 2 columns):
time 2 non-null timedelta64[ns]
id 2 non-null float64
dtypes: float64(1), timedelta64[ns](1)
memory usage: 48.0 bytes

Tensorflow SKCompat converting float32 values in Pandas dataframe to tf.float64 values, and then giving an error

I am using the following Pandas dataframe as the training input for an SKCompat estimator:
>>> training_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8709 entries, 4396 to 1889
Data columns (total 8 columns):
season 8709 non-null int64
holiday 8709 non-null int64
workingday 8709 non-null int64
weather 8709 non-null int64
temp 8709 non-null float32
atemp 8709 non-null float32
humidity 8709 non-null int64
windspeed 8709 non-null float32
At some point in the tensorflow code it passes the dataframe through the function:
tensorflow.contrib.learn.python.learn.learn_io.pandas_io.extract_pandas_data.
This seems to lose the dtype information and go back to float64
>>> x_training = extract_pandas_data(x_training)
>>> x_training.dtype
{dtype} float64
further on I then get the following exception, as the floats have been converted to float64:
TypeError: Input 'input_data' of 'TreePredictions' Op has type float64 that does not match expected type of float32.
I have seen a few examples of people using tf.cast to get around this issue, but I don't understand how to apply for my use case. What do I need to do to this Pandas DataFrame to make it work with the TensorForestEstimator?
Many thanks,
Mark
Code example, with "tf.cast" fix:
def stackoverflow_example(x_training: pd.DataFrame, y_training: pd.DataFrame):
params = tensor_forest.ForestHParams(
num_classes=1, num_features=5,
num_trees=10, max_nodes=1000)
graph_builder_class = tensor_forest.TrainingLossForest
est = estimator.SKCompat(random_forest.TensorForestEstimator(
params, graph_builder_class=graph_builder_class))
x_training = tf.cast(x_training.drop('datetime', 1), tf.float32)
est.fit(x_training, y_training, batch_size=1000)
this code returns the following exception with the cast:
ValueError: Inputs cannot be tensors. Please provide input_fn.

Can pandas read_csv use dtype and write NaN on unparsable data? [duplicate]

I am working in PANDAS with Python and I am looking at a weather CSV file. I am able to pull data from it with no problem. However, I am not able to pull data that meets certain criteria such as when to show which days have the temperature above 100 degrees.
I have this as my code so far:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('csv/weather.csv')
print(df[[df.MaxTemperatureF > 100 ]])
That last line is where I think I have my problem. The error traceback that I now get, after doing the steps below, is the following:
Traceback (most recent call last):
File "weather.py", line 40, in <module>
print(df[df['MaxTemperatureF' > 100]])
TypeError: unorderable types: str() > int()
Mikes-MBP-2:dataframes mikecuddy$ python3 weather.py
Traceback (most recent call last):
File "weather.py", line 41, in <module>
print(df[[df.MaxTemperatureF > 100 ]])
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-
packages/pandas/core/frame.py", line 1991, in __getitem__
return self._getitem_array(key)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-
packages/pandas/core/frame.py", line 2028, in _getitem_array
(len(key), len(self.index)))
ValueError: Item wrong length 1 instead of 360.
I have been doing a tutorial at: http://www.gregreda.com/2013/10/26/working-with-pandas-dataframes/ Again any help would be great! Thank you!
df.info() information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 360 entries, 0 to 359
Data columns (total 23 columns):
PST 360 non-null object
MaxTemperatureF 359 non-null float64
Mean TemperatureF 359 non-null float64
Min TemperatureF 359 non-null float64
Max Dew PointF 359 non-null float64
MeanDew PointF 359 non-null float64
Min DewpointF 359 non-null float64
Max Humidity 359 non-null float64
Mean Humidity 359 non-null float64
Min Humidity 359 non-null float64
Max Sea Level PressureIn 359 non-null float64
Mean Sea Level PressureIn 359 non-null float64
Min Sea Level PressureIn 359 non-null float64
Max VisibilityMiles 355 non-null float64
Mean VisibilityMiles 355 non-null float64
Min VisibilityMiles 355 non-null float64
Max Wind SpeedMPH 359 non-null float64
Mean Wind SpeedMPH 359 non-null float64
Max Gust SpeedMPH 211 non-null float64
PrecipitationIn 360 non-null float64
CloudCover 343 non-null float64
Events 18 non-null object
WindDirDegrees 360 non-null int64
dtypes: float64(20), int64(1), object(2)
memory usage: 64.8+ KB
None
For the max temperature you can specify a converter function:
df = pd.read_csv('csv/weather.csv', converters={'MaxTemperatureF':float})
Edit: as #ptrj mentions in a comment you can do this to substitute np.nan for string values in the MaxTemperatureF column:
df = pd.read_csv('csv/weather.csv',
converters={'MaxTemperatureF':
lambda x: try: return float(x);
except ValueError: return np.nan;})
Edit2: #ptrj's solution since he can't write it up in a comment...
def my_conv(x):
try:
return float(x)
except ValueError:
return np.nan
df = pd.read_csv('csv/weather.csv', converters={'MaxTemperatureF': my_conv})
Other things:
If the first row of the csv file has the headers then don't pass header=0.
Being that you already have the header now you don't need to specify cols=...
The default sep is ',' so you don't need to specify that.
Try this: You have '()" instead of [].
print(df[df.MaxTemperatureF.astype(float) > 100 ])
notes:
df.isnull().sum()
df.dropna()
df.fillna(0)

Pandas - Behaviour of DateTime x-axis when using secondary y-axis not as expected (what am I doing wrong?)

I'm trying to plot two dataframes over each other, both with a DateTimeIndex using two secondary axis. First how I load the data:
import pandas as pd
df1 = pd.read_csv('SmartIce_20140927_all_voltage.csv', encoding='latin1', parse_dates=['DateTime'], index_col='DateTime')
df2 = pd.read_csv('SmartIce_20140927_temperature.csv', encoding='latin1', parse_dates=['UTC_Time'], index_col='UTC_Time')
And check the data output:
In [7]: df1.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10302 entries, 2014-09-27 16:58:54 to 2014-09-29 11:56:20
Data columns (total 5 columns):
DLPIO20_AIN0 10302 non-null float64
DLPIO20_AIN1 10302 non-null float64
DLPIO20_AIN2 10302 non-null float64
DLPIO20_AIN3 10302 non-null float64
DLPIO20_AIN4 10302 non-null float64
dtypes: float64(5)
In [8]: df1.head()
Out[8]:
DLPIO20_AIN0 DLPIO20_AIN1 DLPIO20_AIN2 DLPIO20_AIN3 \
DateTime
2014-09-27 16:58:54 0.004883 3.642578 3.696289 4.980469
2014-09-27 16:59:09 0.004883 3.637695 3.637695 4.985352
In [12]: df2.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2580 entries, 2014-09-27 16:53:00 to 2014-09-29 11:52:00
Data columns (total 3 columns):
Sample 2580 non-null int64
Temp 2580 non-null float64
DateTime 2580 non-null object
dtypes: float64(1), int64(1), object(1)
In [14]: df2.head()
Out[14]:
Sample Temp DateTime
UTC_Time
2014-09-27 16:53:00 1 -15.44 9/27/2014 14:23
2014-09-27 16:54:00 2 -14.61 9/27/2014 14:24
Now when I try to plot:
df1.DLPIO20_AIN4.plot()
df2.Temp.plot(secondary_y=True, style='g')
I get two images (I can't attach images because I need ten reputation). Image one has a time axis that is just hours (formatted for example 18:00:00 at a diagonal). Image two, which I wasn't expecting, has a time axis formatted as hours and underneath the day (which I prefer). I was expecting to get one plot layed over the other plot. I've played around with various things but I'm not sure what I should be doing to fix it, nor how to proceed. I believe the DatetimeIndexes are identical, ...or at least I understand I have set them up like that.

Categories