Using a formula with Statsmodels of scipy - Python - python
I am building a straightforward regression model with statsmodels using a formula but I am getting an error which I do not understand.
For a reproducible example my dataframe is Prices:
Prices.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 25 entries, 2012-06-30 to 2018-06-30
Freq: Q-DEC
Data columns (total 4 columns):
FB 25 non-null float64
GOOG 25 non-null float64
GDP_growth 25 non-null float64
GDP_growth_shifted 24 non-null float64
dtypes: float64(4)
memory usage: 1000.0 bytes
Prices.to_json()
'{"FB":{"1341014400000":35.26,"1348963200000":31.9133333333,"1356912000000":24.76,"1364688000000":25.43,"1372550400000":27.6966666667,"1380499200000":29.8733333333,"1388448000000":42.2733333333,"1396224000000":54.78,"1404086400000":61.9233333333,"1412035200000":64.54,"1419984000000":72.7233333333,"1427760000000":77.42,"1435622400000":80.87,"1443571200000":83.9333333333,"1451520000000":90.2466666667,"1459382400000":100.8266666667,"1467244800000":106.8833333333,"1475193600000":116.6166666667,"1483142400000":122.8233333333,"1490918400000":125.43,"1498780800000":135.3733333333,"1506729600000":152.3233333333,"1514678400000":166.5233333333,"1522454400000":174.9033333333,"1530316800000":172.17},"GOOG":{"1341014400000":306.6033333333,"1348963200000":312.19,"1356912000000":338.7,"1364688000000":363.36,"1372550400000":396.7433333333,"1380499200000":428.11,"1388448000000":459.26,"1396224000000":520.9766666667,"1404086400000":568.72,"1412035200000":556.0,"1419984000000":565.0566666667,"1427760000000":535.56,"1435622400000":536.8866666667,"1443571200000":560.1466666667,"1451520000000":627.9,"1459382400000":709.6133333333,"1467244800000":738.7766666667,"1475193600000":743.08,"1483142400000":771.7733333333,"1490918400000":801.2033333333,"1498780800000":843.9766666667,"1506729600000":915.2533333333,"1514678400000":964.72,"1522454400000":1036.6966666667,"1530316800000":1083.77},"GDP_growth":{"1341014400000":1.7,"1348963200000":0.5,"1356912000000":0.5,"1364688000000":3.6,"1372550400000":0.5,"1380499200000":3.2,"1388448000000":3.2,"1396224000000":-1.0,"1404086400000":5.1,"1412035200000":4.9,"1419984000000":1.9,"1427760000000":3.3,"1435622400000":3.3,"1443571200000":1.0,"1451520000000":0.4,"1459382400000":1.5,"1467244800000":2.3,"1475193600000":1.9,"1483142400000":1.8,"1490918400000":1.8,"1498780800000":3.0,"1506729600000":2.8,"1514678400000":2.3,"1522454400000":2.2,"1530316800000":4.1},"GDP_growth_shifted":{"1341014400000":0.5,"1348963200000":0.5,"1356912000000":3.6,"1364688000000":0.5,"1372550400000":3.2,"1380499200000":3.2,"1388448000000":-1.0,"1396224000000":5.1,"1404086400000":4.9,"1412035200000":1.9,"1419984000000":3.3,"1427760000000":3.3,"1435622400000":1.0,"1443571200000":0.4,"1451520000000":1.5,"1459382400000":2.3,"1467244800000":1.9,"1475193600000":1.8,"1483142400000":1.8,"1490918400000":3.0,"1498780800000":2.8,"1506729600000":2.3,"1514678400000":2.2,"1522454400000":4.1,"1530316800000":null}}'
My code is :
import statsmodels.formula.api as sm
from statsmodels.formula.api import ols
result = sm.ols(formula="FB ~ GOOG + GDP_growth", data=Price.tail(-1)).fit()
PatsyError: Error evaluating factor: NameError: name 'GDP_growth' is not defined
FB ~ GOOG + GDP_growth
^^^^^^^^^^
Related
Why are my forecast predictions coming out as NaN?
My problem is pretty simple, and I know I'm missing something very obvious, I just can't figure out what it is.... My test predictions for Holt-Winters are coming out as NaN and I can't figure out why. Can anyone help on this? I'm using a Jupyter Notebook, and trying to forecast sales of one SKU using Holt-Winters method. I even went as far as using Here is the code I used: # Import the libraries needed to execute Holt-Winters import pandas as pd import numpy as np %matplotlib inline df = pd.read_csv('../Data/M1045_White.csv',index_col='Month',parse_dates=True) # Set the month column as the index column df.index.freq = 'MS' df.index df.head() df.info() <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 48 entries, 2015-05-01 to 2019-04-01 Freq: MS Data columns (total 7 columns): Sales 48 non-null int64 EWMA12 48 non-null float64 SES12 47 non-null float64 DESadd12 47 non-null float64 DESmul12 47 non-null float64 TESadd12 48 non-null float64 TESmul12 12 non-null float64 dtypes: float64(6), int64(1) memory usage: 3.0 KB from statsmodels.tsa.holtwinters import SimpleExpSmoothing # Train Test Split train_data = df.iloc[:36] # Goes up to but not including 36 test_data = df.iloc[12:] # Fit the Model fitted_model = exponentialSmoothing(train_data['Sales'],trend='mul',seasonal='mul',seasonal_periods=12).fit() test_predictions = fitted_model.forecast(12).rename('HW M1045 White Forecast') test_predictions Here is the output of my predictions: 2018-05-01 NaN 2018-06-01 NaN 2018-07-01 NaN 2018-08-01 NaN 2018-09-01 NaN 2018-10-01 NaN 2018-11-01 NaN 2018-12-01 NaN 2019-01-01 NaN 2019-02-01 NaN 2019-03-01 NaN 2019-04-01 NaN Freq: MS, Name: HW M1045 White Forecast, dtype: float64 Can someone please point out what I may have missed? This seems to be a simple problem with a simple solution, but it's kicking my butt. Thanks!
The answer has something to do with the seasonal_periods variable being set to 12. If this is updated to 6 then the predictions yield actual values. I'm not a stats expert in Exponential Smoothing to understand why this is the case.
Reason: Your training data contained some NaNs, so it was unable to model nor forecast. See the non-null values count for each column, it is not the same. df.info() <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 48 entries, 2015-05-01 to 2019-04-01 Freq: MS Data columns (total 7 columns): Sales 48 non-null int64 EWMA12 48 non-null float64 SES12 47 non-null float64 DESadd12 47 non-null float64 DESmul12 47 non-null float64 TESadd12 48 non-null float64 TESmul12 12 non-null float64 dtypes: float64(6), int64(1) memory usage: 3.0 KB Check if there are any missing values in dataframe df.isnull().sum() Solution: In your case, missing value treatment is needed before training the model.
Thanks all. My but there was a few blank cells, and N/A within my dataset that caused my code to throw me this error. My mistake not doing a better job with data cleaning. As well, I ensured my dates where formatted correctly and sales data should be integer.
Pandas dataframe adding zero-padding before the datetime
I'm using Pandas dataframe. And I have a dataFrame df as the following: time id ------------- 5:13:40 1 16:20:59 2 ... For the first row, the time 5:13:40 has no zero padding before, and I want to convert it to 05:13:40. So my expected df would be like: time id ------------- 05:13:40 1 16:20:59 2 ... The type of time is <class 'datetime.timedelta'>.Could anyone give me some hints to handle this problem? Thanks so much!
Use pd.to_timedelta: df['time'] = pd.to_timedelta(df['time']) Before: print(df) time id 1 5:13:40 1.0 2 16:20:59 2.0 df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 2 entries, 1 to 2 Data columns (total 2 columns): time 2 non-null object id 2 non-null float64 dtypes: float64(1), object(1) memory usage: 48.0+ bytes After: print(df) time id 1 05:13:40 1.0 2 16:20:59 2.0 df.info() d<class 'pandas.core.frame.DataFrame'> Int64Index: 2 entries, 1 to 2 Data columns (total 2 columns): time 2 non-null timedelta64[ns] id 2 non-null float64 dtypes: float64(1), timedelta64[ns](1) memory usage: 48.0 bytes
Tensorflow SKCompat converting float32 values in Pandas dataframe to tf.float64 values, and then giving an error
I am using the following Pandas dataframe as the training input for an SKCompat estimator: >>> training_data.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 8709 entries, 4396 to 1889 Data columns (total 8 columns): season 8709 non-null int64 holiday 8709 non-null int64 workingday 8709 non-null int64 weather 8709 non-null int64 temp 8709 non-null float32 atemp 8709 non-null float32 humidity 8709 non-null int64 windspeed 8709 non-null float32 At some point in the tensorflow code it passes the dataframe through the function: tensorflow.contrib.learn.python.learn.learn_io.pandas_io.extract_pandas_data. This seems to lose the dtype information and go back to float64 >>> x_training = extract_pandas_data(x_training) >>> x_training.dtype {dtype} float64 further on I then get the following exception, as the floats have been converted to float64: TypeError: Input 'input_data' of 'TreePredictions' Op has type float64 that does not match expected type of float32. I have seen a few examples of people using tf.cast to get around this issue, but I don't understand how to apply for my use case. What do I need to do to this Pandas DataFrame to make it work with the TensorForestEstimator? Many thanks, Mark Code example, with "tf.cast" fix: def stackoverflow_example(x_training: pd.DataFrame, y_training: pd.DataFrame): params = tensor_forest.ForestHParams( num_classes=1, num_features=5, num_trees=10, max_nodes=1000) graph_builder_class = tensor_forest.TrainingLossForest est = estimator.SKCompat(random_forest.TensorForestEstimator( params, graph_builder_class=graph_builder_class)) x_training = tf.cast(x_training.drop('datetime', 1), tf.float32) est.fit(x_training, y_training, batch_size=1000) this code returns the following exception with the cast: ValueError: Inputs cannot be tensors. Please provide input_fn.
Can pandas read_csv use dtype and write NaN on unparsable data? [duplicate]
I am working in PANDAS with Python and I am looking at a weather CSV file. I am able to pull data from it with no problem. However, I am not able to pull data that meets certain criteria such as when to show which days have the temperature above 100 degrees. I have this as my code so far: import pandas as pd import numpy as np import matplotlib.pyplot as plt df = pd.read_csv('csv/weather.csv') print(df[[df.MaxTemperatureF > 100 ]]) That last line is where I think I have my problem. The error traceback that I now get, after doing the steps below, is the following: Traceback (most recent call last): File "weather.py", line 40, in <module> print(df[df['MaxTemperatureF' > 100]]) TypeError: unorderable types: str() > int() Mikes-MBP-2:dataframes mikecuddy$ python3 weather.py Traceback (most recent call last): File "weather.py", line 41, in <module> print(df[[df.MaxTemperatureF > 100 ]]) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site- packages/pandas/core/frame.py", line 1991, in __getitem__ return self._getitem_array(key) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site- packages/pandas/core/frame.py", line 2028, in _getitem_array (len(key), len(self.index))) ValueError: Item wrong length 1 instead of 360. I have been doing a tutorial at: http://www.gregreda.com/2013/10/26/working-with-pandas-dataframes/ Again any help would be great! Thank you! df.info() information: <class 'pandas.core.frame.DataFrame'> RangeIndex: 360 entries, 0 to 359 Data columns (total 23 columns): PST 360 non-null object MaxTemperatureF 359 non-null float64 Mean TemperatureF 359 non-null float64 Min TemperatureF 359 non-null float64 Max Dew PointF 359 non-null float64 MeanDew PointF 359 non-null float64 Min DewpointF 359 non-null float64 Max Humidity 359 non-null float64 Mean Humidity 359 non-null float64 Min Humidity 359 non-null float64 Max Sea Level PressureIn 359 non-null float64 Mean Sea Level PressureIn 359 non-null float64 Min Sea Level PressureIn 359 non-null float64 Max VisibilityMiles 355 non-null float64 Mean VisibilityMiles 355 non-null float64 Min VisibilityMiles 355 non-null float64 Max Wind SpeedMPH 359 non-null float64 Mean Wind SpeedMPH 359 non-null float64 Max Gust SpeedMPH 211 non-null float64 PrecipitationIn 360 non-null float64 CloudCover 343 non-null float64 Events 18 non-null object WindDirDegrees 360 non-null int64 dtypes: float64(20), int64(1), object(2) memory usage: 64.8+ KB None
For the max temperature you can specify a converter function: df = pd.read_csv('csv/weather.csv', converters={'MaxTemperatureF':float}) Edit: as #ptrj mentions in a comment you can do this to substitute np.nan for string values in the MaxTemperatureF column: df = pd.read_csv('csv/weather.csv', converters={'MaxTemperatureF': lambda x: try: return float(x); except ValueError: return np.nan;}) Edit2: #ptrj's solution since he can't write it up in a comment... def my_conv(x): try: return float(x) except ValueError: return np.nan df = pd.read_csv('csv/weather.csv', converters={'MaxTemperatureF': my_conv}) Other things: If the first row of the csv file has the headers then don't pass header=0. Being that you already have the header now you don't need to specify cols=... The default sep is ',' so you don't need to specify that.
Try this: You have '()" instead of []. print(df[df.MaxTemperatureF.astype(float) > 100 ]) notes: df.isnull().sum() df.dropna() df.fillna(0)
Pandas - Behaviour of DateTime x-axis when using secondary y-axis not as expected (what am I doing wrong?)
I'm trying to plot two dataframes over each other, both with a DateTimeIndex using two secondary axis. First how I load the data: import pandas as pd df1 = pd.read_csv('SmartIce_20140927_all_voltage.csv', encoding='latin1', parse_dates=['DateTime'], index_col='DateTime') df2 = pd.read_csv('SmartIce_20140927_temperature.csv', encoding='latin1', parse_dates=['UTC_Time'], index_col='UTC_Time') And check the data output: In [7]: df1.info() <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 10302 entries, 2014-09-27 16:58:54 to 2014-09-29 11:56:20 Data columns (total 5 columns): DLPIO20_AIN0 10302 non-null float64 DLPIO20_AIN1 10302 non-null float64 DLPIO20_AIN2 10302 non-null float64 DLPIO20_AIN3 10302 non-null float64 DLPIO20_AIN4 10302 non-null float64 dtypes: float64(5) In [8]: df1.head() Out[8]: DLPIO20_AIN0 DLPIO20_AIN1 DLPIO20_AIN2 DLPIO20_AIN3 \ DateTime 2014-09-27 16:58:54 0.004883 3.642578 3.696289 4.980469 2014-09-27 16:59:09 0.004883 3.637695 3.637695 4.985352 In [12]: df2.info() <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 2580 entries, 2014-09-27 16:53:00 to 2014-09-29 11:52:00 Data columns (total 3 columns): Sample 2580 non-null int64 Temp 2580 non-null float64 DateTime 2580 non-null object dtypes: float64(1), int64(1), object(1) In [14]: df2.head() Out[14]: Sample Temp DateTime UTC_Time 2014-09-27 16:53:00 1 -15.44 9/27/2014 14:23 2014-09-27 16:54:00 2 -14.61 9/27/2014 14:24 Now when I try to plot: df1.DLPIO20_AIN4.plot() df2.Temp.plot(secondary_y=True, style='g') I get two images (I can't attach images because I need ten reputation). Image one has a time axis that is just hours (formatted for example 18:00:00 at a diagonal). Image two, which I wasn't expecting, has a time axis formatted as hours and underneath the day (which I prefer). I was expecting to get one plot layed over the other plot. I've played around with various things but I'm not sure what I should be doing to fix it, nor how to proceed. I believe the DatetimeIndexes are identical, ...or at least I understand I have set them up like that.