Vector autoregressive (VAR) model fitting with different lag operator - python

I am a Master 2 student in computational neuroscience.
I'm at the very end of my analysis and I have a problem with the application of a VAR model (vector autoregressive model).
It is a rather complex problem to solve and it concerns the test of different lags operators on the data. For me the problem comes when I try to compute the cholesky factorization on a covariance matrix with negative numbers . :
I may have found a solution but I can't include it in the python function that deploys the model ("VAR"). If someone has ten minutes to help me, please write me. Thanks for your attention :)
for i in [1,2,3,4,6,8,9,10,12,13,14,15,16,17,18,19,20]:
print(i)
df_entropie_G1_w_diff = df_entropie_G1_w.iloc[i,2145:].diff()
df_RMSE_G1_w_diff = df_g1_RMSE_w.iloc[i,2145:].diff()
df_var_G1_w_diff = df_var_G1_w.iloc[i,2145:].diff()
df_data = pd.concat([df_entropie_G1_w_diff,df_RMSE_G1_w_diff,df_var_G1_w_diff],axis = 1)
df_data = df_data.diff().dropna()
df_data = df_data.T
df_data = df_data.reset_index()
del df_data['index']
df_data = df_data.T
df_data['Time'] = pd.to_timedelta(np.arange(537), unit='s')
df_data.index = df_data['Time']
del df_data['Time']
Arrange names of columns
df_data_T = df_data.T
df_data_T = df_data_T.reset_index()
del df_data_T['index']
df_data_T = df_data_T.T
df_data = df_data_T.rename(columns={0:'Entropie',1:'RMSE',2:'Var'})
model = VAR(df_data)
liste_aic = []
liste_bic = []
liste_fpe = []
liste_hqic = []
for a in range(0,25,1):
result = model.fit(a)
print('Lag Order =', a)
print('AIC : ', result.aic)
print('BIC : ', result.bic)
print('FPE : ', result.fpe)
print('HQIC: ', result.hqic, '\n')
liste_aic.append(result.aic)
liste_bic.append(result.bic)
liste_fpe.append(result.fpe)
liste_hqic.append(result.hqic)
1
Lag Order = 0
AIC : -59.6358271069015
BIC : -59.61188298346849
FPE : 1.260344786813777e-26
HQIC: -59.626460351200464
Lag Order = 1
/opt/anaconda3/lib/python3.8/site-packages/statsmodels/tsa/base/tsa_model.py:578: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting.
warnings.warn('An unsupported index was provided and will be'
Traceback (most recent call last):
File "", line 139, in
print('AIC : ', result.aic)
File "/opt/anaconda3/lib/python3.8/site-packages/statsmodels/base/wrapper.py", line 34, in getattribute
obj = getattr(results, attr)
File "/opt/anaconda3/lib/python3.8/site-packages/statsmodels/tsa/vector_ar/var_model.py", line 2139, in aic
return self.info_criteria['aic']
File "pandas/_libs/properties.pyx", line 33, in pandas._libs.properties.CachedProperty.get
File "/opt/anaconda3/lib/python3.8/site-packages/statsmodels/tsa/vector_ar/var_model.py", line 2120, in info_criteria
ld = logdet_symm(self.sigma_u_mle)
File "/opt/anaconda3/lib/python3.8/site-packages/statsmodels/tools/linalg.py", line 28, in logdet_symm
c, _ = linalg.cho_factor(m, lower=True)
File "/opt/anaconda3/lib/python3.8/site-packages/scipy/linalg/decomp_cholesky.py", line 152, in cho_factor
c, lower = _cholesky(a, lower=lower, overwrite_a=overwrite_a, clean=False,
File "/opt/anaconda3/lib/python3.8/site-packages/scipy/linalg/decomp_cholesky.py", line 37, in _cholesky
raise LinAlgError("%d-th leading minor of the array is not positive "
LinAlgError: 3-th leading minor of the array is not positive definite

Related

Python Length Mismatch

I'm learning python and trying to adapt a notebook someone posted on Kaggle to my current project. Unfortunately, I keep getting a "Length Mismatch: Expected Axis has 33 elements, new values have 9 elements" error.
import pandas as pd
import numpy as np
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))
data = pd.read_csv('c:\python\Rent Increase History.csv')
def stats(df, pred=None):
obs = df.shape[0]
types = df.dtypes
counts = df.apply(lambda x: x.count())
uniques = df.apply(lambda x: [x.unique()])
nulls = df.apply(lambda x: x.isnull().sum())
distinct = df.apply(lambda x: x.unique().shape[0])
ratio_missing = (df.isnull().sum() / obs) * 100
skewness = df.skew()
kurtosis = df.kurt()
print('Data shape:', df.shape)
if pred is None:
cols = ['types', 'counts', 'distinct', 'nulls', 'ratio_missing', 'uniques', 'skewness', 'kurtosis']
str = pd.concat([types, counts, distinct, nulls, ratio_missing, uniques, skewness, kurtosis], axis=1)
else:
corr = df.corr()[pred]
str = pd.concat([types, counts, distinct, nulls, ratio_missing, uniques, skewness, kurtosis, corr], axis=1,
sort=False)
corr_col = 'corr ' + pred
cols = ['types', 'counts', 'distinct', 'nulls', 'ratio_missing', 'uniques', 'skewness', 'kurtosis', corr_col]
str.columns = cols
dtypes = str.types.value_counts()
print('___________________________\nData types:\n', str.types.value_counts())
print('___________________________')
return str
StatDetails = stats(data, 'MovedOutInPeriod')
Print(StatDetails.sort_values(by='corr MovedOutInPeriod', ascending=False))
From what I can tell, this function should be returning 9 columns instead of the 33 I started with by design... Why am I still getting this error?
Thanks in advance. I'm sure this is something simple that I am missing.
Update - Here's the full list of errors:
Traceback (most recent call last):
File "C:\Users\john\PycharmProjects\pythonProject\main.py", line 37, in <module>
StatDetails = rstr(data, 'MovedOutInPeriod')
File "C:\Users\john\PycharmProjects\pythonProject\main.py", line 30, in rstr
str.columns = cols
File "C:\Users\john\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\generic.py", line 5152, in __setattr__
return object.__setattr__(self, name, value)
File "pandas\_libs\properties.pyx", line 66, in pandas._libs.properties.AxisProperty.__set__
File "C:\Users\john\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\generic.py", line 564, in _set_axis
self._mgr.set_axis(axis, labels)
File "C:\Users\john\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\internals\managers.py", line 226, in set_axis
raise ValueError(
ValueError: Length mismatch: Expected axis has 33 elements, new values have 9 elements
The problem is the feature uniques, write the code like this:
uniques = df.apply(lambda x: x.unique())

OneHotEncoding error when applying to an empty field

The code consists of applying the OneHotEncoding technique to two fields of a binetflow file: Proto and State. I have to do this to 5 files. I was able to apply the code below with perfection to the first two. When it gets to the third it throws the error:
TypeError: '<' not supported between instances of 'str' and 'float'.
I'm sure the error's in line: 0.000000,icmp,,60,60.0,0 of the file in which the field State's empty.
I want to simply ignore the One hot Encoding and copy the State field the way it is, which is empty and jump to the next line.
df = opendataset()
df['State2'] = df['State']
df['Proto2'] = df['Proto']
df['Dur'] = df.Dur.apply(lambda n: '%.6f' % n)
le = LabelEncoder()
dfle = df
dfle.State = le.fit_transform(dfle.State)
X = dfle[['State']].values
Y = dfle[['Proto']].values
ohe = OneHotEncoder()
OnehotX = ohe.fit_transform(X).toarray()
OnehotY = ohe.fit_transform(Y).toarray()
dx = pd.DataFrame(data=OnehotX)
dy = pd.DataFrame(data=OnehotY)
dfle['State'] = (dx[dx.columns[0:]].apply(lambda x:''.join(x.dropna().astype(int).astype(str)), axis=1))
dfle['Proto'] = (dy[dy.columns[0:]].apply(lambda y:''.join(y.dropna().astype(int).astype(str)), axis=1))
08-03 Edit
This (below) is the TraceBack when I run the code above. As you can see, the error is dfle.State = le.fit_transform(dfle.State) and consequently OnehotX = ohe.fit_transform(X).toarray().
Traceback (most recent call last):
File
"C:/Users/V/PycharmProjects/PreProcess/testfile.py",
line 39, in dfle.State = le.fit_transform(dfle.State)
File
"C:\Users\V\PycharmProjects\PreProcess\venv\lib\site-packages\sklearn\preprocessing\label.py",
line 236, in fit_transform self.classes_, y = _encode(y, encode=True)
File
"C:\Users\V\PycharmProjects\PreProcess\venv\lib\site-packages\sklearn\preprocessing\label.py",
line 108, in _encode return _encode_python(values, uniques, encode)
File
"C:\Users\V\PycharmProjects\PreProcess\venv\lib\site-packages\sklearn\preprocessing\label.py",
> line 63, in _encode_python uniques = sorted(set(values))
TypeError: '<' not supported between instances of 'str' and 'float'
NEW CODE:
I tried to do what Hemerson Tacon said and apply Try/Exception to the parts where the traceback throws an error but it warns me that it has an error and throws another error.
le = LabelEncoder()
dfle = df
try:
dfle.State = le.fit_transform(dfle.State)
except TypeError:
pass
X = dfle[['State']].values
Y = dfle[['Proto']].values
ohe = OneHotEncoder()
try:
OnehotX = ohe.fit_transform(X).toarray()
except ValueError:
pass
OnehotY = ohe.fit_transform(Y).toarray()
dx = pd.DataFrame(data=OnehotX)
dy = pd.DataFrame(data=OnehotY)
dfle['State'] = (dx[dx.columns[0:]].apply(lambda x:''.join(x.dropna().astype(int).astype(str)), axis=1))
dfle['Proto'] = (dy[dy.columns[0:]].apply(lambda y:''.join(y.dropna().astype(int).astype(str)), axis=1))
NEW ERROR:
Traceback (most recent call last): File
"C:/Users/V/PycharmProjects/PreProcess/testfile.py",
line 53, in
** dx = pd.DataFrame(data=OnehotX) NameError: name 'OnehotX' is not defined**
LAST EDIT 09/03
The solution to the problem was to simply add the line df.replace() to the code. So when it reads it replaces NaN for the word empty fixing the problem.
dfle['State'].replace(np.nan,"empty", inplace=True)
df = opendataset()
df['State2'] = df['State']
df['Proto2'] = df['Proto']
df['Dur'] = df.Dur.apply(lambda n: '%.6f' % n)
le = LabelEncoder()
dfle = df
dfle['State'].replace(np.nan,"empty", inplace=True)
dfle.State = le.fit_transform(dfle.State)
X = dfle[['State']].values
Y = dfle[['Proto']].values
ohe = OneHotEncoder()
OnehotX = ohe.fit_transform(X).toarray()
OnehotY = ohe.fit_transform(Y).toarray()
dx = pd.DataFrame(data=OnehotX)
dy = pd.DataFrame(data=OnehotY)
You could put your code in question inside a try block and catch the TypeError exception, check if is the case where the State's field is empty and if true ignore it as you said, and if not true raise the error again.
If you had posted the actual code that calls the OneHotEncoding to your data would be easier to answer you and provide some code in the answer.
Edit
The OnehotX variable is defined only inside the try block. You need to define it outside and before this block to fix the error. Something like OnehotX = None would work. Also, I reinforce what I said before, in the except block would be a good practice to test if the exception is due to the problem you have identified, this means, test if the State field is empty.

Getting error slicing time series with pandas

I'm trying to slice a time series, I can do it perfectly this way :
subseries = series['2015-07-07 01:00:00':'2015-07-07 03:30:00'] .
But the following code won't work
def GetDatetime():
Y = int(raw_input("Year "))
M = int(raw_input("Month "))
D = int(raw_input("Day "))
d = datetime.datetime(Y, M, D) #creates a datetime object
return d
filePath = "pathtofile.csv"
series = pd.read_csv(str(filePath), index_col='date')
series.index = pd.to_datetime(series.index, unit='s')
d = GetDatetime()
f = GetDatetime()
subseries = series[d:f]
The last line generates this error:
Traceback (most recent call last):
File "dontgivemeerrorsbrasommek.py", line 37, in <module>
brasla7nina= df[d:f]
File "/usr/local/lib/python2.7/dist-packages/pandas-0.20.2-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 1952, in __getitem__
indexer = convert_to_index_sliceable(self, key)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.20.2-py2.7-linux-x86_64.egg/pandas/core/indexing.py", line 1896, in convert_to_index_sliceable
return idx._convert_slice_indexer(key, kind='getitem')
File "/usr/local/lib/python2.7/dist-packages/pandas-0.20.2-py2.7-linux-x86_64.egg/pandas/core/indexes/base.py", line 1407, in _convert_slice_indexer
indexer = self.slice_indexer(start, stop, step, kind=kind)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.20.2-py2.7-linux-x86_64.egg/pandas/core/indexes/datetimes.py", line 1515, in slice_indexer
return Index.slice_indexer(self, start, end, step, kind=kind)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.20.2-py2.7-linux-x86_64.egg/pandas/core/indexes/base.py", line 3350, in slice_indexer
kind=kind)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.20.2-py2.7-linux-x86_64.egg/pandas/core/indexes/base.py", line 3538, in slice_locs
start_slice = self.get_slice_bound(start, 'left', kind)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.20.2-py2.7-linux-x86_64.egg/pandas/core/indexes/base.py", line 3487, in get_slice_bound
raise err
KeyError: 1435802520000000000
I think it's a time-stamp conversion problem so I tried the following but still it wouldn't work :
d3 = pandas.Timestamp(datetime(Y, M, D, H, m))
d2 = pandas.to_datetime(d)
Your help would be appreciated, thank you. :)
change def GetDatetime() function return value to:
return str(d)
This will return datetime string which times series will be able to deal with.
if I understand your code correctly, when you do this:
subseries = series['2015-07-07 01:00:00':'2015-07-07 03:30:00']
you're slicing series (btw, that's confusing seeing as there is a pandas datatype Series) from two strings.
if that works, then what you need from subseries= df[d:f] would be that d and f be strings.
you can do that by calling the datetime method .strftime() eg:
d= GetDatetime().strftime('%Y-%m-%d 00:00:00')
f= GetDatetime().strftime('%Y-%m-%d 00:00:00')

Error in scikit code

I am new to Machine Learning and am trying the titanic problem from Kaggle. I have written the attached code that uses decision tree to do computations on data. There is an error that I am unable to remove.
Code :
#!/usr/bin/env python
from __future__ import print_function
import pandas as pd
import numpy as np
from sklearn import tree
train_uri = './titanic/train.csv'
test_uri = './titanic/test.csv'
train = pd.read_csv(train_uri)
test = pd.read_csv(test_uri)
# print(train[train["Sex"] == 'female']["Survived"].value_counts(normalize=True))
train['Child'] = float('NaN')
train['Child'][train['Age'] < 18] = 1
train['Child'][train['Age'] >= 18] = 0
# print(train[train['Child'] == 1]['Survived'].value_counts(normalize=True))
# print(train['Embarked'][train['Embarked'] == 'C'].value_counts())
# print(train.shape)
## Fill empty 'Embarked' values with 'S'
train['Embarked'] = train['Embarked'].fillna('S')
## Convert Embarked classes to integers
train["Embarked"][train["Embarked"] == "S"] = 0
train['Embarked'][train['Embarked'] == "C"] = 1
train['Embarked'][train['Embarked'] == "Q"] = 2
train['Sex'][train['Sex'] == 'male'] = 0
train['Sex'][train['Sex'] == 'female'] = 1
target = train['Survived'].values
features_a = train[['Pclass', 'Sex', 'Age', 'Fare']].values
tree_a = tree.DecisionTreeClassifier()
##### Line With Error #####
tree_a = tree_a.fit(features_a, target)
# print(tree_a.feature_importances_)
# print(tree_a.score(features_a, target))
Error:
Traceback (most recent call last):
File "titanic.py", line 40, in <module>
tree_a = tree_a.fit(features_a, target)
File "/usr/local/lib/python2.7/dist-packages/sklearn/tree/tree.py", line 739, in fit
X_idx_sorted=X_idx_sorted)
File "/usr/local/lib/python2.7/dist-packages/sklearn/tree/tree.py", line 122, in fit
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 407, in check_array
_assert_all_finite(array)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 58, in _assert_all_finite
" or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
This error isn't present when I run the code on Datacamp server but present when I run it locally. I don't understand why this is coming up, I have checked the data and the values in either features_a or target don't contain NaN or really high values.
Try each feature one by one and you will probably find one of them has some nulls. I note you do not check if sex has nulls.
Also by coding each categoric variable manually it would be easy to make an error perhaps by misspelling one of the categories. Instead you can use df=pd.get_dummies(df) and it will automatically code all the categoric variables for you. No need to specify each category manually.
You can also try dropna() function of pandas to drop all those rows from dataset which have invalid values like NaN.

Astroquery python: querying NED with list of objects

I have extracted a list of Simbad names from a VizieR catalog and would like to find the axis ratio of the objects from the diameters table in NED. Code below.
import numpy as np
from astropy.table import Table,Column
from astroquery.vizier import Vizier
from astroquery.ned import Ned
v = Vizier(columns = ['SimbadName','W50max'])
catalist = v.find_catalogs('VIII/73')
v.ROW_LIMIT = -1
a = v.get_catalogs(catalist.keys())
filter = a[0]['W50max'] > 500
targets = a[0][filter]
print targets
simName = targets['SimbadName']
W50max = targets['W50max']
counter = 1
for objects in simName:
result_table = Ned.get_table(objects, table='diameters')
## find where Axis Ratio results are not masked
notMasked = (np.where(result_table['NED Axis Ratio'].mask == False))
## calculate average value of Axis Ratio
print counter, np.sum(result_table['NED Axis Ratio'])/np.size(notMasked)
counter += 1
The fourth object in simNames has no diameters table so creates an error:
File "/home/tom/VizRauto.py", line 40, in <module>
result_table = Ned.get_table(objects, table='diameters')
File "/usr/local/lib/python2.7/dist-packages/astroquery/ned/core.py", line 505, in get_table
result = self._parse_result(response, verbose=verbose)
File "/usr/local/lib/python2.7/dist-packages/astroquery/ned/core.py", line 631, in _parse_result
raise RemoteServiceError("The remote service returned the following error message.\nERROR: {err_msg}".format(err_msg=err_msg))
RemoteServiceError: The remote service returned the following error message.
ERROR: Unknown error
So I tried:
counter = 1
for objects in simName:
try:
result_table = Ned.get_table(objects, table='diameters')
## find where Axis Ratio results are not masked
notMasked = (np.where(result_table['NED Axis Ratio'].mask == False))
## calculate average value of Axis Ratio
print counter, np.sum(result_table['NED Axis Ratio'])/np.size(notMasked)
except RemoteServiceError:
continue
counter += 1
which produces:
Traceback (most recent call last):
File "/home/tom/Dropbox/AST03CosmoLarge/Project/scripts/VizRauto.py", line 57, in <module>
except RemoteServiceError:
NameError: name 'RemoteServiceError' is not defined
So obviously the RemoteServiceError from core.py is not recognized. What is the best way to handle this or is there a better method for querying NED with a list of objects?

Categories