Error in scikit code - python

I am new to Machine Learning and am trying the titanic problem from Kaggle. I have written the attached code that uses decision tree to do computations on data. There is an error that I am unable to remove.
Code :
#!/usr/bin/env python
from __future__ import print_function
import pandas as pd
import numpy as np
from sklearn import tree
train_uri = './titanic/train.csv'
test_uri = './titanic/test.csv'
train = pd.read_csv(train_uri)
test = pd.read_csv(test_uri)
# print(train[train["Sex"] == 'female']["Survived"].value_counts(normalize=True))
train['Child'] = float('NaN')
train['Child'][train['Age'] < 18] = 1
train['Child'][train['Age'] >= 18] = 0
# print(train[train['Child'] == 1]['Survived'].value_counts(normalize=True))
# print(train['Embarked'][train['Embarked'] == 'C'].value_counts())
# print(train.shape)
## Fill empty 'Embarked' values with 'S'
train['Embarked'] = train['Embarked'].fillna('S')
## Convert Embarked classes to integers
train["Embarked"][train["Embarked"] == "S"] = 0
train['Embarked'][train['Embarked'] == "C"] = 1
train['Embarked'][train['Embarked'] == "Q"] = 2
train['Sex'][train['Sex'] == 'male'] = 0
train['Sex'][train['Sex'] == 'female'] = 1
target = train['Survived'].values
features_a = train[['Pclass', 'Sex', 'Age', 'Fare']].values
tree_a = tree.DecisionTreeClassifier()
##### Line With Error #####
tree_a = tree_a.fit(features_a, target)
# print(tree_a.feature_importances_)
# print(tree_a.score(features_a, target))
Error:
Traceback (most recent call last):
File "titanic.py", line 40, in <module>
tree_a = tree_a.fit(features_a, target)
File "/usr/local/lib/python2.7/dist-packages/sklearn/tree/tree.py", line 739, in fit
X_idx_sorted=X_idx_sorted)
File "/usr/local/lib/python2.7/dist-packages/sklearn/tree/tree.py", line 122, in fit
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 407, in check_array
_assert_all_finite(array)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 58, in _assert_all_finite
" or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
This error isn't present when I run the code on Datacamp server but present when I run it locally. I don't understand why this is coming up, I have checked the data and the values in either features_a or target don't contain NaN or really high values.

Try each feature one by one and you will probably find one of them has some nulls. I note you do not check if sex has nulls.
Also by coding each categoric variable manually it would be easy to make an error perhaps by misspelling one of the categories. Instead you can use df=pd.get_dummies(df) and it will automatically code all the categoric variables for you. No need to specify each category manually.

You can also try dropna() function of pandas to drop all those rows from dataset which have invalid values like NaN.

Related

Removing data points above/below value in python

I have a dataframe where I am trying to remove all the values outside the range [-500,500], I simply want to remove the particular colum/"Index" values that exceed this limit. I have tried a lot of different things, but nothing really seems to work. I have tried using this code, but then I get the error. enter image description here
File "C:\Users\Jeffs.spyder-py3\kplr006779699.py", line 30, in data = data[data['0'] < abs(500)]
File "C:\Users\Jeffs\anaconda3\lib\site-packages\pandas\core\frame.py", line 3024, in getitem indexer = self.columns.get_loc(key)
File "C:\Users\Jeffs\anaconda3\lib\site-packages\pandas\core\indexes\range.py", line 354, in get_loc raise KeyError(key)
KeyError: '0'
which i'm guessing is because the column named '0' doesn't have really have a column name.
from astropy.io import ascii
import numpy as np
import matplotlib.pyplot as plt
import math
import pandas as pd
#Data from KIC 6779699
df = ascii.read(r'G:\Downloads\kplr006779699_kasoc-ts_llc_v1-2.dat')
# print(df)
x_Julian_data = df['col1']
x_data_raw = (x_Julian_data-54000)*86400 #Julian time to seconds: 60*60*24
data = np.linspace(0, 65541, num = int(65541) , endpoint = True)
y_data_raw = df['col2'] #Relative flux ppm
for i in range (65541-2):#Cleaning up data
data[i+1] = y_data_raw[i+1]-.5*(y_data_raw[i]+y_data_raw[i+2])
data[0] = 0
data[65540] = 0
data = pd.DataFrame(data)
data = data[data['0'] < abs(500)]
plt.plot(x_data_raw, data)
plt.xlim([1.1E8,1.25E8])
plt.ylim([-500,500])
I can't quite get it to work, even if I try using a definition.
Is there an easier way to approach this?
"data" is a numpy array (created using np.linspace), so you can filter it by value *before you create the data frame:
data = data[data < abs(500)]
new_df = pd.DataFrame(data, columns=['a_useful_column_name'])
(while debugging consider using a new variable name for the DataFrame)

Vector autoregressive (VAR) model fitting with different lag operator

I am a Master 2 student in computational neuroscience.
I'm at the very end of my analysis and I have a problem with the application of a VAR model (vector autoregressive model).
It is a rather complex problem to solve and it concerns the test of different lags operators on the data. For me the problem comes when I try to compute the cholesky factorization on a covariance matrix with negative numbers . :
I may have found a solution but I can't include it in the python function that deploys the model ("VAR"). If someone has ten minutes to help me, please write me. Thanks for your attention :)
for i in [1,2,3,4,6,8,9,10,12,13,14,15,16,17,18,19,20]:
print(i)
df_entropie_G1_w_diff = df_entropie_G1_w.iloc[i,2145:].diff()
df_RMSE_G1_w_diff = df_g1_RMSE_w.iloc[i,2145:].diff()
df_var_G1_w_diff = df_var_G1_w.iloc[i,2145:].diff()
df_data = pd.concat([df_entropie_G1_w_diff,df_RMSE_G1_w_diff,df_var_G1_w_diff],axis = 1)
df_data = df_data.diff().dropna()
df_data = df_data.T
df_data = df_data.reset_index()
del df_data['index']
df_data = df_data.T
df_data['Time'] = pd.to_timedelta(np.arange(537), unit='s')
df_data.index = df_data['Time']
del df_data['Time']
Arrange names of columns
df_data_T = df_data.T
df_data_T = df_data_T.reset_index()
del df_data_T['index']
df_data_T = df_data_T.T
df_data = df_data_T.rename(columns={0:'Entropie',1:'RMSE',2:'Var'})
model = VAR(df_data)
liste_aic = []
liste_bic = []
liste_fpe = []
liste_hqic = []
for a in range(0,25,1):
result = model.fit(a)
print('Lag Order =', a)
print('AIC : ', result.aic)
print('BIC : ', result.bic)
print('FPE : ', result.fpe)
print('HQIC: ', result.hqic, '\n')
liste_aic.append(result.aic)
liste_bic.append(result.bic)
liste_fpe.append(result.fpe)
liste_hqic.append(result.hqic)
1
Lag Order = 0
AIC : -59.6358271069015
BIC : -59.61188298346849
FPE : 1.260344786813777e-26
HQIC: -59.626460351200464
Lag Order = 1
/opt/anaconda3/lib/python3.8/site-packages/statsmodels/tsa/base/tsa_model.py:578: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting.
warnings.warn('An unsupported index was provided and will be'
Traceback (most recent call last):
File "", line 139, in
print('AIC : ', result.aic)
File "/opt/anaconda3/lib/python3.8/site-packages/statsmodels/base/wrapper.py", line 34, in getattribute
obj = getattr(results, attr)
File "/opt/anaconda3/lib/python3.8/site-packages/statsmodels/tsa/vector_ar/var_model.py", line 2139, in aic
return self.info_criteria['aic']
File "pandas/_libs/properties.pyx", line 33, in pandas._libs.properties.CachedProperty.get
File "/opt/anaconda3/lib/python3.8/site-packages/statsmodels/tsa/vector_ar/var_model.py", line 2120, in info_criteria
ld = logdet_symm(self.sigma_u_mle)
File "/opt/anaconda3/lib/python3.8/site-packages/statsmodels/tools/linalg.py", line 28, in logdet_symm
c, _ = linalg.cho_factor(m, lower=True)
File "/opt/anaconda3/lib/python3.8/site-packages/scipy/linalg/decomp_cholesky.py", line 152, in cho_factor
c, lower = _cholesky(a, lower=lower, overwrite_a=overwrite_a, clean=False,
File "/opt/anaconda3/lib/python3.8/site-packages/scipy/linalg/decomp_cholesky.py", line 37, in _cholesky
raise LinAlgError("%d-th leading minor of the array is not positive "
LinAlgError: 3-th leading minor of the array is not positive definite

OneHotEncoding error when applying to an empty field

The code consists of applying the OneHotEncoding technique to two fields of a binetflow file: Proto and State. I have to do this to 5 files. I was able to apply the code below with perfection to the first two. When it gets to the third it throws the error:
TypeError: '<' not supported between instances of 'str' and 'float'.
I'm sure the error's in line: 0.000000,icmp,,60,60.0,0 of the file in which the field State's empty.
I want to simply ignore the One hot Encoding and copy the State field the way it is, which is empty and jump to the next line.
df = opendataset()
df['State2'] = df['State']
df['Proto2'] = df['Proto']
df['Dur'] = df.Dur.apply(lambda n: '%.6f' % n)
le = LabelEncoder()
dfle = df
dfle.State = le.fit_transform(dfle.State)
X = dfle[['State']].values
Y = dfle[['Proto']].values
ohe = OneHotEncoder()
OnehotX = ohe.fit_transform(X).toarray()
OnehotY = ohe.fit_transform(Y).toarray()
dx = pd.DataFrame(data=OnehotX)
dy = pd.DataFrame(data=OnehotY)
dfle['State'] = (dx[dx.columns[0:]].apply(lambda x:''.join(x.dropna().astype(int).astype(str)), axis=1))
dfle['Proto'] = (dy[dy.columns[0:]].apply(lambda y:''.join(y.dropna().astype(int).astype(str)), axis=1))
08-03 Edit
This (below) is the TraceBack when I run the code above. As you can see, the error is dfle.State = le.fit_transform(dfle.State) and consequently OnehotX = ohe.fit_transform(X).toarray().
Traceback (most recent call last):
File
"C:/Users/V/PycharmProjects/PreProcess/testfile.py",
line 39, in dfle.State = le.fit_transform(dfle.State)
File
"C:\Users\V\PycharmProjects\PreProcess\venv\lib\site-packages\sklearn\preprocessing\label.py",
line 236, in fit_transform self.classes_, y = _encode(y, encode=True)
File
"C:\Users\V\PycharmProjects\PreProcess\venv\lib\site-packages\sklearn\preprocessing\label.py",
line 108, in _encode return _encode_python(values, uniques, encode)
File
"C:\Users\V\PycharmProjects\PreProcess\venv\lib\site-packages\sklearn\preprocessing\label.py",
> line 63, in _encode_python uniques = sorted(set(values))
TypeError: '<' not supported between instances of 'str' and 'float'
NEW CODE:
I tried to do what Hemerson Tacon said and apply Try/Exception to the parts where the traceback throws an error but it warns me that it has an error and throws another error.
le = LabelEncoder()
dfle = df
try:
dfle.State = le.fit_transform(dfle.State)
except TypeError:
pass
X = dfle[['State']].values
Y = dfle[['Proto']].values
ohe = OneHotEncoder()
try:
OnehotX = ohe.fit_transform(X).toarray()
except ValueError:
pass
OnehotY = ohe.fit_transform(Y).toarray()
dx = pd.DataFrame(data=OnehotX)
dy = pd.DataFrame(data=OnehotY)
dfle['State'] = (dx[dx.columns[0:]].apply(lambda x:''.join(x.dropna().astype(int).astype(str)), axis=1))
dfle['Proto'] = (dy[dy.columns[0:]].apply(lambda y:''.join(y.dropna().astype(int).astype(str)), axis=1))
NEW ERROR:
Traceback (most recent call last): File
"C:/Users/V/PycharmProjects/PreProcess/testfile.py",
line 53, in
** dx = pd.DataFrame(data=OnehotX) NameError: name 'OnehotX' is not defined**
LAST EDIT 09/03
The solution to the problem was to simply add the line df.replace() to the code. So when it reads it replaces NaN for the word empty fixing the problem.
dfle['State'].replace(np.nan,"empty", inplace=True)
df = opendataset()
df['State2'] = df['State']
df['Proto2'] = df['Proto']
df['Dur'] = df.Dur.apply(lambda n: '%.6f' % n)
le = LabelEncoder()
dfle = df
dfle['State'].replace(np.nan,"empty", inplace=True)
dfle.State = le.fit_transform(dfle.State)
X = dfle[['State']].values
Y = dfle[['Proto']].values
ohe = OneHotEncoder()
OnehotX = ohe.fit_transform(X).toarray()
OnehotY = ohe.fit_transform(Y).toarray()
dx = pd.DataFrame(data=OnehotX)
dy = pd.DataFrame(data=OnehotY)
You could put your code in question inside a try block and catch the TypeError exception, check if is the case where the State's field is empty and if true ignore it as you said, and if not true raise the error again.
If you had posted the actual code that calls the OneHotEncoding to your data would be easier to answer you and provide some code in the answer.
Edit
The OnehotX variable is defined only inside the try block. You need to define it outside and before this block to fix the error. Something like OnehotX = None would work. Also, I reinforce what I said before, in the except block would be a good practice to test if the exception is due to the problem you have identified, this means, test if the State field is empty.

OverflowError: size does not fit in an int

I am writing a python script to use in AzureML. My dataset is quite big. I have a dataset with columns called ID(int) and DataType(text). I would like to concatenate these values to just have one column with text that has both the ID and the DataType seperated by a comma.
How can I avoid getting an error when I do this. Do I have any mistakes in my code?
When i run this code I get the following error:
Error 0085: The following error occurred during script evaluation, please view the output log for more information:
---------- Start of error message from Python interpreter ----------
data:text/plain,Caught exception while executing function: Traceback (most recent call last):
File "C:\server\invokepy.py", line 167, in batch
idfs.append(rutils.RUtils.RFileToDataFrame(infile))
File "C:\server\RReader\rutils.py", line 15, in RFileToDataFrame
rreader = RReaderFactory.construct_from_file(filename, compressed)
File "C:\server\RReader\rreaderfactory.py", line 25, in construct_from_file
return _RReaderFactory.construct_from_stream(stream)
File "C:\server\RReader\rreaderfactory.py", line 46, in construct_from_stream
return RReader(BinaryReader(RFactoryConstants.big_endian, stream.read()))
File "C:\pyhome\lib\gzip.py", line 254, in read
self._read(readsize)
File "C:\pyhome\lib\gzip.py", line 313, in _read
self._add_read_data( uncompress )
File "C:\pyhome\lib\gzip.py", line 329, in _add_read_data
self.crc = zlib.crc32(data, self.crc) & 0xffffffffL
OverflowError: size does not fit in an int
My code is as below:
# The script MUST contain a function named azureml_main
# which is the entry point for this module.
#
# The entry point function can contain up to two input arguments:
# Param<dataframe1>: a pandas.DataFrame
# Param<dataframe2>: a pandas.DataFrame
def azureml_main(dataframe1):
import pandas as pd
dataframe1['SignalID,DataType'] = dataframe1['ID'] + " , " + dataframe1['DataType']
dataframe1 = dataframe1.drop('DataType')
dataframe1 = dataframe1.drop('ID')
# Return value must be of a sequence of pandas.DataFrame
return dataframe1
I get the same error when I run the default python code in AzureML. So I am pretty sure my data just does not fit in the data frame.
The default script is the following:
# The script MUST contain a function named azureml_main
# which is the entry point for this module.
#
# The entry point function can contain up to two input arguments:
# Param<dataframe1>: a pandas.DataFrame
# Param<dataframe2>: a pandas.DataFrame
def azureml_main(dataframe1 = None, dataframe2 = None):
# Execution logic goes here
print('Input pandas.DataFrame #1:\r\n\r\n{0}'.format(dataframe1))
# If a zip file is connected to the third input port is connected,
# it is unzipped under ".\Script Bundle". This directory is added
# to sys.path. Therefore, if your zip file contains a Python file
# mymodule.py you can import it using:
# import mymodule
# Return value must be of a sequence of pandas.DataFrame
return dataframe1,
If you need concatenate integer ID and string DataType columns to new column SignalID, use casting by astype. Then you can drop columns DataType and ID adding parameter axis=1:
import pandas as pd
def azureml_main(dataframe1):
dataframe1['SignalID'] = dataframe1['ID'].astype(str)
+ " , "
+ dataframe1['DataType']
dataframe1 = dataframe1.drop(['DataType', 'ID'], axis=1)
# Return value must be of a sequence of pandas.DataFrame
return dataframe1

Astroquery python: querying NED with list of objects

I have extracted a list of Simbad names from a VizieR catalog and would like to find the axis ratio of the objects from the diameters table in NED. Code below.
import numpy as np
from astropy.table import Table,Column
from astroquery.vizier import Vizier
from astroquery.ned import Ned
v = Vizier(columns = ['SimbadName','W50max'])
catalist = v.find_catalogs('VIII/73')
v.ROW_LIMIT = -1
a = v.get_catalogs(catalist.keys())
filter = a[0]['W50max'] > 500
targets = a[0][filter]
print targets
simName = targets['SimbadName']
W50max = targets['W50max']
counter = 1
for objects in simName:
result_table = Ned.get_table(objects, table='diameters')
## find where Axis Ratio results are not masked
notMasked = (np.where(result_table['NED Axis Ratio'].mask == False))
## calculate average value of Axis Ratio
print counter, np.sum(result_table['NED Axis Ratio'])/np.size(notMasked)
counter += 1
The fourth object in simNames has no diameters table so creates an error:
File "/home/tom/VizRauto.py", line 40, in <module>
result_table = Ned.get_table(objects, table='diameters')
File "/usr/local/lib/python2.7/dist-packages/astroquery/ned/core.py", line 505, in get_table
result = self._parse_result(response, verbose=verbose)
File "/usr/local/lib/python2.7/dist-packages/astroquery/ned/core.py", line 631, in _parse_result
raise RemoteServiceError("The remote service returned the following error message.\nERROR: {err_msg}".format(err_msg=err_msg))
RemoteServiceError: The remote service returned the following error message.
ERROR: Unknown error
So I tried:
counter = 1
for objects in simName:
try:
result_table = Ned.get_table(objects, table='diameters')
## find where Axis Ratio results are not masked
notMasked = (np.where(result_table['NED Axis Ratio'].mask == False))
## calculate average value of Axis Ratio
print counter, np.sum(result_table['NED Axis Ratio'])/np.size(notMasked)
except RemoteServiceError:
continue
counter += 1
which produces:
Traceback (most recent call last):
File "/home/tom/Dropbox/AST03CosmoLarge/Project/scripts/VizRauto.py", line 57, in <module>
except RemoteServiceError:
NameError: name 'RemoteServiceError' is not defined
So obviously the RemoteServiceError from core.py is not recognized. What is the best way to handle this or is there a better method for querying NED with a list of objects?

Categories