OneHotEncoding error when applying to an empty field - python

The code consists of applying the OneHotEncoding technique to two fields of a binetflow file: Proto and State. I have to do this to 5 files. I was able to apply the code below with perfection to the first two. When it gets to the third it throws the error:
TypeError: '<' not supported between instances of 'str' and 'float'.
I'm sure the error's in line: 0.000000,icmp,,60,60.0,0 of the file in which the field State's empty.
I want to simply ignore the One hot Encoding and copy the State field the way it is, which is empty and jump to the next line.
df = opendataset()
df['State2'] = df['State']
df['Proto2'] = df['Proto']
df['Dur'] = df.Dur.apply(lambda n: '%.6f' % n)
le = LabelEncoder()
dfle = df
dfle.State = le.fit_transform(dfle.State)
X = dfle[['State']].values
Y = dfle[['Proto']].values
ohe = OneHotEncoder()
OnehotX = ohe.fit_transform(X).toarray()
OnehotY = ohe.fit_transform(Y).toarray()
dx = pd.DataFrame(data=OnehotX)
dy = pd.DataFrame(data=OnehotY)
dfle['State'] = (dx[dx.columns[0:]].apply(lambda x:''.join(x.dropna().astype(int).astype(str)), axis=1))
dfle['Proto'] = (dy[dy.columns[0:]].apply(lambda y:''.join(y.dropna().astype(int).astype(str)), axis=1))
08-03 Edit
This (below) is the TraceBack when I run the code above. As you can see, the error is dfle.State = le.fit_transform(dfle.State) and consequently OnehotX = ohe.fit_transform(X).toarray().
Traceback (most recent call last):
File
"C:/Users/V/PycharmProjects/PreProcess/testfile.py",
line 39, in dfle.State = le.fit_transform(dfle.State)
File
"C:\Users\V\PycharmProjects\PreProcess\venv\lib\site-packages\sklearn\preprocessing\label.py",
line 236, in fit_transform self.classes_, y = _encode(y, encode=True)
File
"C:\Users\V\PycharmProjects\PreProcess\venv\lib\site-packages\sklearn\preprocessing\label.py",
line 108, in _encode return _encode_python(values, uniques, encode)
File
"C:\Users\V\PycharmProjects\PreProcess\venv\lib\site-packages\sklearn\preprocessing\label.py",
> line 63, in _encode_python uniques = sorted(set(values))
TypeError: '<' not supported between instances of 'str' and 'float'
NEW CODE:
I tried to do what Hemerson Tacon said and apply Try/Exception to the parts where the traceback throws an error but it warns me that it has an error and throws another error.
le = LabelEncoder()
dfle = df
try:
dfle.State = le.fit_transform(dfle.State)
except TypeError:
pass
X = dfle[['State']].values
Y = dfle[['Proto']].values
ohe = OneHotEncoder()
try:
OnehotX = ohe.fit_transform(X).toarray()
except ValueError:
pass
OnehotY = ohe.fit_transform(Y).toarray()
dx = pd.DataFrame(data=OnehotX)
dy = pd.DataFrame(data=OnehotY)
dfle['State'] = (dx[dx.columns[0:]].apply(lambda x:''.join(x.dropna().astype(int).astype(str)), axis=1))
dfle['Proto'] = (dy[dy.columns[0:]].apply(lambda y:''.join(y.dropna().astype(int).astype(str)), axis=1))
NEW ERROR:
Traceback (most recent call last): File
"C:/Users/V/PycharmProjects/PreProcess/testfile.py",
line 53, in
** dx = pd.DataFrame(data=OnehotX) NameError: name 'OnehotX' is not defined**
LAST EDIT 09/03
The solution to the problem was to simply add the line df.replace() to the code. So when it reads it replaces NaN for the word empty fixing the problem.
dfle['State'].replace(np.nan,"empty", inplace=True)
df = opendataset()
df['State2'] = df['State']
df['Proto2'] = df['Proto']
df['Dur'] = df.Dur.apply(lambda n: '%.6f' % n)
le = LabelEncoder()
dfle = df
dfle['State'].replace(np.nan,"empty", inplace=True)
dfle.State = le.fit_transform(dfle.State)
X = dfle[['State']].values
Y = dfle[['Proto']].values
ohe = OneHotEncoder()
OnehotX = ohe.fit_transform(X).toarray()
OnehotY = ohe.fit_transform(Y).toarray()
dx = pd.DataFrame(data=OnehotX)
dy = pd.DataFrame(data=OnehotY)

You could put your code in question inside a try block and catch the TypeError exception, check if is the case where the State's field is empty and if true ignore it as you said, and if not true raise the error again.
If you had posted the actual code that calls the OneHotEncoding to your data would be easier to answer you and provide some code in the answer.
Edit
The OnehotX variable is defined only inside the try block. You need to define it outside and before this block to fix the error. Something like OnehotX = None would work. Also, I reinforce what I said before, in the except block would be a good practice to test if the exception is due to the problem you have identified, this means, test if the State field is empty.

Related

Python Length Mismatch

I'm learning python and trying to adapt a notebook someone posted on Kaggle to my current project. Unfortunately, I keep getting a "Length Mismatch: Expected Axis has 33 elements, new values have 9 elements" error.
import pandas as pd
import numpy as np
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))
data = pd.read_csv('c:\python\Rent Increase History.csv')
def stats(df, pred=None):
obs = df.shape[0]
types = df.dtypes
counts = df.apply(lambda x: x.count())
uniques = df.apply(lambda x: [x.unique()])
nulls = df.apply(lambda x: x.isnull().sum())
distinct = df.apply(lambda x: x.unique().shape[0])
ratio_missing = (df.isnull().sum() / obs) * 100
skewness = df.skew()
kurtosis = df.kurt()
print('Data shape:', df.shape)
if pred is None:
cols = ['types', 'counts', 'distinct', 'nulls', 'ratio_missing', 'uniques', 'skewness', 'kurtosis']
str = pd.concat([types, counts, distinct, nulls, ratio_missing, uniques, skewness, kurtosis], axis=1)
else:
corr = df.corr()[pred]
str = pd.concat([types, counts, distinct, nulls, ratio_missing, uniques, skewness, kurtosis, corr], axis=1,
sort=False)
corr_col = 'corr ' + pred
cols = ['types', 'counts', 'distinct', 'nulls', 'ratio_missing', 'uniques', 'skewness', 'kurtosis', corr_col]
str.columns = cols
dtypes = str.types.value_counts()
print('___________________________\nData types:\n', str.types.value_counts())
print('___________________________')
return str
StatDetails = stats(data, 'MovedOutInPeriod')
Print(StatDetails.sort_values(by='corr MovedOutInPeriod', ascending=False))
From what I can tell, this function should be returning 9 columns instead of the 33 I started with by design... Why am I still getting this error?
Thanks in advance. I'm sure this is something simple that I am missing.
Update - Here's the full list of errors:
Traceback (most recent call last):
File "C:\Users\john\PycharmProjects\pythonProject\main.py", line 37, in <module>
StatDetails = rstr(data, 'MovedOutInPeriod')
File "C:\Users\john\PycharmProjects\pythonProject\main.py", line 30, in rstr
str.columns = cols
File "C:\Users\john\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\generic.py", line 5152, in __setattr__
return object.__setattr__(self, name, value)
File "pandas\_libs\properties.pyx", line 66, in pandas._libs.properties.AxisProperty.__set__
File "C:\Users\john\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\generic.py", line 564, in _set_axis
self._mgr.set_axis(axis, labels)
File "C:\Users\john\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\internals\managers.py", line 226, in set_axis
raise ValueError(
ValueError: Length mismatch: Expected axis has 33 elements, new values have 9 elements
The problem is the feature uniques, write the code like this:
uniques = df.apply(lambda x: x.unique())

python - How to read table with chunksize and names

how can i read data from a csv with chnunksize and names?
I tried this:
sms = pd.read_table('demodata.csv', header=None, names=['label', 'good'])
X = sms.label.tolist()
y = sms.good.tolist()
and it worked totaly fine. But if try this, i'll get an error:
sms = pd.read_table('demodata.csv', chunksize=100, header=None, names=['label', 'good'])
X = sms.label.tolist()
y = sms.good.tolist()
And i get this error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-18-e3f35149ab7f> in <module>()
----> 1 X = sms.label.tolist()
2 y = sms.good.tolist()
AttributeError: 'TextFileReader' object has no attribute 'label'
Why does it work in the first but not in the second place?

AttributeError: 'Series' object has no attribute 'label'

I'm trying to follow a tutorial on sound classification in neural networks, and I've found 3 different versions of the same tutorial, all of which work, but they all reach a snag at this point in the code, where I get the "AttributeError: 'Series' object has no attribute 'label'" issue. I'm not particularly au fait with either NNs or Python, so apologies if this is something trivial like a deprecation error, but I can't seem to figure it out myself.
def parser(row):
# function to load files and extract features
file_name = os.path.join(os.path.abspath(data_dir), 'Train/train', str(row.ID) + '.wav')
# handle exception to check if there isn't a file which is corrupted
try:
# here kaiser_fast is a technique used for faster extraction
X, sample_rate = librosa.load(file_name, res_type='kaiser_fast')
# we extract mfcc feature from data
mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T,axis=0)
except Exception as e:
print("Error encountered while parsing file: ", file)
return None, None
feature = mfccs
label = row.Class
return [feature, label]
temp = train.apply(parser, axis=1)
temp.columns = ['feature', 'label']
from sklearn.preprocessing import LabelEncoder
X = np.array(temp.feature.tolist())
y = np.array(temp.label.tolist())
lb = LabelEncoder()
y = np_utils.to_categorical(lb.fit_transform(y))
As mentioned, I've seen three different tutorials on the same subject, all of which end with the same "temp = train.apply(parser, axis=1) temp.columns = ['feature', 'label']" fragment, so I'm assuming this is assigning correctly, but I don't know where it's going wrong otherwise. Help appreciated!
Edit: Traceback as requested, turns out I'd added the wrong traceback. Also I've since found out that this is a case of converting the series object to a dataframe, so any help with that would be great.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-17-1613f53e2d98> in <module>()
1 from sklearn.preprocessing import LabelEncoder
2
----> 3 X = np.array(temp.feature.tolist())
4 y = np.array(temp.label.tolist())
5
/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in __getattr__(self, name)
4370 if self._info_axis._can_hold_identifiers_and_holds_name(name):
4371 return self[name]
-> 4372 return object.__getattribute__(self, name)
4373
4374 def __setattr__(self, name, value):
AttributeError: 'Series' object has no attribute 'feature'
Your current implementation of parser(row) method returns a list for each row of data from train DataFrame. But this is then collected as a pandas.Series object.
So your temp is actually a Series object. Then the following line dont have any effect:
temp.columns = ['feature', 'label']
Since temp is a Series, it does not have any columns, and hence temp.feature and temp.label dont exist and hence the error.
Change your parser() method as following:
def parser(row):
...
...
...
# Return pandas.Series instead of List
return pd.Series([feature, label])
By doing this, the apply method from temp = train.apply(parser, axis=1) will return a DataFrame, so your other code will work.
I cannot say about the tutorials you are following. Maybe they followed an older version of pandas which allowed a list to be automatically converted to DataFrame.

Error in scikit code

I am new to Machine Learning and am trying the titanic problem from Kaggle. I have written the attached code that uses decision tree to do computations on data. There is an error that I am unable to remove.
Code :
#!/usr/bin/env python
from __future__ import print_function
import pandas as pd
import numpy as np
from sklearn import tree
train_uri = './titanic/train.csv'
test_uri = './titanic/test.csv'
train = pd.read_csv(train_uri)
test = pd.read_csv(test_uri)
# print(train[train["Sex"] == 'female']["Survived"].value_counts(normalize=True))
train['Child'] = float('NaN')
train['Child'][train['Age'] < 18] = 1
train['Child'][train['Age'] >= 18] = 0
# print(train[train['Child'] == 1]['Survived'].value_counts(normalize=True))
# print(train['Embarked'][train['Embarked'] == 'C'].value_counts())
# print(train.shape)
## Fill empty 'Embarked' values with 'S'
train['Embarked'] = train['Embarked'].fillna('S')
## Convert Embarked classes to integers
train["Embarked"][train["Embarked"] == "S"] = 0
train['Embarked'][train['Embarked'] == "C"] = 1
train['Embarked'][train['Embarked'] == "Q"] = 2
train['Sex'][train['Sex'] == 'male'] = 0
train['Sex'][train['Sex'] == 'female'] = 1
target = train['Survived'].values
features_a = train[['Pclass', 'Sex', 'Age', 'Fare']].values
tree_a = tree.DecisionTreeClassifier()
##### Line With Error #####
tree_a = tree_a.fit(features_a, target)
# print(tree_a.feature_importances_)
# print(tree_a.score(features_a, target))
Error:
Traceback (most recent call last):
File "titanic.py", line 40, in <module>
tree_a = tree_a.fit(features_a, target)
File "/usr/local/lib/python2.7/dist-packages/sklearn/tree/tree.py", line 739, in fit
X_idx_sorted=X_idx_sorted)
File "/usr/local/lib/python2.7/dist-packages/sklearn/tree/tree.py", line 122, in fit
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 407, in check_array
_assert_all_finite(array)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 58, in _assert_all_finite
" or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
This error isn't present when I run the code on Datacamp server but present when I run it locally. I don't understand why this is coming up, I have checked the data and the values in either features_a or target don't contain NaN or really high values.
Try each feature one by one and you will probably find one of them has some nulls. I note you do not check if sex has nulls.
Also by coding each categoric variable manually it would be easy to make an error perhaps by misspelling one of the categories. Instead you can use df=pd.get_dummies(df) and it will automatically code all the categoric variables for you. No need to specify each category manually.
You can also try dropna() function of pandas to drop all those rows from dataset which have invalid values like NaN.

Astroquery python: querying NED with list of objects

I have extracted a list of Simbad names from a VizieR catalog and would like to find the axis ratio of the objects from the diameters table in NED. Code below.
import numpy as np
from astropy.table import Table,Column
from astroquery.vizier import Vizier
from astroquery.ned import Ned
v = Vizier(columns = ['SimbadName','W50max'])
catalist = v.find_catalogs('VIII/73')
v.ROW_LIMIT = -1
a = v.get_catalogs(catalist.keys())
filter = a[0]['W50max'] > 500
targets = a[0][filter]
print targets
simName = targets['SimbadName']
W50max = targets['W50max']
counter = 1
for objects in simName:
result_table = Ned.get_table(objects, table='diameters')
## find where Axis Ratio results are not masked
notMasked = (np.where(result_table['NED Axis Ratio'].mask == False))
## calculate average value of Axis Ratio
print counter, np.sum(result_table['NED Axis Ratio'])/np.size(notMasked)
counter += 1
The fourth object in simNames has no diameters table so creates an error:
File "/home/tom/VizRauto.py", line 40, in <module>
result_table = Ned.get_table(objects, table='diameters')
File "/usr/local/lib/python2.7/dist-packages/astroquery/ned/core.py", line 505, in get_table
result = self._parse_result(response, verbose=verbose)
File "/usr/local/lib/python2.7/dist-packages/astroquery/ned/core.py", line 631, in _parse_result
raise RemoteServiceError("The remote service returned the following error message.\nERROR: {err_msg}".format(err_msg=err_msg))
RemoteServiceError: The remote service returned the following error message.
ERROR: Unknown error
So I tried:
counter = 1
for objects in simName:
try:
result_table = Ned.get_table(objects, table='diameters')
## find where Axis Ratio results are not masked
notMasked = (np.where(result_table['NED Axis Ratio'].mask == False))
## calculate average value of Axis Ratio
print counter, np.sum(result_table['NED Axis Ratio'])/np.size(notMasked)
except RemoteServiceError:
continue
counter += 1
which produces:
Traceback (most recent call last):
File "/home/tom/Dropbox/AST03CosmoLarge/Project/scripts/VizRauto.py", line 57, in <module>
except RemoteServiceError:
NameError: name 'RemoteServiceError' is not defined
So obviously the RemoteServiceError from core.py is not recognized. What is the best way to handle this or is there a better method for querying NED with a list of objects?

Categories