How to label encode the column index in the data-set table?

How to label encode the column index in the data-set table? - python

I'm trying to label encode the second column I'm getting an error. What am I doing wrong?
I'm able to encode the first column
data.head()
area_type availability location size society total_sqft bath balcony price
0 Super built-up Area 19-Dec Electronic City Phase II 2 BHK Coomee 1056 2.0 1.0 39.07
1 Plot Area Ready To Move Chikka Tirupathi 4 Bedroom Theanmp 2600 5.0 3.0 120.00
2 Built-up Area Ready To Move Uttarahalli 3 BHK NaN 1440 2.0 3.0 62.00
3 Super built-up Area Ready To Move Lingadheeranahalli 3 BHK Soiewre 1521 3.0 1.0 95.00
4 Super built-up Area Ready To Move Kothanur 2 BHK NaN 1200 2.0 1.0 51.00
enc = LabelEncoder()
data.iloc[:,2] = enc.fit_transform(data.iloc[:,2])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-20-53fda4a71b5e> in <module>()
1 enc = LabelEncoder()
----> 2 data.iloc[:,2] = enc.fit_transform(data.iloc[:,2])
~/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/label.py in fit_transform(self, y)
110 """
111 y = column_or_1d(y, warn=True)
--> 112 self.classes_, y = np.unique(y, return_inverse=True)
113 return y
114
~/anaconda3/lib/python3.6/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis)
208 ar = np.asanyarray(ar)
209 if axis is None:
--> 210 return _unique1d(ar, return_index, return_inverse, return_counts)
211 if not (-ar.ndim <= axis < ar.ndim):
212 raise ValueError('Invalid axis kwarg specified for unique')
~/anaconda3/lib/python3.6/site-packages/numpy/lib/arraysetops.py in _unique1d(ar, return_index, return_inverse, return_counts)
272
273 if optional_indices:
--> 274 perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
275 aux = ar[perm]
276 else:
TypeError: '<' not supported between instances of 'float' and 'str'
I want to label encode the second column "Location", If I use data.iloc[:,1] = enc.fit_transform(data.iloc[:,1]) indexing I can label encode availability column, So
How can I fix this?

What is the datatype of your column?
The error arises because the label encoder cannot order numbers (and np.nan are floats) and strings.
To fix this you can:
- Replace any nan with an empty string data['col_name'].fillna('',inplace=True);
- Convert the column to a string by typing data['col_name'] = data['col_name'].astype(str)

Related

Convert float to int not loosing information of original values

I would need to transform float to int. However, I would like to not loose any information while converting it. The values (from a dataframe column used a y in modeling build) that I am taking into account are as follows:
-1.0
0.0
9.0
-0.5
1.5
1.5
...
If I convert them into int directly, I might get -0.5 as 0 or -1, so I will loose some information.
I need to convert the values above to int because I need to pass them to fit a model model.fit(X, y). Any format that could allow me to pass these values in the fit function (the above column is meant y column)?
Code:
from sklearn.preprocessing import MinMaxScaler
le = preprocessing.LabelEncoder()
X = df[['Col1','Col2']].apply(le.fit_transform)
X_transformed=np.concatenate(((X[['Col1']]),(X[['Col2']])), axis=1)
y=df['Label'].values
scaler=MinMaxScaler()
X_scaled=scaler.fit_transform(X_transformed)
model_LS = LabelSpreading(kernel='knn',
gamma=70,
alpha=0.5,
max_iter=30,
tol=0.001,
n_jobs=-1,
)
LS=model_LS.fit(X_scaled, y)
Data:
Col1 Col2 Label
Cust1 Cust2 1.0
Cust1 Cust4 1.0
Cust4 Cust5 -1.5
Cust12 Cust6 9.0
The error that I am getting running the above code is:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-174-14429cc07d75> in <module>
2
----> 3 LS=model_LS.fit(X_scaled, y)
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/semi_supervised/_label_propagation.py in fit(self, X, y)
228 X, y = self._validate_data(X, y)
229 self.X_ = X
--> 230 check_classification_targets(y)
231
232 # actual graph construction (implementations should override this)
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/multiclass.py in check_classification_targets(y)
181 if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
182 'multilabel-indicator', 'multilabel-sequences']:
--> 183 raise ValueError("Unknown label type: %r" % y_type)
184
185
ValueError: Unknown label type: 'continuous'

You can multiply your values to remove the decimal part:
df = pd.DataFrame({'Label': [1.0, -1.3, 0.75, 9.0, 7.8236]})
decimals = df['Label'].astype(str).str.split('.').str[1].str.len().max()
df['y'] = df['Label'].mul(float(f"1e{decimals}")).astype(int)
print(df)
# Output:
Label y
0 1.0000 10000
1 -1.3000 -13000
2 0.7500 7500
3 9.0000 90000
4 7.8236 78236

I think you need:
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame(data={'y':[-1.0, 0.0 , 9.0, -0.5, 1.5 , 1.5]})
le = LabelEncoder()
le.fit(df['y'])
df['y'] = le.transform(df['y'])
print(df)
OUTPUT
y
0 0
1 2
2 4
3 1
4 3
5 3

How do I remove outliers from a pandas DataFrame that has both numerical and non-numerical data

I have a dataframe (cgf) that looks as follows and I want to remove the outliers for only the numerical columns:
Product object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Product 180 non-null object
1 Age 180 non-null int64
2 Gender 180 non-null object
3 Education 180 non-null category
4 MaritalStatus 180 non-null object
5 Usage 180 non-null int64
6 Fitness 180 non-null category
7 Income 180 non-null int64
8 Miles 180 non-null int64
dtypes: category(2), int64(4), object(3)
I tried several scripts using z-score and IQR methods, but none of them worked. For example, here is a script for the z-score that didn't work
from scipy import stats
import numpy as np
z = np.abs(stats.zscore(cgf)) # get the z-score of every value with respect to their columns
print(z)
I get this error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-102-2759aa3fbd60> in <module>
----> 1 z = np.abs(stats.zscore(cgf)) # get the z-score of every value with respect to their columns
2 print(z)
~\anaconda3\lib\site-packages\scipy\stats\stats.py in zscore(a, axis, ddof, nan_policy)
2495 sstd = np.nanstd(a=a, axis=axis, ddof=ddof, keepdims=True)
2496 else:
-> 2497 mns = a.mean(axis=axis, keepdims=True)
2498 sstd = a.std(axis=axis, ddof=ddof, keepdims=True)
2499
~\anaconda3\lib\site-packages\numpy\core\_methods.py in _mean(a, axis, dtype, out, keepdims)
160 ret = umr_sum(arr, axis, dtype, out, keepdims)
161 if isinstance(ret, mu.ndarray):
--> 162 ret = um.true_divide(
163 ret, rcount, out=ret, casting='unsafe', subok=False)
164 if is_float16_result and out is None:
TypeError: unsupported operand type(s) for /: 'str' and 'int'
Here is the IQR method I tried, but it also failed as follows:
np.where((cgf < (Q1 - 1.5 * IQR)) | (cgf > (Q3 + 1.5 * IQR)))
error message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-96-bb3dfd2ce6c5> in <module>
----> 1 np.where((cgf < (Q1 - 1.5 * IQR)) | (cgf > (Q3 + 1.5 * IQR)))
~\anaconda3\lib\site-packages\pandas\core\ops\__init__.py in f(self, other)
702
703 # See GH#4537 for discussion of scalar op behavior
--> 704 new_data = dispatch_to_series(self, other, op, axis=axis)
705 return self._construct_result(new_data)
706
~\anaconda3\lib\site-packages\pandas\core\ops\__init__.py in dispatch_to_series(left, right, func, axis)
273 # _frame_arith_method_with_reindex
274
--> 275 bm = left._mgr.operate_blockwise(right._mgr, array_op)
276 return type(left)(bm)
277
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in operate_blockwise(self, other, array_op)
362 Apply array_op blockwise with another (aligned) BlockManager.
363 """
--> 364 return operate_blockwise(self, other, array_op)
365
366 def apply(self: T, f, align_keys=None, **kwargs) -> T:
~\anaconda3\lib\site-packages\pandas\core\internals\ops.py in operate_blockwise(left, right, array_op)
36 lvals, rvals = _get_same_shape_values(blk, rblk, left_ea, right_ea)
37
---> 38 res_values = array_op(lvals, rvals)
39 if left_ea and not right_ea and hasattr(res_values, "reshape"):
40 res_values = res_values.reshape(1, -1)
~\anaconda3\lib\site-packages\pandas\core\ops\array_ops.py in comparison_op(left, right, op)
228 if should_extension_dispatch(lvalues, rvalues):
229 # Call the method on lvalues
--> 230 res_values = op(lvalues, rvalues)
231
232 elif is_scalar(rvalues) and isna(rvalues):
~\anaconda3\lib\site-packages\pandas\core\ops\common.py in new_method(self, other)
63 other = item_from_zerodim(other)
64
---> 65 return method(self, other)
66
67 return new_method
~\anaconda3\lib\site-packages\pandas\core\arrays\categorical.py in func(self, other)
74 if not self.ordered:
75 if opname in ["__lt__", "__gt__", "__le__", "__ge__"]:
---> 76 raise TypeError(
77 "Unordered Categoricals can only compare equality or not"
78 )
TypeError: Unordered Categoricals can only compare equality or not
How do I resolve some of these errors? It appears the combination of categorical and numerical data in my df is causing a problem, but I am a newbie and I don't know how to fix it so that I can remove outliers

For example, if you're dropping outliers in the 'Age' column, then the changes happened in this column will get reflected in the data frame. i.e., that entire row will be dropped.
Reference: towardsdatascience
Reference: how-to-remove-outliers

Why do I get this error when comparing two Numpy arrays?

See the below error message. It points to this code which takes two numpy arrays with company brands and see if there are any new brand names in the new_df brand column.
I have looked at the input variables new_df['brand'].unique(),existing_df['brand'].unique() and neither of them are None, they are numpy arrays, so I don't get what the problem is:
#find new brands
brand_diff = np.setdiff1d(new_df['brand'].unique(),existing_df['brand'].unique(),False)
count_brand_diff = len(brand_diff)
TypeError Traceback (most recent call last)
<ipython-input-75-254b4c01e085> in <module>
71
72 #find new brands
---> 73 brand_diff = np.setdiff1d(new_df['brand'].unique(),existing_df['brand'].unique(),False)
74 count_brand_diff = len(brand_diff)
75
<__array_function__ internals> in setdiff1d(*args, **kwargs)
~/opt/anaconda3/lib/python3.7/site-packages/numpy/lib/arraysetops.py in setdiff1d(ar1, ar2, assume_unique)
782 ar1 = np.asarray(ar1).ravel()
783 else:
--> 784 ar1 = unique(ar1)
785 ar2 = unique(ar2)
786 return ar1[in1d(ar1, ar2, assume_unique=True, invert=True)]
<__array_function__ internals> in unique(*args, **kwargs)
~/opt/anaconda3/lib/python3.7/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis)
260 ar = np.asanyarray(ar)
261 if axis is None:
--> 262 ret = _unique1d(ar, return_index, return_inverse, return_counts)
263 return _unpack_tuple(ret)
264
~/opt/anaconda3/lib/python3.7/site-packages/numpy/lib/arraysetops.py in _unique1d(ar, return_index, return_inverse, return_counts)
308 aux = ar[perm]
309 else:
--> 310 ar.sort()
311 aux = ar
312 mask = np.empty(aux.shape, dtype=np.bool_)
TypeError: '<' not supported between instances of 'NoneType' and 'NavigableString'```

The problem is with the data you are using because the code is correct,
example:
>>existing_df
brand
apple
apple
bmw
>>new_df
brand
apple
lexus
bmw
>>count_brand_diff
1
Hence, of you need more help, please provide an example of the data you are using.

Correlation matrix of categorical and numerical values not working

I am trying to convert my categorical columns into integers with Label Encoder in order to create a correlation matrix consisting of a mix of numerical and categorical variables. This is my table structure:
a int64
b int64
c object
d object
e object
f object
g object
dtype: object
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for x in df.columns:
if df[x].dtypes=='object':
df[x]=le.fit_transform(df[x])
corr = df.corr()
Then I get this error:
TypeError: unorderable types: int() < str()
TypeError Traceback (most recent call last)
<command-205607> in <module>()
3 for x in df.columns:
4 if df[x].dtypes=='object':
----> 5 df[x]=le.fit_transform(df[x])
6 corr = df.corr()
/databricks/python/lib/python3.5/site-packages/sklearn/preprocessing/label.py in fit_transform(self, y)
129 y = column_or_1d(y, warn=True)
130 _check_numpy_unicode_bug(y)
--> 131 self.classes_, y = np.unique(y, return_inverse=True)
132 return y
133
/databricks/python/lib/python3.5/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis)
221 ar = np.asanyarray(ar)
222 if axis is None:
--> 223 return _unique1d(ar, return_index, return_inverse, return_counts)
224 if not (-ar.ndim <= axis < ar.ndim):
225 raise ValueError('Invalid axis kwarg specified for unique')
/databricks/python/lib/python3.5/site-packages/numpy/lib/arraysetops.py in _unique1d(ar, return_index, return_inverse, return_counts)
278
279 if optional_indices:
--> 280 perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
281 aux = ar[perm]
282 else:
TypeError: unorderable types: int() < str()
Does anybody have an idea what is wrong?

Change df[x]=le.fit_transform(df[x]) to
df[x]=le.fit_transform(df[x].astype(str))
And it should work.

Searching a Pandas series using a string produces a KeyError

I'm trying to use df[df['col'].str.contains("string")] (described in these two SO questions: 1 & 2) to select rows based on a partial string match. Here's my code:
import requests
import json
import pandas as pd
import datetime
url = "http://api.turfgame.com/v4/zones/all" # get request returns .json
r = requests.get(url)
df = pd.read_json(r.content) # create a df containing all zone info
print df[df['region'].str.contains("Uppsala")].head()
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-23-55bbf5679808> in <module>()
----> 1 print df[df['region'].str.contains("Uppsala")].head()
C:\Users\User\AppData\Local\Enthought\Canopy32\User\lib\site-packages\pandas\core\frame.pyc in __getitem__(self, key)
1670 if isinstance(key, (Series, np.ndarray, list)):
1671 # either boolean or fancy integer index
-> 1672 return self._getitem_array(key)
1673 elif isinstance(key, DataFrame):
1674 return self._getitem_frame(key)
C:\Users\User\AppData\Local\Enthought\Canopy32\User\lib\site-packages\pandas\core\frame.pyc in _getitem_array(self, key)
1714 return self.take(indexer, axis=0, convert=False)
1715 else:
-> 1716 indexer = self.ix._convert_to_indexer(key, axis=1)
1717 return self.take(indexer, axis=1, convert=True)
1718
C:\Users\User\AppData\Local\Enthought\Canopy32\User\lib\site-packages\pandas\core\indexing.pyc in _convert_to_indexer(self, obj, axis, is_setter)
1083 if isinstance(obj, tuple) and is_setter:
1084 return {'key': obj}
-> 1085 raise KeyError('%s not in index' % objarr[mask])
1086
1087 return indexer
KeyError: '[ nan nan nan ..., nan nan nan] not in index'
I don't understand the which I get a KeyError because df.columns returns:
Index([u'dateCreated', u'id', u'latitude', u'longitude', u'name', u'pointsPerHour', u'region', u'takeoverPoints', u'totalTakeovers'], dtype='object')
So the Key is in the list of columns and opening the page in an internet browser I can find 739 instances of 'Uppsala'.
The column in which I'm search was a nested .json table that looks like this {"id":200,"name":"Scotland","country":"gb"}. Do I have do something special to search between '{}' characters? Could somebody explain where I've made my mistake(s)?

Looks to me like your region column contains dictionaries, which aren't really supported as elements, and so .str isn't working. One way to solve the problem is to promote the region dictionaries to columns in their own right, maybe with something like:
>>> region = pd.DataFrame(df.pop("region").tolist())
>>> df = df.join(region, rsuffix="_region")
after which you have
>>> df.head()
dateCreated id latitude longitude name pointsPerHour takeoverPoints totalTakeovers country id_region name_region
0 2013-06-15T08:00:00+0000 14639 55.947079 -3.206477 GrandSquare 1 185 32 gb 200 Scotland
1 2014-06-15T20:02:37+0000 31571 55.649181 12.609056 Stenringen 1 185 6 dk 172 Hovedstaden
2 2013-06-15T08:00:00+0000 18958 54.593570 -5.955772 Hospitality 0 250 1 gb 206 Northern Ireland
3 2013-06-15T08:00:00+0000 18661 53.754283 -1.526638 LanshawZone 0 250 0 gb 202 Yorkshire & The Humber
4 2013-06-15T08:00:00+0000 17424 55.949285 -3.144777 NoDogsZone 0 250 5 gb 200 Scotland
and
>>> df[df["name_region"].str.contains("Uppsala")].head()
dateCreated id latitude longitude name pointsPerHour takeoverPoints totalTakeovers country id_region name_region
28 2013-07-16T18:53:48+0000 20828 59.793476 17.775389 MoraStenRast 5 125 536 se 142 Uppsala
59 2013-02-08T21:42:53+0000 14797 59.570418 17.482116 BålWoods 3 155 555 se 142 Uppsala
102 2014-06-19T12:00:00+0000 31843 59.617637 17.077094 EnaAlle 5 125 168 se 142 Uppsala
328 2012-09-24T20:08:22+0000 11461 59.634438 17.066398 BluePark 6 110 1968 se 142 Uppsala
330 2014-08-28T20:00:00+0000 33695 59.867027 17.710792 EnbackensBro 4 140 59 se 142 Uppsala
(A hack workaround would be df["region"].apply(str).str.contains("Uppsala"), but I think it's best to clean the data right at the start.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to label encode the column index in the data-set table? - python

Related

Convert float to int not loosing information of original values

How do I remove outliers from a pandas DataFrame that has both numerical and non-numerical data

Why do I get this error when comparing two Numpy arrays?

Correlation matrix of categorical and numerical values not working

Searching a Pandas series using a string produces a KeyError

Categories

Resources