Model prediction returns warnings - python

I'm trying to build a model that predicts the probability for an athlete to win a medal.
I have a dataframe that looks like this:
Here is what I've already done
#Cleaning df
#Replace NaN with mean or average
df['Height'].fillna(value=df['Height'].mean(), inplace=True)
df['Weight'].fillna(value=df['Weight'].mean(), inplace=True)
#Changing type to integer
df.Height = df.Height.astype(int)
df.Weight = df.Weight.astype(int)
#Target variable
y= df["Medal"]
#If Male =0, if female = 1
df['Sex'] = df['Sex'].apply(lambda x: 1 if str(x) != 'M' else 0)
#Predictive
feature_names = ["Age", "Sex", "Height", "Weight"]
X= df[feature_names]
#Regressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
regressor = DecisionTreeRegressor(random_state=0)
cross_val_score(regressor, X, y, cv=10)
But when I run the code, it returns me an error
warnings.warn("Estimator fit failed. The score on this train-test"
C:\Users\miss_\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py:610: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 593, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 1247, in fit
super().fit(
File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 156, in fit
X, y = self._validate_data(X, y,
File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\base.py", line 430, in _validate_data
X = check_array(X, **check_X_params)
File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 663, in check_array
_assert_all_finite(array,
File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 103, in _assert_all_finite
raise ValueError(
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
And returns an array like this : array[NaN, NaN, NaN...]
My X looks like this
Age Sex Height Weight
0 24.0 1 180 80
1 23.0 1 170 60
2 24.0 1 175 70
3 34.0 1 175 70
4 21.0 1 185 82
... ... ... ... ...
271111 29.0 1 179 89
271112 27.0 1 176 59
271113 27.0 1 176 59
271114 30.0 1 185 96
271115 34.0 1 185 96
And my y :
0 0
1 0
2 0
3 1
4 0
..
271111 0
271112 0
271113 0
271114 0
271115 0
Name: Medal, Length: 271116, dtype: int64

You fill missing values for "Height" and "Weight". You should apply the same operation for the feature "Age".
First locate your missing values in this column:
>>> df.loc[df['Age'].isna(), ['ID', 'Name', 'Age'])
If you have few missing values, you can fill with the mean:
>>> df['Age'].fillna(value=df['Age'].mean(), inplace=True)
But if you have many missing values, fill them with the global mean is probably not a good idea. "Age" can depends on "Country", "Sport", "Year" even the "Season" (winter or summer). In fact, this is the same for Height/Weight: the average height in VolleyBall is probably not the same in Archery...

Related

Convert float to int not loosing information of original values

I would need to transform float to int. However, I would like to not loose any information while converting it. The values (from a dataframe column used a y in modeling build) that I am taking into account are as follows:
-1.0
0.0
9.0
-0.5
1.5
1.5
...
If I convert them into int directly, I might get -0.5 as 0 or -1, so I will loose some information.
I need to convert the values above to int because I need to pass them to fit a model model.fit(X, y). Any format that could allow me to pass these values in the fit function (the above column is meant y column)?
Code:
from sklearn.preprocessing import MinMaxScaler
le = preprocessing.LabelEncoder()
X = df[['Col1','Col2']].apply(le.fit_transform)
X_transformed=np.concatenate(((X[['Col1']]),(X[['Col2']])), axis=1)
y=df['Label'].values
scaler=MinMaxScaler()
X_scaled=scaler.fit_transform(X_transformed)
model_LS = LabelSpreading(kernel='knn',
gamma=70,
alpha=0.5,
max_iter=30,
tol=0.001,
n_jobs=-1,
)
LS=model_LS.fit(X_scaled, y)
Data:
Col1 Col2 Label
Cust1 Cust2 1.0
Cust1 Cust4 1.0
Cust4 Cust5 -1.5
Cust12 Cust6 9.0
The error that I am getting running the above code is:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-174-14429cc07d75> in <module>
2
----> 3 LS=model_LS.fit(X_scaled, y)
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/semi_supervised/_label_propagation.py in fit(self, X, y)
228 X, y = self._validate_data(X, y)
229 self.X_ = X
--> 230 check_classification_targets(y)
231
232 # actual graph construction (implementations should override this)
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/multiclass.py in check_classification_targets(y)
181 if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
182 'multilabel-indicator', 'multilabel-sequences']:
--> 183 raise ValueError("Unknown label type: %r" % y_type)
184
185
ValueError: Unknown label type: 'continuous'
You can multiply your values to remove the decimal part:
df = pd.DataFrame({'Label': [1.0, -1.3, 0.75, 9.0, 7.8236]})
decimals = df['Label'].astype(str).str.split('.').str[1].str.len().max()
df['y'] = df['Label'].mul(float(f"1e{decimals}")).astype(int)
print(df)
# Output:
Label y
0 1.0000 10000
1 -1.3000 -13000
2 0.7500 7500
3 9.0000 90000
4 7.8236 78236
I think you need:
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame(data={'y':[-1.0, 0.0 , 9.0, -0.5, 1.5 , 1.5]})
le = LabelEncoder()
le.fit(df['y'])
df['y'] = le.transform(df['y'])
print(df)
OUTPUT
y
0 0
1 2
2 4
3 1
4 3
5 3

xarray set new 2D coordinate as dimension

I have an xarray dataset of sea surface temperature values on an x/y grid. x and y are 1D vector coordinates, so it looks like this minimal example:
<xarray.Dataset>
Dimensions: (x: 10, y: 10)
Coordinates:
* x (x) int64 0 1 2 3 4 5 6 7 8 9
* y (y) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
data (x, y) float64 0.559 0.01037 0.1562 ... 0.08778 0.3272 0.8661
I am able to compute the lat/lon from this x/y grid, and the output is 2 2D arrays. I can add them as coordinates with ds.assign_coords:
<xarray.Dataset>
Dimensions: (x: 10, y: 10)
Coordinates:
* x (x) int64 0 1 2 3 4 5 6 7 8 9
* y (y) int64 0 1 2 3 4 5 6 7 8 9
lat (x, y) float64 30.0 30.0 30.0 30.0 30.0 ... 39.0 39.0 39.0 39.0
lon (x, y) float64 -120.0 -119.0 -118.0 -117.0 ... -113.0 -112.0 -111.0
Data variables:
data (x, y) float64 0.559 0.01037 0.1562 ... 0.08778 0.3272 0.8661
But I'd like to .sel along slices of the lat/lon. This currently isn't possible, as I get the error:
ds.sel(lat=slice(32,36), lon=slice(-118, -115))
ValueError Traceback (most recent call last)
<ipython-input-20-28c79202d5f3> in <module>
----> 1 ds.sel(lat=slice(32,36), lon=slice(-118, -115))
~/.local/lib/python3.8/site-packages/xarray/core/dataset.py in sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
2363 """
2364 indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "sel")
-> 2365 pos_indexers, new_indexes = remap_label_indexers(
2366 self, indexers=indexers, method=method, tolerance=tolerance
2367 )
~/.local/lib/python3.8/site-packages/xarray/core/coordinates.py in remap_label_indexers(obj, indexers, method, tolerance, **indexers_kwargs)
419 }
420
--> 421 pos_indexers, new_indexes = indexing.remap_label_indexers(
422 obj, v_indexers, method=method, tolerance=tolerance
423 )
~/.local/lib/python3.8/site-packages/xarray/core/indexing.py in remap_label_indexers(data_obj, indexers, method, tolerance)
256 new_indexes = {}
257
--> 258 dim_indexers = get_dim_indexers(data_obj, indexers)
259 for dim, label in dim_indexers.items():
260 try:
~/.local/lib/python3.8/site-packages/xarray/core/indexing.py in get_dim_indexers(data_obj, indexers)
222 ]
223 if invalid:
--> 224 raise ValueError(f"dimensions or multi-index levels {invalid!r} do not exist")
225
226 level_indexers = defaultdict(dict)
ValueError: dimensions or multi-index levels ['lat', 'lon'] do not exist
So my question is this: How can I change the dimensions of data to be (lon: (10,10), lat: (10,10)) instead of (x: 10, y: 10? Is this even possible?
Code to reproduce the example dataset:
import numpy as np
import xarray as xr
# Create sample data
data = np.random.rand(10,10)
x = y = np.arange(10)
# Set up dataset
ds = xr.Dataset(
data_vars = dict(
data = (["x", "y"], data)
),
coords= {
"x" : x,
"y" : y
}
)
# Create example lat/lon and assign to dataset
lon, lat = np.meshgrid(np.linspace(-120, -111, 10), np.linspace(30, 39, 10))
ds = ds.assign_coords({
"lat": (["x", "y"], lat),
"lon": (["x", "y"], lon)
})

How to label encode the column index in the data-set table?

I'm trying to label encode the second column I'm getting an error. What am I doing wrong?
I'm able to encode the first column
data.head()
area_type availability location size society total_sqft bath balcony price
0 Super built-up Area 19-Dec Electronic City Phase II 2 BHK Coomee 1056 2.0 1.0 39.07
1 Plot Area Ready To Move Chikka Tirupathi 4 Bedroom Theanmp 2600 5.0 3.0 120.00
2 Built-up Area Ready To Move Uttarahalli 3 BHK NaN 1440 2.0 3.0 62.00
3 Super built-up Area Ready To Move Lingadheeranahalli 3 BHK Soiewre 1521 3.0 1.0 95.00
4 Super built-up Area Ready To Move Kothanur 2 BHK NaN 1200 2.0 1.0 51.00
enc = LabelEncoder()
data.iloc[:,2] = enc.fit_transform(data.iloc[:,2])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-20-53fda4a71b5e> in <module>()
1 enc = LabelEncoder()
----> 2 data.iloc[:,2] = enc.fit_transform(data.iloc[:,2])
~/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/label.py in fit_transform(self, y)
110 """
111 y = column_or_1d(y, warn=True)
--> 112 self.classes_, y = np.unique(y, return_inverse=True)
113 return y
114
~/anaconda3/lib/python3.6/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis)
208 ar = np.asanyarray(ar)
209 if axis is None:
--> 210 return _unique1d(ar, return_index, return_inverse, return_counts)
211 if not (-ar.ndim <= axis < ar.ndim):
212 raise ValueError('Invalid axis kwarg specified for unique')
~/anaconda3/lib/python3.6/site-packages/numpy/lib/arraysetops.py in _unique1d(ar, return_index, return_inverse, return_counts)
272
273 if optional_indices:
--> 274 perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
275 aux = ar[perm]
276 else:
TypeError: '<' not supported between instances of 'float' and 'str'
I want to label encode the second column "Location", If I use data.iloc[:,1] = enc.fit_transform(data.iloc[:,1]) indexing I can label encode availability column, So
How can I fix this?
What is the datatype of your column?
The error arises because the label encoder cannot order numbers (and np.nan are floats) and strings.
To fix this you can:
- Replace any nan with an empty string data['col_name'].fillna('',inplace=True);
- Convert the column to a string by typing data['col_name'] = data['col_name'].astype(str)

Sci-kit learn how to print labels for confusion matrix?

So I'm using sci-kit learn to classify some data. I have 13 different class values/categorizes to classify the data to. Now I have been able to use cross validation and print the confusion matrix. However, it only shows the TP and FP etc without the classlabels, so I don't know which class is what. Below is my code and my output:
def classify_data(df, feature_cols, file):
nbr_folds = 5
RANDOM_STATE = 0
attributes = df.loc[:, feature_cols] # Also known as x
class_label = df['task'] # Class label, also known as y.
file.write("\nFeatures used: ")
for feature in feature_cols:
file.write(feature + ",")
print("Features used", feature_cols)
sampler = RandomOverSampler(random_state=RANDOM_STATE)
print("RandomForest")
file.write("\nRandomForest")
rfc = RandomForestClassifier(max_depth=2, random_state=RANDOM_STATE)
pipeline = make_pipeline(sampler, rfc)
class_label_predicted = cross_val_predict(pipeline, attributes, class_label, cv=nbr_folds)
conf_mat = confusion_matrix(class_label, class_label_predicted)
print(conf_mat)
accuracy = accuracy_score(class_label, class_label_predicted)
print("Rows classified: " + str(len(class_label_predicted)))
print("Accuracy: {0:.3f}%\n".format(accuracy * 100))
file.write("\nClassifier settings:" + str(pipeline) + "\n")
file.write("\nRows classified: " + str(len(class_label_predicted)))
file.write("\nAccuracy: {0:.3f}%\n".format(accuracy * 100))
file.writelines('\t'.join(str(j) for j in i) + '\n' for i in conf_mat)
#Output
Rows classified: 23504
Accuracy: 17.925%
0 372 46 88 5 73 0 536 44 317 0 200 127
0 501 29 85 0 136 0 655 9 154 0 172 67
0 97 141 78 1 56 0 336 37 429 0 435 198
0 135 74 416 5 37 0 507 19 323 0 128 164
0 247 72 145 12 64 0 424 21 296 0 304 223
0 190 41 36 0 178 0 984 29 196 0 111 43
0 218 13 71 7 52 0 917 139 177 0 111 103
0 215 30 84 3 71 0 1175 11 55 0 102 62
0 257 55 156 1 13 0 322 184 463 0 197 160
0 188 36 104 2 34 0 313 99 827 0 69 136
0 281 80 111 22 16 0 494 19 261 0 313 211
0 207 66 87 18 58 0 489 23 157 0 464 239
0 113 114 44 6 51 0 389 30 408 0 338 315
As you can see, you can't really know what column is what and the print is also "misaligned" so it's difficult to understand.
Is there a way to print the labels as well?
From the doc, it seems that there is no such option to print the rows and column labels of the confusion matrix. However, you can specify the label order using argument labels=...
Example:
from sklearn.metrics import confusion_matrix
y_true = ['yes','yes','yes','no','no','no']
y_pred = ['yes','no','no','no','no','no']
print(confusion_matrix(y_true, y_pred))
# Output:
# [[3 0]
# [2 1]]
print(confusion_matrix(y_true, y_pred, labels=['yes', 'no']))
# Output:
# [[1 2]
# [0 3]]
If you want to print the confusion matrix with labels, you may try pandas and set the index and columns of the DataFrame.
import pandas as pd
cmtx = pd.DataFrame(
confusion_matrix(y_true, y_pred, labels=['yes', 'no']),
index=['true:yes', 'true:no'],
columns=['pred:yes', 'pred:no']
)
print(cmtx)
# Output:
# pred:yes pred:no
# true:yes 1 2
# true:no 0 3
Or
unique_label = np.unique([y_true, y_pred])
cmtx = pd.DataFrame(
confusion_matrix(y_true, y_pred, labels=unique_label),
index=['true:{:}'.format(x) for x in unique_label],
columns=['pred:{:}'.format(x) for x in unique_label]
)
print(cmtx)
# Output:
# pred:no pred:yes
# true:no 3 0
# true:yes 2 1
It is important to ensure that the way you label your confusion matrix rows and columns corresponds exactly to the way sklearn has coded the classes. The true order of the labels can be revealed using the .classes_ attribute of the classifier. You can use the code below to prepare a confusion matrix data frame.
labels = rfc.classes_
conf_df = pd.DataFrame(confusion_matrix(class_label, class_label_predicted, columns=labels, index=labels))
conf_df.index.name = 'True labels'
The second thing to note is that your classifier is not predicting labels well. The number of correctly predicted labels is shown on the main diagonal of the confusion matrix. You have non-zero values accross the matrix and some classes have not been predicted at all - the columns that are all zero. It might be a good idea to run the classifier with its default parameters and then try to optimise them.
Another better way of doing this is using crosstab function in pandas.
pd.crosstab(y_true, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)
or
pd.crosstab(le.inverse_transform(y_true),
le.inverse_transform(y_pred),
rownames=['True'],
colnames=['Predicted'],
margins=True)
Since confusion matrix is just a numpy matrix, it does not contain any column information. What you can do is convert your matrix into a dataframe and then print this dataframe.
import pandas as pd
import numpy as np
def cm2df(cm, labels):
df = pd.DataFrame()
# rows
for i, row_label in enumerate(labels):
rowdata={}
# columns
for j, col_label in enumerate(labels):
rowdata[col_label]=cm[i,j]
df = df.append(pd.DataFrame.from_dict({row_label:rowdata}, orient='index'))
return df[labels]
cm = np.arange(9).reshape((3, 3))
df = cm2df(cm, ["a", "b", "c"])
print(df)
Code snippet is from https://gist.github.com/nickynicolson/202fe765c99af49acb20ea9f77b6255e
Output:
a b c
a 0 1 2
b 3 4 5
c 6 7 8
It appears your data has 13 different classes, which is why your confusion matrix has 13 rows and columns. Furthermore, your classes aren't labeled in any way, just integers from what I can see.
If this isn't the case, and your training data has actual labels, you can pass a list of unique labels to confusion_matrix
conf_mat = confusion_matrix(class_label, class_label_predicted, df['task'].unique())

error raised while running model_selection.cross_val_score

I have this test part and it works well :
data = pd.read_csv('/home/noodle/B_train.csv')
print(data.head())
features = data.iloc[:, :-1].as_matrix()
targets = data.iloc[:, -1:].as_matrix()
targets = targets.reshape(-1)
print(targets.shape, utils.multiclass.type_of_target(targets))
clf = tree.DecisionTreeClassifier(max_depth=5)
scores = model_selection.cross_val_score(clf, features, targets)
print(scores)
the targets's shape is (115,) and 'type_of_target' is binary...
here's the head of data:
no x y z m k l t
0 17 1 4 1 1 1020 1 1
1 17 1 10 2 1 1037 2 1
2 18 1 5 1 1 1512 3 1
3 18 1 2 0 1 1440 1 1
4 15 1 4 1 1 465 1 1
Here comes the problem:
while I am running anther code, it raises errors:
File "/home/noodle/PycharmProjects/qh/dc_tree.py", line 61, in find_common
scores = model_selection.cross_val_score(clf, features, labels, cv=5)
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/model_selection/_validation.py", line 130, in cross_val_score
cv = check_cv(cv, y, classifier=is_classifier(estimator))
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/model_selection/_split.py", line 1584, in check_cv
(type_of_target(y) in ('binary', 'multiclass'))):
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/utils/multiclass.py", line 237, in type_of_target
if is_multilabel(y):
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/utils/multiclass.py", line 153, in is_multilabel
labels = np.unique(y)
File "/usr/local/python34/lib/python3.4/site-packages/numpy/lib/arraysetops.py", line 214, in unique
ar.sort()
TypeError: unorderable types: str() > float()
Here is the code and data head
data = data.as_matrix()
labels = data[:, 0]
features = data[:, 1:]
print(labels.shape, utils.multiclass.type_of_target(labels))
clf = RandomForestClassifier(n_estimators=i, max_depth=None,
min_samples_split=2, random_state=0)
scores = model_selection.cross_val_score(clf, features, labels, cv=5)
working data head:
flag UserInfo_1 UserInfo_2 UserInfo_3 UserInfo_4 ProductInfo_1
0 0 missing 5226.590000 0.000000 0.0 0.0
1 0 missing 0.000000 0.000000 0.0 0.0
2 0 missing 5272.206555 2412.077228 missing missing
3 0 missing 5272.206555 2412.077228 missing missing
4 0 missing 5272.206555 2412.077228 missing missing
the labels's shape is (4000,) and 'type_of_target' is binary. there seems to be no differences between labels and targets(in the test part), except the shape in the first dimension.so I think it may be caused by the str in features...I don't want to get_dummies my working data, at first. So I try to change test data into :
no x y z m k l t
0 17 g 4 1 1 1020 1 1
1 17 g 10 2 1 1037 2 1
2 18 g 5 1 1 1512 3 1
3 18 g 2 0 1 1440 1 1
4 15 g 4 1 1 465 1 1
and run it to figure out what's wrong, but it raises another different error:
Traceback (most recent call last):
File "/home/noodle/PycharmProjects/bigtest/tensortest.py", line 71, in <module>
scores = model_selection.cross_val_score(clf, features, targets1)
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/model_selection/_validation.py", line 140, in cross_val_score
for train, test in cv_iter)
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py", line 758, in __call__
while self.dispatch_one_batch(iterator):
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py", line 608, in dispatch_one_batch
self._dispatch(tasks)
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py", line 571, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 109, in apply_async
result = ImmediateResult(func)
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 326, in __init__
self.results = batch()
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py", line 131, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py", line 131, in <listcomp>
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/model_selection/_validation.py", line 238, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/tree/tree.py", line 739, in fit
X_idx_sorted=X_idx_sorted)
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/tree/tree.py", line 122, in fit
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/utils/validation.py", line 382, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: 'n'
So the error that the working part raises is not caused by the str in the working data... right? how can I fix it?

Categories