I would need to transform float to int. However, I would like to not loose any information while converting it. The values (from a dataframe column used a y in modeling build) that I am taking into account are as follows:
-1.0
0.0
9.0
-0.5
1.5
1.5
...
If I convert them into int directly, I might get -0.5 as 0 or -1, so I will loose some information.
I need to convert the values above to int because I need to pass them to fit a model model.fit(X, y). Any format that could allow me to pass these values in the fit function (the above column is meant y column)?
Code:
from sklearn.preprocessing import MinMaxScaler
le = preprocessing.LabelEncoder()
X = df[['Col1','Col2']].apply(le.fit_transform)
X_transformed=np.concatenate(((X[['Col1']]),(X[['Col2']])), axis=1)
y=df['Label'].values
scaler=MinMaxScaler()
X_scaled=scaler.fit_transform(X_transformed)
model_LS = LabelSpreading(kernel='knn',
gamma=70,
alpha=0.5,
max_iter=30,
tol=0.001,
n_jobs=-1,
)
LS=model_LS.fit(X_scaled, y)
Data:
Col1 Col2 Label
Cust1 Cust2 1.0
Cust1 Cust4 1.0
Cust4 Cust5 -1.5
Cust12 Cust6 9.0
The error that I am getting running the above code is:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-174-14429cc07d75> in <module>
2
----> 3 LS=model_LS.fit(X_scaled, y)
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/semi_supervised/_label_propagation.py in fit(self, X, y)
228 X, y = self._validate_data(X, y)
229 self.X_ = X
--> 230 check_classification_targets(y)
231
232 # actual graph construction (implementations should override this)
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/multiclass.py in check_classification_targets(y)
181 if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
182 'multilabel-indicator', 'multilabel-sequences']:
--> 183 raise ValueError("Unknown label type: %r" % y_type)
184
185
ValueError: Unknown label type: 'continuous'
You can multiply your values to remove the decimal part:
df = pd.DataFrame({'Label': [1.0, -1.3, 0.75, 9.0, 7.8236]})
decimals = df['Label'].astype(str).str.split('.').str[1].str.len().max()
df['y'] = df['Label'].mul(float(f"1e{decimals}")).astype(int)
print(df)
# Output:
Label y
0 1.0000 10000
1 -1.3000 -13000
2 0.7500 7500
3 9.0000 90000
4 7.8236 78236
I think you need:
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame(data={'y':[-1.0, 0.0 , 9.0, -0.5, 1.5 , 1.5]})
le = LabelEncoder()
le.fit(df['y'])
df['y'] = le.transform(df['y'])
print(df)
OUTPUT
y
0 0
1 2
2 4
3 1
4 3
5 3
Related
I'm trying to build a model that predicts the probability for an athlete to win a medal.
I have a dataframe that looks like this:
Here is what I've already done
#Cleaning df
#Replace NaN with mean or average
df['Height'].fillna(value=df['Height'].mean(), inplace=True)
df['Weight'].fillna(value=df['Weight'].mean(), inplace=True)
#Changing type to integer
df.Height = df.Height.astype(int)
df.Weight = df.Weight.astype(int)
#Target variable
y= df["Medal"]
#If Male =0, if female = 1
df['Sex'] = df['Sex'].apply(lambda x: 1 if str(x) != 'M' else 0)
#Predictive
feature_names = ["Age", "Sex", "Height", "Weight"]
X= df[feature_names]
#Regressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
regressor = DecisionTreeRegressor(random_state=0)
cross_val_score(regressor, X, y, cv=10)
But when I run the code, it returns me an error
warnings.warn("Estimator fit failed. The score on this train-test"
C:\Users\miss_\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py:610: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 593, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 1247, in fit
super().fit(
File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 156, in fit
X, y = self._validate_data(X, y,
File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\base.py", line 430, in _validate_data
X = check_array(X, **check_X_params)
File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 663, in check_array
_assert_all_finite(array,
File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 103, in _assert_all_finite
raise ValueError(
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
And returns an array like this : array[NaN, NaN, NaN...]
My X looks like this
Age Sex Height Weight
0 24.0 1 180 80
1 23.0 1 170 60
2 24.0 1 175 70
3 34.0 1 175 70
4 21.0 1 185 82
... ... ... ... ...
271111 29.0 1 179 89
271112 27.0 1 176 59
271113 27.0 1 176 59
271114 30.0 1 185 96
271115 34.0 1 185 96
And my y :
0 0
1 0
2 0
3 1
4 0
..
271111 0
271112 0
271113 0
271114 0
271115 0
Name: Medal, Length: 271116, dtype: int64
You fill missing values for "Height" and "Weight". You should apply the same operation for the feature "Age".
First locate your missing values in this column:
>>> df.loc[df['Age'].isna(), ['ID', 'Name', 'Age'])
If you have few missing values, you can fill with the mean:
>>> df['Age'].fillna(value=df['Age'].mean(), inplace=True)
But if you have many missing values, fill them with the global mean is probably not a good idea. "Age" can depends on "Country", "Sport", "Year" even the "Season" (winter or summer). In fact, this is the same for Height/Weight: the average height in VolleyBall is probably not the same in Archery...
I have an xarray dataset of sea surface temperature values on an x/y grid. x and y are 1D vector coordinates, so it looks like this minimal example:
<xarray.Dataset>
Dimensions: (x: 10, y: 10)
Coordinates:
* x (x) int64 0 1 2 3 4 5 6 7 8 9
* y (y) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
data (x, y) float64 0.559 0.01037 0.1562 ... 0.08778 0.3272 0.8661
I am able to compute the lat/lon from this x/y grid, and the output is 2 2D arrays. I can add them as coordinates with ds.assign_coords:
<xarray.Dataset>
Dimensions: (x: 10, y: 10)
Coordinates:
* x (x) int64 0 1 2 3 4 5 6 7 8 9
* y (y) int64 0 1 2 3 4 5 6 7 8 9
lat (x, y) float64 30.0 30.0 30.0 30.0 30.0 ... 39.0 39.0 39.0 39.0
lon (x, y) float64 -120.0 -119.0 -118.0 -117.0 ... -113.0 -112.0 -111.0
Data variables:
data (x, y) float64 0.559 0.01037 0.1562 ... 0.08778 0.3272 0.8661
But I'd like to .sel along slices of the lat/lon. This currently isn't possible, as I get the error:
ds.sel(lat=slice(32,36), lon=slice(-118, -115))
ValueError Traceback (most recent call last)
<ipython-input-20-28c79202d5f3> in <module>
----> 1 ds.sel(lat=slice(32,36), lon=slice(-118, -115))
~/.local/lib/python3.8/site-packages/xarray/core/dataset.py in sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
2363 """
2364 indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "sel")
-> 2365 pos_indexers, new_indexes = remap_label_indexers(
2366 self, indexers=indexers, method=method, tolerance=tolerance
2367 )
~/.local/lib/python3.8/site-packages/xarray/core/coordinates.py in remap_label_indexers(obj, indexers, method, tolerance, **indexers_kwargs)
419 }
420
--> 421 pos_indexers, new_indexes = indexing.remap_label_indexers(
422 obj, v_indexers, method=method, tolerance=tolerance
423 )
~/.local/lib/python3.8/site-packages/xarray/core/indexing.py in remap_label_indexers(data_obj, indexers, method, tolerance)
256 new_indexes = {}
257
--> 258 dim_indexers = get_dim_indexers(data_obj, indexers)
259 for dim, label in dim_indexers.items():
260 try:
~/.local/lib/python3.8/site-packages/xarray/core/indexing.py in get_dim_indexers(data_obj, indexers)
222 ]
223 if invalid:
--> 224 raise ValueError(f"dimensions or multi-index levels {invalid!r} do not exist")
225
226 level_indexers = defaultdict(dict)
ValueError: dimensions or multi-index levels ['lat', 'lon'] do not exist
So my question is this: How can I change the dimensions of data to be (lon: (10,10), lat: (10,10)) instead of (x: 10, y: 10? Is this even possible?
Code to reproduce the example dataset:
import numpy as np
import xarray as xr
# Create sample data
data = np.random.rand(10,10)
x = y = np.arange(10)
# Set up dataset
ds = xr.Dataset(
data_vars = dict(
data = (["x", "y"], data)
),
coords= {
"x" : x,
"y" : y
}
)
# Create example lat/lon and assign to dataset
lon, lat = np.meshgrid(np.linspace(-120, -111, 10), np.linspace(30, 39, 10))
ds = ds.assign_coords({
"lat": (["x", "y"], lat),
"lon": (["x", "y"], lon)
})
I have a problem with SciKit Learn.
I'm doing a really simple linear regression problem. Based on input values of Hours Studied & the resulting grade, I want to be able to estimate a students grade, based on how long they study.
In [1]: import pandas as pd
In [2]: path = 'Desktop/hoursgrades.csv'
In [3]: df = pd.read_csv(path)
In [4]: X = df['Hours Studied']
In [5]: y = df['Grade']
In [6]: training_data_in = list()
In [7]: training_data_out = list()
In [8]: training_data_in.append(X)
In [9]: training_data_out.append(y)
In [11]: from sklearn.linear_model import LinearRegression
In [12]: model = LinearRegression(n_jobs =-1)
In [13]: model.fit(X = training_data_in, y = training_data_out)
Out[13]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=-1, normalize=False)
In this example, the DF looks like this:
In [16]: df
Out[16]:
Hours Studied Grade
0 1 10.0
1 2 20.0
2 3 30.0
3 4 40.0
4 5 50.0
5 6 60.0
6 7 70.0
7 8 80.0
8 9 90.0
9 10 100.0
And X looks like this:
In [17]: X
Out[17]:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
Name: Hours Studied, dtype: int64
And y looks like this:
In [18]: y
Out[18]:
0 10.0
1 20.0
2 30.0
3 40.0
4 50.0
5 60.0
6 70.0
7 80.0
8 90.0
9 100.0
Name: Grade, dtype: float64
So far so good, it seems to have accepted everything I've put in so far. So now, I want to test the model with some input data. So, I want to say, the number of hours this student studied is 5 & for the model to tell me the expected grade.
But when I put that into the model, I get the below error.
Can anyone advise?
In [14]: studied_hour = [[5]]
In [15]: outcome = model.predict(X = studied_hour)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-15-6fdab4ae2efd> in <module>()
----> 1 outcome = model.predict(X = studied_hour)
~/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/base.py in predict(self, X)
254 Returns predicted values.
255 """
--> 256 return self._decision_function(X)
257
258 _preprocess_data = staticmethod(_preprocess_data)
~/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/base.py in _decision_function(self, X)
239 X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
240 return safe_sparse_dot(X, self.coef_.T,
--> 241 dense_output=True) + self.intercept_
242
243 def predict(self, X):
~/anaconda3/lib/python3.7/site-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output)
138 return ret
139 else:
--> 140 return np.dot(a, b)
141
142
ValueError: shapes (1,1) and (10,10) not aligned: 1 (dim 1) != 10 (dim 0)
I should add:
In [39]: X.shape
Out[39]: (10,)
In [40]: y.shape
Out[40]: (10,)
The input shape of both X and y is not correct, it has to be (n_samples, n_features) for X and (n_samples,) for y as per the docs.
You see the error because the model thinks you have ten features and ten different outputs (hence the (10, 10)).
You get the correct results by using
X = df[['Hours Studied']] # note the double brackets, shape (10, 1)
y = df['Grade']
model = LinearRegression().fit(X, y)
model.predict([[5]])
array([50.])
I'm trying to label encode the second column I'm getting an error. What am I doing wrong?
I'm able to encode the first column
data.head()
area_type availability location size society total_sqft bath balcony price
0 Super built-up Area 19-Dec Electronic City Phase II 2 BHK Coomee 1056 2.0 1.0 39.07
1 Plot Area Ready To Move Chikka Tirupathi 4 Bedroom Theanmp 2600 5.0 3.0 120.00
2 Built-up Area Ready To Move Uttarahalli 3 BHK NaN 1440 2.0 3.0 62.00
3 Super built-up Area Ready To Move Lingadheeranahalli 3 BHK Soiewre 1521 3.0 1.0 95.00
4 Super built-up Area Ready To Move Kothanur 2 BHK NaN 1200 2.0 1.0 51.00
enc = LabelEncoder()
data.iloc[:,2] = enc.fit_transform(data.iloc[:,2])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-20-53fda4a71b5e> in <module>()
1 enc = LabelEncoder()
----> 2 data.iloc[:,2] = enc.fit_transform(data.iloc[:,2])
~/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/label.py in fit_transform(self, y)
110 """
111 y = column_or_1d(y, warn=True)
--> 112 self.classes_, y = np.unique(y, return_inverse=True)
113 return y
114
~/anaconda3/lib/python3.6/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis)
208 ar = np.asanyarray(ar)
209 if axis is None:
--> 210 return _unique1d(ar, return_index, return_inverse, return_counts)
211 if not (-ar.ndim <= axis < ar.ndim):
212 raise ValueError('Invalid axis kwarg specified for unique')
~/anaconda3/lib/python3.6/site-packages/numpy/lib/arraysetops.py in _unique1d(ar, return_index, return_inverse, return_counts)
272
273 if optional_indices:
--> 274 perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
275 aux = ar[perm]
276 else:
TypeError: '<' not supported between instances of 'float' and 'str'
I want to label encode the second column "Location", If I use data.iloc[:,1] = enc.fit_transform(data.iloc[:,1]) indexing I can label encode availability column, So
How can I fix this?
What is the datatype of your column?
The error arises because the label encoder cannot order numbers (and np.nan are floats) and strings.
To fix this you can:
- Replace any nan with an empty string data['col_name'].fillna('',inplace=True);
- Convert the column to a string by typing data['col_name'] = data['col_name'].astype(str)
So I'm using sci-kit learn to classify some data. I have 13 different class values/categorizes to classify the data to. Now I have been able to use cross validation and print the confusion matrix. However, it only shows the TP and FP etc without the classlabels, so I don't know which class is what. Below is my code and my output:
def classify_data(df, feature_cols, file):
nbr_folds = 5
RANDOM_STATE = 0
attributes = df.loc[:, feature_cols] # Also known as x
class_label = df['task'] # Class label, also known as y.
file.write("\nFeatures used: ")
for feature in feature_cols:
file.write(feature + ",")
print("Features used", feature_cols)
sampler = RandomOverSampler(random_state=RANDOM_STATE)
print("RandomForest")
file.write("\nRandomForest")
rfc = RandomForestClassifier(max_depth=2, random_state=RANDOM_STATE)
pipeline = make_pipeline(sampler, rfc)
class_label_predicted = cross_val_predict(pipeline, attributes, class_label, cv=nbr_folds)
conf_mat = confusion_matrix(class_label, class_label_predicted)
print(conf_mat)
accuracy = accuracy_score(class_label, class_label_predicted)
print("Rows classified: " + str(len(class_label_predicted)))
print("Accuracy: {0:.3f}%\n".format(accuracy * 100))
file.write("\nClassifier settings:" + str(pipeline) + "\n")
file.write("\nRows classified: " + str(len(class_label_predicted)))
file.write("\nAccuracy: {0:.3f}%\n".format(accuracy * 100))
file.writelines('\t'.join(str(j) for j in i) + '\n' for i in conf_mat)
#Output
Rows classified: 23504
Accuracy: 17.925%
0 372 46 88 5 73 0 536 44 317 0 200 127
0 501 29 85 0 136 0 655 9 154 0 172 67
0 97 141 78 1 56 0 336 37 429 0 435 198
0 135 74 416 5 37 0 507 19 323 0 128 164
0 247 72 145 12 64 0 424 21 296 0 304 223
0 190 41 36 0 178 0 984 29 196 0 111 43
0 218 13 71 7 52 0 917 139 177 0 111 103
0 215 30 84 3 71 0 1175 11 55 0 102 62
0 257 55 156 1 13 0 322 184 463 0 197 160
0 188 36 104 2 34 0 313 99 827 0 69 136
0 281 80 111 22 16 0 494 19 261 0 313 211
0 207 66 87 18 58 0 489 23 157 0 464 239
0 113 114 44 6 51 0 389 30 408 0 338 315
As you can see, you can't really know what column is what and the print is also "misaligned" so it's difficult to understand.
Is there a way to print the labels as well?
From the doc, it seems that there is no such option to print the rows and column labels of the confusion matrix. However, you can specify the label order using argument labels=...
Example:
from sklearn.metrics import confusion_matrix
y_true = ['yes','yes','yes','no','no','no']
y_pred = ['yes','no','no','no','no','no']
print(confusion_matrix(y_true, y_pred))
# Output:
# [[3 0]
# [2 1]]
print(confusion_matrix(y_true, y_pred, labels=['yes', 'no']))
# Output:
# [[1 2]
# [0 3]]
If you want to print the confusion matrix with labels, you may try pandas and set the index and columns of the DataFrame.
import pandas as pd
cmtx = pd.DataFrame(
confusion_matrix(y_true, y_pred, labels=['yes', 'no']),
index=['true:yes', 'true:no'],
columns=['pred:yes', 'pred:no']
)
print(cmtx)
# Output:
# pred:yes pred:no
# true:yes 1 2
# true:no 0 3
Or
unique_label = np.unique([y_true, y_pred])
cmtx = pd.DataFrame(
confusion_matrix(y_true, y_pred, labels=unique_label),
index=['true:{:}'.format(x) for x in unique_label],
columns=['pred:{:}'.format(x) for x in unique_label]
)
print(cmtx)
# Output:
# pred:no pred:yes
# true:no 3 0
# true:yes 2 1
It is important to ensure that the way you label your confusion matrix rows and columns corresponds exactly to the way sklearn has coded the classes. The true order of the labels can be revealed using the .classes_ attribute of the classifier. You can use the code below to prepare a confusion matrix data frame.
labels = rfc.classes_
conf_df = pd.DataFrame(confusion_matrix(class_label, class_label_predicted, columns=labels, index=labels))
conf_df.index.name = 'True labels'
The second thing to note is that your classifier is not predicting labels well. The number of correctly predicted labels is shown on the main diagonal of the confusion matrix. You have non-zero values accross the matrix and some classes have not been predicted at all - the columns that are all zero. It might be a good idea to run the classifier with its default parameters and then try to optimise them.
Another better way of doing this is using crosstab function in pandas.
pd.crosstab(y_true, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)
or
pd.crosstab(le.inverse_transform(y_true),
le.inverse_transform(y_pred),
rownames=['True'],
colnames=['Predicted'],
margins=True)
Since confusion matrix is just a numpy matrix, it does not contain any column information. What you can do is convert your matrix into a dataframe and then print this dataframe.
import pandas as pd
import numpy as np
def cm2df(cm, labels):
df = pd.DataFrame()
# rows
for i, row_label in enumerate(labels):
rowdata={}
# columns
for j, col_label in enumerate(labels):
rowdata[col_label]=cm[i,j]
df = df.append(pd.DataFrame.from_dict({row_label:rowdata}, orient='index'))
return df[labels]
cm = np.arange(9).reshape((3, 3))
df = cm2df(cm, ["a", "b", "c"])
print(df)
Code snippet is from https://gist.github.com/nickynicolson/202fe765c99af49acb20ea9f77b6255e
Output:
a b c
a 0 1 2
b 3 4 5
c 6 7 8
It appears your data has 13 different classes, which is why your confusion matrix has 13 rows and columns. Furthermore, your classes aren't labeled in any way, just integers from what I can see.
If this isn't the case, and your training data has actual labels, you can pass a list of unique labels to confusion_matrix
conf_mat = confusion_matrix(class_label, class_label_predicted, df['task'].unique())