StandardScaler in Python - python

I want to standardize 'x_train'.
The first 'x_train' in the picture is the original data set, and the next 'x_train' below the previous one is standardized.
I just want to standardize the first six columns, so I wrote x_train[:,0:6] during standardization.
However, the result of standardization is obviously unreasonable. Moreover, when I use the mean and standard deviation of 'x_train' to standardize x_test, the result went right. It's weird. I have no idea what's wrong with my code.
Below is my code for standardizing.

Try -
scaler = preprocessing.StandardScaler().fit(x_train.iloc[:, 0:6])
#returning the scaled values to a new variable
X_train_first_six = scaler.transform(x_train.iloc[:, 0:6])
X_test_first_six = scaler.transform(x_test.iloc[:, 0:6])
ref. pandas iloc

Related

How indexing of sliced data frame works in pandas

How to get the correct row from a datfarme which is sliced?
To show what I mean, look at this code sample:
import lightgbm as lgb
from sklearn.model_selection import train_test_split
import numpy as np
data=pd.DataFrame()
data['one']=range(0,1000)
data['p1']=data['one']+1
data['p2']=data['one']+2
label=data['p1']%2==0
X_train, X_test, y_train, y_test = train_test_split(data, label, test_size=0.2, random_state=100)
lgb_model = lgb.LGBMClassifier(objective = 'binary')
lgb_fitted = lgb_model.fit(X_train, y_train, verbose = False)
y_prob=lgb_fitted.predict_proba(X_test)
y_prob= pd.DataFrame(y_prob,columns = ['No','Yes'])
model_uncertain=y_prob.loc[(y_prob['Yes'] >= .5) & (y_prob['Yes'] <= .52)]
model_uncertain
My question:
How can I get the row in the X_test dataframe which is related to the first raw in model_uncertain data frame?
To make sure that I am getting the right row, I test it using passing the same row to
predict_proba using the following code as I should get the same result:
y_prob_3=lgb_fitted.predict_proba([X_test.iloc[3]])
y_prob_3
But the result is not the same.
I think I am not sending the correct row to predict_proba, as it should return the same value for a row.
What is the correct way to find the n row in model_uncertain and find the corresponding row in X_test data frame?
How can I get the row in the X_test dataframe which is related to the first raw in model_uncertain data frame?
You're on the right track:
>>> idx_of_first_uncertainty_row = model_uncertain.iloc[0].index
>>> row_in_test_data = X.loc[idx_of_first_uncertainty_row]
Yes, indexes are preserved between the original dataframe and its slices (unless you reset the index somewhere in between).
To make sure that I am getting the right row, I test it using passing the same row to predict_proba using the following code as I should get the same result (...) But the result is not the same.
Why do you think they're not the same? In the dataframe image you can't see all of the decimals. A better way to confirm if they're the same (well, really really similar) would be to use something like np.isclose to compare model_uncertain.iloc[0] (first row of dataframe) and X_train.loc[3] (row where index is 3):
>>> np.isclose(model_uncertain.iloc[0].values, X_train.loc[3].values)

sklearn cross validation : The least populated class in y has only 1 members, which is less than n_splits=10

i'm working in a machine learning project and i'm stuck with this warning when i try to use cross validation to know how many neighbours do i need to achieve the best accuracy in knn; here's the warning:
The least populated class in y has only 1 members, which is less than n_splits=10.
The dataset i'm using is https://archive.ics.uci.edu/ml/datasets/Student+Performance
In this dataset we have several attributes, but we'll be using only "G1", "G2", "G3", "studytime","freetime","health","famrel". all the instances in those columns are integers.
https://i.stack.imgur.com/sirSl.png <-dataset example
Next,here's my first chunk of code where i assign the train and test groups:
import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/gdrive')
import sklearn
data=pd.read_excel("/gdrive/MyDrive/Colab Notebooks/student-por.xls")
#print(data.head())
data = data[["G1", "G2", "G3", "studytime","freetime","health","famrel"]]
print(data)
predict = "G3"
x = np.array(data.drop([predict], axis=1))
y = np.array(data[predict])
print(y)
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.3, random_state=42)
print(len(y))
print(len(x))
That's how i assign x and y. with len, i can see that x and y have 649 rows both, representing 649 students.
Here's the second chunk of code when i do the cross_val:
#CROSSVALIDATION
from sklearn.neighbors import KNeighborsClassifier
neighbors = list (range(2,30))
cv_scores=[]
#print(y_train)
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
for k in neighbors:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn,x_train,y_train,cv=11,scoring='accuracy')
cv_scores.append(scores.mean())
plt.plot(cv_scores)
plt.show()```
the code is pretty self explanatory as you can tell
The warning:
The least populated class in y has only 1 members, which is less than n_splits=10.
happens in every iteration of the for-loop
Although this warning happens every time, plt.show() is still able to plot a graph regarding which amount of neighbours is best to achieve a good accuracy, i dont know if the plot, or the readings in cv_scores are accurate.
my question is :
How my "class in y" has only 1 members, len(y) clearly says y have 649 instances, more than enough to be splitted in 59 groups of 11 members each one?, By members is it referring to "instances" in my dataset, or colums/labels in the y group?
I'm not using stratify=y when i do the train/test splits, it's seems to be the 1# solution to this warning but its useless in my case.
I've tried everything i've seen on google/stack overflow and nothing helped me, the dataset seems to be the problem, but i canĀ“t understand whats wrong.
I think your main mistake is that your are using KNeighborsClassifier, and your feature to predict seems to be continuous (G3 - final grade (numeric: from 0 to 20, output target)) and not categorical.
In this case, every single value of the "y" is taken as a different possible class or label. The message you obtain is saying that in your dataset (on the "y"), there are values that only appears one time. For example, the values 3, appears only one time inside your dataset. This is not an error, but indicates that the model won't work correctly or accurate.
After all, I strongly reccomend you to use the sklearn.neighbors.KNeighborsRegressor.
This is the Knn used for "continuous" variables and not classes. Using this model, you shouldn't have this problem anymore. The output value will be the mean between the number of nearest neighbors you defined.
With this simple changes, your problem will be solved.

Getting all values of permuation importnace

I calculated the permutation Importance using eli5. But I only get a subset of the values.
import eli5
eli5.show_weights(perm, feature_names = X.columns.tolist())
At the end of the original plot ..10 more is shown. How can I get all the values?
The show_weights method has a top argument, which when set to None there is no limit in the shown features (see the documentation), so that should fix your problem:
eli5.show_weights(perm, feature_names = X.columns.tolist(), top=None)

Unable to Apply Scikit-Learn Imputer To A Dataset With Two Features

I am trying to read in the complete Titanic dataset, which can be found here:
biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls
import pandas as pd
# Importing the dataset
dataset = pd.read_excel('titanic3.xls')
y = dataset.iloc[:, 1].values
x = dataset.iloc[:, 2:14].values
# Create Dataset for Men
men_on_board = dataset[dataset['sex'] == 'male']
male_fatalities = men_on_board[men_on_board['survived'] ==0]
X_male = male_fatalities.iloc[:, [4,8]].values
# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X_male[:,0])
X_male[:,0] = imputer.transform(X_male[:,0])
When I run all but the last line, I get the following warning:
/Users/<username>/anaconda/lib/python3.6/site-packages/sklearn/utils/validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
DeprecationWarning)
When I run the last line, it throws the following error:
File "<ipython-input-2-07afef05ee1c>", line 1, in <module>
X_male[:,0] = imputer.transform(X_male[:,0])
ValueError: could not broadcast input array from shape (523) into shape (682)
I've used the above code snippet for imputation on other projects, not sure why it's not working.
A quick solution is to change axis = 0 to axis = 1. This will make it work, though I'm not sure if that's what you want. So I want to give some explanation about what happened here as following:
The warning basically tells you sklearn estimator now requires 2D data arrays rather than 1D data arrays where interpreting data as samples (rows) vs as features (columns) matters. During this deprecation process, this requirement is enforce by np.atleast_2d which assume your data has a single sample (row). Meanwhile, you passed axis = 0 to the Imputer which "impute along columns" by strategy = 'mean'. However, you have only 1 row now. When it comes across a missing value, there is no mean to replace that missing value. Therefore the entire column (which contains just this missing value) is discarded. As you can see, this is equal to
X_male[:,0][~np.isnan(X_male[:,0])].reshape(1, -1)
That's why the assignment X_male[:,0] = imputer.transform(X_male[:,0]) failed: X_male[:,0] is shape(682) while imputer.transform(X_male[:,0]) is shape(523). My previous solution basically changes it to "impute along rows" where you do have mean to replace missing values. You won't drop anything this time and your imputer.transform(X_male[:,0]) is shape(682) which can be assigned to X_male[:,0].
Now I don't know why your code snippet for imputation works on other projects. For your specific case here, a (logically) better way in regarding to the deprecation warning could be using X.reshape(-1, 1) since your data has a single feature and 682 samples. However, you need to reshape the transformed data back before being able to be assigned to X_male[:,0]:
imputer = imputer.fit(X_male[:,0].reshape(-1, 1))
X_male[:,0] = imputer.transform(X_male[:,0].reshape(-1, 1)).reshape(-1)

using Imputer in scikit-learn

I need to fill the missing temperature value with the mean value of that month using Imputer() in scikit-learn.
First I split the dataframe into groups based on the month. Then I called the imputer function to calculate the mean for that group and fill in the missing values.
Here is the code I wrote but it didn't work:
def impute_missing (data_1_group):
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(data_1_group)
data_1_group=imp.transform(data_1_group['datetime'])
return(data_1_group)
for data_1_group in data_1.groupby(pd.TimeGrouper("M")):
impute_missing(data_1_group)
Any suggestion?
try this small change
imp=imp.fit(data_1_group['datetime'])
data_1_group=imp.transform(data_1_group['datetime'])
Though I m new to scikit myself, I am recommending the solution that worked for me. This is because
1) imp object needs to override to fit, as in the first line
2) it needs to fit and impute the same dataset, which in this case seems to be data_1_group['datetime']
I hope this helps

Categories