How to get the correct row from a datfarme which is sliced?
To show what I mean, look at this code sample:
import lightgbm as lgb
from sklearn.model_selection import train_test_split
import numpy as np
data=pd.DataFrame()
data['one']=range(0,1000)
data['p1']=data['one']+1
data['p2']=data['one']+2
label=data['p1']%2==0
X_train, X_test, y_train, y_test = train_test_split(data, label, test_size=0.2, random_state=100)
lgb_model = lgb.LGBMClassifier(objective = 'binary')
lgb_fitted = lgb_model.fit(X_train, y_train, verbose = False)
y_prob=lgb_fitted.predict_proba(X_test)
y_prob= pd.DataFrame(y_prob,columns = ['No','Yes'])
model_uncertain=y_prob.loc[(y_prob['Yes'] >= .5) & (y_prob['Yes'] <= .52)]
model_uncertain
My question:
How can I get the row in the X_test dataframe which is related to the first raw in model_uncertain data frame?
To make sure that I am getting the right row, I test it using passing the same row to
predict_proba using the following code as I should get the same result:
y_prob_3=lgb_fitted.predict_proba([X_test.iloc[3]])
y_prob_3
But the result is not the same.
I think I am not sending the correct row to predict_proba, as it should return the same value for a row.
What is the correct way to find the n row in model_uncertain and find the corresponding row in X_test data frame?
How can I get the row in the X_test dataframe which is related to the first raw in model_uncertain data frame?
You're on the right track:
>>> idx_of_first_uncertainty_row = model_uncertain.iloc[0].index
>>> row_in_test_data = X.loc[idx_of_first_uncertainty_row]
Yes, indexes are preserved between the original dataframe and its slices (unless you reset the index somewhere in between).
To make sure that I am getting the right row, I test it using passing the same row to predict_proba using the following code as I should get the same result (...) But the result is not the same.
Why do you think they're not the same? In the dataframe image you can't see all of the decimals. A better way to confirm if they're the same (well, really really similar) would be to use something like np.isclose to compare model_uncertain.iloc[0] (first row of dataframe) and X_train.loc[3] (row where index is 3):
>>> np.isclose(model_uncertain.iloc[0].values, X_train.loc[3].values)
Related
I want to standardize 'x_train'.
The first 'x_train' in the picture is the original data set, and the next 'x_train' below the previous one is standardized.
I just want to standardize the first six columns, so I wrote x_train[:,0:6] during standardization.
However, the result of standardization is obviously unreasonable. Moreover, when I use the mean and standard deviation of 'x_train' to standardize x_test, the result went right. It's weird. I have no idea what's wrong with my code.
Below is my code for standardizing.
Try -
scaler = preprocessing.StandardScaler().fit(x_train.iloc[:, 0:6])
#returning the scaled values to a new variable
X_train_first_six = scaler.transform(x_train.iloc[:, 0:6])
X_test_first_six = scaler.transform(x_test.iloc[:, 0:6])
ref. pandas iloc
I am running some regression models to predict performance.
After running the models I created a variable to see the predictions (y_pred_* are lists with 2567 values):
y_pred_LR = regressor.predict(X_test)
y_pred_SVR = regressor2.predict(X_test)
y_pred_RF = regressor3.predict(X_test)
the types of these prediction lists are Array of float64, while the y_test is a DataFrame.
I wanted to create a table with the results, I tried some different ways, calling as list, trying to convert, trying to select as values, and I did not succeed so far, any one could help?
My last trial was like below:
comparison = pd.DataFrame({'Real': y_test, LR':y_pred_LR,'RF':y_pred_RF,'SVM':y_pred_SVM})
In this case the DataFrame is created but the values donĀ“t appear.
Additionally, I would like to create two new rows with the mean and standard deviation of results and this row should be located at beginning (or first row) of the Data Frame.
Thanks
import pandas as pd
import numpy as np
real = np.array([2] * 10).reshape(-1,1)
y_pred_LR = np.array([0] * 10)
y_pred_SVR = np.array([1] * 10)
y_pred_RF = np.array([5] * 10)
real = real.flatten()
comparison = pd.DataFrame({'real':real,'y_pred_LR':y_pred_LR,'y_pred_SVR':y_pred_SVR,"y_pred_RF":y_pred_RF})
Mean = comparison.mean(axis=0)
StD = comparison.std(axis=0)
Mean_StD = pd.concat([Mean,StD],axis=1).T
result = pd.concat([Mean_StD,comparison],ignore_index=True)
print(result)
I don't know that my code is correct or not. but I got the error:
bad input shape (1, 301)
from sklearn import svm
import pandas as pd
clf = svm.SVC(gamma='scale')
df = pd.read_csv('C:\\Users\\Armin\\Desktop\\heart.csv')
x = [df.age[1:302], df.sex[1:302], df.cp[1:302], df.trestbps[1:302], df.chol[1:302], df.fbs[1:302], df.restecg[1:302], df.thalach[1:302], df.exang[1:302], df.oldpeak[1:302], df.slope[1:302], df.ca[1:302], df.thal[1:302]]
y = [df.target[1:302]]
clf.fit(x, y)
This is a very simple fix.
You need all the columns from df in x except the target column, for that, just do:
x = df.drop('target', axis=1)
And your target column will be:
y = df['target']
And now do your fit:
clf.fit(x, y)
It will work.
PS: What you were trying to do is passing list of Series having the features value. But what you just need to do is, pass the actual values of your feature set and targets from the dataframe directly.
Some more references for you to get started and keep going:
Read more about what to pass to the fit method here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.fit
Here is a super basic tutorial from the folks of scikit themselves: https://scikit-learn.org/stable/tutorial/basic/tutorial.html
This is a follow on question from Subsetting Dask DataFrames. I wish to shuffle data from a dask dataframe before sending it in batches to a ML algorithm.
The answer in that question was to do the following:
for part in df.repartition(npartitions=100).to_delayed():
batch = part.compute()
However, even if I was to shuffle the contents of batch I'm a bit worried that it might not be ideal. The data is a time series set so datapoints would be highly correlated within each partition.
What I would ideally like is something along the lines of:
rand_idx = np.random.choice(len(df), batch_size, replace=False)
batch = df.iloc[rand_idx, :]
which would work on pandas but not dask. Any thoughts?
Edit 1: Potential Solution
I tried doing
train_len = int(len_df*0.8)
idx = np.random.permutation(len_df)
train_idx = idx[:train_len]
test_idx = idx[train_len:]
train_df = df.loc[train_idx]
test_df = df.loc[test_idx]
However, if I try doing train_df.loc[:5,:].compute() this return a 124451 row dataframe. So clearly using dask wrong.
I recommend adding a column of random data to your dataframe and then using that to set the index:
df = df.map_partitions(add_random_column_to_pandas_dataframe, ...)
df = df.set_index('name-of-random-column')
I encountered the same issue recently and came up with a different approach using dask array and shuffle_slice introduced in this pull request
It shuffles the whole sample
import numpy as np
from dask.array.slicing import shuffle_slice
d_arr = df.to_dask_array(True)
df_len = len(df)
np.random.seed(42)
index = np.random.choice(df_len, df_len, replace=False)
d_arr = shuffle_slice(d_arr, index)
and to transform back to dask dataframe
df = d_arr.to_dask_dataframe(df.columns)
for me it works well for large data sets
If you're trying to separate your dataframe into training and testing subsets, it is what does sklearn.model_selection.train_test_split and it works with pandas.DataFrame. (Go there for an example)
And for your case of using it with dask, you may be interested by the library dklearn, that seems to implements this function.
To do that, we can use the train_test_split function, which mirrors
the scikit-learn function of the same name. We'll hold back 20% of the
rows:
from dklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)
More information here.
Note: I did not perform any test with dklearn, this is just a thing I came across, but I hope it can help.
EDIT: what about dask.DataFrame.random_split?
Examples
50/50 split
>>> a, b = df.random_split([0.5, 0.5])
80/10/10 split, consistent random_state
>>> a, b, c = df.random_split([0.8, 0.1, 0.1], random_state=123)
Use for ML applications is illustrated here
For people here really just wanting to shuffle the rows as the title implies:
This is costly
import numpy as np
random_idx = np.random.permutation(len(sd.index))
sd.assign(random_idx=random_idx)
sd = sd.set_index('x', sorted=True)
I am trying to read in the complete Titanic dataset, which can be found here:
biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls
import pandas as pd
# Importing the dataset
dataset = pd.read_excel('titanic3.xls')
y = dataset.iloc[:, 1].values
x = dataset.iloc[:, 2:14].values
# Create Dataset for Men
men_on_board = dataset[dataset['sex'] == 'male']
male_fatalities = men_on_board[men_on_board['survived'] ==0]
X_male = male_fatalities.iloc[:, [4,8]].values
# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X_male[:,0])
X_male[:,0] = imputer.transform(X_male[:,0])
When I run all but the last line, I get the following warning:
/Users/<username>/anaconda/lib/python3.6/site-packages/sklearn/utils/validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
DeprecationWarning)
When I run the last line, it throws the following error:
File "<ipython-input-2-07afef05ee1c>", line 1, in <module>
X_male[:,0] = imputer.transform(X_male[:,0])
ValueError: could not broadcast input array from shape (523) into shape (682)
I've used the above code snippet for imputation on other projects, not sure why it's not working.
A quick solution is to change axis = 0 to axis = 1. This will make it work, though I'm not sure if that's what you want. So I want to give some explanation about what happened here as following:
The warning basically tells you sklearn estimator now requires 2D data arrays rather than 1D data arrays where interpreting data as samples (rows) vs as features (columns) matters. During this deprecation process, this requirement is enforce by np.atleast_2d which assume your data has a single sample (row). Meanwhile, you passed axis = 0 to the Imputer which "impute along columns" by strategy = 'mean'. However, you have only 1 row now. When it comes across a missing value, there is no mean to replace that missing value. Therefore the entire column (which contains just this missing value) is discarded. As you can see, this is equal to
X_male[:,0][~np.isnan(X_male[:,0])].reshape(1, -1)
That's why the assignment X_male[:,0] = imputer.transform(X_male[:,0]) failed: X_male[:,0] is shape(682) while imputer.transform(X_male[:,0]) is shape(523). My previous solution basically changes it to "impute along rows" where you do have mean to replace missing values. You won't drop anything this time and your imputer.transform(X_male[:,0]) is shape(682) which can be assigned to X_male[:,0].
Now I don't know why your code snippet for imputation works on other projects. For your specific case here, a (logically) better way in regarding to the deprecation warning could be using X.reshape(-1, 1) since your data has a single feature and 682 samples. However, you need to reshape the transformed data back before being able to be assigned to X_male[:,0]:
imputer = imputer.fit(X_male[:,0].reshape(-1, 1))
X_male[:,0] = imputer.transform(X_male[:,0].reshape(-1, 1)).reshape(-1)