MinMaxScaler for a number of columns in a pandas DataFrame - python

I want to apply MinmaxScaler on a number of pandas DataFrame 'together'. Meaning that I want the scaler to perform on all data in those columns, not separately on each column.
My DataFrame has 20 columns. I want to apply the scaler on 12 of the columns at the same time. I have already read this. But it does not solve my problem since it acts on each column separately.

IIUC, you want the sklearn scaler to fit and transform multiple columns with the same criteria (in this case min and max definitions). Here is one way you can do this -
You can save the initial shape of the columns and then transform the numpy array of those columns into a 1D array from a 2D array.
Next you can fit your scaler and transform this 1D array
Finally you can use the old shape to reshape the array back into the n columns you need and save them
The advantage of this approach is that this works with any of the sklearn scalers you need to use, MinMaxScaler, StandardScaler etc.
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],
'B':[103.02,107.26,110.35,114.23,114.68],
'C':['big','small','big','small','small']})
cols = ['A','B']
old_shape = dfTest[cols].shape #(5,2)
dfTest[cols] = scaler.fit_transform(dfTest[cols].to_numpy().reshape(-1,1)).reshape(old_shape)
print(dfTest)
A B C
0 0.000000 0.884188 big
1 0.756853 0.926301 small
2 0.764303 0.956992 big
3 0.817143 0.995530 small
4 0.766885 1.000000 small

you can extract the "min" and "max" statistics from those columns and perform the scaling yourself:
# columns of interest
cols = [...]
# get the minimum and maximum values in that region
vals = df[cols].to_numpy()
min_val = vals.min()
max_val = vals.max()
# scale the region using them
df[cols] = df[cols].sub(min_val).div(max_val - min_val)
(sub is method way of doing "-" and div is for "/".)
Above, df is your training dataframe; to scale the testing dataframe, you replace df with that in the last line, e.g.,
test_df[cols] = test_df[cols].sub(min_val).div(max_val - min_val)
instead of extracting min/max of it separately which would leak information from the test set.

Related

pandas dataframe rows scaling with sklearn

How can I apply a sklearn scaler to all rows of a pandas dataframe. The question is related to pandas dataframe columns scaling with sklearn. How can I apply a sklearn scaler to all values of a row?
NOTE: I know that for feature scaling it's normal to have features in columns and scaling features column wise like in the refenced other question. However I'd like to use sklearn scalers for preprocessing data for visualization where it's reasonable to scale row wise in my case.
Sklearn works both with panda dataframes and numpy arrays, and numpy arrays allow some basic matrix transformations when dataframes don't.
You can transform the dataframe to a numpy array, vectors = df.values. Then transpose the array, scale the transposed array columnwise, transpose it back
scaled_rows = scaler.fit_transform(vectors.T).T
and convert it to dataframe scaled_df = pd.DataFrame(data = scaled_rows, columns = df.columns)

Scikit: Problem returning Dataframe from imputer instead of Numpy Array

I am trying to impute some missing values in a Dataframe using the scikit-learn IterativeImputer(). The problem is that the imputer will take the pandas dataframe as an input, but will return a numpy array instead of the original dataframe. Here is a simple example taken from this post.
# Create an empty dataset
df = pd.DataFrame()
# Create two variables called x0 and x1. Make the first value of x1 a missing value
df['x0'] = [0.3051,0.4949,0.6974,0.3769,0.2231,0.341,0.4436,0.5897,0.6308,0.5]
df['x1'] = [np.nan,0.2654,0.2615,0.5846,0.4615,0.8308,0.4962,0.3269,0.5346,0.6731]
imputer = IterativeImputer(max_iter=10, random_state=42)
imputer.fit(df)
imputed_df = imputer.transform(df)
imputed_df
The problem is that when the numpy array is returned, the column names are removed and other metadata. I can of course manually extract that metadata from the original dataframe and then reapply it, but that seems a bit hacky. Pandas has its own imputer in terms of Dataframe.fillna() but the algorithms are not as sophisticated as the scikit ones.
So is there a way to fit the imputer to a dataframe and return a dataframe from the result.
Yes you can , just assign the values back
df[:]= imputer.transform(df)

Sklearn Decision Tree - using sparse matrix and other features simultaneously

I am using Sklearn Decision Tree for some classification and I have two types of data: categorical and continuous. I used pd.get_dummies for my categorical values and ended up with over 90 features. Which is, of course, quite a lot.
The thing is that I then iterate over max_features parameter to get the best score for my model, and having more than 20 features is too time-consuming. So I thought that Sklearn could use sparse matricies for my categorical features, instead of 70 columns with 0 and 1.
The question is: can Sklearn use a mix of sparse matricies and regular arrays or no? If yes - how do I do that? Currently I get error: setting an array element with a sequence
Here is some code to get the idea. df_with_dummies is what I currently use, but I hope there is a way to use df_with_sparse
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
a = np.random.randn(10,3)
b = np.random.random((10,1))
df = pd.DataFrame(a, columns = "A B C".split())
df['temp'] = b
df['dum1'] = np.where(df.temp < 0.5, 1, 0)
df['dum2'] = np.where(df.temp >= 0.5, 1, 0)
del df['temp']
df_with_dummies = df.copy()
a = df[['dum1', 'dum2']]
dums = csr_matrix(a)
df['dums'] = dums
df_with_sparse = df.copy()
When you do:
df['dums'] = dums
dums being a sparse matrix is not correctly handled by the pandas DataFrame and it will be broadcasted to each row. pandas does not complain about it, because it thinks of sparse matrix as an non-array object.
That means that each element in the df['dums'] column will point to the whole sparse matrix dums. So essentially, each array element is being set with an array hence the error setting an array element with a sequence when it is being processed in scikit-learn estimators.
For that you can do:
from scipy.sparse import hstack
df_with_sparse = hstack([df[['A', 'B', 'C']].values, dums])
Now you can pass this further.

Row-wise prediction over Pandas dataframe by passing sklearn.predict to df.apply

Assuming we have a Pandas dataframe and a scikit-learn model, trained (fit) using that dataframe. Is there a way to do row-wise prediction? The use case is to use the predict function to fill in empty values in the dataframe, using an sklearn model.
I expected that this would be possible using the pandas apply function (with axis=1), but I keep getting dimensionality errors.
Using Pandas version '0.22.0' and sklearn version '0.19.1'.
Simple example:
import pandas as pd
from sklearn.cluster import kmeans
data = [[x,y,x*y] for x in range(1,10) for y in range(10,15)]
df = pd.DataFrame(data,columns=['input1','input2','output'])
model = kmeans()
model.fit(df[['input1','input2']],df['output'])
df['predictions'] = df[['input1','input2']].apply(model.predict,axis=1)
The resulting dimensionality error:
ValueError: ('Expected 2D array, got 1D array instead:\narray=[ 1.
10.].\nReshape your data either using array.reshape(-1, 1) if your data has
a single feature or array.reshape(1, -1) if it contains a single sample.',
'occurred at index 0')
Running predict on the whole column works fine:
df['predictions'] = model.predict(df[['input1','input2']])
However, I want the flexibility to use this row-wise.
I've tried various approaches to reshape the data first, for example:
def reshape_predict(df):
return model.predict(np.reshape(df.values,(1,-1)))
df[['input1','input2']].apply(reshape_predict,axis=1)
Which just returns the input with no error, whereas I expect it to return a single column of output values (as an array).
SOLUTION:
Thanks to Yakym for providing a working solution! Trying a few variants based on his suggestion, the easiest solution was to simply wrap the row values in square brackets (I tried this previously, but without the 0 index for the prediction, with no luck).
df['predictions'] = df[['input1','input2']].apply(lambda x: model.predict([x])[0],axis=1)
Slightly more verbose, you can turn each row into 2D array by adding new a new axis to the values. You will then have to access the prediction with 0 index:
df["predictions"] = df[["input1", "input2"]].apply(
lambda s: model.predict(s.values[None])[0], axis=1
)

Proper way to do row correlations in a pandas dataframe

I want to calculate the correlation across two rows of a Pandas DataFrame. It is easy to calculate the correlation across two rows when all entries are of a numerical type, like this:
import pandas as pd
import numpy as np
example_df = pd.DataFrame(np.random.randn(10, 30), np.arange(10))
example_df.iloc[1, :].corr(example_df.iloc[2, :])
But if the DataFrame is of mixed type, you get an error when calculating the correlation even when you choose only the subset of numerical entries:
example_df['Letter'] = 'A'
example_df.iloc[1, :-1].corr(example_df.iloc[2, :-1])
AttributeError: 'numpy.float64' object has no attribute 'sqrt'
The Pearson's correlation function makes use of the square root function and that function doesn't exist for an object type so it can't do the correlation. You have to manually change the type to float and then you can calculate the correlation.
example_df.iloc[1, :-1].astype('float64').corr(example_df.iloc[2, :-1].astype('float64'))
Is there a better way to do this?
I don't know if these are any better than what you did, but here's a way with numpy:
np.corrcoef(df_example.iloc[1:3, :-1])
array([[ 1. , -0.37194563],
[-0.37194563, 1. ]])
And here's a way with pandas:
df_example.iloc[1:3, :-1].T.corr()
1 2
1 1.000000 -0.371946
2 -0.371946 1.000000
If you want to compare non-contiguous rows, adjust iloc like this:
df_example.iloc[[1, 4], :-1].T.corr()
You could hide the non-float column(s) in the index
example_df = example_df.set_index(['Letter'], append=True)
so that the rows are once again purely of float dtype. Then
example_df.iloc[1, :].corr(example_df.iloc[2, :])
works as before.

Categories