Pandas deep copy and scikit learn min max scaler - python

I am trying to do the following
df = pd.read_csv('a.csv')
scaler = MinMaxScaler()
df_copy = df.copy(deep=True)
for i in range(1, len(df)):
df_chunk = df_copy.iloc[i,i+10]
df_chunk = scaler.fit_transform (df_chunk)
so each df_chunk should be a scaled data frame.
The issue is that some are not scaled correctly.
If I were to plot the scaled data points, a properly scaled data frame will look like a range of numbers scattered between 0 and 1 sort of evenly. But the data frames I get are in 2 extremes, with the first ~80% of the numbers in the 0.9 range, while the others near the 0.1 range.
So it feels like the first ~80% of the data got scaled twice by the scaler. I have already tried using pandas deep copy to solve this, but it doesn't seem to help.
If you have any idea, why?
I would really appreciate it.

I'm not too sure why you want to apply the scaler on chunks of your data. If you fear that your CSV may be too large, you would want to read the CSV by chunks in the read_csv operation and process those chunks.
Now onto your issue. You're re-fitting your scaler on every chunk which is why you're getting the weird results. You either have to fit the entire data with your scaler, or you have to online fit the data using the partial_fit method.
I'll provide you both solutions.
Solution #1: read and fit the entire data
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df = pd.read_csv('a.csv')
df_scaled = scaler.fit_transform(df)
Solution #2: read the csv by chunks, and online train
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
# first read the csv by chunks and update the scaler
for chunk in pd.read_csv('a.csv', chunksize=10):
scaler.partial_fit(chunk)
# read the csv again by chunks to transform the chunks
for chunk in pd.read_csv('a.csv', chunksize=10):
transformed = scaler.transform(chunk)
# not too sure what you want to do after this
# but you can either print the results of the transformation
# or write the transformed chunk to a new csv

Related

How to substitute Scaled column with usual columns in my dataframe

I have scaled columns, however how do I put them back into my data frame?
Here is the code that I have:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
num_cols = ['fare_amount','trip_distance','jfk_drop_distance','lga_drop_distance','ewr_drop_distance','met_drop_distance','wtc_drop_distance']
features = train_df[num_cols]
ct = ColumnTransformer([('scaler',
StandardScaler(),
['fare_amount','trip_distance','jfk_drop_distance','lga_drop_distance','ewr_drop_distance','met_drop_distance','wtc_drop_distance'])]
,remainder='passthrough')
ct.fit_transform(features)
My main data frame that I want to substite this columns with old one is train_data
I think, U seem to be almost close.
just put the fit_transform data to your dataframe like below:
...
train_df[num_cols] = ct.fit_transform(features)
...

How to speeding up the scaling process in python?

I have a large text dataset and I'm using the MinMaxScaler to transform one feature. The code is working fine but takes more than 3 mins and I want to reduce the consumed time for this process. Is there any suggestions to speed up this process OR alternative method to so this transformation faster?
df = cleanData('data.csv')
scaler = MinMaxScaler(feature_range=(0, 5))
scaler.fit(pd.DataFrame(df.loc[:,'year']))
df.loc[:,'year'] = scaler.transform(pd.DataFrame(df.loc[:,'year']))
You can try doing it with dask-ml:
import dask.dataframe as dd
from dask_ml.preprocessing import MinMaxScaler
# or read directly from csv with ddf = dd.read_csv('data.csv')
ddf = dd.from_pandas(df, npartitions=10)
scaler = MinMaxScaler(feature_range=(0, 5))
scaler.fit(ddf['year'])
ddf['year'] = scaler.transform(ddf['year'])
There are also other preprocessing tools available in dask_ml, see https://ml.dask.org/modules/generated/dask_ml.preprocessing.MinMaxScaler.html?highlight=minmaxscaler

Data imputation in Python for Google Analytics data

I have sets of Google Analytics data from a website which I plan to analyse for a project. However, due to maintenance and other factors, there are chunks of dates for which there is no data. I want to impute this data while still maintaining the integrity of the data as I plan to plot these sets and compare the curves of different sets to each-other over time.
Example
I want to use the nearest valid datapoints to each missing datapoint to impute that value in order to maintain the underlying shape that can be seen from the image.
I've already tried to use scikit-learn's KNN-Imputer and Iterative Imputer but I'm either miss-understanding how these imputers are supposed to be used or they're not the correct for what I'm trying to do, potentially both.
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import numpy as np
df = pd.read_csv('data.csv', names=['Day','Views'],delimiter=',',skiprows=3, usecols=[0,1], skipfooter=1, engine='python', quoting= 1)
df = df.replace(0, np.nan)
da = df.Views.rename_axis('ID').values
da = da.reshape(-1,1)
imputer = IterativeImputer(n_nearest_features = 100, max_iter = 10)
df_imputed = imputer.fit_transform(da)
df_imputed.reshape(1,-1)
df.Views = df_imputed
df
All of the NaN values are calculated to be the exact same number from what I have currently implemented.
Any help would be greatly appreciated.
The problem here was I reshaping the array. My data was just a 1D array of values so I was making it 2D by reshaping the array which was causing all the NaN values to be calculated as the same. When I added an index column and included this as an input to the imputer the values were calculated correctly.I also ended up using a KNN imputer from sklearn instead of the iterative imputer in this instance.

Issue with Scikit-learn data analysis

am attempting to take a .dat file of about 90,000 data lines of two variables (wavelength and intensity) and apply a sklearn.pca filter to it.
Here is a small set of that data:
wavelength intensity
[um] [W/m**2/um/sr]
196.078431372549 1.108370393265022E-003
192.307692307692 1.163428008597600E-003
188.679245283019 1.223639983609668E-003
The code I am using to analyze the data is below
pca= PCA(n_components=2)
pca.fit(data)
print(pca.components_)
The error code I get is this when I try to apply 2 pca components to one of the data sets:
ValueError: Datatype coercion is not allowed
Any help resolving would be much appreciated
I think in your case, the problem is the column name, especially [W/m**2/um/sr].
Also when using PCA, do not forget to rescale the input variables into "comparable" units using StandardScaler.
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
data = pd.DataFrame({'wavelength [um]': [196.078431372549, 1.108370393265022E-003, 192.307692307692], 'intensity [W/m**2/um/sr]': [1.163428008597600E-003, 188.679245283019, 1.223639983609668E-003]})
scaler = StandardScaler(with_mean=True, with_std=True)
pca= PCA(n_components=2)
pca.fit(scaler.fit_transform(data))
print(pca.components_)
Worked well for me. Maybe you just need to specify:
data.columns = data.columns.astype(str)

Shuffling data in dask

This is a follow on question from Subsetting Dask DataFrames. I wish to shuffle data from a dask dataframe before sending it in batches to a ML algorithm.
The answer in that question was to do the following:
for part in df.repartition(npartitions=100).to_delayed():
batch = part.compute()
However, even if I was to shuffle the contents of batch I'm a bit worried that it might not be ideal. The data is a time series set so datapoints would be highly correlated within each partition.
What I would ideally like is something along the lines of:
rand_idx = np.random.choice(len(df), batch_size, replace=False)
batch = df.iloc[rand_idx, :]
which would work on pandas but not dask. Any thoughts?
Edit 1: Potential Solution
I tried doing
train_len = int(len_df*0.8)
idx = np.random.permutation(len_df)
train_idx = idx[:train_len]
test_idx = idx[train_len:]
train_df = df.loc[train_idx]
test_df = df.loc[test_idx]
However, if I try doing train_df.loc[:5,:].compute() this return a 124451 row dataframe. So clearly using dask wrong.
I recommend adding a column of random data to your dataframe and then using that to set the index:
df = df.map_partitions(add_random_column_to_pandas_dataframe, ...)
df = df.set_index('name-of-random-column')
I encountered the same issue recently and came up with a different approach using dask array and shuffle_slice introduced in this pull request
It shuffles the whole sample
import numpy as np
from dask.array.slicing import shuffle_slice
d_arr = df.to_dask_array(True)
df_len = len(df)
np.random.seed(42)
index = np.random.choice(df_len, df_len, replace=False)
d_arr = shuffle_slice(d_arr, index)
and to transform back to dask dataframe
df = d_arr.to_dask_dataframe(df.columns)
for me it works well for large data sets
If you're trying to separate your dataframe into training and testing subsets, it is what does sklearn.model_selection.train_test_split and it works with pandas.DataFrame. (Go there for an example)
And for your case of using it with dask, you may be interested by the library dklearn, that seems to implements this function.
To do that, we can use the train_test_split function, which mirrors
the scikit-learn function of the same name. We'll hold back 20% of the
rows:
from dklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)
More information here.
Note: I did not perform any test with dklearn, this is just a thing I came across, but I hope it can help.
EDIT: what about dask.DataFrame.random_split?
Examples
50/50 split
>>> a, b = df.random_split([0.5, 0.5])
80/10/10 split, consistent random_state
>>> a, b, c = df.random_split([0.8, 0.1, 0.1], random_state=123)
Use for ML applications is illustrated here
For people here really just wanting to shuffle the rows as the title implies:
This is costly
import numpy as np
random_idx = np.random.permutation(len(sd.index))
sd.assign(random_idx=random_idx)
sd = sd.set_index('x', sorted=True)

Categories