Shuffling data in dask

Shuffling data in dask - python

This is a follow on question from Subsetting Dask DataFrames. I wish to shuffle data from a dask dataframe before sending it in batches to a ML algorithm.
The answer in that question was to do the following:
for part in df.repartition(npartitions=100).to_delayed():
batch = part.compute()
However, even if I was to shuffle the contents of batch I'm a bit worried that it might not be ideal. The data is a time series set so datapoints would be highly correlated within each partition.
What I would ideally like is something along the lines of:
rand_idx = np.random.choice(len(df), batch_size, replace=False)
batch = df.iloc[rand_idx, :]
which would work on pandas but not dask. Any thoughts?
Edit 1: Potential Solution
I tried doing
train_len = int(len_df*0.8)
idx = np.random.permutation(len_df)
train_idx = idx[:train_len]
test_idx = idx[train_len:]
train_df = df.loc[train_idx]
test_df = df.loc[test_idx]
However, if I try doing train_df.loc[:5,:].compute() this return a 124451 row dataframe. So clearly using dask wrong.

I recommend adding a column of random data to your dataframe and then using that to set the index:
df = df.map_partitions(add_random_column_to_pandas_dataframe, ...)
df = df.set_index('name-of-random-column')

I encountered the same issue recently and came up with a different approach using dask array and shuffle_slice introduced in this pull request
It shuffles the whole sample
import numpy as np
from dask.array.slicing import shuffle_slice
d_arr = df.to_dask_array(True)
df_len = len(df)
np.random.seed(42)
index = np.random.choice(df_len, df_len, replace=False)
d_arr = shuffle_slice(d_arr, index)
and to transform back to dask dataframe
df = d_arr.to_dask_dataframe(df.columns)
for me it works well for large data sets

If you're trying to separate your dataframe into training and testing subsets, it is what does sklearn.model_selection.train_test_split and it works with pandas.DataFrame. (Go there for an example)
And for your case of using it with dask, you may be interested by the library dklearn, that seems to implements this function.
To do that, we can use the train_test_split function, which mirrors
the scikit-learn function of the same name. We'll hold back 20% of the
rows:
from dklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)
More information here.
Note: I did not perform any test with dklearn, this is just a thing I came across, but I hope it can help.
EDIT: what about dask.DataFrame.random_split?
Examples
50/50 split
>>> a, b = df.random_split([0.5, 0.5])
80/10/10 split, consistent random_state
>>> a, b, c = df.random_split([0.8, 0.1, 0.1], random_state=123)
Use for ML applications is illustrated here

For people here really just wanting to shuffle the rows as the title implies:
This is costly
import numpy as np
random_idx = np.random.permutation(len(sd.index))
sd.assign(random_idx=random_idx)
sd = sd.set_index('x', sorted=True)

Related

How to create synthetic data based on dataset with mixed data types for classification problem?

I am trying to build a classification model, but I don't have enough data. What would be the most appropriate way to create synthetic data based on my existing dataset if I have numerical and categorical features?
I looked at using Vine copulas like here: https://sdv.dev/Copulas/tutorials/03_Multivariate_Distributions.html#Vine-Copulas but sampling such copulas gives floats even for the columns that I would like to be integers (label-encoded categorical features). And then I dont know how to convert such floats back to a categorical features.
Sample toy code of my problem is below
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.datasets import fetch_openml
from copulas.multivariate import VineCopula, GaussianMultivariate
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
X['label'] = y
# reducing features and removing nulls to keep things simple
X = X[['sex', 'age', 'fare', 'embarked', 'label']]
row_keep = X.isnull().sum(axis=1) == 0
df = X.loc[row_keep, :].copy()
df.reset_index(drop=True, inplace=True)
# encoding columns
cat_cols = ['sex', 'embarked', 'label']
num_cols = ['age', 'fare']
label_encoders = {}
for c in cat_cols:
cat_proc = preprocessing.LabelEncoder()
col_proc = cat_proc.fit_transform(df[c])
df[c] = col_proc
label_encoders[c] = cat_proc
# Fit a copula
copula = VineCopula('regular')
copula.fit(df)
# Sample synthetic data
df_synthetic = copula.sample(1000)
All the columns of df_synthetic are floats. How would I convert those back to ints that I can map back to categorical features?
Is there another way to augment this sort of dataset? Would be even better, if it's performant and I can sample 7000-10000 new synthetic entries. The toy problem with 5 columns above took ~1mins to sample 1000 rows, but my real problem has 27 columns, which I imagine would take a lot longer.

To have your columns converted to ints, use round and then .astype(int):
df_synthetic["sex"] = round(df_synthetic["sex"]).astype(int)
df_synthetic["embarked"] = round(df_synthetic["embarked"]).astype(int)
df_synthetic["label"] = round(df_synthetic["label"]).astype(int)
You might have to adjust values manually (ex. cap sex in [0,1] if some larger/smaller value has been generated), but that will strongly depend on your data characteristics.

How indexing of sliced data frame works in pandas

How to get the correct row from a datfarme which is sliced?
To show what I mean, look at this code sample:
import lightgbm as lgb
from sklearn.model_selection import train_test_split
import numpy as np
data=pd.DataFrame()
data['one']=range(0,1000)
data['p1']=data['one']+1
data['p2']=data['one']+2
label=data['p1']%2==0
X_train, X_test, y_train, y_test = train_test_split(data, label, test_size=0.2, random_state=100)
lgb_model = lgb.LGBMClassifier(objective = 'binary')
lgb_fitted = lgb_model.fit(X_train, y_train, verbose = False)
y_prob=lgb_fitted.predict_proba(X_test)
y_prob= pd.DataFrame(y_prob,columns = ['No','Yes'])
model_uncertain=y_prob.loc[(y_prob['Yes'] >= .5) & (y_prob['Yes'] <= .52)]
model_uncertain
My question:
How can I get the row in the X_test dataframe which is related to the first raw in model_uncertain data frame?
To make sure that I am getting the right row, I test it using passing the same row to
predict_proba using the following code as I should get the same result:
y_prob_3=lgb_fitted.predict_proba([X_test.iloc[3]])
y_prob_3
But the result is not the same.
I think I am not sending the correct row to predict_proba, as it should return the same value for a row.
What is the correct way to find the n row in model_uncertain and find the corresponding row in X_test data frame?

How can I get the row in the X_test dataframe which is related to the first raw in model_uncertain data frame?
You're on the right track:
>>> idx_of_first_uncertainty_row = model_uncertain.iloc[0].index
>>> row_in_test_data = X.loc[idx_of_first_uncertainty_row]
Yes, indexes are preserved between the original dataframe and its slices (unless you reset the index somewhere in between).
To make sure that I am getting the right row, I test it using passing the same row to predict_proba using the following code as I should get the same result (...) But the result is not the same.
Why do you think they're not the same? In the dataframe image you can't see all of the decimals. A better way to confirm if they're the same (well, really really similar) would be to use something like np.isclose to compare model_uncertain.iloc[0] (first row of dataframe) and X_train.loc[3] (row where index is 3):
>>> np.isclose(model_uncertain.iloc[0].values, X_train.loc[3].values)

Pandas deep copy and scikit learn min max scaler

I am trying to do the following
df = pd.read_csv('a.csv')
scaler = MinMaxScaler()
df_copy = df.copy(deep=True)
for i in range(1, len(df)):
df_chunk = df_copy.iloc[i,i+10]
df_chunk = scaler.fit_transform (df_chunk)
so each df_chunk should be a scaled data frame.
The issue is that some are not scaled correctly.
If I were to plot the scaled data points, a properly scaled data frame will look like a range of numbers scattered between 0 and 1 sort of evenly. But the data frames I get are in 2 extremes, with the first ~80% of the numbers in the 0.9 range, while the others near the 0.1 range.
So it feels like the first ~80% of the data got scaled twice by the scaler. I have already tried using pandas deep copy to solve this, but it doesn't seem to help.
If you have any idea, why?
I would really appreciate it.

I'm not too sure why you want to apply the scaler on chunks of your data. If you fear that your CSV may be too large, you would want to read the CSV by chunks in the read_csv operation and process those chunks.
Now onto your issue. You're re-fitting your scaler on every chunk which is why you're getting the weird results. You either have to fit the entire data with your scaler, or you have to online fit the data using the partial_fit method.
I'll provide you both solutions.
Solution #1: read and fit the entire data
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df = pd.read_csv('a.csv')
df_scaled = scaler.fit_transform(df)
Solution #2: read the csv by chunks, and online train
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
# first read the csv by chunks and update the scaler
for chunk in pd.read_csv('a.csv', chunksize=10):
scaler.partial_fit(chunk)
# read the csv again by chunks to transform the chunks
for chunk in pd.read_csv('a.csv', chunksize=10):
transformed = scaler.transform(chunk)
# not too sure what you want to do after this
# but you can either print the results of the transformation
# or write the transformed chunk to a new csv

How to find the features names of the coefficients using scikit linear regression?

I use scikit linear regression and if I change the order of the features, the coef are still printed in the same order, hence I would like to know the mapping of the feature with the coeff.
#training the model
model_1_features = ['sqft_living', 'bathrooms', 'bedrooms', 'lat', 'long']
model_2_features = model_1_features + ['bed_bath_rooms']
model_3_features = model_2_features + ['bedrooms_squared', 'log_sqft_living', 'lat_plus_long']
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
model_2 = linear_model.LinearRegression()
model_2.fit(train_data[model_2_features], train_data['price'])
model_3 = linear_model.LinearRegression()
model_3.fit(train_data[model_3_features], train_data['price'])
# extracting the coef
print model_1.coef_
print model_2.coef_
print model_3.coef_

The trick is that right after you have trained your model, you know the order of the coefficients:
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
print(list(zip(model_1.coef_, model_1_features)))
This will print the coefficients and the correct feature. (Tested with pandas DataFrame)
If you want to reuse the coefficients later you can also put them in a dictionary:
coef_dict = {}
for coef, feat in zip(model_1.coef_,model_1_features):
coef_dict[feat] = coef
(You can test it for yourself by training two models with the same features but, as you said, shuffled order of features.)

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
coef_table = pd.DataFrame(list(X_train.columns)).copy()
coef_table.insert(len(coef_table.columns),"Coefs",regressor.coef_.transpose())

#Robin posted a great answer, but for me I had to make one tweak on it to work the way I wanted, and it was to refer to the dimension of the 'coef_' np.array that I wanted, namely modifying to this: model_1.coef_[0,:], as below:
coef_dict = {}
for coef, feat in zip(model_1.coef_[0,:],model_1_features):
coef_dict[feat] = coef
Then the dict was created as I pictured it, with {'feature_name' : coefficient_value} pairs.

Here is what I use for pretty printing of coefficients in Jupyter. I'm not sure I follow why order is an issue - as far as I know the order of the coefficients should match the order of the input data that you gave it.
Note that the first line assumes you have a Pandas data frame called df in which you originally stored the data prior to turning it into a numpy array for regression:
fieldList = np.array(list(df)).reshape(-1,1)
coeffs = np.reshape(np.round(clf.coef_,5),(-1,1))
coeffs=np.concatenate((fieldList,coeffs),axis=1)
print(pd.DataFrame(coeffs,columns=['Field','Coeff']))

Borrowing from Robin, but simplifying the syntax:
coef_dict = dict(zip(model_1_features, model_1.coef_))
Important note about zip: zip assumes its inputs are of equal length, making it especially important to confirm that the lengths of the features and coefficients match (which in more complicated models might not be the case). If one input is longer than the other, the longer input will have the values in its extra index positions cut off. Notice the missing 7 in the following example:
In [1]: [i for i in zip([1, 2, 3], [4, 5, 6, 7])]
Out[1]: [(1, 4), (2, 5), (3, 6)]

pd.DataFrame(data=regression.coef_, index=X_train.columns)

All of these answers were great but what personally worked for me was this, as the feature names I needed were the columns of my train_date dataframe:
pd.DataFrame(data=model_1.coef_,columns=train_data.columns)

Right after training the model, the coefficient values are stored in the variable model.coef_[0]. We can iterate over the column names and store the column name and their coefficient value in a dictionary.
model.fit(X_train,y)
# assuming all the columns except last one is used in training
columns = data.iloc[:,-1].columns
coef_dict = {}
for i in range(0,len(columns)):
coef_dict[columns[i]] = model.coef_[0][i]
Hope this helps!

As of scikit-learn version 1.0, the LinearRegression estimator has a feature_names_in_ attribute. From the docs:
feature_names_in_ : ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
Assuming you're fitting on a pandas.DataFrame (train_data), your estimators (model_1, model_2, and model_3) will have the attribute. You can line up your coefficients using any of the methods listed in previous answers, but I'm in favor of this one:
coef_series = pd.Series(
data=model_1.coef_,
index=model_1.feature_names_in_
)
A minimally reproducible example
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
# for repeatability
np.random.seed(0)
# random data
Xy = pd.DataFrame(
data=np.random.random((10, 3)),
columns=["x0", "x1", "y"]
)
# separate X and y
X = Xy.drop(columns="y")
y = Xy.y
# initialize estimator
lr = LinearRegression()
# fit to pandas.DataFrame
lr.fit(X, y)
# get coeficients and their respective feature names
coef_series = pd.Series(
data=lr.coef_,
index=lr.feature_names_in_
)
print(coef_series)
x0 0.230524
x1 -0.275611
dtype: float64

How to split Test and Train data such that there is garenteed at least one of each Class in each

I have some fairly unbalanced data I am trying to classify.
However, it is classifying fairly well.
To evaluate exactly how well, I must split the data into training and test subsets.
Right now I am doing that by the very simple measure of:
import numpy as np
corpus = pandas.DataFrame(..., columns=["data","label"]) # My data, simplified
train_index = np.random.rand(len(corpus))>0.2
training_data = corpus[train_index]
test_data = corpus[np.logical_not(train_index)]
This is nice and simple, but some of the classes occur very very rarely:
about 15 occur less than 100 times each in the corpus of over 50,000 cases, and two of them each occur only once.
I would like to partition my data corpus into test and training subsets such that:
If a class occurs less than twice, it is excluded from both
each class occurs at least once, in test and in training
The split into test and training is otherwise random
I can throw together something to do this,
(likely the simplest way is to remove things with less than 2 occurances) and then just resample til the spit has both on each side), but I wonder if there is a clean method that already exists.
I don't think that sklearn.cross_validation.train_test_split will do for this, but that it exists suggests that sklearn might have this kind of functionality.

The following meets your 3 conditions for partitioning the data into test and training:
#get rid of items with fewer than 2 occurrences.
corpus=corpus[corpus.groupby('label').label.transform(len)>1]
from sklearn.cross_validation import StratifiedShuffleSplit
sss=StratifiedShuffleSplit(corpus['label'].tolist(), 1, test_size=0.5, random_state=None)
train_index, test_index =list(*sss)
training_data=corpus.iloc[train_index]
test_data=corpus.iloc[test_index]
I've tested the code above by using the following fictitious dataframe:
#create random data with labels 0 to 39, then add 2 label case and one label case.
corpus=pd.DataFrame({'data':np.random.randn(49998),'label':np.random.randint(40,size=49998)})
corpus.loc[49998]=[random.random(),40]
corpus.loc[49999]=[random.random(),40]
corpus.loc[50000]=[random.random(),41]
Which produces the following output when testing the code:
test_data[test_data['label']==40]
Out[110]:
data label
49999 0.231547 40
training_data[training_data['label']==40]
Out[111]:
data label
49998 0.253789 40
test_data[test_data['label']==41]
Out[112]:
Empty DataFrame
Columns: [data, label]
Index: []
training_data[training_data['label']==41]
Out[113]:
Empty DataFrame
Columns: [data, label]
Index: []

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Shuffling data in dask - python

I recommend adding a column of random data to your dataframe and then using that to set the index: df = df.map_partitions(add_random_column_to_pandas_dataframe, ...) df = df.set_index('name-of-random-column')

For people here really just wanting to shuffle the rows as the title implies: This is costly import numpy as np random_idx = np.random.permutation(len(sd.index)) sd.assign(random_idx=random_idx) sd = sd.set_index('x', sorted=True)

Related

How to create synthetic data based on dataset with mixed data types for classification problem?

How indexing of sliced data frame works in pandas

Pandas deep copy and scikit learn min max scaler

How to find the features names of the coefficients using scikit linear regression?

How to split Test and Train data such that there is garenteed at least one of each Class in each

Categories

Resources