I am trying to build a classification model, but I don't have enough data. What would be the most appropriate way to create synthetic data based on my existing dataset if I have numerical and categorical features?
I looked at using Vine copulas like here: https://sdv.dev/Copulas/tutorials/03_Multivariate_Distributions.html#Vine-Copulas but sampling such copulas gives floats even for the columns that I would like to be integers (label-encoded categorical features). And then I dont know how to convert such floats back to a categorical features.
Sample toy code of my problem is below
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.datasets import fetch_openml
from copulas.multivariate import VineCopula, GaussianMultivariate
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
X['label'] = y
# reducing features and removing nulls to keep things simple
X = X[['sex', 'age', 'fare', 'embarked', 'label']]
row_keep = X.isnull().sum(axis=1) == 0
df = X.loc[row_keep, :].copy()
df.reset_index(drop=True, inplace=True)
# encoding columns
cat_cols = ['sex', 'embarked', 'label']
num_cols = ['age', 'fare']
label_encoders = {}
for c in cat_cols:
cat_proc = preprocessing.LabelEncoder()
col_proc = cat_proc.fit_transform(df[c])
df[c] = col_proc
label_encoders[c] = cat_proc
# Fit a copula
copula = VineCopula('regular')
copula.fit(df)
# Sample synthetic data
df_synthetic = copula.sample(1000)
All the columns of df_synthetic are floats. How would I convert those back to ints that I can map back to categorical features?
Is there another way to augment this sort of dataset? Would be even better, if it's performant and I can sample 7000-10000 new synthetic entries. The toy problem with 5 columns above took ~1mins to sample 1000 rows, but my real problem has 27 columns, which I imagine would take a lot longer.
To have your columns converted to ints, use round and then .astype(int):
df_synthetic["sex"] = round(df_synthetic["sex"]).astype(int)
df_synthetic["embarked"] = round(df_synthetic["embarked"]).astype(int)
df_synthetic["label"] = round(df_synthetic["label"]).astype(int)
You might have to adjust values manually (ex. cap sex in [0,1] if some larger/smaller value has been generated), but that will strongly depend on your data characteristics.
I have a list of categorical columns within my dataframe that I am trying to OneHotEncode. I've used the following code for each of these columns individually, but cannot figure out how to iterate through my categoricals list to do the same. Does anyone know how to do this?
categoricals = ['bedrooms', 'bathrooms', 'floors', 'condition', 'grade',
'yr_built']
from sklearn.preprocessing import OneHotEncoder
bedrooms = df[['bedrooms']]
bed = OneHotEncoder(categories="auto", sparse=False, handle_unknown="ignore")
bed.fit(bedrooms)
bed_encoded = bed.transform(bedrooms)
bed_encoded = pd.DataFrame(
bed_encoded,
columns=bed.categories_[0],
index=df.index
)
df.drop("bedrooms", axis=1, inplace=True)
df = pd.concat([df, bed_encoded], axis=1)
Method:1
Create the DataFrame first. You can use an ordinal encoder like label-encoder first and then do the one-hot encoding.
categorical_cols = ['bedrooms', 'bathrooms', 'floors', 'condition', 'grade',
'yr_built']
from sklearn.preprocessing import LabelEncoder
# instantiate labelencoder object
le = LabelEncoder()
# apply le on categorical feature columns
# data is the dataframe
data[categorical_cols] = data[categorical_cols].apply(lambda col:
le.fit_transform(col))
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
#One-hot-encode the categorical columns.
#outputs an array instead of dataframe.
array_hot_encoded = ohe.fit_transform(data[categorical_cols])
#Convert it to df
data_hot_encoded = pd.DataFrame(array_hot_encoded, index=data.index)
#Extract only the columns that are numeric and don't need to be encoded
data_numeric_cols = data.drop(columns=categorical_cols)
#Concatenate the two dataframes :
data_out = pd.concat([data_hot_encoded, data_numeric_cols], axis=1)
You could also use pd.factorize() to ap the categorical data mapped to ordinal data.
Method:2
Use pd.get_dummies() so that you can do one-hot encoding directly from raw data.(don't have to convert into ordinal data)
import pandas as pd
df = pd.get_dummies(data, columns = categorical_cols)
I have two DataFrames and in each of them I have a categorical column col. I want to replace all the categories with numbers, so I decided to do it this fashion:
df1['col'] = pd.factorize(df1['col'])[0]
Now the question is how can I code df2[col] in the same way? And how can I also code categories that are present indf2[col] but not in df1[col]?
You need a LabelEncoder
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
df1['col'] = enc.fit_transform(df1['col'])
df2['col'] = enc.transform(df2['col'])
for unseen label, this may be a solution:
enc = LabelEncoder()
enc.fit(df1['col'])
diz_map = dict(zip(enc.classes_, enc.transform(enc.classes_)+1))
for i in set(df2['col']).difference(df1['col']):
diz_map[i] = 0
df1['col'] = [diz_map[i] for i in df1['col'].values]
df2['col'] = [diz_map[i] for i in df2['col'].values]
you map as 0 all the unseen values in df2['col']
I have this code that normalizes a pandas dataframe.
import numpy as np; import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn import preprocessing
df = pd.read_csv('DS/RS_DS/final_dataset.csv')
rec_df = df.drop(['person_id','encounter_id','birthdate','CN','HN','DN','DIAG_DM','DIAG_NONDM','TPN'], axis=1)
#normalize values from 0 to 1
df_val = rec_df.values
min_max_scaler = preprocessing.MinMaxScaler()
df_val_scaled = min_max_scaler.fit_transform(df_val)
df_scaled = pd.DataFrame(df_val_scaled)
df_flask = pd.DataFrame([[42.8,151,73,79,0,1,74]],columns=['weight','height','wc','hc','isMale','isFemale','age'])
df_flask_val = df_flask.values
df_flask_val_scaled = min_max_scaler.fit_transform(df_flask_val)
df_flask_scaled = pd.DataFrame(df_flask_val_scaled)
df_scaled returns a dataframe that is normalized. df_flask is a dataframe that I want to normalize based on df_scaled so I can use it for comparison. df_flask_scaled return all 0, I think it didnt normalize based on the dataframe. is there anyway to normalize the single row df.
or should I add this data to the dataframe then compute normalizing?
I think you should do fit and transform separately. This is done to ensure that the distribution of data using in fitting is maintained.
# initialise scaler
min_max_scaler = preprocessing.MinMaxScaler()
# fit here
min_max_scaler.fit(rec_df.values)
# apply transformation
df_val_scaled = min_max_scaler.transform(rec_df.values)
df_flask_val_scaled = min_max_scaler.transform(df_flask_val)
I want to apply scaling (using StandardScaler() from sklearn.preprocessing) to a pandas dataframe. The following code returns a numpy array, so I lose all the column names and indeces. This is not what I want.
features = df[["col1", "col2", "col3", "col4"]]
autoscaler = StandardScaler()
features = autoscaler.fit_transform(features)
A "solution" I found online is:
features = features.apply(lambda x: autoscaler.fit_transform(x))
It appears to work, but leads to a deprecationwarning:
/usr/lib/python3.5/site-packages/sklearn/preprocessing/data.py:583:
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17
and will raise ValueError in 0.19. Reshape your data either using
X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1)
if it contains a single sample.
I therefore tried:
features = features.apply(lambda x: autoscaler.fit_transform(x.reshape(-1, 1)))
But this gives:
Traceback (most recent call last): File "./analyse.py", line 91, in
features = features.apply(lambda x: autoscaler.fit_transform(x.reshape(-1, 1))) File
"/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 3972, in
apply
return self._apply_standard(f, axis, reduce=reduce) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 4081, in
_apply_standard
result = self._constructor(data=results, index=index) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 226, in
init
mgr = self._init_dict(data, index, columns, dtype=dtype) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 363, in
_init_dict
dtype=dtype) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 5163, in
_arrays_to_mgr
arrays = _homogenize(arrays, index, dtype) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 5477, in
_homogenize
raise_cast_failure=False) File "/usr/lib/python3.5/site-packages/pandas/core/series.py", line 2885,
in _sanitize_array
raise Exception('Data must be 1-dimensional') Exception: Data must be 1-dimensional
How do I apply scaling to the pandas dataframe, leaving the dataframe intact? Without copying the data if possible.
You could convert the DataFrame as a numpy array using as_matrix(). Example on a random dataset:
Edit:
Changing as_matrix() to values, (it doesn't change the result) per the last sentence of the as_matrix() docs above:
Generally, it is recommended to use ‘.values’.
import pandas as pd
import numpy as np #for the random integer example
df = pd.DataFrame(np.random.randint(0.0,100.0,size=(10,4)),
index=range(10,20),
columns=['col1','col2','col3','col4'],
dtype='float64')
Note, indices are 10-19:
In [14]: df.head(3)
Out[14]:
col1 col2 col3 col4
10 3 38 86 65
11 98 3 66 68
12 88 46 35 68
Now fit_transform the DataFrame to get the scaled_features array:
from sklearn.preprocessing import StandardScaler
scaled_features = StandardScaler().fit_transform(df.values)
In [15]: scaled_features[:3,:] #lost the indices
Out[15]:
array([[-1.89007341, 0.05636005, 1.74514417, 0.46669562],
[ 1.26558518, -1.35264122, 0.82178747, 0.59282958],
[ 0.93341059, 0.37841748, -0.60941542, 0.59282958]])
Assign the scaled data to a DataFrame (Note: use the index and columns keyword arguments to keep your original indices and column names:
scaled_features_df = pd.DataFrame(scaled_features, index=df.index, columns=df.columns)
In [17]: scaled_features_df.head(3)
Out[17]:
col1 col2 col3 col4
10 -1.890073 0.056360 1.745144 0.466696
11 1.265585 -1.352641 0.821787 0.592830
12 0.933411 0.378417 -0.609415 0.592830
Edit 2:
Came across the sklearn-pandas package. It's focused on making scikit-learn easier to use with pandas. sklearn-pandas is especially useful when you need to apply more than one type of transformation to column subsets of the DataFrame, a more common scenario. It's documented, but this is how you'd achieve the transformation we just performed.
from sklearn_pandas import DataFrameMapper
mapper = DataFrameMapper([(df.columns, StandardScaler())])
scaled_features = mapper.fit_transform(df.copy(), 4)
scaled_features_df = pd.DataFrame(scaled_features, index=df.index, columns=df.columns)
import pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('your file here')
ss = StandardScaler()
df_scaled = pd.DataFrame(ss.fit_transform(df),columns = df.columns)
The df_scaled will be the 'same' dataframe, only now with the scaled values
Reassigning back to df.values preserves both index and columns.
df.values[:] = StandardScaler().fit_transform(df)
features = ["col1", "col2", "col3", "col4"]
autoscaler = StandardScaler()
df[features] = autoscaler.fit_transform(df[features])
This worked with MinMaxScaler in getting back the array values to original dataframe. It should work on StandardScaler as well.
data_scaled = pd.DataFrame(scaled_features, index=df.index, columns=df.columns)
where, data_scaled is the new data frame, scaled_features = the array post normalization, df = original dataframe for which we need the index and columns back.
Works for me:
from sklearn.preprocessing import StandardScaler
cols = list(train_df_x_num.columns)
scaler = StandardScaler()
train_df_x_num[cols] = scaler.fit_transform(train_df_x_num[cols])
This is what I did:
X.Column1 = StandardScaler().fit_transform(X.Column1.values.reshape(-1, 1))
Since sklearn Version 1.2, estiamtors can return a DataFrame keeping the column names.
set_output can be configured per estimator by calling the set_output method or globally by setting set_config(transform_output="pandas")
See Release Highlights for scikit-learn 1.2 - Pandas output with set_output API
Example for set_output():
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().set_output(transform="pandas")
Example for set_config():
from sklearn import set_config
set_config(transform_output="pandas")
You can mix multiple data types in scikit-learn using Neuraxle:
Option 1: discard the row names and column names
from neuraxle.pipeline import Pipeline
from neuraxle.base import NonFittableMixin, BaseStep
class PandasToNumpy(NonFittableMixin, BaseStep):
def transform(self, data_inputs, expected_outputs):
return data_inputs.values
pipeline = Pipeline([
PandasToNumpy(),
StandardScaler(),
])
Then, you proceed as you intended:
features = df[["col1", "col2", "col3", "col4"]] # ... your df data
pipeline, scaled_features = pipeline.fit_transform(features)
Option 2: to keep the original column names and row names
You could even do this with a wrapper as such:
from neuraxle.pipeline import Pipeline
from neuraxle.base import MetaStepMixin, BaseStep
class PandasValuesChangerOf(MetaStepMixin, BaseStep):
def transform(self, data_inputs, expected_outputs):
new_data_inputs = self.wrapped.transform(data_inputs.values)
new_data_inputs = self._merge(data_inputs, new_data_inputs)
return new_data_inputs
def fit_transform(self, data_inputs, expected_outputs):
self.wrapped, new_data_inputs = self.wrapped.fit_transform(data_inputs.values)
new_data_inputs = self._merge(data_inputs, new_data_inputs)
return self, new_data_inputs
def _merge(self, data_inputs, new_data_inputs):
new_data_inputs = pd.DataFrame(
new_data_inputs,
index=data_inputs.index,
columns=data_inputs.columns
)
return new_data_inputs
df_scaler = PandasValuesChangerOf(StandardScaler())
Then, you proceed as you intended:
features = df[["col1", "col2", "col3", "col4"]] # ... your df data
df_scaler, scaled_features = df_scaler.fit_transform(features)
You can try this code, this will give you a DataFrame with indexes
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_boston # boston housing dataset
dt= load_boston().data
col= load_boston().feature_names
# Make a dataframe
df = pd.DataFrame(data=dt, columns=col)
# define a method to scale data, looping thru the columns, and passing a scaler
def scale_data(data, columns, scaler):
for col in columns:
data[col] = scaler.fit_transform(data[col].values.reshape(-1, 1))
return data
# specify a scaler, and call the method on boston data
scaler = StandardScaler()
df_scaled = scale_data(df, col, scaler)
# view first 10 rows of the scaled dataframe
df_scaled[0:10]
You could directly assign a numpy array to a data frame by using slicing.
from sklearn.preprocessing import StandardScaler
features = df[["col1", "col2", "col3", "col4"]]
autoscaler = StandardScaler()
features[:] = autoscaler.fit_transform(features.values)