CSV normalization in Python - python

I'm working on a CSV file which contains several medical data and I want to implement it for ML model, but before executing the ML model, I want to normalize the data between 0 to 1. Below is my script, but it's producing error, how to resolve the error
Sample input file
import pandas as pd
import scipy as sp
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from pandas import read_csv
Data = ('Medical_Data.csv')
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(Data, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
np.set_printoptions(precision=3)
print(rescaledX[0:5,:])
Error massage:
could not convert string to float: 'preg'

You're performing pd.read_csv twice. Data will be in a DataFrame format and you cannot perform pd.read_csv on a DataFrame.
---- UPDATE
names needs to be defined before read_csv. Please refer to https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html .
import pandas as pd
import scipy as sp
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from pandas import read_csv
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv('Medical_Data.csv', names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
np.set_printoptions(precision=3)
print(rescaledX[0:5,:])

You don`t have to use 'pd.read_csv' twice. Just use like that:
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
Data = pd.read_csv('Medical_Data.csv',names=names)
and also if you want to get DataFrame's columns, use this code:
columns = Data.columns
'Data.columns' will return column list

Related

How to create a for loop for OneHotEncoder

I have a list of categorical columns within my dataframe that I am trying to OneHotEncode. I've used the following code for each of these columns individually, but cannot figure out how to iterate through my categoricals list to do the same. Does anyone know how to do this?
categoricals = ['bedrooms', 'bathrooms', 'floors', 'condition', 'grade',
'yr_built']
from sklearn.preprocessing import OneHotEncoder
bedrooms = df[['bedrooms']]
bed = OneHotEncoder(categories="auto", sparse=False, handle_unknown="ignore")
bed.fit(bedrooms)
bed_encoded = bed.transform(bedrooms)
bed_encoded = pd.DataFrame(
bed_encoded,
columns=bed.categories_[0],
index=df.index
)
df.drop("bedrooms", axis=1, inplace=True)
df = pd.concat([df, bed_encoded], axis=1)
Method:1
Create the DataFrame first. You can use an ordinal encoder like label-encoder first and then do the one-hot encoding.
categorical_cols = ['bedrooms', 'bathrooms', 'floors', 'condition', 'grade',
'yr_built']
from sklearn.preprocessing import LabelEncoder
# instantiate labelencoder object
le = LabelEncoder()
# apply le on categorical feature columns
# data is the dataframe
data[categorical_cols] = data[categorical_cols].apply(lambda col:
le.fit_transform(col))
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
#One-hot-encode the categorical columns.
#outputs an array instead of dataframe.
array_hot_encoded = ohe.fit_transform(data[categorical_cols])
#Convert it to df
data_hot_encoded = pd.DataFrame(array_hot_encoded, index=data.index)
#Extract only the columns that are numeric and don't need to be encoded
data_numeric_cols = data.drop(columns=categorical_cols)
#Concatenate the two dataframes :
data_out = pd.concat([data_hot_encoded, data_numeric_cols], axis=1)
You could also use pd.factorize() to ap the categorical data mapped to ordinal data.
Method:2
Use pd.get_dummies() so that you can do one-hot encoding directly from raw data.(don't have to convert into ordinal data)
import pandas as pd
df = pd.get_dummies(data, columns = categorical_cols)

How can I transform a 2d array to a pandas dataframe in python

Currently, I'm doing the titanic dataset on kaggle. The Age column has some missing values, and I tried to impute them using sklearn.impute SimpleImputer.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error as mae
from sklearn.model_selection import train_test_split as tts
from sklearn.impute import SimpleImputer
titanic_data = pd.read_csv("../input/titanic/train.csv")
imputer = SimpleImputer(missing_values=np.nan)
features = ['Age', 'Pclass']
X = titanic_data[features]
y = titanic_data.Survived
age_arr = X.Age.values.reshape(1, -1)
imputed_age = pd.DataFrame(imputer.fit_transform(age_arr))
X.Age = imputed_age
print(imputed_age)
As shown above, I have some trouble arranging and converting those arrays and data columns. I need a proper way to make those a single column in the age column. When I print imputed_age, it gives me a dataframe where each age is a column. I want to make all of these in the same column, and how could I easily do the imputing and successfully put the imputed values into the dataframe again?
How could I put those imputed values into the dataframe?
I asked this on a forum elsewhere and someone gave me a solution. I'll put it here, and I've modified it a bit.
import pandas as pd
import seaborn as sns
from sklearn.impute import SimpleImputer
df = sns.load_dataset("titanic")
features = ["pclass","age"]
X = df.loc[:,features]
y = df.survived
imputer = SimpleImputer()
age_transform = pd.DataFrame(imputer.fit_transform(pd.DataFrame(X.age)),columns=["Age"])
I check your code and I found that if we input dataframe in imputer.fit_transform, we don't need to reshape to (1,-1).
So I just make age columns as dataframe and input it in imputer and fit_transform. And I think it works well.

Question regarding OneHotEncoding - Python

I'm working on a project applying the One Hot Encoding technique to a categorical column of a .binetflow file.
CODE:
import pandas as pd
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
mydataset = pd.read_csv('originalfiletest.binetflow')
le = LabelEncoder()
dfle = mydataset
dfle.State = le.fit_transform(dfle.State)
X = dfle[['State']].values
ohe = OneHotEncoder()
Onehot = ohe.fit_transform(X).toarray()
dfle['State'] = Onehot
mydataset.to_csv('newfiletest.binetflow', columns=['Dur','State','TotBytes','average_packet_size','average_bits_psecond'], index=False)
Original binetflow file
Currently, I'm using Pandas and I'm being able to apply the technique. The problem is when I need to write in the second file.
When I try to write, the output I'm expecting is, for example: 0001 or 0.0.0.1 in variable Onehot, but what I get is either 0.0 or 1.0 when I try to pass it to the column dfle['State'].
The images can be found bellow.
variable Onehot
column dfle['State']
Moreover, the column that should just write, when I write print on the compiler it shows correctly but when it writes in the file it adds a few decimal places.
Original and new binetflow file
Onehot is numpy array and the problem is in your assignment of the array to dataframe column
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
mydataset = pd.DataFrame(data={'State': ['a', 'a', 'b', 'c', 'a', 'd']})
le = LabelEncoder()
mydataset.State = le.fit_transform(mydataset.State)
X = mydataset[['State']].values
ohe = OneHotEncoder()
Onehot = ohe.fit_transform(X).toarray()
dx = pd.DataFrame(data=Onehot)
mydataset['State'] = (dx[dx.columns[0:]].apply(lambda x: ','.join(x.dropna().astype(int).astype(str)), axis=1))
mydataset.to_csv('newfiletest.binetflow',
columns=['Dur', 'State', 'TotBytes', 'average_packet_size', 'average_bits_psecond'], index=False)

normalize input data based on a normalized dataset

I have this code that normalizes a pandas dataframe.
import numpy as np; import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn import preprocessing
df = pd.read_csv('DS/RS_DS/final_dataset.csv')
rec_df = df.drop(['person_id','encounter_id','birthdate','CN','HN','DN','DIAG_DM','DIAG_NONDM','TPN'], axis=1)
#normalize values from 0 to 1
df_val = rec_df.values
min_max_scaler = preprocessing.MinMaxScaler()
df_val_scaled = min_max_scaler.fit_transform(df_val)
df_scaled = pd.DataFrame(df_val_scaled)
df_flask = pd.DataFrame([[42.8,151,73,79,0,1,74]],columns=['weight','height','wc','hc','isMale','isFemale','age'])
df_flask_val = df_flask.values
df_flask_val_scaled = min_max_scaler.fit_transform(df_flask_val)
df_flask_scaled = pd.DataFrame(df_flask_val_scaled)
df_scaled returns a dataframe that is normalized. df_flask is a dataframe that I want to normalize based on df_scaled so I can use it for comparison. df_flask_scaled return all 0, I think it didnt normalize based on the dataframe. is there anyway to normalize the single row df.
or should I add this data to the dataframe then compute normalizing?
I think you should do fit and transform separately. This is done to ensure that the distribution of data using in fitting is maintained.
# initialise scaler
min_max_scaler = preprocessing.MinMaxScaler()
# fit here
min_max_scaler.fit(rec_df.values)
# apply transformation
df_val_scaled = min_max_scaler.transform(rec_df.values)
df_flask_val_scaled = min_max_scaler.transform(df_flask_val)

How to use sklearn fit_transform with pandas and return dataframe instead of numpy array?

I want to apply scaling (using StandardScaler() from sklearn.preprocessing) to a pandas dataframe. The following code returns a numpy array, so I lose all the column names and indeces. This is not what I want.
features = df[["col1", "col2", "col3", "col4"]]
autoscaler = StandardScaler()
features = autoscaler.fit_transform(features)
A "solution" I found online is:
features = features.apply(lambda x: autoscaler.fit_transform(x))
It appears to work, but leads to a deprecationwarning:
/usr/lib/python3.5/site-packages/sklearn/preprocessing/data.py:583:
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17
and will raise ValueError in 0.19. Reshape your data either using
X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1)
if it contains a single sample.
I therefore tried:
features = features.apply(lambda x: autoscaler.fit_transform(x.reshape(-1, 1)))
But this gives:
Traceback (most recent call last): File "./analyse.py", line 91, in
features = features.apply(lambda x: autoscaler.fit_transform(x.reshape(-1, 1))) File
"/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 3972, in
apply
return self._apply_standard(f, axis, reduce=reduce) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 4081, in
_apply_standard
result = self._constructor(data=results, index=index) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 226, in
init
mgr = self._init_dict(data, index, columns, dtype=dtype) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 363, in
_init_dict
dtype=dtype) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 5163, in
_arrays_to_mgr
arrays = _homogenize(arrays, index, dtype) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 5477, in
_homogenize
raise_cast_failure=False) File "/usr/lib/python3.5/site-packages/pandas/core/series.py", line 2885,
in _sanitize_array
raise Exception('Data must be 1-dimensional') Exception: Data must be 1-dimensional
How do I apply scaling to the pandas dataframe, leaving the dataframe intact? Without copying the data if possible.
You could convert the DataFrame as a numpy array using as_matrix(). Example on a random dataset:
Edit:
Changing as_matrix() to values, (it doesn't change the result) per the last sentence of the as_matrix() docs above:
Generally, it is recommended to use ‘.values’.
import pandas as pd
import numpy as np #for the random integer example
df = pd.DataFrame(np.random.randint(0.0,100.0,size=(10,4)),
index=range(10,20),
columns=['col1','col2','col3','col4'],
dtype='float64')
Note, indices are 10-19:
In [14]: df.head(3)
Out[14]:
col1 col2 col3 col4
10 3 38 86 65
11 98 3 66 68
12 88 46 35 68
Now fit_transform the DataFrame to get the scaled_features array:
from sklearn.preprocessing import StandardScaler
scaled_features = StandardScaler().fit_transform(df.values)
In [15]: scaled_features[:3,:] #lost the indices
Out[15]:
array([[-1.89007341, 0.05636005, 1.74514417, 0.46669562],
[ 1.26558518, -1.35264122, 0.82178747, 0.59282958],
[ 0.93341059, 0.37841748, -0.60941542, 0.59282958]])
Assign the scaled data to a DataFrame (Note: use the index and columns keyword arguments to keep your original indices and column names:
scaled_features_df = pd.DataFrame(scaled_features, index=df.index, columns=df.columns)
In [17]: scaled_features_df.head(3)
Out[17]:
col1 col2 col3 col4
10 -1.890073 0.056360 1.745144 0.466696
11 1.265585 -1.352641 0.821787 0.592830
12 0.933411 0.378417 -0.609415 0.592830
Edit 2:
Came across the sklearn-pandas package. It's focused on making scikit-learn easier to use with pandas. sklearn-pandas is especially useful when you need to apply more than one type of transformation to column subsets of the DataFrame, a more common scenario. It's documented, but this is how you'd achieve the transformation we just performed.
from sklearn_pandas import DataFrameMapper
mapper = DataFrameMapper([(df.columns, StandardScaler())])
scaled_features = mapper.fit_transform(df.copy(), 4)
scaled_features_df = pd.DataFrame(scaled_features, index=df.index, columns=df.columns)
import pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('your file here')
ss = StandardScaler()
df_scaled = pd.DataFrame(ss.fit_transform(df),columns = df.columns)
The df_scaled will be the 'same' dataframe, only now with the scaled values
Reassigning back to df.values preserves both index and columns.
df.values[:] = StandardScaler().fit_transform(df)
features = ["col1", "col2", "col3", "col4"]
autoscaler = StandardScaler()
df[features] = autoscaler.fit_transform(df[features])
This worked with MinMaxScaler in getting back the array values to original dataframe. It should work on StandardScaler as well.
data_scaled = pd.DataFrame(scaled_features, index=df.index, columns=df.columns)
where, data_scaled is the new data frame, scaled_features = the array post normalization, df = original dataframe for which we need the index and columns back.
Works for me:
from sklearn.preprocessing import StandardScaler
cols = list(train_df_x_num.columns)
scaler = StandardScaler()
train_df_x_num[cols] = scaler.fit_transform(train_df_x_num[cols])
This is what I did:
X.Column1 = StandardScaler().fit_transform(X.Column1.values.reshape(-1, 1))
Since sklearn Version 1.2, estiamtors can return a DataFrame keeping the column names.
set_output can be configured per estimator by calling the set_output method or globally by setting set_config(transform_output="pandas")
See Release Highlights for scikit-learn 1.2 - Pandas output with set_output API
Example for set_output():
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().set_output(transform="pandas")
Example for set_config():
from sklearn import set_config
set_config(transform_output="pandas")
You can mix multiple data types in scikit-learn using Neuraxle:
Option 1: discard the row names and column names
from neuraxle.pipeline import Pipeline
from neuraxle.base import NonFittableMixin, BaseStep
class PandasToNumpy(NonFittableMixin, BaseStep):
def transform(self, data_inputs, expected_outputs):
return data_inputs.values
pipeline = Pipeline([
PandasToNumpy(),
StandardScaler(),
])
Then, you proceed as you intended:
features = df[["col1", "col2", "col3", "col4"]] # ... your df data
pipeline, scaled_features = pipeline.fit_transform(features)
Option 2: to keep the original column names and row names
You could even do this with a wrapper as such:
from neuraxle.pipeline import Pipeline
from neuraxle.base import MetaStepMixin, BaseStep
class PandasValuesChangerOf(MetaStepMixin, BaseStep):
def transform(self, data_inputs, expected_outputs):
new_data_inputs = self.wrapped.transform(data_inputs.values)
new_data_inputs = self._merge(data_inputs, new_data_inputs)
return new_data_inputs
def fit_transform(self, data_inputs, expected_outputs):
self.wrapped, new_data_inputs = self.wrapped.fit_transform(data_inputs.values)
new_data_inputs = self._merge(data_inputs, new_data_inputs)
return self, new_data_inputs
def _merge(self, data_inputs, new_data_inputs):
new_data_inputs = pd.DataFrame(
new_data_inputs,
index=data_inputs.index,
columns=data_inputs.columns
)
return new_data_inputs
df_scaler = PandasValuesChangerOf(StandardScaler())
Then, you proceed as you intended:
features = df[["col1", "col2", "col3", "col4"]] # ... your df data
df_scaler, scaled_features = df_scaler.fit_transform(features)
You can try this code, this will give you a DataFrame with indexes
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_boston # boston housing dataset
dt= load_boston().data
col= load_boston().feature_names
# Make a dataframe
df = pd.DataFrame(data=dt, columns=col)
# define a method to scale data, looping thru the columns, and passing a scaler
def scale_data(data, columns, scaler):
for col in columns:
data[col] = scaler.fit_transform(data[col].values.reshape(-1, 1))
return data
# specify a scaler, and call the method on boston data
scaler = StandardScaler()
df_scaled = scale_data(df, col, scaler)
# view first 10 rows of the scaled dataframe
df_scaled[0:10]
You could directly assign a numpy array to a data frame by using slicing.
from sklearn.preprocessing import StandardScaler
features = df[["col1", "col2", "col3", "col4"]]
autoscaler = StandardScaler()
features[:] = autoscaler.fit_transform(features.values)

Categories