normalize input data based on a normalized dataset - python

I have this code that normalizes a pandas dataframe.
import numpy as np; import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn import preprocessing
df = pd.read_csv('DS/RS_DS/final_dataset.csv')
rec_df = df.drop(['person_id','encounter_id','birthdate','CN','HN','DN','DIAG_DM','DIAG_NONDM','TPN'], axis=1)
#normalize values from 0 to 1
df_val = rec_df.values
min_max_scaler = preprocessing.MinMaxScaler()
df_val_scaled = min_max_scaler.fit_transform(df_val)
df_scaled = pd.DataFrame(df_val_scaled)
df_flask = pd.DataFrame([[42.8,151,73,79,0,1,74]],columns=['weight','height','wc','hc','isMale','isFemale','age'])
df_flask_val = df_flask.values
df_flask_val_scaled = min_max_scaler.fit_transform(df_flask_val)
df_flask_scaled = pd.DataFrame(df_flask_val_scaled)
df_scaled returns a dataframe that is normalized. df_flask is a dataframe that I want to normalize based on df_scaled so I can use it for comparison. df_flask_scaled return all 0, I think it didnt normalize based on the dataframe. is there anyway to normalize the single row df.
or should I add this data to the dataframe then compute normalizing?

I think you should do fit and transform separately. This is done to ensure that the distribution of data using in fitting is maintained.
# initialise scaler
min_max_scaler = preprocessing.MinMaxScaler()
# fit here
min_max_scaler.fit(rec_df.values)
# apply transformation
df_val_scaled = min_max_scaler.transform(rec_df.values)
df_flask_val_scaled = min_max_scaler.transform(df_flask_val)

Related

Confuse why my KNN code is throwing a ValueError

I am using sklearn for KNN regressor:
#importing libraries and data
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor as KNR
theta = pd.read_csv("train.csv")#pandas dataframe
#getting data wanted from theta and putting it in a new dataframe
a = theta.get("YearBuilt")
b = theta.get("YrSold")
A = a.to_frame()
B = b.to_frame()
glasses = [A,B]
x = pd.concat(glasses)
#getting target data
y = theta.get("SalePrice")
#using KNN
horses = KNR(n_neighbors = 3)
horses.fit(x,y)
I get this error message:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Could someone please explain this? My data is in the hundred thousands for target and the thousands for input. And there is no blanks in the data.
Before answering the question, Let me refactor the code. You are using a dataframe so you can index single or muliple fields of the dataframe without going through the extra steps you've used:
#importing libraries and data
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor as KNR
theta = pd.read_csv("train.csv") # pandas dataframe
#getting data wanted from theta and putting it in a new dataframe
x = theta[["YearBuilt", "YrSold"]] # index multiple fields
#getting target data
y = theta["SalePrice"] # index single field
#using KNN
horses = KNR(n_neighbors = 3)
horses.fit(x,y) # fit KNN
Regarding your error, it indicates that you have some NaN, Inf, large values in your data. You can ensure these doesnt occur by filtering out the NaN and inf values using this:
theta = theta.replace([np.inf, -np.inf], np.nan)
theta.dropna(inplace=True)

Standardizing a set of columns in a pandas dataframe with sklearn

I have a table with four columns: CustomerID, Recency, Frequency and Revenue.
I need to standardize (scale) the columns Recency, Frequency and Revenue and save the column CustomerID.
I used this code:
from sklearn.preprocessing import normalize, StandardScaler
df.set_index('CustomerID', inplace = True)
standard_scaler = StandardScaler()
df = standard_scaler.fit_transform(df)
df = pd.DataFrame(data = df, columns = ['Recency', 'Frequency','Revenue'])
But the result is a table without the column CustomerID. Is there any way to get a table with the corresponding CustomerID and the scaled columns?
fit_transform returns an ndarray with no indices, so you are losing the index you set on df.set_index('CustomerID', inplace = True).
Instead of doing this, you can simply take the subset of columns you need to transform, pass them to StandardScaler, and overwrite the original columns.
# Subset of columns to transform
cols = ['Recency','Frequency','Revenue']
# Overwrite old columns with transformed columns
df[cols] = StandardScaler.fit_transform(df[cols])
This way, you leave CustomerID completely unchanged.
You can use scale to standardize specific columns:
from sklearn.preprocessing import scale
cols = ['Recency', 'Frequency', 'Revenue']
df[cols] = scale(df[cols])
You can use this metod:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
df[:, 3:] = sc.fit_transform(df[:, 1:])

ValueError: could not convert string to float: 'Q'

I am new to programming and I was working with the titanic dataset from Kaggle. I have been trying to build the Logistic Regression model after performing one-hot encoding. But I keep getting the error. I think the error is caused due to the dummy variable. Below is my code.
import numpy as np
import pandas as pd
import matplotlib as plt
import seaborn as sns
#Loading data
df=pd.read_csv(r"C:\Users\Downloads\train.csv")
#Deleting unwanted columns
df.drop(["PassengerId","Name","Cabin","Ticket"],axis=1,inplace=True)
#COunt of Missing values in each column
print(df.isnull().sum())
#Deleting rows with missing values based on column name
df.dropna(subset=['Embarked','Age'],inplace=True)
print(df.isnull().sum())
#One hot encoding for categorical variables
#Creating dummy variables for Sex column
dummies = pd.get_dummies(df.Sex)
dummies2=pd.get_dummies(df.Embarked)
#Appending the dummies dataframe with original dataframe
new_df= pd.concat([df,dummies,dummies2],axis='columns')
print(type(new_df))
#print(new_df.head(10))
#Drop the original sex,Embarked column and one of the dummy column for bth variables
new_df.drop(['Sex','Embarked'],axis='columns',inplace=True)
print(new_df.head(10))
new_df.info()
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix,accuracy_score
x = df.drop('Survived', axis=1)
y = df['Survived']
logmodel = LogisticRegression()
logmodel.fit(x, y)
As we discussed in the comments, here is the solution:
First, you need to modify your x and y variables to use new_df instead of df just like so:
x = new_df.drop('Survived', axis=1)
y = new_df['Survived']
Then, you need to increase the iteration of your Logistic Regression Model like so:
logmodel = LogisticRegression(max_iter=1000)

Not matching sample in y axis for knn

Im trying to make my way to a sligthly more flexible knn input script than the tutorials based of the iris dataset but Im having some trouble (I think) to add the matching 2nd dimension to the numpy array in #6 and when I come to #11. the fitting.
File "G:\PROGRAMMERING\Anaconda\lib\site-packages\sklearn\utils\validation.py", line 212, in check_consistent_length
" samples: %r" % [int(l) for l in lengths]) ValueError: Found input variables with inconsistent numbers of samples: [150, 1]
x is (150,5) and y is (150,1). 150 is the number of samples in both, but they differ in number of fields, is this the problem and if so how do I fix it?
#1. Loading the Pandas libraries as pd
import pandas as pd
import numpy as np
#2. Read data from the file 'custom.csv' placed in your code directory
data = pd.read_csv("custom.csv")
#3. Preview the first 5 lines of the loaded data
print(data.head())
print(type(data))
#4.Test the shape of the data
print(data.shape)
df = pd.DataFrame(data)
print(df)
#5. Convert non-numericals to numericals
print(df.dtypes)
# Any object should be converted to numerical
df['species'] = pd.Categorical(df['species'])
df['species'] = df.species.cat.codes
print("outcome:")
print(df.dtypes)
#6.Convert df to numpy.ndarray
np = df.to_numpy()
print(type(np)) #this should state <class 'numpy.ndarray'>
print(data.shape)
print(np)
x = np.data
y = [df['species']]
print(y)
#K-nearest neighbor (find closest) - searach for the K nearest observations in the dataset
#The model calculates the distance to all, and selects the K nearest ones.
#8. Import the class you plan to use
from sklearn.neighbors import (KNeighborsClassifier)
#9. Pick a value for K
k = 2
#10. Instantiate the "estimator" (make an instance of the model)
knn = KNeighborsClassifier(n_neighbors=k)
print(knn)
#11. fit the model with data/model training
knn.fit(x, y)
#12. Predict the response for a new observation
print(knn.predict([[3, 5, 4, 2]]))```
This is how I used the scikit-learn KNeighborsClassifier to fit the knn model:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
df = datasets.load_iris()
X = pd.DataFrame(df.data)
y = df.target
knn = KNeighborsClassifier(n_neighbors = 2)
knn.fit(X,y)
print(knn.predict([[6, 3, 5, 2]]))
#prints output class [2]
print(knn.predict([[3, 5, 4, 2]]))
#prints output class [1]
From DataFrame you don't need to convert to numpy array, you can directly fit the model on DataFrame, also while converting the DataFrame to numpy array you have named that as np which is also used to import numpy at the top import numpy as np
The input prediction input is 4 columns, leaving the fifth 'species' without prediction. Also, if 'species' was the target it cannot be given as input to the knn at the same time. The pop removes this particular column from the dataFrame df.
#npdf = df.to_numpy()
df = df.apply(lambda x:pd.Series(x))
y = np.asarray(df['species'])
#removes the target from the sample
df.pop('species')
x = df.to_numpy()

CSV normalization in Python

I'm working on a CSV file which contains several medical data and I want to implement it for ML model, but before executing the ML model, I want to normalize the data between 0 to 1. Below is my script, but it's producing error, how to resolve the error
Sample input file
import pandas as pd
import scipy as sp
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from pandas import read_csv
Data = ('Medical_Data.csv')
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(Data, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
np.set_printoptions(precision=3)
print(rescaledX[0:5,:])
Error massage:
could not convert string to float: 'preg'
You're performing pd.read_csv twice. Data will be in a DataFrame format and you cannot perform pd.read_csv on a DataFrame.
---- UPDATE
names needs to be defined before read_csv. Please refer to https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html .
import pandas as pd
import scipy as sp
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from pandas import read_csv
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv('Medical_Data.csv', names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
np.set_printoptions(precision=3)
print(rescaledX[0:5,:])
You don`t have to use 'pd.read_csv' twice. Just use like that:
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
Data = pd.read_csv('Medical_Data.csv',names=names)
and also if you want to get DataFrame's columns, use this code:
columns = Data.columns
'Data.columns' will return column list

Categories