I'm trying to standardize a dataset in Python as part of Principle Component Analysis. I've managed to do the following so far:
cancer_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', header=None)
cancer_data.columns = ['Sample code', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape',
'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin',
'Normal Nucleoli', 'Mitoses','Class']
cancer_data = cancer_data.replace('?', np.NaN)
cancer_data = cancer_data.fillna(cancer_data.median())
classDF = cancer_data['Class']
cancer_data = cancer_data.drop(['Class' ,'Sample code'], axis = 1)
# Standardization of data
standardized = StandardScaler().fit_transform(cancer_data)
x = pd.DataFrame(standardized, columns = cancer_data.columns)
However when I check the Mean values, I get the following output:
array([-5.08256606e-17, -9.14861892e-17, -3.04953964e-17, 5.08256606e-17,
5.08256606e-17, -8.13210570e-17, 3.04953964e-17, -1.32146718e-16,
-8.13210570e-17])
I'm not too sure what I'm doing wrong for these values to be wrong so any help is much appreicated (I'm new to data mining).
Use the formula of the standarization:
column = column to standardized
df_std[column] = (df_std[column] - df_std[column].mean()) /
df_std[column].std()
or:
from sklearn.preprocessing import StandardScaler
# create a scaler object
std_scaler = StandardScaler()
std_scaler
# fit and transform the data
df_std = pd.DataFrame(std_scaler.fit_transform(df_cars), columns=column)
Read for more information :
https://towardsdatascience.com/data-normalization-with-pandas-and-scikit-learn-7c1cc6ed6475
Related
so currently this is the code I have. Not attached are various graphs that I have made that show the actual stock price from the CSV and then my projections. I'm wanting to make it where I simply predict tomorrow's stock price given all of this historical data but I'm having a difficult time. The "df.loc[len(df.index)] = ['2022-04-05',0,0,0,0,0,0]" was where I was trying to put the predictions for future days although I am open to other ways.
# Machine learning
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# For data manipulation
import pandas as pd
import numpy as np
# To plot
import matplotlib.pyplot as plt
plt.style.use('seaborn-darkgrid')
# To ignore warnings
import warnings
warnings.filterwarnings("ignore")
# method of pandas
df = pd.read_csv('data_files/MSFT.csv')
#add extra row of blank data for future prediction
df.loc[len(df.index)] = ['2022-04-05',0,0,0,0,0,0]
df.loc[len(df.index)] = ['2022-04-06',0,0,0,0,0,0]
df.loc[len(df.index)] = ['2022-04-07',0,0,0,0,0,0]
df.loc[len(df.index)] = ['2022-04-08',0,0,0,0,0,0]
# Changes The Date column as index columns
df.index = pd.to_datetime(df['Date'])
# drop The original date column
df = df.drop(['Date'], axis='columns')
print(df)
# Create predictor variables
df['Open-Close'] = df.Open - df.Close
df['High-Low'] = df.High - df.Low
# Store all predictor variables in a variable X
X = df[['Open-Close', 'High-Low']]
X.head()
# Target variables
y = np.where(df['Close'].shift(-1) > df['Close'], 1, 0)
print(y)
split_percentage = 0.8
split = int(split_percentage*len(df))
# Train data set
X_train = X[:split]
y_train = y[:split]
# Test data set
X_test = X[split:]
y_test = y[split:]
# Support vector classifier
cls = SVC().fit(X_train, y_train)
df['Predicted_Signal'] = cls.predict(X)
# Calculate daily returns
df['Return'] = df.Close.pct_change()
# Calculate strategy returns
df['Strategy_Return'] = df.Return * df.Predicted_Signal.shift(1)
# Calculate Cumulutive returns
df['Cum_Ret'] = df['Return'].cumsum()
# Plot Strategy Cumulative returns
df['Cum_Strategy'] = df['Strategy_Return'].cumsum()
I have data showing the price to lease different cars.
i have created a matrix to show the correlations between each of the elements involved but i do not trust it. in my experience the correlations it is showing should not be. the blp (the cost to fully purchase the car) should be the most important factor, however im getting seats and engine volume. (engine volume i can understand, but seats?)
perhaps the problem may be how i scaled my data.
correlation matrix image
from matplotlib import pyplot
import pandas as pd
import numpy
from sklearn import *
def scale_this_data(data, col_names):
print("scalling data now")
new_df = pd.DataFrame(columns = col_names)
for col in data.columns:
wanted_col = False
for the_col in col_names:
if the_col == col:
wanted_col = True
if wanted_col == True:
np_arr = data[col].values
np_arr = np_arr.reshape(-1, 1)
min_max_scaler = preprocessing.MinMaxScaler()
np_arr = min_max_scaler.fit_transform(np_arr)
#for n in range(len(data[col])):
old = data[col].iloc[3]
data[col] = np_arr
print(str(data[col].iloc[3])+ " this became this = "+ str(data[col]))
return data
Path = "new_ratebook.csv"
col_names = ['Net Rental2','Doors2', 'Seats2', 'BHP2', 'Eng CC2', 'CO22', 'blp2']
data = pd.read_csv(Path , dtype = str , index_col=False, low_memory=False)
data = scale_this_data(data, col_names)
data.to_csv("scaleddata.csv")
correlations = data.corr()
fig = pyplot.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=0, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,7,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(col_names)
ax.set_yticklabels(col_names)
pyplot.savefig('correlations.png')
pyplot.show()
question, how do i confirm to myself the the correlation is correct
You can confirm it with various ways. Some are the following:
Verify that data are correct.
Take some of your data (reduce the length of your data frame).
Calculate it by hand (good way to convince one's self), calculator, excel or an online correlation coefficient calculator like the Pearson Correlation Coefficient Calculator from google results.
By the way, correlation does not imply effect/causality (link, archived).
In the case a dataframe has two or more columns with numerical and text values, and one Label/Target column, if I want to apply a model like svm, how can I use only the columns I am more interested in?
Ex.
Data Num Label/Target No_Sense
What happens here? group1 1 Migrate
Customer Management group2 0 Change Stage
Life Cycle Stages group1 1 Restructure
Drop-down allows to select status type group3 1 Restructure Status
and so.
The approach I have taken is
1.encode "Num" column:
one_hot = pd.get_dummies(df['Num'])
df = df.drop('Num',axis = 1)
df = df.join(one_hot)
2.encode "Data" column:
def bag_words(df):
df = basic_preprocessing(df)
count_vectorizer = CountVectorizer()
count_vectorizer.fit(df['Data'])
list_corpus = df["Data"].tolist()
list_labels = df["Label/Target"].tolist()
X = count_vectorizer.transform(list_corpus)
return X, list_labels
Then apply bag_words to the dataset
X, y = bag_words(df)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)
Is there anything that I missed in these steps? How can I select only "Data" and "Num" features in my training dataset? (as I think "No_Sense" is not so relevant for my purposes)
EDIT: I have tried with
def bag_words(df):
df = basic_preprocessing(df)
count_vectorizer = CountVectorizer()
count_vectorizer.fit(df['Data'])
list_corpus = df["Data"].tolist()+ df["group1"].tolist()+df["group2"].tolist()+df["group3"].tolist() #<----
list_labels = df["Label/Target"].tolist()
X = count_vectorizer.transform(list_corpus)
return X, list_labels
but I have found the error:
TypeError: 'int' object is not iterable
I hope this helps you:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer
#this part so I can recreate you df from the string you posted
#remove this part !!!!
data="""
Data Num Label/Target No_Sense
What happens here? group1 1 Migrate
Customer Management group2 0 Change Stage
Life Cycle Stages group1 1 Restructure
Drop-down allows to select status type group3 1 Restructure Status
"""
df = pd.DataFrame(np.array( [ re.split(r'\s{2,}', line) for line in lines[1:] ] ),
columns = lines[0].split())
#what you want starts from here!!!!:
one_hot = pd.get_dummies(df['Num'])
df = df.drop('Num',axis = 1)
df = df.join(one_hot)
#at this point you have 3 new fetures for 'Num' variable
def bag_words(df):
count_vectorizer = CountVectorizer()
count_vectorizer.fit(df['Data'])
matrix = count_vectorizer.transform(df['Data'])
#this dataframe: `encoded_df`has 15 new features, these are the result of fitting
#the CountVectorizer to the 'Data' variable
encoded_df = pd.DataFrame(data=matrix.toarray(), columns=["Data"+str(i) for i in range(matrix.shape[1])])
#adding them to the dataframe
df.join(encoded_df)
#getting the numpy arrays that you can use in training
X = df.loc[:, ["Data"+str(i) for i in range(matrix.shape[1])] + ["group1", "group2", "group3"]].to_numpy()
y = df.loc[:, ["Label/Target"]].to_numpy()
return X, y
X, y = bag_words(df)
I'm at a loss as to what's happening here.
I'm downloading historical stock data with Pandas Datareader, and after some small manipulations (ie. re-arranging the dataframe, adding moving averages, etc.), I pass the dataframe to FeatureTools to do a quick Auto Feature Engineering, which it does fine by adding new columns to the dataframe...
BUT then I pass it to FeatureSelector (to remove all columns that are highly correlated, have no importance, etc.) but I receive an issue where FeatureSelector cannot find the "label" column in the dataset that I'm trying to point it to anymore (Adj Close). I'm new to FeatureSelector so I'm not entirely sure how to use it yet. From there, it will pass the data on to TPOT to do an Auto Regression.
I have included my full code here, I know you're not supposed to, but it will be a working code for anyone to be able to try and see my issue on their side. The error I get is:
KeyError: "labels ['Adj Close'] not contained in axis"
It would appear that FeatureSelector is removing the "Adj Close" label/column during the removal step, but I thought that was why we assign it to the internal "label=" part? Any suggestions would be great. Would love to get this working. Just type in a ticker symbol to get started (ex. CLVS). Thanks!
ticker_input = input('Which stock ticker would you like to predict?') # Start with CLVS for testing
print('Getting the historical data for: ',ticker_input)
# Downloading historical data as dataframe
from datetime import datetime
from pandas_datareader import data as web
import pandas as pd
ex = 'yahoo'
start = datetime(2010, 1, 1)
end = datetime.now()
df = web.DataReader(ticker_input, ex, start, end).reset_index()
# Create the prediction dataset
df = df.drop(['Close'],axis=1)
df['PrevHi'] = df['High'].shift(1)
df['PrevLo'] = df['Low'].shift(1)
df['PrevClose'] = df['Adj Close'].shift(1)
df['PrevVol'] = df['Volume'].shift(1)
df['PrevOpen'] = df['Open'].shift(1)
df = df.drop(['High','Low','Volume'],axis=1)
# Get the 9 and 20 MA values
df['9MA'] = df['Open'].rolling(window=9).mean()
df['20MA'] = df['Open'].rolling(window=20).mean()
import time
# Reshape the df
df2 = df[['Date','Open','PrevOpen','PrevHi','PrevLo','PrevClose','PrevVol','9MA','20MA','Adj Close']]
df2.dropna(how='all') # THIS DROP ISN'T DROPPING ROWS W/ BLANK VALUES FOR SOME REASON???
# Auto Feature Engineering using Feature Tools
import featuretools as ft
#print(ft.list_primitives().to_string()) # To get full list of primitives that could be used
print('Adding the engineered features to the dataframe. This may take a while...')
es = ft.EntitySet(id = 'stockdata')
es.entity_from_dataframe(entity_id = 'data', dataframe = df2,
make_index = False,index = 'Date')
# Run deep feature synthesis with transformation primitives
feature_matrix, feature_defs = ft.dfs(entityset = es, target_entity = 'data', max_depth=2,verbose=True,
agg_primitives = ['skew','mean','median',
'all','count','num_unique','trend','max','mode',
'std','sum','min'],
trans_primitives = ['divide_numeric'])
# 'diff',
# 'greater_than',
# 'less_than_equal_to',
# 'cum_mean',
# 'time_since',
# 'cum_sum',
# 'add_numeric',
# 'multiply_numeric',
# 'greater_than_equal_to',
# 'negate',
# 'cum_min',
# 'subtract_numeric',
# 'not',
# 'cum_count',
# 'modulo_numeric',
# 'less_than'])
print(feature_matrix.head())
df2 = feature_matrix
df2.to_csv('FeatureMatrix.csv')
# Trying to now name all the feature columns and label for FeatureSelector...
features = df2.drop(['Adj Close'],axis=1)
label = df2['Adj Close'].values
# Now, drop all columns of low importance
from feature_selector import FeatureSelector
fs = FeatureSelector(data = features, labels = label)
fs.identify_all(selection_params = {'missing_threshold': 0.6,
'correlation_threshold': 0.98,
'task': 'regression',
'eval_metric': 'mse',
'cumulative_importance': 0.99})
df2 = fs.remove(methods = 'all')
# Somewhere above it's not recognizing my Adj Close label anymore?
# Training dataset
df = df2.iloc[:-90] # subtracting 90 rows/days to use as the predictions dataset later
print('Printing training dataframe...')
print(df)
# Prediction dataset for later use
prediction_df = df2.iloc[-90:]
print('Printing prediction dataframe for later use...')
print(prediction_df)
# Can keep adding to the dataset with things like PrevIndustryHi,Lo,Close,Open and other metrics
print('Pausing for 20 seconds to review before training...')
time.sleep(20)
# Now, train a TPOT Regressor
from tpot import TPOTRegressor
from sklearn.model_selection import train_test_split
import os
features = df.drop(['Adj Close'],axis=1)
label = df['Adj Close']
X_train, X_test, y_train, y_test = train_test_split(features, label,
train_size=0.75, test_size=0.25)
# Create a folder to cache the pipeline work (use if not using auto)
# if os.path.exists('./PipelineCache'):
# pass
# else:
# os.mkdir('./PipelineCache')
tpot = TPOTRegressor(generations=10, population_size=40, verbosity=2) #memory='./PipelineCache', memory='auto',
tpot.fit(X_train, y_train)
predictions = (tpot.predict(X_test))
actuals = y_test
last_row = df.tail(1)
print('The last closing price was :')
print(last_row['Adj Close'])
print("TPOT's final score on training data is : ")
print(tpot.score(X_test, y_test))
if os.path.exists('./Exported Pipelines'):
pass
else:
os.mkdir('./Exported Pipelines')
tpot.export('./Exported Pipelines/1day-prediction-pipeline.py')
# Now, use the TPOT model to predict on the held out predictions dataset
from sklearn.metrics import mean_squared_error
features = prediction_df.drop(['Adj Close'], axis=1)
labels = prediction_df['Adj Close']
# Fit the model to the prediction_df and predict the labels
#tpot.fit(features, labels)
results = tpot.predict(features)
predictions_list = []
for preds in results:
predictions_list.append(preds)
prediction_df['Predictions'] = predictions_list
prediction_df.to_csv('PredictionsPerformance.csv', index=True)
print('The Mean Square Error of the predictions is :')
print(mean_squared_error(labels,results))
print('DONE!')
# Clear the cache directory when you don't need it anymore.
# If you're testing the same dataset over and over, use the
# same cache file
#from shutil import rmtree
#rmtree('./PipelineCache')
As a workaround, I just re-added the df column with the adj close in it, after the removal process, like so:
# Trying to now name all the feature columns and label for FeatureSelector...
features = df.drop("Adj Close", axis=1)
label = df["Adj Close"]
# Now, drop all columns of low importance
from feature_selector import FeatureSelector
fs = FeatureSelector(data = features, labels = label)
fs.identify_all(selection_params = {'missing_threshold': 0.6,
'correlation_threshold': 0.98,
'task': 'regression',
'eval_metric': 'mse',
'cumulative_importance': 0.99})
all_to_remove = fs.check_removal()
print(all_to_remove[:])
df = fs.remove(methods = 'all')
# Re-Add the Adj Close to the df because FeatureTools removes it once you assign it as the label for some reason
df['Adj Close'] = label
For the dataset that I am working with, the categorical variables are ordinal, ranging from 1 to 5 for three columns. I am going to be feeding this into XGBoost.
Would I be okay to just run this command and skip creating dummy variables:
ser = pd.Series([1, 2, 3], dtype='category')
ser = ser.to_frame()
ser = ser.T
I would like to know conceptually, since the categorical data is ordinal, would simply converting that to type category be adequate for the model? I tried creating dummy variables but all the values become a 1.
As for the code now, it runs but this command returns: 'numpy.int64'.
type(ser[0][0])
Am I going about this correctly? Any help would be great!
Edit: updated code
Edit2: Normalizing the numerical data values. Is this logic correct?:
r = [1, 2, 3, 100 ,200]
scaler = preprocessing.StandardScaler()
r = preprocessing.scale(r)
r = pd.Series(r)
r = r.to_frame()
r = r.T
Edit3: This is the dataset.
Just setting categorical variables as dtype="category" is not sufficient and won't work.
You need to convert categorical values to true categorical values with pd.factorize(), where each category is assigned a numerical label.
Let's say df is your pandas dataframe. Then in general you could use this boilerplate code:
df_numeric = df.select_dtypes(exclude=['object'])
df_obj = df.select_dtypes(include=['object']).copy()
# factorize categoricals columnwise
for c in df_obj:
df_obj[c] = pd.factorize(df_obj[c])[0]
# if you want to one hot encode then add this line:
df_obj = pd.get_dummies(df_obj, prefix_sep='_', drop_first = True)
# merge dataframes back to one dataframe
df_final = pd.concat([df_numeric, df_obj], axis=1)
Since your categorical variables already are factorized (as far as I understand), you can skip the factorization and just try one hot encoding.
See also this post on stats.stackexchange.
If you want to standardize/normalize your numerical data (not the categorical) use this function:
from sklearn import preprocessing
def scale_data(data, scale="robust"):
x = data.values
if scale == "minmax":
scaler = preprocessing.MinMaxScaler()
x_scaled = scaler.fit_transform(x)
elif scale == "standard":
scaler = preprocessing.StandardScaler()
x_scaled = scaler.fit_transform(x)
elif scale == "quantile":
scaler = preprocessing.QuantileTransformer()
x_scaled = scaler.fit_transform(x)
elif scale == "robust":
scaler = preprocessing.RobustScaler()
x_scaled = scaler.fit_transform(x)
data = pd.DataFrame(x_scaled, columns = data.columns)
return data
scaled_df = scale_data(df_numeric, "robust")
Putting it all together for your dataset:
from sklearn import preprocessing
df = pd.read_excel("default of credit card clients.xls", skiprows=1)
y = df['default payment next month'] #target variable
del df['default payment next month']
c = [2,3,4] # index of categorical data columns
r = list(range(0,24))
r = [x for x in r if x not in c] # get list of all other columns
df_cat = df.iloc[:, [2,3,4]].copy()
df_con = df.iloc[:, r].copy()
# factorize categorical data
for c in df_cat:
df_cat[c] = pd.factorize(df_cat[c])[0]
# scale continuous data
scaler = preprocessing.MinMaxScaler()
df_scaled = scaler.fit_transform(df_con)
df_scaled = pd.DataFrame(df_scaled, columns=df_con.columns)
df_final = pd.concat([df_cat, df_scaled], axis=1)
#reorder columns back to original order
cols = df.columns
df_final = df_final[cols]
To further improve the code, do the train/test split before normalization, fit_transform() on the training data and just transform() on the test data. Otherwise you will have a data leak.