Create feature vector programmatically in Spark ML / pyspark - python

I'm wondering if there is a concise way to run ML (e.g KMeans) on a DataFrame in pyspark if I have the features in multiple numeric columns.
I.e. as in the Iris dataset:
(a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1)
I'd like to use KMeans without recreating the DataSet with the feature vector added manually as a new column and the original columns hardcoded repeatedly in the code.
The solution I'd like to improve:
from pyspark.mllib.linalg import Vectors
from pyspark.sql.types import Row
from pyspark.ml.clustering import KMeans, KMeansModel
iris = sqlContext.read.parquet("/opt/data/iris.parquet")
iris.first()
# Row(a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1)
df = iris.map(lambda r: Row(
id = r.id,
a1 = r.a1,
a2 = r.a2,
a3 = r.a3,
a4 = r.a4,
label = r.label,
binomial_label=r.binomial_label,
features = Vectors.dense(r.a1, r.a2, r.a3, r.a4))
).toDF()
kmeans_estimator = KMeans()\
.setFeaturesCol("features")\
.setPredictionCol("prediction")\
kmeans_transformer = kmeans_estimator.fit(df)
predicted_df = kmeans_transformer.transform(df).drop("features")
predicted_df.first()
# Row(a1=5.1, a2=3.5, a3=1.4, a4=0.2, binomial_label=1, id=u'id_1', label=u'Iris-setosa', prediction=1)
I'm looking for a solution, which is something like:
feature_cols = ["a1", "a2", "a3", "a4"]
prediction_col_name = "prediction"
<dataframe independent code for KMeans>
<New dataframe is created, extended with the `prediction` column.>

You can use VectorAssembler:
from pyspark.ml.feature import VectorAssembler
ignore = ['id', 'label', 'binomial_label']
assembler = VectorAssembler(
inputCols=[x for x in df.columns if x not in ignore],
outputCol='features')
assembler.transform(df)
It can be combined with k-means using ML Pipeline:
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[assembler, kmeans_estimator])
model = pipeline.fit(df)

Related

Encoding multiple columns

In the case a dataframe has two or more columns with numerical and text values, and one Label/Target column, if I want to apply a model like svm, how can I use only the columns I am more interested in?
Ex.
Data Num Label/Target No_Sense
What happens here? group1 1 Migrate
Customer Management group2 0 Change Stage
Life Cycle Stages group1 1 Restructure
Drop-down allows to select status type group3 1 Restructure Status
and so.
The approach I have taken is
1.encode "Num" column:
one_hot = pd.get_dummies(df['Num'])
df = df.drop('Num',axis = 1)
df = df.join(one_hot)
2.encode "Data" column:
def bag_words(df):
df = basic_preprocessing(df)
count_vectorizer = CountVectorizer()
count_vectorizer.fit(df['Data'])
list_corpus = df["Data"].tolist()
list_labels = df["Label/Target"].tolist()
X = count_vectorizer.transform(list_corpus)
return X, list_labels
Then apply bag_words to the dataset
X, y = bag_words(df)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)
Is there anything that I missed in these steps? How can I select only "Data" and "Num" features in my training dataset? (as I think "No_Sense" is not so relevant for my purposes)
EDIT: I have tried with
def bag_words(df):
df = basic_preprocessing(df)
count_vectorizer = CountVectorizer()
count_vectorizer.fit(df['Data'])
list_corpus = df["Data"].tolist()+ df["group1"].tolist()+df["group2"].tolist()+df["group3"].tolist() #<----
list_labels = df["Label/Target"].tolist()
X = count_vectorizer.transform(list_corpus)
return X, list_labels
but I have found the error:
TypeError: 'int' object is not iterable
I hope this helps you:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer
#this part so I can recreate you df from the string you posted
#remove this part !!!!
data="""
Data Num Label/Target No_Sense
What happens here? group1 1 Migrate
Customer Management group2 0 Change Stage
Life Cycle Stages group1 1 Restructure
Drop-down allows to select status type group3 1 Restructure Status
"""
df = pd.DataFrame(np.array( [ re.split(r'\s{2,}', line) for line in lines[1:] ] ),
columns = lines[0].split())
#what you want starts from here!!!!:
one_hot = pd.get_dummies(df['Num'])
df = df.drop('Num',axis = 1)
df = df.join(one_hot)
#at this point you have 3 new fetures for 'Num' variable
def bag_words(df):
count_vectorizer = CountVectorizer()
count_vectorizer.fit(df['Data'])
matrix = count_vectorizer.transform(df['Data'])
#this dataframe: `encoded_df`has 15 new features, these are the result of fitting
#the CountVectorizer to the 'Data' variable
encoded_df = pd.DataFrame(data=matrix.toarray(), columns=["Data"+str(i) for i in range(matrix.shape[1])])
#adding them to the dataframe
df.join(encoded_df)
#getting the numpy arrays that you can use in training
X = df.loc[:, ["Data"+str(i) for i in range(matrix.shape[1])] + ["group1", "group2", "group3"]].to_numpy()
y = df.loc[:, ["Label/Target"]].to_numpy()
return X, y
X, y = bag_words(df)

How do I standardize my data so that the Mean is 0?

I'm trying to standardize a dataset in Python as part of Principle Component Analysis. I've managed to do the following so far:
cancer_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', header=None)
cancer_data.columns = ['Sample code', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape',
'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin',
'Normal Nucleoli', 'Mitoses','Class']
cancer_data = cancer_data.replace('?', np.NaN)
cancer_data = cancer_data.fillna(cancer_data.median())
classDF = cancer_data['Class']
cancer_data = cancer_data.drop(['Class' ,'Sample code'], axis = 1)
# Standardization of data
standardized = StandardScaler().fit_transform(cancer_data)
x = pd.DataFrame(standardized, columns = cancer_data.columns)
However when I check the Mean values, I get the following output:
array([-5.08256606e-17, -9.14861892e-17, -3.04953964e-17, 5.08256606e-17,
5.08256606e-17, -8.13210570e-17, 3.04953964e-17, -1.32146718e-16,
-8.13210570e-17])
I'm not too sure what I'm doing wrong for these values to be wrong so any help is much appreicated (I'm new to data mining).
Use the formula of the standarization:
column = column to standardized
df_std[column] = (df_std[column] - df_std[column].mean()) /
df_std[column].std()
or:
from sklearn.preprocessing import StandardScaler
# create a scaler object
std_scaler = StandardScaler()
std_scaler
# fit and transform the data
df_std = pd.DataFrame(std_scaler.fit_transform(df_cars), columns=column)
Read for more information :
https://towardsdatascience.com/data-normalization-with-pandas-and-scikit-learn-7c1cc6ed6475

Do you have to create dummy variables for ordinal variables? Also getting error with conversion

For the dataset that I am working with, the categorical variables are ordinal, ranging from 1 to 5 for three columns. I am going to be feeding this into XGBoost.
Would I be okay to just run this command and skip creating dummy variables:
ser = pd.Series([1, 2, 3], dtype='category')
ser = ser.to_frame()
ser = ser.T
I would like to know conceptually, since the categorical data is ordinal, would simply converting that to type category be adequate for the model? I tried creating dummy variables but all the values become a 1.
As for the code now, it runs but this command returns: 'numpy.int64'.
type(ser[0][0])
Am I going about this correctly? Any help would be great!
Edit: updated code
Edit2: Normalizing the numerical data values. Is this logic correct?:
r = [1, 2, 3, 100 ,200]
scaler = preprocessing.StandardScaler()
r = preprocessing.scale(r)
r = pd.Series(r)
r = r.to_frame()
r = r.T
Edit3: This is the dataset.
Just setting categorical variables as dtype="category" is not sufficient and won't work.
You need to convert categorical values to true categorical values with pd.factorize(), where each category is assigned a numerical label.
Let's say df is your pandas dataframe. Then in general you could use this boilerplate code:
df_numeric = df.select_dtypes(exclude=['object'])
df_obj = df.select_dtypes(include=['object']).copy()
# factorize categoricals columnwise
for c in df_obj:
df_obj[c] = pd.factorize(df_obj[c])[0]
# if you want to one hot encode then add this line:
df_obj = pd.get_dummies(df_obj, prefix_sep='_', drop_first = True)
# merge dataframes back to one dataframe
df_final = pd.concat([df_numeric, df_obj], axis=1)
Since your categorical variables already are factorized (as far as I understand), you can skip the factorization and just try one hot encoding.
See also this post on stats.stackexchange.
If you want to standardize/normalize your numerical data (not the categorical) use this function:
from sklearn import preprocessing
def scale_data(data, scale="robust"):
x = data.values
if scale == "minmax":
scaler = preprocessing.MinMaxScaler()
x_scaled = scaler.fit_transform(x)
elif scale == "standard":
scaler = preprocessing.StandardScaler()
x_scaled = scaler.fit_transform(x)
elif scale == "quantile":
scaler = preprocessing.QuantileTransformer()
x_scaled = scaler.fit_transform(x)
elif scale == "robust":
scaler = preprocessing.RobustScaler()
x_scaled = scaler.fit_transform(x)
data = pd.DataFrame(x_scaled, columns = data.columns)
return data
scaled_df = scale_data(df_numeric, "robust")
Putting it all together for your dataset:
from sklearn import preprocessing
df = pd.read_excel("default of credit card clients.xls", skiprows=1)
y = df['default payment next month'] #target variable
del df['default payment next month']
c = [2,3,4] # index of categorical data columns
r = list(range(0,24))
r = [x for x in r if x not in c] # get list of all other columns
df_cat = df.iloc[:, [2,3,4]].copy()
df_con = df.iloc[:, r].copy()
# factorize categorical data
for c in df_cat:
df_cat[c] = pd.factorize(df_cat[c])[0]
# scale continuous data
scaler = preprocessing.MinMaxScaler()
df_scaled = scaler.fit_transform(df_con)
df_scaled = pd.DataFrame(df_scaled, columns=df_con.columns)
df_final = pd.concat([df_cat, df_scaled], axis=1)
#reorder columns back to original order
cols = df.columns
df_final = df_final[cols]
To further improve the code, do the train/test split before normalization, fit_transform() on the training data and just transform() on the test data. Otherwise you will have a data leak.

How can I get class names back when using MultiLabelBinarizer

I have a csv file which looks like this:
target,data
AAA,some text document
AAA;BBB,more text
AAC,more text
Here is the code:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.naive_bayes import BernoulliNB
import pandas as pd
pdf = pd.read_csv("Train.csv", sep=',')
pdfT = pd.read_csv("Test.csv", sep=',')
X1 = pdf['data']
Y1 = [[t for t in tar.split(';')] for tar in pdf['target']]
X2 = pdfT['data']
Y2 = [[t for t in tar.split(';')] for tar in pdfT['target']]
# Vectorizer data
hv = HashingVectorizer(stop_words='english', non_negative=True)
X1 = hv.transform(X1)
X2 = hv.transform(X2)
mlb = MultiLabelBinarizer()
mlb.fit(Y1+Y2)
Y1 = mlb.transform(Y1)
# mlb.classes_ looks like ['AAA','AAC','BBB',...] len(mlb.classes_)==1363
# Y1 looks like [[0,0,0,....0,0,0], ... ] now
# fit
clsf = OneVsRestClassifier(BernoulliNB(alpha=.001))
clsf.fit(X1,Y1)
# predict_proba
proba = clsf.predict_proba(X2)
# want to get class names back
classnames = mlb.inverse_transform(clsf.classes_) # booom, shit happens
for i in range(len(proba)):
# get classnames,probability dict
preDict = dict(zip(classnames, proba[i]))
# sort dict by probability value, print actual and top 5 predict results
print(Y2[i], dict(sorted(preDict.items(),key=lambda d:d[1],reverse=True)[0:5]))
The problem is after clsf.fit(X1,Y1)
clsf.classes_ is an int array [0,1,2,3,...1362]
why is it not like Y1? How can I get the classnames from clsf.classes_? mlb.classes_ == clsf.classes_ or not, with same order?
When you fit OneVsRestClassifier with multiple labels a LabelBinarizer is called during the fit call, which will convert the the multilabels into unique labels for each class.
You can access the label_binarizer_ attribute of the clsf object, which has an attribute for classes that will contain the class definition for classes fit in the call to clsf.

Pandas categorical variable transformation

Data.csv: param1,param2,param3,result
1,2,cat1,12
2,3,cat2,13
1,6,cat1,6
1,1,cat2,12
Suppose i read the data from the file and convert categorical variables into dummy variables like this:
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
data = pd.read_csv('data.csv')
type_dummies = pd.get_dummies(data.house_type)
data = pd.concat([data, type_dummies], axis=1)
I received dataframe:
1,2,1,0,..
1,6,0,1,..
I made simple linear regression for that dataset and received coeffs. How can i convert a new record (new_data = np.array([12,19,cat1])) for new_data = np.array([12,19,1,0)) using pandas for using it in my linear model? (such that new data categorical variables will be converted into dummy variables)
Typically you'll want to setup a pipeline to record the correct category:code mapping.
class CategoricalTransformer(TransformerMixin):
def fit(self, X, y=None, *args, **kwargs):
self.columns_ = X.columns
self.cat_columns_ = X.select_dtypes(include=['category']).columns
self.non_cat_columns_ = X.columns.drop(self.cat_columns_)
self.cat_map_ = {col: X[col].cat.categories
for col in self.cat_columns_}
self.ordered_ = {col: X[col].cat.ordered
for col in self.cat_columns_}
self.dummy_columns_ = {col: ["_".join([col, v])
for v in self.cat_map_[col]]
for col in self.cat_columns_}
self.transformed_columns_ = pd.Index(
self.non_cat_columns_.tolist() +
list(chain.from_iterable(self.dummy_columns_[k]
for k in self.cat_columns_))
)
def transform(self, X, y=None, *args, **kwargs):
return (pd.get_dummies(X)
.reindex(columns=self.transformed_columns_)
.fillna(0))
More here.
with the pipeline sklearn.pipeline.make_pipeline(CategoricalTransformer(), LinearRegression()), your predict method should correctly translate from the categorical house_type to variables.

Categories