I'm wondering if there is a concise way to run ML (e.g KMeans) on a DataFrame in pyspark if I have the features in multiple numeric columns.
I.e. as in the Iris dataset:
(a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1)
I'd like to use KMeans without recreating the DataSet with the feature vector added manually as a new column and the original columns hardcoded repeatedly in the code.
The solution I'd like to improve:
from pyspark.mllib.linalg import Vectors
from pyspark.sql.types import Row
from pyspark.ml.clustering import KMeans, KMeansModel
iris = sqlContext.read.parquet("/opt/data/iris.parquet")
iris.first()
# Row(a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1)
df = iris.map(lambda r: Row(
id = r.id,
a1 = r.a1,
a2 = r.a2,
a3 = r.a3,
a4 = r.a4,
label = r.label,
binomial_label=r.binomial_label,
features = Vectors.dense(r.a1, r.a2, r.a3, r.a4))
).toDF()
kmeans_estimator = KMeans()\
.setFeaturesCol("features")\
.setPredictionCol("prediction")\
kmeans_transformer = kmeans_estimator.fit(df)
predicted_df = kmeans_transformer.transform(df).drop("features")
predicted_df.first()
# Row(a1=5.1, a2=3.5, a3=1.4, a4=0.2, binomial_label=1, id=u'id_1', label=u'Iris-setosa', prediction=1)
I'm looking for a solution, which is something like:
feature_cols = ["a1", "a2", "a3", "a4"]
prediction_col_name = "prediction"
<dataframe independent code for KMeans>
<New dataframe is created, extended with the `prediction` column.>
You can use VectorAssembler:
from pyspark.ml.feature import VectorAssembler
ignore = ['id', 'label', 'binomial_label']
assembler = VectorAssembler(
inputCols=[x for x in df.columns if x not in ignore],
outputCol='features')
assembler.transform(df)
It can be combined with k-means using ML Pipeline:
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[assembler, kmeans_estimator])
model = pipeline.fit(df)
Related
In the case a dataframe has two or more columns with numerical and text values, and one Label/Target column, if I want to apply a model like svm, how can I use only the columns I am more interested in?
Ex.
Data Num Label/Target No_Sense
What happens here? group1 1 Migrate
Customer Management group2 0 Change Stage
Life Cycle Stages group1 1 Restructure
Drop-down allows to select status type group3 1 Restructure Status
and so.
The approach I have taken is
1.encode "Num" column:
one_hot = pd.get_dummies(df['Num'])
df = df.drop('Num',axis = 1)
df = df.join(one_hot)
2.encode "Data" column:
def bag_words(df):
df = basic_preprocessing(df)
count_vectorizer = CountVectorizer()
count_vectorizer.fit(df['Data'])
list_corpus = df["Data"].tolist()
list_labels = df["Label/Target"].tolist()
X = count_vectorizer.transform(list_corpus)
return X, list_labels
Then apply bag_words to the dataset
X, y = bag_words(df)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)
Is there anything that I missed in these steps? How can I select only "Data" and "Num" features in my training dataset? (as I think "No_Sense" is not so relevant for my purposes)
EDIT: I have tried with
def bag_words(df):
df = basic_preprocessing(df)
count_vectorizer = CountVectorizer()
count_vectorizer.fit(df['Data'])
list_corpus = df["Data"].tolist()+ df["group1"].tolist()+df["group2"].tolist()+df["group3"].tolist() #<----
list_labels = df["Label/Target"].tolist()
X = count_vectorizer.transform(list_corpus)
return X, list_labels
but I have found the error:
TypeError: 'int' object is not iterable
I hope this helps you:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer
#this part so I can recreate you df from the string you posted
#remove this part !!!!
data="""
Data Num Label/Target No_Sense
What happens here? group1 1 Migrate
Customer Management group2 0 Change Stage
Life Cycle Stages group1 1 Restructure
Drop-down allows to select status type group3 1 Restructure Status
"""
df = pd.DataFrame(np.array( [ re.split(r'\s{2,}', line) for line in lines[1:] ] ),
columns = lines[0].split())
#what you want starts from here!!!!:
one_hot = pd.get_dummies(df['Num'])
df = df.drop('Num',axis = 1)
df = df.join(one_hot)
#at this point you have 3 new fetures for 'Num' variable
def bag_words(df):
count_vectorizer = CountVectorizer()
count_vectorizer.fit(df['Data'])
matrix = count_vectorizer.transform(df['Data'])
#this dataframe: `encoded_df`has 15 new features, these are the result of fitting
#the CountVectorizer to the 'Data' variable
encoded_df = pd.DataFrame(data=matrix.toarray(), columns=["Data"+str(i) for i in range(matrix.shape[1])])
#adding them to the dataframe
df.join(encoded_df)
#getting the numpy arrays that you can use in training
X = df.loc[:, ["Data"+str(i) for i in range(matrix.shape[1])] + ["group1", "group2", "group3"]].to_numpy()
y = df.loc[:, ["Label/Target"]].to_numpy()
return X, y
X, y = bag_words(df)
I'm trying to standardize a dataset in Python as part of Principle Component Analysis. I've managed to do the following so far:
cancer_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', header=None)
cancer_data.columns = ['Sample code', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape',
'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin',
'Normal Nucleoli', 'Mitoses','Class']
cancer_data = cancer_data.replace('?', np.NaN)
cancer_data = cancer_data.fillna(cancer_data.median())
classDF = cancer_data['Class']
cancer_data = cancer_data.drop(['Class' ,'Sample code'], axis = 1)
# Standardization of data
standardized = StandardScaler().fit_transform(cancer_data)
x = pd.DataFrame(standardized, columns = cancer_data.columns)
However when I check the Mean values, I get the following output:
array([-5.08256606e-17, -9.14861892e-17, -3.04953964e-17, 5.08256606e-17,
5.08256606e-17, -8.13210570e-17, 3.04953964e-17, -1.32146718e-16,
-8.13210570e-17])
I'm not too sure what I'm doing wrong for these values to be wrong so any help is much appreicated (I'm new to data mining).
Use the formula of the standarization:
column = column to standardized
df_std[column] = (df_std[column] - df_std[column].mean()) /
df_std[column].std()
or:
from sklearn.preprocessing import StandardScaler
# create a scaler object
std_scaler = StandardScaler()
std_scaler
# fit and transform the data
df_std = pd.DataFrame(std_scaler.fit_transform(df_cars), columns=column)
Read for more information :
https://towardsdatascience.com/data-normalization-with-pandas-and-scikit-learn-7c1cc6ed6475
For the dataset that I am working with, the categorical variables are ordinal, ranging from 1 to 5 for three columns. I am going to be feeding this into XGBoost.
Would I be okay to just run this command and skip creating dummy variables:
ser = pd.Series([1, 2, 3], dtype='category')
ser = ser.to_frame()
ser = ser.T
I would like to know conceptually, since the categorical data is ordinal, would simply converting that to type category be adequate for the model? I tried creating dummy variables but all the values become a 1.
As for the code now, it runs but this command returns: 'numpy.int64'.
type(ser[0][0])
Am I going about this correctly? Any help would be great!
Edit: updated code
Edit2: Normalizing the numerical data values. Is this logic correct?:
r = [1, 2, 3, 100 ,200]
scaler = preprocessing.StandardScaler()
r = preprocessing.scale(r)
r = pd.Series(r)
r = r.to_frame()
r = r.T
Edit3: This is the dataset.
Just setting categorical variables as dtype="category" is not sufficient and won't work.
You need to convert categorical values to true categorical values with pd.factorize(), where each category is assigned a numerical label.
Let's say df is your pandas dataframe. Then in general you could use this boilerplate code:
df_numeric = df.select_dtypes(exclude=['object'])
df_obj = df.select_dtypes(include=['object']).copy()
# factorize categoricals columnwise
for c in df_obj:
df_obj[c] = pd.factorize(df_obj[c])[0]
# if you want to one hot encode then add this line:
df_obj = pd.get_dummies(df_obj, prefix_sep='_', drop_first = True)
# merge dataframes back to one dataframe
df_final = pd.concat([df_numeric, df_obj], axis=1)
Since your categorical variables already are factorized (as far as I understand), you can skip the factorization and just try one hot encoding.
See also this post on stats.stackexchange.
If you want to standardize/normalize your numerical data (not the categorical) use this function:
from sklearn import preprocessing
def scale_data(data, scale="robust"):
x = data.values
if scale == "minmax":
scaler = preprocessing.MinMaxScaler()
x_scaled = scaler.fit_transform(x)
elif scale == "standard":
scaler = preprocessing.StandardScaler()
x_scaled = scaler.fit_transform(x)
elif scale == "quantile":
scaler = preprocessing.QuantileTransformer()
x_scaled = scaler.fit_transform(x)
elif scale == "robust":
scaler = preprocessing.RobustScaler()
x_scaled = scaler.fit_transform(x)
data = pd.DataFrame(x_scaled, columns = data.columns)
return data
scaled_df = scale_data(df_numeric, "robust")
Putting it all together for your dataset:
from sklearn import preprocessing
df = pd.read_excel("default of credit card clients.xls", skiprows=1)
y = df['default payment next month'] #target variable
del df['default payment next month']
c = [2,3,4] # index of categorical data columns
r = list(range(0,24))
r = [x for x in r if x not in c] # get list of all other columns
df_cat = df.iloc[:, [2,3,4]].copy()
df_con = df.iloc[:, r].copy()
# factorize categorical data
for c in df_cat:
df_cat[c] = pd.factorize(df_cat[c])[0]
# scale continuous data
scaler = preprocessing.MinMaxScaler()
df_scaled = scaler.fit_transform(df_con)
df_scaled = pd.DataFrame(df_scaled, columns=df_con.columns)
df_final = pd.concat([df_cat, df_scaled], axis=1)
#reorder columns back to original order
cols = df.columns
df_final = df_final[cols]
To further improve the code, do the train/test split before normalization, fit_transform() on the training data and just transform() on the test data. Otherwise you will have a data leak.
I have a csv file which looks like this:
target,data
AAA,some text document
AAA;BBB,more text
AAC,more text
Here is the code:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.naive_bayes import BernoulliNB
import pandas as pd
pdf = pd.read_csv("Train.csv", sep=',')
pdfT = pd.read_csv("Test.csv", sep=',')
X1 = pdf['data']
Y1 = [[t for t in tar.split(';')] for tar in pdf['target']]
X2 = pdfT['data']
Y2 = [[t for t in tar.split(';')] for tar in pdfT['target']]
# Vectorizer data
hv = HashingVectorizer(stop_words='english', non_negative=True)
X1 = hv.transform(X1)
X2 = hv.transform(X2)
mlb = MultiLabelBinarizer()
mlb.fit(Y1+Y2)
Y1 = mlb.transform(Y1)
# mlb.classes_ looks like ['AAA','AAC','BBB',...] len(mlb.classes_)==1363
# Y1 looks like [[0,0,0,....0,0,0], ... ] now
# fit
clsf = OneVsRestClassifier(BernoulliNB(alpha=.001))
clsf.fit(X1,Y1)
# predict_proba
proba = clsf.predict_proba(X2)
# want to get class names back
classnames = mlb.inverse_transform(clsf.classes_) # booom, shit happens
for i in range(len(proba)):
# get classnames,probability dict
preDict = dict(zip(classnames, proba[i]))
# sort dict by probability value, print actual and top 5 predict results
print(Y2[i], dict(sorted(preDict.items(),key=lambda d:d[1],reverse=True)[0:5]))
The problem is after clsf.fit(X1,Y1)
clsf.classes_ is an int array [0,1,2,3,...1362]
why is it not like Y1? How can I get the classnames from clsf.classes_? mlb.classes_ == clsf.classes_ or not, with same order?
When you fit OneVsRestClassifier with multiple labels a LabelBinarizer is called during the fit call, which will convert the the multilabels into unique labels for each class.
You can access the label_binarizer_ attribute of the clsf object, which has an attribute for classes that will contain the class definition for classes fit in the call to clsf.
Data.csv: param1,param2,param3,result
1,2,cat1,12
2,3,cat2,13
1,6,cat1,6
1,1,cat2,12
Suppose i read the data from the file and convert categorical variables into dummy variables like this:
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
data = pd.read_csv('data.csv')
type_dummies = pd.get_dummies(data.house_type)
data = pd.concat([data, type_dummies], axis=1)
I received dataframe:
1,2,1,0,..
1,6,0,1,..
I made simple linear regression for that dataset and received coeffs. How can i convert a new record (new_data = np.array([12,19,cat1])) for new_data = np.array([12,19,1,0)) using pandas for using it in my linear model? (such that new data categorical variables will be converted into dummy variables)
Typically you'll want to setup a pipeline to record the correct category:code mapping.
class CategoricalTransformer(TransformerMixin):
def fit(self, X, y=None, *args, **kwargs):
self.columns_ = X.columns
self.cat_columns_ = X.select_dtypes(include=['category']).columns
self.non_cat_columns_ = X.columns.drop(self.cat_columns_)
self.cat_map_ = {col: X[col].cat.categories
for col in self.cat_columns_}
self.ordered_ = {col: X[col].cat.ordered
for col in self.cat_columns_}
self.dummy_columns_ = {col: ["_".join([col, v])
for v in self.cat_map_[col]]
for col in self.cat_columns_}
self.transformed_columns_ = pd.Index(
self.non_cat_columns_.tolist() +
list(chain.from_iterable(self.dummy_columns_[k]
for k in self.cat_columns_))
)
def transform(self, X, y=None, *args, **kwargs):
return (pd.get_dummies(X)
.reindex(columns=self.transformed_columns_)
.fillna(0))
More here.
with the pipeline sklearn.pipeline.make_pipeline(CategoricalTransformer(), LinearRegression()), your predict method should correctly translate from the categorical house_type to variables.