I have a csv file which looks like this:
target,data
AAA,some text document
AAA;BBB,more text
AAC,more text
Here is the code:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.naive_bayes import BernoulliNB
import pandas as pd
pdf = pd.read_csv("Train.csv", sep=',')
pdfT = pd.read_csv("Test.csv", sep=',')
X1 = pdf['data']
Y1 = [[t for t in tar.split(';')] for tar in pdf['target']]
X2 = pdfT['data']
Y2 = [[t for t in tar.split(';')] for tar in pdfT['target']]
# Vectorizer data
hv = HashingVectorizer(stop_words='english', non_negative=True)
X1 = hv.transform(X1)
X2 = hv.transform(X2)
mlb = MultiLabelBinarizer()
mlb.fit(Y1+Y2)
Y1 = mlb.transform(Y1)
# mlb.classes_ looks like ['AAA','AAC','BBB',...] len(mlb.classes_)==1363
# Y1 looks like [[0,0,0,....0,0,0], ... ] now
# fit
clsf = OneVsRestClassifier(BernoulliNB(alpha=.001))
clsf.fit(X1,Y1)
# predict_proba
proba = clsf.predict_proba(X2)
# want to get class names back
classnames = mlb.inverse_transform(clsf.classes_) # booom, shit happens
for i in range(len(proba)):
# get classnames,probability dict
preDict = dict(zip(classnames, proba[i]))
# sort dict by probability value, print actual and top 5 predict results
print(Y2[i], dict(sorted(preDict.items(),key=lambda d:d[1],reverse=True)[0:5]))
The problem is after clsf.fit(X1,Y1)
clsf.classes_ is an int array [0,1,2,3,...1362]
why is it not like Y1? How can I get the classnames from clsf.classes_? mlb.classes_ == clsf.classes_ or not, with same order?
When you fit OneVsRestClassifier with multiple labels a LabelBinarizer is called during the fit call, which will convert the the multilabels into unique labels for each class.
You can access the label_binarizer_ attribute of the clsf object, which has an attribute for classes that will contain the class definition for classes fit in the call to clsf.
Related
Using DB scan I am iterating through a csv with x, y, z, and id data. I would like to generate a new csv for a every combination of eps and minimum samples that occur within a set range.
The output csv should also include the number of points in the cluster, eps value, and minimum samples used as columns along with the original x, y, z, and id data. The code below will do this, but will only create a csv for the last cluster calculated.
%matplotlib notebook
from interp import *
import pandas as pd
import pandas as pdfrom sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import silhouette_score
import itertools
points_file = "examples/example-1/fcr.csv"
eps_values = np.arange(0.1,0.5,0.1)
min_samples = np.arange(5,20)
dbscan_params = list(itertools.product(eps_values, min_samples))
cluster_n = []
epsvalues = []
min_samp = []
for p in dbscan_params:
dbscan_cluster = DBSCAN(eps=p[0],
min_samples=p[1]).fit(points)
epsvalues.append(p[0])
min_samp.append(p[1]), cluster_n.append(
len(np.unique(dbscan_cluster.labels_)))
df = pd.read_csv(points_file, header=None, names=["x", "y", "z", "id"])
df["cluster"] = clustering.labels_
df["eps"] = eps
df["min_sample"] = min_sam
csv_name = f'fcr_{eps}e{min_sam}m.csv'
df.to_csv(csv_name, index=False)
Why does the predict_proba function give 2 columns?
I looked to this website:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba
However, it just says returns: T: array-like of shape (n_samples, n_classes)
Returns the probability of the sample for each class in the model, where classes are ordered as they are in self.classes_.
I still don't understand why the output always returns 2 columns.
import numpy as np
import pandas as pd
from pylab import rcParams
import seaborn as sb
from sklearn.preprocessing import scale
from collections import Counter
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
rcParams['figure.figsize'] = 5,4
sb.set_style('whitegrid')
from sklearn.linear_model import LogisticRegression
import os
cwd = os.getcwd()
file_path = cwd + '\\Default.xlsx'
default_data = pd.read_excel(file_path)
default_data = pd.read_excel('Default.xlsx')
default_data = default_data.drop(['Unnamed: 0'], axis=1)
default_data['default_factor'] = default_data.default.factorize()[0]
default_data['student_factor'] = default_data.student.factorize()[0]
X = default_data[['balance']]
y = default_data['default_factor']
lr = LogisticRegression()
lr.fit(X, y)
X_pred = np.linspace(start = 0, stop = 3000, num = 2).reshape(-1,1)
y_pred = lr.predict_proba(X_pred)
X_pred
X_pred.shape
y_pred.shape
Short answer
In every column it gives you information about the probability, that sample belong to this class (zero column shows the probability for belonging to class 0, first column shows the probability for belonging to class 1 and so on)
Detailed answer
Let's say that y_pred.shape gives you shape (2, 2) means, that you have 2 samples and 2 classes.
let's say that your X_pred looks like this:
In: print(X_pred)
Out: [[ 0.],
[3000.]]
that means that you have two samples:
sample one, with only feature x = [0] and
sample two, with only feature x = [3000]
let's say that output of your prediction looks like this:
In: print(y_pred)
Out: [[0.28, 0.72]
[0.65, 0.35]]
so it means, that sample one most probably belongs to class = 1 (first row tells you that it could be class 0 with probability 28% and class 1 with probability 72%)
and sample two most probably belongs to class = 0 (second row tells you that it could be class 0 with probability 65% and class 1 with probability 35%)
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.linear_model import LinearRegression
df = pd.DataFrame({'brand' : ['aaaa', 'asdfasdf', 'sadfds', 'NaN'],
'category' : ['asdf','asfa','asdfas','as'],
'num1' : [1, 1, 0, 0] ,
'target' : [0.2,0.11,1.34,1.123]})
train_continuous_cols = df.select_dtypes(include=["int64","float64"]).columns.tolist()
train_categorical_cols = df.select_dtypes(include=["object"]).columns.tolist()
preprocess = make_column_transformer(
(StandardScaler(),train_continuous_cols),
(OneHotEncoder(), train_categorical_cols)
)
df= preprocess.fit_transform(df)
Just trying to get all the feature names:
preprocess.get_feature_names()
Getting this error:
Transformer standardscaler (type StandardScaler) does not provide get_feature_names
How can I solve it? The examples online use pipeline and I'm trying to avoid that.
The following re-implementation of the ColumnTransformer returns a pandas DataFrame. Note that it should only be used if you input a pandas DataFrame to your pipeline.
All kudos go to Johannes Haupt who provided the get_feature_names() function that is resilient to transformers that don't have this function (see blogpost Extracting Column Names from the ColumnTransformer). I commented off the warnings because I did not want them and also pre-prending the transformation step to the column name; but it is easy to un-comment as you like.
#import warnings
import sklearn
import pandas as pd
class ColumnTransformerWithNames(ColumnTransformer):
def get_feature_names(column_transformer):
"""Get feature names from all transformers.
Returns
-------
feature_names : list of strings
Names of the features produced by transform.
"""
# Remove the internal helper function
#check_is_fitted(column_transformer)
# Turn loopkup into function for better handling with pipeline later
def get_names(trans):
# >> Original get_feature_names() method
if trans == 'drop' or (
hasattr(column, '__len__') and not len(column)):
return []
if trans == 'passthrough':
if hasattr(column_transformer, '_df_columns'):
if ((not isinstance(column, slice))
and all(isinstance(col, str) for col in column)):
return column
else:
return column_transformer._df_columns[column]
else:
indices = np.arange(column_transformer._n_features)
return ['x%d' % i for i in indices[column]]
if not hasattr(trans, 'get_feature_names'):
# >>> Change: Return input column names if no method avaiable
# Turn error into a warning
# warnings.warn("Transformer %s (type %s) does not "
# "provide get_feature_names. "
# "Will return input column names if available"
# % (str(name), type(trans).__name__))
# For transformers without a get_features_names method, use the input
# names to the column transformer
if column is None:
return []
else:
return [#name + "__" +
f for f in column]
return [#name + "__" +
f for f in trans.get_feature_names()]
### Start of processing
feature_names = []
# Allow transformers to be pipelines. Pipeline steps are named differently, so preprocessing is needed
if type(column_transformer) == sklearn.pipeline.Pipeline:
l_transformers = [(name, trans, None, None) for step, name, trans in column_transformer._iter()]
else:
# For column transformers, follow the original method
l_transformers = list(column_transformer._iter(fitted=True))
for name, trans, column, _ in l_transformers:
if type(trans) == sklearn.pipeline.Pipeline:
# Recursive call on pipeline
_names = column_transformer.get_feature_names(trans)
# if pipeline has no transformer that returns names
if len(_names)==0:
_names = [#name + "__" +
f for f in column]
feature_names.extend(_names)
else:
feature_names.extend(get_names(trans))
return feature_names
def transform(self, X):
indices = X.index.values.tolist()
original_columns = X.columns.values.tolist()
X_mat = super().transform(X)
new_cols = self.get_feature_names()
new_X = pd.DataFrame(X_mat.toarray(), index=indices, columns=new_cols)
return new_X
def fit_transform(self, X, y=None):
super().fit_transform(X, y)
return self.transform(X)
Then you can replace the calls to ColumnTransformer to ColumnTransformerWithNames. The output is a DataFrame and this step now has a working get_feature_names().
I am assuming you are looking for ways to access the result of the transformer, which yields a numpy array.
ColumnTransfomer has an attribute called transformers_ :`
From the documentation:
transformers_ : list
The collection of fitted transformers as tuples of
(name, fitted_transformer, column). `fitted_transformer` can be an
estimator, 'drop', or 'passthrough'. In case there were no columns
selected, this will be the unfitted transformer.
If there are remaining columns, the final element is a tuple of the
form:
('remainder', transformer, remaining_columns) corresponding to the
``remainder`` parameter. If there are remaining columns, then
``len(transformers_)==len(transformers)+1``, otherwise
``len(transformers_)==len(transformers)``.
So that provides unfortunately only information on the transformer itself and the column it has been applied to, however not on the location of the resulting data except for the following :
notes: The order of the columns in the transformed feature matrix follows the
order of how the columns are specified in the transformers list.
So we know that the order of the output columns is the same as the order in which the columns are specified in the transformers list. Plus, we also know for our transformer steps how much columns they yield, as a StandardScaler() yields the same number of columns as the original data and OneHotEncoder() yields number of columns equal to the number of categories.
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.compose import ColumnTransformer, make_column_transformer
df = pd.DataFrame({'brand' : ['aaaa', 'asdfasdf', 'sadfds', 'NaN'],
'category' : ['asdf','asfa','asdfas','asd'],
'num1' : [1, 1, 0, 0] ,
'target' : [0.2,0.11,1.34,1.123]})
train_continuous_cols = df.select_dtypes(include=["int64","float64"]).columns.tolist()
train_categorical_cols = df.select_dtypes(include=["object"]).columns.tolist()
# get n_categories for categorical features
n_categories = [df[x].nunique() for x in train_categorical_cols]
preprocess = make_column_transformer(
(StandardScaler(),train_continuous_cols),
(OneHotEncoder(), train_categorical_cols)
)
preprocessed_df = preprocess.fit_transform(df)
# the scaler yield 1 column each
indexes_scaler = list(range(0,len(train_continuous_cols)))
# the encoder yields a number of columns equal to the number of categories in the data
cum_index_encoder = [0] + list(np.cumsum(n_categories))
# the encoder indexes come after the scaler indexes
start_index_encoder = indexes_scaler[-1]+1
indexes_encoder = [x + start_index_encoder for x in cum_index_encoder]
# get both lower and uper bound of index
index_pairs= zip (indexes_encoder[:-1],indexes_encoder[1:])
This results in the following output:
print ('Transformed {} continious cols resulting in a df with shape:'.format(len(train_continuous_cols)))
print (preprocessed_df[: , indexes_scaler].shape)
Transformed 2 continious cols resulting in a df with shape:
(4, 2)
for column, (start_id, end_id) in zip (train_categorical_cols,index_pairs):
print('Transformed column {} resulted in a df with shape:'.format(column))
print(preprocessed_df[:, start_id:end_id].shape)
Transformed column brand resulted in a df with shape:
(4, 4)
Transformed column category resulted in a df with shape:
(4, 4)
I have a lookup table created using -
lookupTable, data_training_panda_y_indexed = np.unique(data_training_panda_y, return_inverse=True)
However, I want to apply the lookupTable on a different array data_cross_validation_panda_y
data_training_panda_y is a list of strings which can be these values - Incoming, Outgoing, Neutral.
So, lookUpTable is ndArray ('Incoming' 'Outgoing 'Neutral')
Code so far -
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from numpy import dtype
from _codecs import lookup
#Load data
data = np.genfromtxt('../Data/bezdekIris.csv',delimiter=',',usecols=[0,1,2,3,4],dtype=None)
labels = np.genfromtxt('../Data/bezdekIris.csv',delimiter=',',usecols=[4],dtype=None)
#Shuffle the rows
np.random.shuffle(data)
#Cut the data into 3 parts
data_rows = np.size(data, 0)
training_rows = int(round(0.6*data_rows))
cross_validation_rows = int(round(0.2*data_rows))
testing_rows = data_rows - training_rows - cross_validation_rows
data_training_panda = pd.DataFrame(data[:training_rows])
data_training_panda_X = data_training_panda.iloc[:,0:4]
data_training_panda_y = data_training_panda.iloc[:,4]
data_cross_validation_panda = pd.DataFrame(data[training_rows:training_rows+cross_validation_rows])
data_cross_validation_panda_X = data_cross_validation_panda.iloc[:,0:4]
data_cross_validation_panda_y = data_cross_validation_panda.iloc[:,4]
data_testing_panda = pd.DataFrame(data[training_rows+cross_validation_rows:])
data_testing_panda_X = data_testing_panda.iloc[:,0:4]
data_testing_panda_y = data_testing_panda.iloc[:,4]
#Take out the labels from the 3 parts
lookupTable, data_training_panda_y_indexed = np.unique(data_training_panda_y, return_inverse=True)
#Label the CV and Testing
data_cross_validation_panda_y_indexed = np.array([])
data_testing_panda_y_indexed = np.array([])
bezdekIris.csv Sample Data -
5.1,3.5,1.4,0.2,Incoming
4.9,3.0,1.4,0.2,Outgoing
4.7,3.2,1.3,0.2,Netural
Using searchsorted could be a solution.
data_cross_validation_panda_y_indexed = np.searchsorted(lookupTable, data_cross_validation_panda_y)
I'm wondering if there is a concise way to run ML (e.g KMeans) on a DataFrame in pyspark if I have the features in multiple numeric columns.
I.e. as in the Iris dataset:
(a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1)
I'd like to use KMeans without recreating the DataSet with the feature vector added manually as a new column and the original columns hardcoded repeatedly in the code.
The solution I'd like to improve:
from pyspark.mllib.linalg import Vectors
from pyspark.sql.types import Row
from pyspark.ml.clustering import KMeans, KMeansModel
iris = sqlContext.read.parquet("/opt/data/iris.parquet")
iris.first()
# Row(a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1)
df = iris.map(lambda r: Row(
id = r.id,
a1 = r.a1,
a2 = r.a2,
a3 = r.a3,
a4 = r.a4,
label = r.label,
binomial_label=r.binomial_label,
features = Vectors.dense(r.a1, r.a2, r.a3, r.a4))
).toDF()
kmeans_estimator = KMeans()\
.setFeaturesCol("features")\
.setPredictionCol("prediction")\
kmeans_transformer = kmeans_estimator.fit(df)
predicted_df = kmeans_transformer.transform(df).drop("features")
predicted_df.first()
# Row(a1=5.1, a2=3.5, a3=1.4, a4=0.2, binomial_label=1, id=u'id_1', label=u'Iris-setosa', prediction=1)
I'm looking for a solution, which is something like:
feature_cols = ["a1", "a2", "a3", "a4"]
prediction_col_name = "prediction"
<dataframe independent code for KMeans>
<New dataframe is created, extended with the `prediction` column.>
You can use VectorAssembler:
from pyspark.ml.feature import VectorAssembler
ignore = ['id', 'label', 'binomial_label']
assembler = VectorAssembler(
inputCols=[x for x in df.columns if x not in ignore],
outputCol='features')
assembler.transform(df)
It can be combined with k-means using ML Pipeline:
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[assembler, kmeans_estimator])
model = pipeline.fit(df)