Binding outputs of transformers in FeatureUnion - python

New to python and sklearn so apologies in advance. I have two transformers and I would like to gather the results in a `FeatureUnion (for a final modelling step at the end). This should be quite simple but FeatureUnion is stacking the outputs rather than providing an nx2 array or DataFrame. In the example below I will generate some data that is 10 rows by 2 columns. This will then generate two features that are 10 rows by 1 column. I would like the final feature union to have 10 rows and 1 column but what I get are 20 rows by 1 column.
I will try to demonstrate with my example below:
some imports
import numpy as np
import pandas as pd
from sklearn import pipeline
from sklearn.base import TransformerMixin
some random data
df = pd.DataFrame(np.random.rand(10, 2), columns=['a', 'b'])
a custom transformer that selects a column
class Trans(TransformerMixin):
def __init__(self, col_name):
self.col_name = col_name
def fit(self, X):
return self
def transform(self, X):
return X[self.col_name]
a pipeline that uses the transformer twice (in my real case I have two different transformers but this reproduces the problem)
pipe = pipeline.FeatureUnion([
('select_a', Trans('a')),
('select_b', Trans('b'))
])
now i use the pipeline but it returns an array of twice the length
pipe.fit_transform(df).shape
(20,)
however I would like an array with dimensions (10, 2).
Quick fix?

The transformers in the FeatureUnion need to return 2-dimensional matrices, however in your code by selecting a column, you are returning a 1-dimensional vector. You could fix this by selecting the column with X[[self.col_name]].

Related

SHAP plotting waterfall using an index value in dataframe

I am working on a binary classification using random forest algorithm
Currently, am trying to explain the model predictions using SHAP values.
So, I referred this useful post here and tried the below.
from shap import TreeExplainer, Explanation
from shap.plots import waterfall
sv = explainer(ord_test_t)
exp = Explanation(sv.values[:,:,1],
sv.base_values[:,1],
data=ord_test_t.values,
feature_names=ord_test_t.columns)
idx = 20
waterfall(exp[idx])
I like the above approach as it allows to display the feature values along with waterfall plot. So, I wish to use this approach
However, this doesn't help me get the waterfall for a specific row in ord_test_t (test data).
For example, let's consider that ord_test_t.Index.tolist() returns 3,5,8,9 etc...
Now, I want to plot the waterfall plot for ord_test_t.iloc[[9]] but when I pass exp[9], it just gets the 9th row but not the index named as 9.
When I try exp.iloc[[9]] it throws error as explanation object doesnt have iloc.
Can help me with this please?
My suggestion is as following:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from shap import TreeExplainer, Explanation
from shap.plots import waterfall
import shap
print(shap.__version__)
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
idx = 9
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X, y)
explainer = TreeExplainer(model)
sv = explainer(X.loc[[idx]]) # corrected, pass the row of interest as df
exp = Explanation(
sv.values[:, :, 1], # class to explain
sv.base_values[:, 1],
data=X.loc[[idx]].values, # corrected, pass the row of interest as df
feature_names=X.columns,
)
waterfall(exp[0]) # pretend you have only 1 data point which is 0th
0.40.0
Proof:
model.predict_proba(X.loc[[idx]]) # corrected
array([[0.95752656, 0.04247344]])

Pass arguments to FunctionTransformer in Pipeline

I have been learning about sklearn preprocessing and pipelines and come across the concept of FunctionTransformer. I want to understand if one has to integrate it in a pipeline and pass arguments to a function which FunctionTransformer is referring to, how would that be done. Consider the example below, for simplicity, i have written a small function:
def return_selected_dataset(dataset, columns):
return dataset[columns]
pipe = Pipeline([('Return_Col', FunctionTransformer(return_selected_dataset))])
pipe.fit_transform(dataset, columns = ['Col1', 'Col2'])
I am getting the following error:
ValueError: Pipeline.fit does not accept the columns parameter. You can pass parameters to specific steps of your pipeline using the stepname__parameter format, e.g. `Pipeline.fit(X, y, logisticregression__sample_weight=sample_weight)`.
How can I pass the value of columns to the function?
Also, can someone suggest any book or website where I can study the sklearn pipelines and preprocessing in detail and how to customize these processes?
Example dataset:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
import pandas as pd
X = pd.DataFrame({'Col1':[1,2],'Col2':[3,4],'Col3':[5,6]})
Your function:
def return_selected_dataset(dataset, columns):
return dataset[columns]
Without the pipeline, it would be like:
FunctionTransformer(return_selected_dataset,
kw_args={'columns':['Col1','Col2']}).transform(X)
Note with pipeline, you can only pass parameters to each of your fit steps,
see the help page:
**fit_paramsdict of string -> object Parameters passed to the fit method of each step, where each parameter name is prefixed such that
parameter p for step s has key s__p.
So I think what you can do is:
pipe = Pipeline([
('Return_Col',
FunctionTransformer(return_selected_dataset,
kw_args={'columns':['Col1','Col2']})
)
])
pipe.fit_transform(X)
Col1 Col2
0 1 3
1 2 4

Limited factor loading output in python factor-analyzer

I am new here but I hope you guys can help me out.
I'm trying to conduct factor analysis with word vectors in python using the Factor-Analyzer module. I have a DataFrame with
100 columns and more than 15,000 rows.
I did not receive any error when I performed a factor analysis. Below is the output:
FactorAnalyzer(bounds=(0.005, 1), impute='median', is_corr_matrix=False,
method='minres', n_factors=10, rotation='varimax',
rotation_kwargs={}, use_smc=True)
But when I try to get the loadings, it only returns 100 rows. I want to get the loadings for all rows.
Here is my code:
import pandas as pd
from factor_analyzer import FactorAnalyzer
import matplotlib.pyplot as plt
import numpy as np
import pickle
factor_df = pd.read_pickle("word_vectors.pkl")
factor_df = pd.DataFrame(data=factor_df)
fa = FactorAnalyzer(n_factors=10, rotation='varimax')
fa.fit(factor_df)
loading = fa.loadings_
loadings_df = pd.DataFrame(fa.loadings_)
loadings_df
The pickle file for my dataset is here.
Factor loadings are the weights and correlations between each variable (column in your DataFrame) and the factor, so the fa.loadings_ object is an array with shape (number_of_variables, number_of_factors) - in your example (100, 10).
If you would like to get transformed DataFrame with only 10 columns in each row, you should call fa.transform(factor_df) after fa.fit(factor_df). Returned array will have shape (number_of_rows, number_of_factors) - in your example (15_000, 10).

Select top n TFIDF features for a given document

I am working with TFIDF sparse matrices for document classification and want to retain only the top n (say 50) terms for each document (ranked by TFIDF score). See EDIT below.
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english',
token_pattern='[A-Za-z][\w\-]*', max_df=0.25)
n = 50
df = pd.read_pickle('my_df.pickle')
df_t = tfidfvectorizer.fit_transform(df['text'])
df_t
Out[15]:
<21175x201380 sparse matrix of type '<class 'numpy.float64'>'
with 6055621 stored elements in Compressed Sparse Row format>
I have tried following the example in this post, although my aim is not to display the features, but just to select the top n for each document before training. But I get a memory error as my data is too large to be converted into a dense matrix.
df_t_sorted = np.argsort(df_t.toarray()).flatten()[::1][n]
Traceback (most recent call last):
File "<ipython-input-16-e0a74c393ca5>", line 1, in <module>
df_t_sorted = np.argsort(df_t.toarray()).flatten()[::1][n]
File "C:\Users\Me\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\sparse\compressed.py", line 943, in toarray
out = self._process_toarray_args(order, out)
File "C:\Users\Me\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\sparse\base.py", line 1130, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
Is there any way to do what I want without working with a dense representation (i.e. without the toarray() call) and without reducing the feature space too much more than I already have (with min_df)?
Note: the max_features parameter is not what I want as it only considers "the top max_features ordered by term frequency across the corpus" (docs here) and what I want is a document-level ranking.
EDIT: I wonder if the best way to address this problem is to set the values of all features except the n-best to zero. I say this because the vocabulary has already been calculated, so feature indices must remain the same, as I will want to use them for other purposes (e.g. to visualise the actual words that correspond to the n-best features).
A colleague wrote some code to retrieve the indices of the n highest-ranked features:
n = 2
tops = np.zeros((df_t.shape[0], n), dtype=int) # store the top indices in a new array
for ind in range(df_t.shape[0]):
tops[ind,] = np.argsort(-df_t[ind].toarray())[0, 0:n] # for each row (i.e. document) sort the (inversed, as argsort is ascending) list and slice top n
But from there, I would need to either:
retrieve the list of remaining (i.e. lowest-ranked) indices and modify the values "in place", or
loop through the original matrix (df_t) and set all values to 0 except for the n best indices in tops.
There is a post here explaining how to work with a csr_matrix, but I'm not sure how to put this into practice to get what I want.
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(tokenizer=word_tokenize,ngram_range=(1,2), binary=True, max_features=50)
TFIDF=vect.fit_transform(df['processed_cv_data'])
The max_features parameter passed in the TfidfVectorizer will pick out the top 50 features ordered by their term frequency but not by their Tf-idf score.
You can view the features by using:
print(vect.get_feature_names())
As you mention, the max_features parameter of the TfidfVectorizer is one way of selecting features.
If you are looking for an alternative way which takes the relationship to the target variable into account, you can use sklearn's SelectKBest. By setting k=50, this will filter your data for the best features. The metric to use for selection can be specified as the parameter score_func.
Example:
from sklearn.feature_selection import SelectKBest
tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english',
token_pattern='[A-Za-z][\w\-]*', max_df=0.25)
df_t = tfidfvectorizer.fit_transform(df['text'])
df_t_reduced = SelectKBest(k=50).fit_transform(df_t, df['target'])
You can also chain it in a pipeline:
pipeline = Pipeline([("vectorizer", TfidfVectorizer()),
("feature_reduction", SelectKBest(k=50)),
("classifier", classifier)])
You could break your numpy array in multiple one to free the memory. Then just concat them
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups(subset='train').data
tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english',
token_pattern='[A-Za-z][\w\-]*', max_df=0.25)
df_t = tfidfvectorizer.fit_transform(data)
n = 10
df_t = tfidfvectorizer.fit_transform(data)
df_top = [np.argsort(df_t[i: i+500, :].toarray(), axis=1)[:, :n]
for i in range(0, df_t.shape[0], 500)]
np.concatenate(df_top, axis=0).shape
>> (11314, 10)

How can I get the feature names from sklearn TruncatedSVD object?

I have the following code
import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD
df = df = pd.DataFrame(np.random.randn(1000, 25), index=dates, columns=list('ABCDEFGHIJKLMOPQRSTUVWXYZ'))
def reduce(dim):
svd = sklearn.decomposition.TruncatedSVD(n_components=dim, n_iter=7, random_state=42)
return svd.fit(df)
fitted = reduce(5)
how do i get the column names from fitted?
In continuation of Mikhail post.
Assume that you already have feature_names from vectorizer.get_feature_names() and after that you have called svd.fit(X)
Now you can also extract sorted best feature names using the following code:
best_fearures = [feature_names[i] for i in svd.components_[0].argsort()[::-1]]
The above code, try to return the arguement of descending sort of svd.components_[0] and find the relative index from feature_names (all of the features) and construct the best_features array.
Then you can see for example the 10 best features:
In[21]: best_features[:10]
Out[21]:
['manag',
'develop',
'busi',
'solut',
'initi',
'enterprise',
'project',
'program',
'process',
'plan']
fitted column names would be SVD dimensions.
Each dimension is a linear combination of input features. To understand what a particular dimension mean take a look at svd.components_ array - it contains a matrix of coefficients input features are multiplied by.
Your original example, slightly changed:
import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD
feature_names = list('ABCDEF')
df = pd.DataFrame(
np.random.randn(1000, len(feature_names)),
columns=feature_names
)
def reduce(dim):
svd = TruncatedSVD(n_components=dim, n_iter=7, random_state=42)
return svd.fit(df)
svd = reduce(3)
Then you can do something like that to get a more readable SVD dimension name - let's compute it for 0th dimension:
" ".join([
"%+0.3f*%s" % (coef, feat)
for coef, feat in zip(svd.components_[0], feature_names)
])
It shows +0.170*A -0.564*B -0.118*C +0.367*D +0.528*E +0.475*F - this is a "feature name" you can use for a 0th SVD dimension in this case (of course, coefficients depend on data, so feature name also depends on data).
If you have many input dimensions you may trade some "precision" with inspectability, e.g. sort coefficients and use only a few top of them. A more elaborate example can be found in https://github.com/TeamHG-Memex/eli5/pull/208 (disclaimer: I'm one of eli5 maintainers; pull request is not by me).

Categories