Undo L2 Normalization in sklearn python - python

Once I normalized my data with an sklearn l2 normalizer and use it as training data:
How do I turn the predicted output back to the "raw" shape?
In my example I used normalized housing prices as y and normalized living space as x. Each used to fit their own X_ and Y_Normalizer.
The y_predict is in therefore also in the normalized shape, how do I turn in into the original raw currency state?
Thank you.

If you are talking about sklearn.preprocessing.Normalizer, which normalizes matrix lines, unfortunately there is no way to go back to original norms unless you store them by hand somewhere.
If you are using sklearn.preprocessing.StandardScaler, which normalizes columns, then you can obtain the values you need to go back in the attributes of that scaler (mean_ if with_mean is set to True and std_)
If you use the normalizer in a pipeline, you wouldn't need to worry about this, because you wouldn't modify your data in place:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
# classifier example
from sklearn.svm import SVC
pipeline = make_pipeline(Normalizer(), SVC())

Thank you very much for your answer, I didn't know about the pipeline feature before
For the case of L2 normalization turns out you can do it manually.
Here is one example for a small array:
x = np.array([5, 8 , 12, 15])
#Using Sklearn
normalizer_x = preprocessing.Normalizer(norm = "l2").fit(x)
x_norm = normalizer_x.transform(x)[0]
print x_norm
>array([ 0.23363466, 0.37381545, 0.56072318, 0.70090397])
Or do it manually with the weight of the squareroot of the squaresum:
#Manually
w = np.sqrt(sum(x**2))
x_norm2 = x/w
print x_norm2
>array([ 0.23363466, 0.37381545, 0.56072318, 0.70090397])
So turning them "back" to the raw formate is simple by multiplying with "w".

Related

Apply coefficients from n degree polynomial to formula

I use a sklearn LinearRegression()estimator, with 5 variables
['feat1', 'feat2', 'feat3', 'feat4', 'feat5']
In order to predict a continuous value.
Estimator returns the list of coefficient values and the bias:
linear = LinearRegression()
print(linear.coef_)
print(linear.intercept_)
[ 0.18799409 -0.05406106 -0.01327966 -0.13348129 -0.00614054]
-0.011064865422734674
Then, given the fact I have each feature as variables, I can hardcode the coefficients into a linear formula and estimate my values, like so:
val = ((0.18799409*feat1) - (0.05406106*feat2) - (0.01327966*feat3) - (0.13348129*feat4) - (0.00614054*feat5)) -0.011064865422734674
Now lets say I use a polynomial regression of degree 2, using a pipeline, and by printing:
model = Pipeline(steps=[
('scaler',StandardScaler()),
('polynomial_features', PolynomialFeatures(degree=degree, include_bias=False)),
('linear_regression', LinearRegression())])
#fit model
model.fit(X_train, y_train)
print(model['linear_regression'].coef_)
print(model['linear_regression'].intercept_)
I get:
[ 7.06524186e-01 -2.98605001e-02 -4.67175212e-02 -4.86890790e-01
-1.06320101e-02 -2.77958604e-03 -3.38253025e-04 -7.80563090e-03
4.51356888e-03 8.32036733e-03 3.57638244e-02 -2.16446849e-02
-7.92169287e-02 3.36809467e-02 -6.60531497e-03 2.16613331e-02
2.10097993e-02 3.49970303e-02 -3.02970698e-02 -7.81462599e-03]
0.011042927069084668
How do I transform the formula above in order to calculate val from regression, with values from .coef_ and .intercept_, using array indexing instead of hardcoding the values, for any 'n' degree ?
Is there any scipy or numpy method suited for that?
It's important to note that polynomial regression is just an extended case of linear regression, thus all we need to do is transform our input data consistently. For any N we can use the PolynomialFeatures from sklearn.preprocessing. From using dummy data, we can see how this would work:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
#set parameters
X = np.stack([np.arange(i,i+10) for i in range(5)]).T
Y = np.random.randn(10)*10+3
N = 2
poly_reg=PolynomialFeatures(degree=N,include_bias=False)
X_poly=poly_reg.fit_transform(X)
#print(X[0],X_poly[0]) #to check parameters, note that it includes the y intercept as an input of 1
poly = LinearRegression().fit(X_poly, Y)
And thus, we can get the coef_ the way you were doing before, and simply perform a matrix multiplication to get the regressed value.
new_dat = poly_reg.transform(np.arange(2,2+10,2)[None]) #5 new datapoints
np.testing.assert_array_equal(poly.predict(new_dat),new_dat # poly.coef_ + poly.intercept_)
----EDIT----
In case you cannot use the transform for PolynomialFeatures, it's just a iterated combination loop to generate the data from your list of features.
new_feats = np.array([feat1,feat2,feat3,feat4,feat5])
from itertools import combinations_with_replacement
def gen_poly_feats(x,N):
#this function returns all unique groupings (w/ replacement) of the indices into the array x for use in polynomial regression.
return np.concatenate([[np.product(x[np.array(i)]) for i in list(combinations_with_replacement(range(len(x)), n))] for n in range(1,N+1)])[None]
new_feats_poly = gen_poly_feats(new_feats,N)
# just to be sure that this matches...
np.testing.assert_array_equal(new_feats_poly,poly_reg.transform(new_feats[None]))
#then we can use the above linear regression model to predict the new data
val = new_feats_poly # poly.coef_ + poly.intercept_

Issue with Scikit-learn data analysis

am attempting to take a .dat file of about 90,000 data lines of two variables (wavelength and intensity) and apply a sklearn.pca filter to it.
Here is a small set of that data:
wavelength intensity
[um] [W/m**2/um/sr]
196.078431372549 1.108370393265022E-003
192.307692307692 1.163428008597600E-003
188.679245283019 1.223639983609668E-003
The code I am using to analyze the data is below
pca= PCA(n_components=2)
pca.fit(data)
print(pca.components_)
The error code I get is this when I try to apply 2 pca components to one of the data sets:
ValueError: Datatype coercion is not allowed
Any help resolving would be much appreciated
I think in your case, the problem is the column name, especially [W/m**2/um/sr].
Also when using PCA, do not forget to rescale the input variables into "comparable" units using StandardScaler.
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
data = pd.DataFrame({'wavelength [um]': [196.078431372549, 1.108370393265022E-003, 192.307692307692], 'intensity [W/m**2/um/sr]': [1.163428008597600E-003, 188.679245283019, 1.223639983609668E-003]})
scaler = StandardScaler(with_mean=True, with_std=True)
pca= PCA(n_components=2)
pca.fit(scaler.fit_transform(data))
print(pca.components_)
Worked well for me. Maybe you just need to specify:
data.columns = data.columns.astype(str)

NLP in Python: Obtain word names from SelectKBest after vectorizing

I can't seem to find an answer to my exact problem. Can anyone help?
A simplified description of my dataframe ("df"): It has 2 columns: one is a bunch of text ("Notes"), and the other is a binary variable indicating if the resolution time was above average or not ("y").
I did bag-of-words on the text:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(lowercase=True, stop_words="english")
matrix = vectorizer.fit_transform(df["Notes"])
My matrix is 6290 x 4650. No problem getting the word names (i.e. feature names) :
feature_names = vectorizer.get_feature_names()
feature_names
Next, I want to know which of these 4650 are most associated with above average resolution times; and reduce the matrix I may want to use in a predictive model. I do a chi-square test to find the top 20 most important words.
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
selector = SelectKBest(chi2, k=20)
selector.fit(matrix, y)
top_words = selector.get_support().nonzero()
# Pick only the most informative columns in the data.
chi_matrix = matrix[:,top_words[0]]
Now I'm stuck. How do I get the words from this reduced matrix ("chi_matrix")? What are my feature names? I was trying this:
chi_matrix.feature_names[selector.get_support(indices=True)].tolist()
Or
chi_matrix.feature_names[features.get_support()]
These gives me an error: feature_names not found. What am I missing?
A
After figuring out really what I wanted to do (thanks Daniel) and doing more research, I found a couple other ways to meet my objective.
Way 1 - https://glowingpython.blogspot.com/2014/02/terms-selection-with-chi-square.html
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(lowercase=True,stop_words='english')
X = vectorizer.fit_transform(df["Notes"])
from sklearn.feature_selection import chi2
chi2score = chi2(X,df['AboveAverage'])[0]
wscores = zip(vectorizer.get_feature_names(),chi2score)
wchi2 = sorted(wscores,key=lambda x:x[1])
topchi2 = zip(*wchi2[-20:])
show=list(topchi2)
show
Way 2 - This is the way I used because it was the easiest for me to understand and produced a nice output listing the word, chi2 score, and p-value. Another thread on here: Sklearn Chi2 For Feature Selection
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest, chi2
vectorizer = CountVectorizer(lowercase=True,stop_words='english')
X = vectorizer.fit_transform(df["Notes"])
y = df['AboveAverage']
# Select 10 features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=10)
chi2_selector.fit(X, y)
# Look at scores returned from the selector for each feature
chi2_scores = pd.DataFrame(list(zip(vectorizer.get_feature_names(), chi2_selector.scores_, chi2_selector.pvalues_)),
columns=['ftr', 'score', 'pval'])
chi2_scores
I had a similar problem recently, but I was not constricted to using the 20 most relevant words. Rather, I could select the words which had a chi score higher than a set threshold. I will give you the method I used to achieve this second task. The reason why this is preferable than just using the first n words accordingly to their chi-score, is that those 20 words may have an extremely low score and thus contribute next to nothing to the classification task.
Here is how I have done it for a binary classification task:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import chi2
THRESHOLD_CHI = 5 # or whatever you like. You may try with
# for threshold_chi in [1,2,3,4,5,6,7,8,9,10] if you prefer
# and measure the f1 scores
X = df['text']
y = df['labels']
cv = CountVectorizer()
cv_sparse_matrix = cv.fit_transform(X)
cv_dense_matrix = cv_sparse_matrix.todense()
chi2_stat, pval = chi2(cv_dense_matrix, y)
chi2_reshaped = chi2_stat.reshape(1,-1)
which_ones_to_keep = chi2_reshaped > THRESHOLD_CHI
which_ones_to_keep = np.repeat(which_ones_to_keep ,axis=0,repeats=which_ones_to_keep.shape[1])
The result is a matrix containing ones where the terms have a chi score higher than the threshold, and zeroes where they have a chi score lower than the threshold. This matrix can then be np.dot with either a cv matrix or a tfidf matrix, and subsequently passed to the fit method of a classifier.
If you do this, the columns of the matrix which_ones_to_keep correspond to the columns of the CountVectorizer object, and you can thus determine which terms were relevant for the given labels by comparing the non-zero columns of the which_ones_to_keep matrix to the indices of the .get_feature_names(), or you can just forget about it and pass it directly to a classifier.

xgboost.plot_tree: binary feature interpretation

I've built an XGBoost model and seek to examine the individual estimators. For reference, this was a binary classification task with discrete and continuous input features. The input feature matrix is a scipy.sparse.csr_matrix.
When I went to examine an individual estimator, however, I found difficulty interpreting the binary input features, such as f60150 below. The real-valued f60150 in the bottommost chart is easy to interpret - its criterion is in the expected range of that feature. However, the comparisons being made for the binary features, <X> < -9.53674e-07 doesn't make sense. Each of these features are either 1 or 0. -9.53674e-07 is a very small negative number, and I imagine this is just some floating-point idiosyncrasy within XGBoost or its underpinning plotting libraries, but it doesn't make sense to use that comparison when the feature is always positive. Can someone help me understand which direction (i.e. yes, missing vs. no corresponds to which true/false side of these binary feature nodes?
Here is a reproducible example:
import numpy as np
import scipy.sparse
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from xgboost import plot_tree, XGBClassifier
import matplotlib.pyplot as plt
def booleanize_csr_matrix(mat):
''' Convert sparse matrix with positive integer elements to 1s '''
nnz_inds = mat.nonzero()
keep = np.where(mat.data > 0)[0]
n_keep = len(keep)
result = scipy.sparse.csr_matrix(
(np.ones(n_keep), (nnz_inds[0][keep], nnz_inds[1][keep])),
shape=mat.shape
)
return result
### Setup dataset
res = fetch_20newsgroups()
text = res.data
outcome = res.target
### Use default params from CountVectorizer to create initial count matrix
vec = CountVectorizer()
X = vec.fit_transform(text)
# Whether to "booleanize" the input matrix
booleanize = True
# Whether to, after "booleanizing", convert the data type to match what's returned by `vec.fit_transform(text)`
to_int = True
if booleanize and to_int:
X = booleanize_csr_matrix(X)
X = X.astype(np.int64)
# Make it a binary classification problem
y = np.where(outcome == 1, 1, 0)
# Random state ensures we will be able to compare trees and their features consistently
model = XGBClassifier(random_state=100)
model.fit(X, y)
plot_tree(model, rankdir='LR'); plt.show()
Running the above with booleanize and to_int set to True yields the following chart:
Running the above with booleanize and to_int set to False yields the following chart:
Heck, even if I do a really simple example, I get the "right" results, regardless of whether X or y are integer or floating types.
X = np.matrix(
[
[1,0],
[1,0],
[0,1],
[0,1],
[1,1],
[1,0],
[0,0],
[0,0],
[1,1],
[0,1]
]
)
y = np.array([1,0,0,0,1,1,1,0,1,1])
model = XGBClassifier(random_state=100)
model.fit(X, y)
plot_tree(model, rankdir='LR'); plt.show()

How to find the features names of the coefficients using scikit linear regression?

I use scikit linear regression and if I change the order of the features, the coef are still printed in the same order, hence I would like to know the mapping of the feature with the coeff.
#training the model
model_1_features = ['sqft_living', 'bathrooms', 'bedrooms', 'lat', 'long']
model_2_features = model_1_features + ['bed_bath_rooms']
model_3_features = model_2_features + ['bedrooms_squared', 'log_sqft_living', 'lat_plus_long']
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
model_2 = linear_model.LinearRegression()
model_2.fit(train_data[model_2_features], train_data['price'])
model_3 = linear_model.LinearRegression()
model_3.fit(train_data[model_3_features], train_data['price'])
# extracting the coef
print model_1.coef_
print model_2.coef_
print model_3.coef_
The trick is that right after you have trained your model, you know the order of the coefficients:
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
print(list(zip(model_1.coef_, model_1_features)))
This will print the coefficients and the correct feature. (Tested with pandas DataFrame)
If you want to reuse the coefficients later you can also put them in a dictionary:
coef_dict = {}
for coef, feat in zip(model_1.coef_,model_1_features):
coef_dict[feat] = coef
(You can test it for yourself by training two models with the same features but, as you said, shuffled order of features.)
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
coef_table = pd.DataFrame(list(X_train.columns)).copy()
coef_table.insert(len(coef_table.columns),"Coefs",regressor.coef_.transpose())
#Robin posted a great answer, but for me I had to make one tweak on it to work the way I wanted, and it was to refer to the dimension of the 'coef_' np.array that I wanted, namely modifying to this: model_1.coef_[0,:], as below:
coef_dict = {}
for coef, feat in zip(model_1.coef_[0,:],model_1_features):
coef_dict[feat] = coef
Then the dict was created as I pictured it, with {'feature_name' : coefficient_value} pairs.
Here is what I use for pretty printing of coefficients in Jupyter. I'm not sure I follow why order is an issue - as far as I know the order of the coefficients should match the order of the input data that you gave it.
Note that the first line assumes you have a Pandas data frame called df in which you originally stored the data prior to turning it into a numpy array for regression:
fieldList = np.array(list(df)).reshape(-1,1)
coeffs = np.reshape(np.round(clf.coef_,5),(-1,1))
coeffs=np.concatenate((fieldList,coeffs),axis=1)
print(pd.DataFrame(coeffs,columns=['Field','Coeff']))
Borrowing from Robin, but simplifying the syntax:
coef_dict = dict(zip(model_1_features, model_1.coef_))
Important note about zip: zip assumes its inputs are of equal length, making it especially important to confirm that the lengths of the features and coefficients match (which in more complicated models might not be the case). If one input is longer than the other, the longer input will have the values in its extra index positions cut off. Notice the missing 7 in the following example:
In [1]: [i for i in zip([1, 2, 3], [4, 5, 6, 7])]
Out[1]: [(1, 4), (2, 5), (3, 6)]
pd.DataFrame(data=regression.coef_, index=X_train.columns)
All of these answers were great but what personally worked for me was this, as the feature names I needed were the columns of my train_date dataframe:
pd.DataFrame(data=model_1.coef_,columns=train_data.columns)
Right after training the model, the coefficient values are stored in the variable model.coef_[0]. We can iterate over the column names and store the column name and their coefficient value in a dictionary.
model.fit(X_train,y)
# assuming all the columns except last one is used in training
columns = data.iloc[:,-1].columns
coef_dict = {}
for i in range(0,len(columns)):
coef_dict[columns[i]] = model.coef_[0][i]
Hope this helps!
As of scikit-learn version 1.0, the LinearRegression estimator has a feature_names_in_ attribute. From the docs:
feature_names_in_ : ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
Assuming you're fitting on a pandas.DataFrame (train_data), your estimators (model_1, model_2, and model_3) will have the attribute. You can line up your coefficients using any of the methods listed in previous answers, but I'm in favor of this one:
coef_series = pd.Series(
data=model_1.coef_,
index=model_1.feature_names_in_
)
A minimally reproducible example
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
# for repeatability
np.random.seed(0)
# random data
Xy = pd.DataFrame(
data=np.random.random((10, 3)),
columns=["x0", "x1", "y"]
)
# separate X and y
X = Xy.drop(columns="y")
y = Xy.y
# initialize estimator
lr = LinearRegression()
# fit to pandas.DataFrame
lr.fit(X, y)
# get coeficients and their respective feature names
coef_series = pd.Series(
data=lr.coef_,
index=lr.feature_names_in_
)
print(coef_series)
x0 0.230524
x1 -0.275611
dtype: float64

Categories