recover column details after PCA and Kmeans - python

I did KMeans clustering after reducing numerical columns in my DataFrame from 5 to 2 using PCA and plotted scatterplot
pc=PCA(n_components = 2).fit_transform(scaled_df)
scaled_df_PCA= pd.DataFrame(pc, columns=['pca_col1','pca_col2'])
#Then I did the KMeans and its plotting
label_PCA=final_km.fit_predict(scaled_df_PCA)
scaled_df_PCA["label_PCA_df"]=label_PCA
a=scaled_df_PCA[scaled_df_PCA.label_PCA_df==0]
b=scaled_df_PCA[scaled_df_PCA.label_PCA_df==1]
c=scaled_df_PCA[scaled_df_PCA.label_PCA_df==2]
sns.scatterplot(a.pca_col1, a.pca_col2, color="green")
sns.scatterplot(b.pca_col1, b.pca_col2, color="red")
sns.scatterplot(c.pca_col1, c.pca_col2, color="yellow")
I get 3 clusters from above based upon 2 columns reduced using PCA. Now I wish to get the columns back for further analysis of those clusters but I am not able to.
And when i use pc.components_ I get error :
AttributeError Traceback (most recent call last)
/tmp/ipykernel_33/4073743739.py in
----> 1 pc.components_
AttributeError: 'numpy.ndarray' object has no attribute 'components_'
or when I do scaled_df_PCA.components_
AttributeError: 'DataFrame' object has no attribute 'components_'
So I wanted to know how to recover details of columns back which were reduced during PCA.

This line from your code stores an NDArray into pc rather than the PCA instance.
pc=PCA(n_components = 2).fit_transform(scaled_df)
An easy fix is to create the PCA instance first and then call fit_transform().
pca = PCA(n_components=2)
df_transformed = pca.fit_transform(scaled_df)
Afterwards, you can still access attributes and methods of the PCA instance, pca.

Related

Can't display bar plot with SHAP

I'm new to SHAP and trying to use it on top of my RandomForestClassifier. Here's the code snippet after I already ran clf.fit(train_x, train_y):
explainer = shap.Explainer(clf)
shap_values = explainer(train_x.to_numpy()[0:5, :])
shap.summary_plot(shap_values, plot_type='bar')
Here's the resulting plot:
Now, there's two problems with this. One is that it is not a bar plot even though I set the plot_type parameter. The other is that I've seemed to lost my feature names somehow (and yes they do exist on the dataframes when calling clf.fit()).
I tried replacing the last line with:
shap.summary_plot(shap_values, train_x.to_numpy()[0:5, :], plot_type='bar')
And that changed nothing. I also tried to replace it with the following to see if I could at least recover my feature names:
shap.summary_plot(shap_values, train_x.to_numpy()[0:5, :], feature_names=list(train_x.columns.values), plot_type='bar')
But that threw an error:
Traceback (most recent call last):
File "sklearn_model_runs.py", line 41, in <module>
main()
File "sklearn_model_runs.py", line 38, in main
shap.summary_plot(shap_values, train_x.to_numpy()[0:5, :], feature_names=list(train_x.columns.values), plot_type='bar')
File "C:\Users\kapoo\anaconda3\envs\sci\lib\site-packages\shap\plots\_beeswarm.py", line 554, in summary_legacy
feature_names=feature_names[sort_inds],
TypeError: only integer scalar arrays can be converted to a scalar index
I'm kind of at a loss at this point. I just tried it with 5 rows of the training set but want to use the whole thing once I get past this stumbling block. If it helps, the classifier had 5 labels and my SHAP version is 0.40.0.
Alright, here was the problem. Replace this:
shap_values = explainer(train_x.to_numpy()[0:5, :])
With this:
shap_values = explainer.shap_values(train_x) # Use whole thing as dataframe
Then you can use this during plotting:
feature_names=list(train_x.columns.values)
The documentation here should really be updated...

Map categorical data for logistic regression

I am trying to run a logistic regression, predicting income based off age, num, and hours-per-week. The income column consists of either <=50K or >50. I have tried to replace the categorical data with numerics below by using the Pandas.map() function and recieved the error:
'DataFrame' object has no attribute 'map'. Then I tried adding the rdd function (as shown below) but get the error:
'DataFrame' object has no attribute 'rdd'
import pandas as pd
import statsmodels.api as sm
adult_train = pd.read_csv("C:/.../adult_training.csv")
adult_test = pd.read_csv("C:/.../adult_test.csv")
# Separate data into predictor variables, X, and target variables, y:
X = pd.DataFrame(adult_train[['age', 'hours-per-week', 'num']])
X = sm.add_constant(X)
y = pd.DataFrame(adult_train[['income']]).rdd.map({'<=50K': 0, '>50K': 1}).astype(int)
logreg01 = sm.Logit(y, X).fit()
If you could please help me be able to run the last line of code, it would be really appreciated.

AffinityPropagation' object has no attribute 'label_'

I am using scikit learn for affinity propogation algo. My input data is a numpy array of size 2303*2303 . It is a similarity matrix. I want to calculate the distance of each element in a cluster to its centroid. When i try to print the labels, i am getting the following error:
"AffinityPropagation' object has no attribute 'label_'". Here is the code:
clusterer = AffinityPropagation(affinity = 'precomputed')
af = clusterer.fit(l2)
print af.label_
I am getting the following error:
AttributeError: 'AffinityPropagation' object has no attribute 'label_'
Thanks.
According to the docs of AffinityPropagation you have to type
print af.labels_

Cannot save to grib2 file using python iris module

I'm using the python iris module to read in some netCDF data and output specific fields in grib format for further downstream processing. However I generate the following error
.../pythonlib/iris/1.9.1/lib/python2.7/site-packages/Iris-1.9.1-py2.7-linux-x86_64.egg/iris/fileformats/grib/_save_rules.pyc in gribbability_check(cube)
1062 cs1 = cube.coord(dimensions=[1]).coord_system
1063 if cs0 is None or cs1 is None:
-> 1064 raise iris.exceptions.TranslationError("CoordSystem not present")
1065 if cs0 != cs1:
1066 raise iris.exceptions.TranslationError("Inconsistent CoordSystems")
TranslationError: CoordSystem not present
So after having read the following :
Iris Google group thread https://groups.google.com/forum/#!searchin/scitools-iris/grib2/scitools-iris/D2InfYESaUM/yVT7ayXSFV0J
StackOverflow thread Converting NetCDF to GRIB2
iris source code at https://github.com/SciTools/iris/blob/master/lib/iris/fileformats/grib/grib_save_rules.py#L80
I attempted the following
In [26]: radius=iris.fileformats.pp.EARTH_RADIUS
In [27]: u.coord(dimensions=[0]).coord_system=iris.coord_systems.GeogCS(radius)
In [28]: u.coord(dimensions=[1]).coord_system=iris.coord_systems.GeogCS(radius)
In [29]: u.coord(dimensions=[0]).coord_system
Out[29]: GeogCS(6371229.0)
In [30]: u.coord(dimensions=[1]).coord_system
Out[30]: GeogCS(6371229.0)
In [31]: iris.save(u,'prod.grib2')
---------------------------------------------------------------------------
TranslationError Traceback (most recent call last)
<ipython-input-15-a38abe1720ac> in <module>()
----> 1 iris.save(u,'prod.grib2')
i.e. I still generate the same error, a failure in the iris subroutine gribbability_check
Hoping someone can help. I'm using iris 1.9.0 with python 2.7.6. The operation also fails with iris 1.8.0
Cheers
Thanks to Andrew Dawson on the iris google group for the answer. Dimensions [0] and [1] in grib_save_rules.py strictly refer to spatial dimensions, even if your cube may use time for the zeroth dimension. To quote:
There is a huge amount of code in between your cube and saving as
grib2. Since grib knows nothing about dimensionalities above 2 (it
only stores 1 grid per message) we split your cube up into one slice
per grid and pass that onward, hence in the function you are referring
to dimension 0 is latitude and 1 is longitude regardless of how many
other dimensions your cube had.
If I repeat the process but prescribe the coord_system to my spatial dimensions and give the vertical co-ordinate an attribute as well using
cube.coord('vertical_level').standard_name = 'air_pressure'
The grib can be saved.

Error in statsmodels.api OLS predict attribute using complex formula

I am trying to use a OLS regression to predict missing (NAN) values of ustar using know data of wind speed (WS), variation of WS by month, and radiation (Rn) using known values of all the variables just mentioned. All variables within the formula do have some missing data at some point within the dataframe, but my regression formula gave me strong correlations with all my variables in the formula and an R squared valued of .80, so I know this gap-filling method of predicted regression data is feasible. Here is my code below:
regression_data = pd.DataFrame([])
regression_data['ustar'] = data['ustar']
regression_data['WS'] = data['WS']
regression_data['Rn'] = data['Rn']
regression_data['month'] = data.index.month
formula = "ustar ~ WS + (WS:C(month)) + (WS:Rn) + 1"
regression_model = sm.regression.linear_model.OLS.from_formula(formula,regression_data)
results = regression_model.fit()
predicted_values = results.predict(regression_data)
Traceback (most recent call last):
File "<ipython-input-61-073df0b2ae63>", line 1, in <module>
predicted_values = results.predict(regression_data)
File "/Users/JasonDucker/anaconda/lib/python3.5/site-packages/statsmodels/base/model.py", line 739, in predict
exog = dmatrix(self.model.data.orig_exog.design_info.builder,
File "/Users/JasonDucker/anaconda/lib/python3.5/site-packages/pandas/core/generic.py", line 2360, in __getattr__
(type(self).__name__, name))
AttributeError: 'DataFrame' object has no attribute 'design_info'
I understand that there has been past similar issues with the same error, but I do know if the complexity of my formula is not handling well inside the "predict" attribute coding. I was wondering if anyone has a perspective of how I should approach this problem.

Categories