Performing PCA and knowing which columns were retained [duplicate] - python

This question already has answers here:
Recovering features names of explained_variance_ratio_ in PCA with sklearn
(5 answers)
Closed 2 years ago.
When performing PCA on a dataset in Python, the explained_variance_ratio_ will show us the different variances for each feature in our dataset.
How do we know which columnn corresponds with which of the resulting variances?
Context: I'm working on a project and I need to know which components give us 90% of the variance with PCA so that we can perform stepwise feature selection later on.
from sklearn.decomposition import PCA
pcaObj = PCA(n_components=None)
X_train = pcaObj.fit_transform(X_train)
X_test = pcaObj.transform(X_test)
components_variance = pcaObj.explained_variance_ratio_
print(sum(components_variance))
print(components_variance)

Edit: I discovered similar question: Recovering features names of explained_variance_ratio_ in PCA with sklearn
The answers are richer and detailed explanations. I have marked this question as duplication but will leave this comment for time being.
I believe you can get the values with;
pd.DataFrame(pcaObj.components_.T, index=X_train.columns)
if X_train is not DataFrame but numpy, pass in the name of the features as they appeared originally as a list.
pd.DataFrame(pcaObj.components_.T, index=['column_a','column_b','column_c'], columns =['PC-1', 'PC-2'])
# column_x where the name of features
.componets_ should return the values you need. We can place them on Pandas pd, with columns names.

The pca.explained_variance_ratio_ parameter gives you an array of the variance of each dimension. Therefore, pca.explained_variance_ratio[i] will give you the variance of the i+1st dimesion.
I don't believe there is a way to match the variance with the 'name' of the column, but going through the variance array in a for loop and noting the index with 90% variance should allow you to then match the index with the column name.

Related

Choosing between imputation methods [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last year.
Improve this question
I'm trying to evaluate 2 methods for imputation of data.
My dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
My target label is LotFrontage.
First I encoded all categorial features with OneHotEncoding and then I used the correlation matrix and filter anything above -0.3 or blow 0.3.
encoded_df = pd.get_dummies(train_df, prefix_sep="_", columns=['MSZoning', 'Street', 'Alley',
'LotShape', 'LandContour', 'Utilities',
'LotConfig', 'LandSlope', 'Neighborhood',
'Condition1', 'Condition2', 'BldgType', 'HouseStyle'])
corrmat = encoded_df.corr()
corrmat[(corrmat > 0.3) | (corrmat < -0.3)]
# filtering out based on corrmat output...
encoded_df = encoded_df[['SalePrice', 'MSSubClass', 'LotFrontage', 'LotArea',
'BldgType_1Fam', 'BldgType_2fmCon', 'BldgType_Duplex', 'BldgType_Twnhs', 'BldgType_TwnhsE',
'MSZoning_C (all)', 'MSZoning_FV', 'MSZoning_RH', 'MSZoning_RL', 'MSZoning_RM']]
Then I try two imputation methods:
use the mean value of LotFrontage (used this method because I saw low outlier ratio)
Tried to predict LotFrontage with DecisionTreeRegressor
# imputate LotFrontage with the mean value (we saw low outliers ratio so we gonna use this)
encoded_df1 = encoded_df.copy()
encoded_df1['LotFrontage'].fillna(encoded_df['LotFrontage'].mean(), inplace=True)
X1 = encoded_df1.drop('LotFrontage', axis=1)
y1 = encoded_df1['LotFrontage']
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1)
classifier1 = DecisionTreeRegressor()
classifier1.fit(X1_train, y1_train)
y1_pred = classifier1.predict(X1_test)
print('score1: ', classifier1.score(X1_test, y1_test))
# imputate LotFrontage with by preditcing it using DecisionTreeRegressor
encoded_df2 = encoded_df.copy()
X2 = encoded_df2[~encoded_df2['LotFrontage'].isnull()].drop('LotFrontage', axis=1)
y2 = encoded_df2[~encoded_df2['LotFrontage'].isnull()]['LotFrontage']
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2)
classifier2 = DecisionTreeRegressor()
classifier2.fit(X2_train, y2_train)
y2_pred = classifier2.predict(encoded_df2[encoded_df2['LotFrontage'].isnull()].drop('LotFrontage', axis=1))
imputated_encoded_df2 = encoded_df2[encoded_df2['LotFrontage'].isnull()].assign(LotFrontage=y2_pred)
X3 = imputated_encoded_df2.drop('LotFrontage', axis=1)
y3 = imputated_encoded_df2['LotFrontage']
X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3)
classifier2.fit(X3_train, y3_train)
y3_pred = classifier2.predict(X3_test)
print('score2: ', classifier2.score(X3_test, y3_test))
My questions are:
Is it correct of me first using fillna with the mean value and then splitting to train and test and checking the score? Because if I'm filling the values prior to fitting the model won't it fit the model on the imputated data and thus giving me biased result? Same for the second method
Anything else I'm doing wrong since I can't determine the best method for imputation since I get bad and random score for both methods
1.Imputation Using (Mean/Median) Values:
This works by calculating the mean/median of the non-missing values in a column and then replacing the missing values within each column separately and independently from the others. It can only be used with numeric data.
Pros:
Easy and fast.
Works well with small numerical datasets.
Cons:
Doesn’t factor the correlations between features.
It only works on the column level.
Will give poor results on encoded categorical features (do NOT use it on categorical features).
Not very accurate.
Doesn’t account for the uncertainty in the imputations.
2.Imputation Using (Most Frequent) or (Zero/Constant) Values:
Most Frequent is another statistical strategy to impute missing values and YES!! It works with categorical features (strings or numerical representations) by replacing missing data with the most frequent values within each column.
Pros:
Works well with categorical features.
Cons:
It also doesn’t factor the correlations between features.
It can introduce bias in the data.
Zero or Constant imputation — as the name suggests — it replaces the missing values with either zero or any constant value you specify
3.Imputation Using k-NN:
The k nearest neighbours is an algorithm that is used for simple classification. The algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set. This can be very useful in making predictions about the missing values by finding the k’s closest neighbours to the observation with missing data and then imputing them based on the non-missing values in the neighbourhood.
How does it work?
It creates a basic mean impute then uses the resulting complete list to construct a KDTree. Then, it uses the resulting KDTree to compute nearest neighbours (NN). After it finds the k-NNs, it takes the weighted average of them.
Pros:
Can be much more accurate than the mean, median or most frequent imputation methods (It depends on the dataset).
Cons:
Computationally expensive. KNN works by storing the whole training dataset in memory.
K-NN is quite sensitive to outliers in the data (unlike SVM)
Since the outlier ratio is low we can use method 3. It will also have less impact on the correlation between the imputed target variable(i.e LotFrontage) and other features.
import sys
from impyute.imputation.cs import fast_knn
sys.setrecursionlimit(100000) #Increase the recursion limit of the OS
# start the KNN training
train_df['LotFrontage']=fast_knn(train_df[['LotFrontage','1stFlrSF','MSSubClass']], k=30)
I've chosen the two features considering their correlation with the LotFrontage column.

Can the input values to the spectral clustering of scikit-learn be a negative value?

Let us say, I have a df of 20 columns and 10K rows. Since the data has a wide range values, I use the following code to normalize the data:
from sklearn.preprocessing import StandardScaler
min_max_scaler = preprocessing.StandardScaler()
df_scaled = min_max_scaler.fit_transform(df)
df_scaled now contains both negative and positive values.
Now if I pass this normalized data frame to the spectral cluster as follows,
spectral = SpectralClustering(n_clusters = k,
n_init=30,
affinity='nearest_neighbors', random_state=cluster_seed,
assign_labels='kmeans')
clusters = spectral.fit_predict(df_scaled)
I will get the cluster lables.
Here is what confuses me: the official doc says that
"Only kernels that produce similarity scores (non-negative values that increase with similarity) should be used. This property is not checked by the clustering algorithm."
Questions: Do the normalized negative values of df_scaled affect the clustering result?
OR
Does it depend on the affinity computation I am using e.g. precomputed, rbf? If so how can I use the normalized input values to SpectralClustering?
My understanding is that normalizing could improve the clustering results and good for faster computation.
I appreciate any help or tips on how to I can approach the problem.
You are passing a data matrix, not a precomputed affinity matrix.
The "nearest neighbors" uses a binary kernel, which is non-negative.
To better understand the inner workings, please have a look at the source code.

Python dask_ml linear regression Multiple constant columns detected error

I am using python with dask to create a logistic regression model, In order to speed up things when training.
I have x that is the feature array (numpy array) and y that is a label vector.
edit:
The numpy arrays are: x_train (n*m size) array of floats and the y_train is (n*1) vector of integers that are labels for the training. both suits well into sklearn LogisticRegression.fit and working fine there.
I tried to use this code to create a pandas df then converting it to dask ddf and training on it like shown here
from dask_ml.linear_model import LogisticRegression
from dask import dataframe as dd
df["label"] = y_train
sd = dd.from_pandas(df, npartitions=3)
lr = LogisticRegression(fit_intercept=False)
lr.fit(sd, sd["label"])
But getting an error
Could not find signature for add_intercept:
I found this issue on Gitgub
Explaining to use this code instead
from dask_ml.linear_model import LogisticRegression
from dask import dataframe as dd
df["label"] = y_train
sd = dd.from_pandas(df, npartitions=3)
lr = LogisticRegression(fit_intercept=False)
lr.fit(sd.values, sd["label"])
But I get this error
ValueError: Multiple constant columns detected!
How can I use dask to train a logistic regression over data originated from a numpy array?
Thanks.
You can bypass std verification by using
lr = LogisticRegression(solver_kwargs={"normalize":False})
Or you can use #Emptyless code to get faulty column_indices
and then remove those columns from your array.
This does not seem like an issue with dask_ml. Looking at the source, the std is calculated using:
mean, std = da.compute(X.mean(axis=0), X.std(axis=0))
This means that for every column in your provided array, dask_ml calculates the standard deviation. If the standard deviation of one of those columns is equal to zero (np.where(std == 0))) that means that that column has zero variation.
Including a column with zero variation does not allow any training, ergo it needs to be removed prior to training the model (in a data preparation / cleansing step).
You can quickly check which columns have no variation by checking the following:
import numpy as np
std = sd.std(axis=0)
column_indices = np.where(std == 0)
print(column_indices)
A little late to the party but here I go anyway. Hope future readers appreciate it. This answer is for the Multiple Columns error.
A Dask DataFrame is split up into many Pandas DataFrames. These are called partitions. If you set your npartitions to 1 it should have exactly the same effect as sci-kit learn. If you set it to more partitions it splits it into multiple DataFrames but I found it changes the shape of the DataFrames which in the end resulted in the Multiple Columns error. It also might cause a overflow warning. Unfortunately it is not in my interest to investigate the direct cause of this error. It might simply be because the DataFrame is too large or too small.
A source for partitioning
Below the errors for search engine indexing:
ValueError: Multiple constant columns detected!
RuntimeWarning: overflow encountered in exp return np.exp(A)

support vector regression time series forecasting - python

I have a dataset of peak load for a year. Its a simple two column dataset with the date and load(kWh).
I want to train it on the first 9 months and then let it predict the next three months . I can't get my head around how to implement SVR. I understand my 'y' would be predicted value in kWh but what about my X values?
Can anyone help?
given multi-variable regression, y =
Regression is a multi-dimensional separation which can be hard to visualize in ones head since it is not 3D.
The better question might be, which are consequential to the output value `y'.
Since you have the code to the loadavg in the kernel source, you can use the input parameters.
For Python (I suppose, the same way will be for R):
Collect the data in this way:
[x_i-9, x_i-8, ..., x_i] vs [x_i+1, x_i+2, x_i+3]
First vector - your input vector. Second vector - your output vector (or value if you like). Use method fit from here, for example: http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR.fit
You can try scaling, removing outliers, apply weights and so on. Play :)

Data Preprocessing Python

I have a DataFrame in Python and I need to preprocess my data. Which is the best method to preprocess data?, knowing that some variables have huge scale and others doesn't. Data hasn't huge deviance either. I tried with preprocessing.Scale function and it works, but I'm not sure at all if is the best method to proceed to the machine learning algorithms.
There are various techniques for data preprocessing, you can refer to the ideas in sklearn.preprocessing as potential guidelines to follow.
http://scikit-learn.org/stable/modules/preprocessing.html
Preprocessing is coupled to the data you are studying, but in general you could explore:
Assessing missing values, by computing their percentage per column
Compute the variance and remove variables with near zero variance
Assess the inter variable correlation to detect redundancy
You can compute these scores easily in pandas as follows:
data_file = "your_input_data_file.csv"
data = pd.read_csv(data_file, delimiter="|")
variance = data.var()
variance = variance.to_frame("variance")
variance["feature_names"] = variance.index
variance.reset_index(inplace=True)
#reordering columns
variance = variance[["feature_names","variance"]]
logging.debug("exporting variance to csv file")
variance.to_csv(data_file+"_variance.csv", sep="|", index=False)
missing_values_percentage = data.isnull().sum()/data.shape[0]
missing_values_percentage = missing_values_percentage.to_frame("missing_values_percentage")
missing_values_percentage["feature_names"] = missing_values_percentage.index
missing_values_percentage.reset_index(inplace=True)
missing_values_percentage = missing_values_percentage[["feature_names","missing_values_percentage"]]
logging.debug("exporting missing values to csv file")
missing_values_percentage.to_csv(data_file+"_mssing_values.csv", sep="|", index=False)
correlation = data.corr()
correlation.to_csv(data_file+"_correlation.csv", sep="|")
The above would generate three files holding respectively, the variance, missing values percentage and correlation results.
Refer to this blog article for a hands on tutorial.
always split your data to train and test split to prevent overfiting.
if some of your features has big scale and some doesnt you should standard the data.make sure to sandard the data only on the train set not to couse overfiting.
you also have to look for missing datas and replace or remove them.
if less than 0.5% of the data in a column is missing you can use 'dropna' otherwise you have to replace it with something(you can replace ut with zero,mean,the previous data...)
you also have to check outliers by using boxplot.
outliers are point that are significantly different from other data in the same group can also affects your prediction in machine learning.
its the best if we check the multicollinearity.
if some features have correlation we have multicollinearity can couse wrong prediction for our model.
for using your data some of the columns might be categorical with sholud be converted to numerical.

Categories