Fisher’s Score based feature selection - python

I am trying to select the best feature based on Fisher's score. In the following code, X_train and y_train are pandas dataframe.
from skfeature.function.similarity_based import fisher_score
ranks = fisher_score.fisher_score(X_train, y_train)
V = X_train.columns[ranks[:5]]
print(V)
The above code is giving the error Length of values (1) does not match the length of index (13).
But if I convert the data frame into NumPy array, then the code is executing. The following code is executing perfectly.
from skfeature.function.similarity_based import fisher_score
ranks = fisher_score.fisher_score(np.array(X_train), np.array(y_train))
V = X_train.columns[ranks[:5]]
print(V)
But the above code is making kernel dead, probably because of large size of numpy array. Is there any way to solve this without using Numpy array or any other way where I may not face kernel dead issue.

Related

ValueError: Data must be 1-Dimensional error while creating a dataframe

I am trying to solve a classification problem with a neural network and after I get the prediction I want to create a pandas data frame with a column from the test dataset and my predictions as the second column. But I am constantly getting error. Here is my code:enter image description here
and here is my error:
enter image description here
Important sidenote: Please, take some time to look into How to make good reproducible pandas examples, there are great suggestions there on how you could ask your question better.
Now for your error:
Data must be 1-dimensional
That means pandas wants a 1-dimensional array, i.e. of the form [0,0,1,1,...,1]. But your preds array is 2-dimensional, i.e. of the form [[0],[0],[1],[1],...,[1]].
So you need to flatten the preds array here:
Instead of for-loops consider using list comprehensions to change your code to something like this:
predictions = [1 if p>0.5 else 0 for p in preds]
df = pd.DataFrame({'PassengerId': test['PassengerId'].values,
'Survived': predictions})
Also, in the meantime look into ndarray.round method - maybe it will better fit your use case:
predictions = preds.round()

facing problem while running reg.predict in jupyter ntbk says "ValueError"

Trying to learn sklearn in python. But the jupyter ntbk is giving error saying "ValueError: Expected 2D array, got scalar array instead:
array=750.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample."
*But I have already defined x to be 2D array using x.values.reshape(-1,1)
You can find the CSV file and screenshot of the Error Code here -> https://github.com/CaptainRD/CSV-for-StackOverflow
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.linear_model import LinearRegression
data = pd.read_csv('1.02. Multiple linear regression.csv')
data.head()
x = data[['SAT','Rand 1,2,3']]
y = data['GPA']
reg = LinearRegression()
reg.fit(x,y)r2 = reg.score(x,y)
n = x.shape[0]
p = x.shape[1]
adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
adjusted_r2
reg.predict(1750)
As you can see in your code, your x has two variables, SAT and Rand 1,2,3. Which means, you need to provide a two dimensional input for your predict method. example:
reg.predict([[1750, 1]])
which returns:
>>> array([1.88])
You are facing this error because you did not provide the second value (for the Rand 1,2,3 variable). Note, if this variable is not important, you should remove it from your x data.
This model is mapping two inputs (SAT and Rand 1,2,3) to a single output (GPA), and thus requires a list of two elements as input for a valid prediction. I'm guessing the 1750 that you're supplying is meant to be the SAT value, but you also need to provide the Rand 1,2,3 value. Something like [1750, 1] would work.

ValueError: could not convert string to float: 'GIAC'

I am trying to perform a K Means Clustering on a set of data that all texts. I have tried these lines of code and I am getting an error saying "ValueError: could not convert string to float: 'GIAC'".
I think the program is still having problems converting my text into vectors to be able to perform a clustering.
I really do not know how to solve this.
Here are the lines of code:
import numpy as np
import matplotlib.pyplot as plot
import pandas as pd
from sklearn.cluster import KMeans
Cert = pd.read_csv('Certification.csv')
X = Cert.iloc[:,:].values
wcss =[]
for i in range(1,5):
kmeans = KMeans(n_clusters = i, init='k-means++', random_state = 0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plot.plot(range(1,5),wcss)
plot.title('Elbow Method')
plot.xlabel('Number of Clusters')
plot.ylabel('WCSS')
plot.show()
I also have attached a screenshot of the error message.error message
enter code here
K-means requires your data to be continuous variables.
Clearly, 'GIAC' is not a number, is it?
K-means cannot be used on this data. You'd need to do one-hot encoding or similar, but that comes with it's very own set of problems with k-means... Usually when you have data with values such as 'GIAC' there just is no sound way to cluster the data in a statistically meaningful way. Too many heuristic choice along he way to get a result, that you could get pretty much any other result, too. Try to approach the problem mathematically, not with copy&pasting code.

Image of Mnist data Python - Error when displaying the image

I'm working with the Mnist data set, in order to learn about Machine learning, and as for now I'm trying to display the first digit in the Mnist data set as an image, and I have encountered a problem.
I have a matrix with the dimensions 784x10000, where each column is a digit in the data set. I have created the matrix myself, because the Mnist data set came in the form of a text file, which in itself caused me quite a lot of problems, but that's a question for itself.
The MN_train matrix below, is my large 784x10000 matrix. So what I'm trying to do below, is to fill up a 28x28 matrix, in order to display my image.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
grey = np.zeros(shape=(28,28))
k = 0
for l in range(28):
for p in range(28):
grey[p,l]=MN_train[k,0]
k = k + 1
print grey
plt.show(grey)
But when I try to display the image, I get the following error:
The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Followed by a image plot that does not look like the number five, as I would expect.
Is there something I have overlooked, or does this tell me that my manipulation of the text file, in order to construct the MN_train matrix, has resulted in an error?
The error you get is because you supply the array to show. show accepts only a single boolean argument hold=True or False.
In order to create an image plot, you need to use imshow.
plt.imshow(grey)
plt.show() # <- no argument here
Also note that the loop is rather inefficient. You may just reshape the input column array.
The complete code would then look like
import numpy as np
import matplotlib.pyplot as plt
MN_train = np.loadtxt( ... )
grey = MN_train[:,0].reshape((28,28))
plt.imshow(grey)
plt.show()

ML - Getting feature names after feature selection - SelectPercentile, python

I have been struggling with this one for a while.
My goal is to take a text feature that I have, and find the best 5-10 words in it to help me classify. Hence, I am running a TfIdfVectorizer, and choosing ~90 best for now. however, after I downsize the feature amount, I am unable to see which features were actually chosen.
here is what I have:
import pandas
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif
train=pandas.read_csv("train.tsv", sep='\t')
labels_train = train["label"]
documents = []
for i, row in train.iterrows():
documents.append((row['boilerplate'][1:-1].lower()))
vectorizer = TfidfVectorizer(sublinear_tf=True, stop_words="english")
features_train_transformed = vectorizer.fit_transform(documents)
selector = SelectPercentile(f_classif, percentile=0.1)
selector.fit(features_train_transformed, labels_train)
features_train_transformed = selector.transform(features_train_transformed).toarray()
The result is that features_train_transformed contains a matrix of all the tfidf scores per word per document of the selected words, however I have no idea which words were chosen, and methods like "get_feature_names()" are unavailable for the class SelectPercentile.
This is neccesary because i need to add these features to a bunch of numeric features and only then make my training and predictions.
selector.get_support() to get you a boolean array of columns that were within the percentile range you specified
train.columns.values should get you the complete list of column names for the original dataframe
filtering the latter with the former should give you the names of columns that make up your chosen percentile range.
the code below (cut-pasted from working code) is similar enough to yours, that it's hopefully helpful
import numpy as np
selection = SelectPercentile(f_regression, percentile=2)
train_minus_target = train.drop("y", axis=1)
x_features = selection.fit_transform(train_minus_target, y_train)
columns = np.asarray(train_minus_target.columns.values)
support = np.asarray(selection.get_support())
columns_with_support = columns[support]
Reference:
about get_support

Categories