I am trying feature selection on the Iris dateset.
I'm referencing from Feature Selection with Univariate Statistical Tests
I am using below lines and I want to find out the significant features:
import pandas
from pandas import read_csv
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
dataframe = pandas.read_csv("C:\\dateset\\iris.csv"]))
array = dataframe.values
X = array[:,0:4]
Y = array[:,4]
test = SelectKBest(score_func=f_classif, k=2)
fit = test.fit(X, Y)
set_printoptions(precision=2)
arr = fit.scores_
print (arr)
# [ 119.26 47.36 1179.03 959.32]
To show the indexes of the top 2 by its score, I added:
idx = (-arr).argsort()[:2]
print (idx)
# [2 3]
Further, how can I have the column/variable names (instead of their indexes)?
Use indexing, here is possible use columns names, because selected first 4 columns:
#first 4 columns
X = array[:,0:4]
cols = dataframe.columns[idx]
If selection is different for X variable is necessary also filter by position DataFrame:
#e.g. selected 3. to 7. column
X = array[:,2:6]
cols = dataframe.iloc[:, 2:6].columns[idx]
import pandas
from pandas import read_csv
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
dataframe = pandas.read_csv("iris.csv")
array = dataframe.values
X = array[:,0:4]
Y = array[:,4]
test = SelectKBest(score_func=f_classif, k=2)
fit = test.fit(X, Y)
set_printoptions(precision=2)
arr = fit.scores_
idx = (-arr).argsort()[:2]
print (idx)
print (arr)
#names=[dataframe.columns[j] for j in X]
names = dataframe.columns[idx]
print(names)
Output
[2 3]
[ 119.26 47.36 1179.03 959.32]
Index(['petal_length', 'petal_width'], dtype='object')
Related
The above screenshot is refereed to as: sample.xlsx. I've been having trouble getting the beta for each stock using the LinearRegression() function.
Input:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.read_excel('sample.xlsx')
mean = df['ChangePercent'].mean()
for index, row in df.iterrows():
symbol = row['stock']
perc = row['ChangePercent']
x = np.array(perc).reshape((-1, 1))
y = np.array(mean)
model = LinearRegression().fit(x, y)
print(model.coef_)
Output:
Line 16: model = LinearRegression().fit(x, y)
"Singleton array %r cannot be considered a valid collection." % x
TypeError: Singleton array array(3.34) cannot be considered a valid collection.
How can I make the collection valid so that I can get a beta value(model.coef_) for each stock?
X and y must have same shape, so you need to reshape both x and y to 1 row and 1 column. In this case it is resumed to the following:
np.array(mean).reshape(-1,1) or np.array(mean).reshape(1,1)
Given that you are training 5 classifiers, each one with just one value, is not surprising that the 5 models will "learn" that the coefficient of the linear regression is 0 and the intercept is 3.37 (y).
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.DataFrame({
"stock": ["ABCD", "XYZ", "JK", "OPQ", "GHI"],
"ChangePercent": [-1.7, 30, 3.7, -15.3, 0]
})
mean = df['ChangePercent'].mean()
for index, row in df.iterrows():
symbol = row['stock']
perc = row['ChangePercent']
x = np.array(perc).reshape(-1,1)
y = np.array(mean).reshape(-1,1)
model = LinearRegression().fit(x, y)
print(f"{model.intercept_} + {model.coef_}*{x} = {y}")
Which is correct from an algorithmic point of view, but it doesn't make any practical sense given that you're only providing one example to train each model.
I am using python for clustering a set of data that I have, but it is showing me this error and I do not where should I make the changes and in which file:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
and the following is my code:
from sklearn import datasets
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
from apyori import apriori
dataset = pd.read_csv('autos1.csv',encoding= 'unicode_escape')
x = dataset.iloc[ : , 1:3]
km = KMeans(n_clusters = 2, random_state = 21)
km.fit(x)
centers = km.cluster_centers_
print(centers)
clusters = x.copy()
clusters ['cluster_id'] = km.fit_predict(x)
plt.xlabel('price')
plt.ylabel('yearOfRegistration')
plt.scatter(clusters['fuelType'], clusters['yearOfRegistration'], c='black', cmap='rainbow')
plt.xlabel('price')
plt.ylabel('yearOfRegistration')
plt.show()
plt.scatter(centers[:,0], centers[:,1], c = 'black', s = 100 , alpha = 0.9 )
plt.scatter(clusters['price'], clusters['yearOfRegistration'], c=clusters['cluster_id'], cmap='rainbow')
plt.xlabel('price')
plt.ylabel('yearOfRegistration')
plt.show()
You need to remove any rows from your dataset that contain nan or nonfinite values.
# Only select rows that have all finite entries.
x = x[np.all(np.isfinite(x), axis=1)]
np.isfinite will return an array of the same shape as your input, so pass axis=1 to np.all to check if all columns (axis 1) of each row are finite.
Then, index into your array to only select those rows.
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df = pd.read_excel("Book1.xlsx")
for column in df:
X = df["Row Labels"]
Y = df[column]
y1 =Y.values.reshape(-1,1)
x1 =X.values.reshape(-1,1)
regressor = LinearRegression()
regressor.fit(x1, y1)
y_new = []
y_i = []
for i in range(12,24):
y_new.append(regressor.predict([[i]]))
y_i.append(i)
df2 = pd.DataFrame({'column':y_new})
i write this code to loop through the dataframe columns to do simple linear regression and put all the predicted value in dataframe. but it is predicting only the last columns value.
df2 = pd.DataFrame({'column':y_new}) creates a column named 'column' verbatim (not the name saved in the variable column. Moreover, df2 is recreated in every iteration, each iteration it only saves the last y_new.
I think what you want is to create a new column in df2 in each iteration:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df = pd.read_excel("Book1.xlsx")
df2 = pd.DataFrame()
for column in df:
X = df["Row Labels"]
Y = df[column]
y1 =Y.values.reshape(-1,1)
x1 =X.values.reshape(-1,1)
regressor = LinearRegression()
regressor.fit(x1, y1)
y_new = []
y_i = []
for i in range(12,24):
y_new.append(regressor.predict([[i]]))
y_i.append(i)
df2[column] = y_new
I have a lookup table created using -
lookupTable, data_training_panda_y_indexed = np.unique(data_training_panda_y, return_inverse=True)
However, I want to apply the lookupTable on a different array data_cross_validation_panda_y
data_training_panda_y is a list of strings which can be these values - Incoming, Outgoing, Neutral.
So, lookUpTable is ndArray ('Incoming' 'Outgoing 'Neutral')
Code so far -
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from numpy import dtype
from _codecs import lookup
#Load data
data = np.genfromtxt('../Data/bezdekIris.csv',delimiter=',',usecols=[0,1,2,3,4],dtype=None)
labels = np.genfromtxt('../Data/bezdekIris.csv',delimiter=',',usecols=[4],dtype=None)
#Shuffle the rows
np.random.shuffle(data)
#Cut the data into 3 parts
data_rows = np.size(data, 0)
training_rows = int(round(0.6*data_rows))
cross_validation_rows = int(round(0.2*data_rows))
testing_rows = data_rows - training_rows - cross_validation_rows
data_training_panda = pd.DataFrame(data[:training_rows])
data_training_panda_X = data_training_panda.iloc[:,0:4]
data_training_panda_y = data_training_panda.iloc[:,4]
data_cross_validation_panda = pd.DataFrame(data[training_rows:training_rows+cross_validation_rows])
data_cross_validation_panda_X = data_cross_validation_panda.iloc[:,0:4]
data_cross_validation_panda_y = data_cross_validation_panda.iloc[:,4]
data_testing_panda = pd.DataFrame(data[training_rows+cross_validation_rows:])
data_testing_panda_X = data_testing_panda.iloc[:,0:4]
data_testing_panda_y = data_testing_panda.iloc[:,4]
#Take out the labels from the 3 parts
lookupTable, data_training_panda_y_indexed = np.unique(data_training_panda_y, return_inverse=True)
#Label the CV and Testing
data_cross_validation_panda_y_indexed = np.array([])
data_testing_panda_y_indexed = np.array([])
bezdekIris.csv Sample Data -
5.1,3.5,1.4,0.2,Incoming
4.9,3.0,1.4,0.2,Outgoing
4.7,3.2,1.3,0.2,Netural
Using searchsorted could be a solution.
data_cross_validation_panda_y_indexed = np.searchsorted(lookupTable, data_cross_validation_panda_y)
my problem is that I have a Dataframe of 200 rows and 200 columns, while I scroll to the right the index column stay fixed ( I can still see it) as it should be.
However when I select a column or value into the Dataframe (for example to order the values in ascending or descending order), the index column change and becomes the same as the column I selected.
I would like to still see the index column.
I am using Spyder 3.3.0 and Python 3.6
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import operator
# Importing the dataset
dataset = pd.read_csv('1992_2014.csv', index_col =0)
nations_all = dataset.iloc[:, 0].values
nations = [nations_all[0]]
for i in range(0, len(nations_all)):
if nations_all[i] not in nations:
nations.append(nations_all[i])
Year = dataset.iloc[:, 1].values
CO2 = dataset.iloc[:, 8].values
# Creating the Trend Matrix between two nations
trend_matrix = pd.DataFrame(index = nations, columns = nations)
for i in nations:
n = dataset[dataset["Nation"] == i].index.values.astype(int)
for k in nations:
kn = dataset[dataset["Nation"] == k].index.values.astype(int)
div_n = CO2[n[0]]
div_kn = CO2[kn[0]]
CO2_n = (CO2[n]/div_n)
CO2_kn = (CO2[kn]/div_kn)
trend_matrix.loc[i, k] = sum(list(map(abs,list(map(operator.sub, CO2_n, CO2_kn)))))
Thanks!