I am running a feature reduction (from 500 to around 30) for a randomforest classifier algo. I can reduce the number of features, but I want to see what features are left at every point in the reduction.As you can see below, I have made an attempt, but does not work.
X does not contain the ColumnNames. Ideally, it could be possible to also have the columnnames in X but only fit from row, then printing X would be possible I think.
I am sure there is a much better way though...
Anybody know how to do this?
FEATURES = []
readThisFile = r'C:\ManyFeatures.txt'
featuresFile = open(readThisFile)
AllFeatures = featuresFile.read()
FEATURES = AllFeatures.split('\n')
featuresFile.close()
Location = r'C:\MASSIVE.xlsx'
data = pd.read_excel(Location)
X = np.array(data[FEATURES])
y = data['_MiniTARGET'].values
for x in range(533, 10,-100):
X = SelectKBest(f_classif, k=x).fit_transform(X, y)
#U=pd.DataFrame(X)
#print (U.feature_importances_)
Related
I am going to run a study in which multiple raters have to evaluate whether each of a number of papers is '1' or '0'. The reason I use multiple raters is that I suspect that each individual rater is likely to make mistakes, and I hope that by using multiple raters I can control for that.
My aim is to estimate the true proportion of '1' in the population of papers, and I want to do this using a bayesian model in PyMC3. More general answers about model specification without the concrete implementation in PyMC3 are of course also welcome.
This is how I've simulated some data:
n = 250 # number of papers we sample
p = 0.3 # true rate
true_sample = binom.rvs(1, 0.3, size=n)
# add error
def rating(array,error_rate):
scores = []
for i in array:
scores.append(np.random.binomial(i, error_rate))
return np.array(scores)
r = 10 # number of raters
r_error = np.random.uniform(0.7, 0.99,10) # how often does each rater rate a paper correctly
#get the data
rated_data = {}
for i in range(r):
rated_data[f'rater_{i}'] = rating(true_sample, r_error[i])
df = pd.DataFrame(rated_data, index = [f'abstract_{i}' for i in range(250)])
This is the model I have tried:
with pm.Model() as binom_model2:
p = pm.Beta('p',0.5,0.5) # this is the proportion of '1' in the population
for i in range(10): # error_r and p for each rater separately
er = pm.Beta(f'er{i}',10,3)
prob = pm.Binomial(f'prob{i}', p = (p * er), n = n,observed = df.iloc[:,i].sum() )
This seems to work fine, in that it gives good estimates of p and error_r (but do tell me if you think there are problems with the model!). However, it doesn't use all information that is available, namely, the fact that the ratings on each row of the dataframe are ratings of the same paper. I presume that a model that could incorporate this, would give even more accurate estimates of p and of the error-rates. I'm not sure how to do this, and any help would be appreciated.
I am having problem with understanding how data should be prepared for different models:
One to many
Many to one
Many to many(A)
Many to many(B)
Is the right way to think o it this way. Shape numbers are no relevant and do not match the one on picture. I am just trying to understand logic behind.:
import numpy as np
#1. one to many
# X for input y for output
X = np.ones([10,1,5])
y = np.zeros([10,3]) #3 represnts size of output vector
#2. many to one
X = np.ones([10,5,5])
y = np.zeros([10,1])
#3. many to many
X = np.ones([10,5,5])
y = np.zeros([10,5])
# in this case cell should be different than y. It must be bigger to shift some data
#4. many to many
X = np.ones([10,5,5])
y = np.zeros([10,5])
# in this case cell is the same shape as y
Here's the problem:
It take 2 variable inputs, and predict a result.
For example: price and volume as inputs and a decision to buy/sell as a result.
I tried implementing this using K-Neighbors with no success. How would you go about this?
X = cleanedData['ES1 End Price'] #only accounts for 1 variable, don't know how to use input another.
y = cleanedData["Result"]
print(X.shape, y.shape)
kmm = KNeighborsClassifier(n_neighbors = 5)
kmm.fit(X,y) #ValueError for size inconsistency, but both are same size.
Thanks!
X needs to be a matrix/2d array where each column stands for a feature, which doesn't seem true from your code, try reshape X to 2d with X[:,None]:
kmm.fit(X[:,None], y)
Or without resorting to reshape, you'd better always use a list to extract features from a data frame:
X = cleanedData[['ES1 End Price']]
OR with more than one columns:
X = cleanedData[['ES1 End Price', 'volume']]
Then X would be a 2d array, and can be used directly in fit:
kmm.fit(X, y)
''In general, it would get better performance creating batches of linear constraints rather than creating them one at a time. I just wondering if it states even with a huge problem.'' - The wise programmer.
To be clear, I have a (35k x 40) dataset, and I want to do SVM on it. I need to produce the Gramm matrix of this dataset, it is fine, but to pass the coefficient to CPLEX is a mess, it takes hours, here my code:
nn = 35000
XXt = np.random.rand(nn,nn) # the gramm matrix of the dataset
yy = np.random.rand(nn) # the label vector of the dataset
temp = ((yy*XXt).T)*yy
xg, yg = np.meshgrid(range(nn), range(nn))
indici = np.dstack([yg,xg])
quadraric_part = []
for ii in xrange(nn):
for indd in indici[ii][ii:]:
quadraric_part.append([indd[0],indd[1],temp[indd[0],indd[1]]])
The 'quadratic_part' is a list of the form [i,j,c_ij] where c_ij is the coefficient stored in temp. It will be passed to the function 'objective.set_quadratic_coefficients()' of the CPLEX Python API.
There is a wiser way to do that?
P.S. I have maybe a Memory problem, so It wold be better, instead store the whole list 'quadratic_part', call several times the function 'objective.set_quadratic_coefficients()'.... you know what I mean?!
Under the hood, objective.set_quadratic makes use of the CPXXcopyquad function in the C Callable Library. Whereas, objective.set_quadratic_coefficients uses CPXXcopyqpsep.
Here is an example (bear in mind that I am not a numpy expert; it's quite possible there's a better way to do that part):
import numpy as np
import cplex
nn = 5 # a small example size here
XXt = np.random.rand(nn,nn) # the gramm matrix of the dataset
yy = np.random.rand(nn) # the label vector of the dataset
temp = ((yy*XXt).T)*yy
# create symetric matrix
tempu = np.triu(temp) # upper triangle
iu1 = np.triu_indices(nn, 1)
tempu.T[iu1] = tempu[iu1] # copy upper into lower
ind = np.array([[x for x in range(nn)] for x in range(nn)])
qmat = []
for i in range(nn):
qmat.append([np.arange(nn), tempu[i]])
c = cplex.Cplex()
c.variables.add(lb=[0]*nn)
c.objective.set_quadratic(qmat)
c.write("test2.lp")
Your Q matrix is completely dense so depending on the amount of memory you have, this technique may not scale. When it's possible, though, you should get better performance initializing your Q matrix with objective.set_quadratic. Perhaps you'll need to use some hybrid technique where you use both set_quadratic and set_quadratic_coefficients.
I'm loading a csv, using Numpy, as a dataset to create a decision tree model in Python. using the below extract places columns 0-7 in X and the last column as the target in Y.
#load and set data
data = np.loadtxt("data/tmp.csv", delimiter=",")
X = data[:,0:7] #identify columns as data sets
Y = data[:,8] #identfy last column as target
#create model
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
What i'd like to know is if its possible to have the classifier in any column. for example if its in the fourth column would the following code still fit the model correctly or would it produce errors when it comes to predicting?
#load and set data
data = np.loadtxt("data/tmp.csv", delimiter=",")
X = data[:,0:8] #identify columns as data sets
Y = data[:,3] #identfy fourth column as target
#create model
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
If you have >4 columns, and the 4th one is the target and the others are features, here's one way (out of many) to load them:
# load data
X = np.hstack([data[:, :3], data[:, 5:]]) # features
Y = data[:,4] # target
# process X & Y
(with belated thanks to #omerbp for reminding me hstack takes a tuple/list, not naked arguments!)
First of all, As suggested by #mescalinum in a comment to the question, think of this situation:
.... 4th_feature ... label
.... 1 ... 1
.... 0 ... 0
.... 1 ... 1
............................
In this example, the classifier (any classifier, not DecisionTreeClassifier particularly) will learn that the 4th feature can best predict the label, since the 4th feature is the label. Unfortunately, this issue happen a lot (by accident I mean).
Secondly, if you want the 4th feature to be input label, you can just swap the columns:
arr[:,[frm, to]] = arr[:,[to, frm]]
#Ahemed Fasih's answer can also do the trick, however its around 10 time slower:
import timeit
setup_code = """
import numpy as np
i, j = 400000, 200
my_array = np.arange(i*j).reshape(i, j)
"""
swap_cols = """
def swap_cols(arr, frm, to):
arr[:,[frm, to]] = arr[:,[to, frm]]
"""
stack ="np.hstack([my_array[:, :3], my_array[:, 5:]])"
swap ="swap_cols(my_array, 4, 8)"
print "hstack - total time:", min(timeit.repeat(stmt=stack,setup=setup_code,number=20,repeat=3))
#hstack - total time: 3.29988478635
print "swap - total time:", min(timeit.repeat(stmt=swap,setup=setup_code+swap_cols,number=20,repeat=3))
#swap - total time: 0.372791106328