I'm trying to use a RandomForestClassifier on some data I have. The code is below:
print train_data[0,0:20]
print train_data[0,21::]
print test_data[0]
print 'Training...'
forest = RandomForestClassifier(n_estimators=100)
forest = forest.fit( train_data[0::,0::20], train_data[0::,21::] )
print 'Predicting...'
output = forest.predict(test_data)
but this generates the following error:
ValueError: Number of features of the model must match the input.
Model n_features is 3 and input n_features is 21
The output from the first three print statements is:
[ 0. 0. 0. 0. 1. 0.
0. 0. 0. 0. 1. 0.
0. 0. 0. 37.7745986 -122.42589168
0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
1. 0.]
[ 0. 0. 0. 0. 0. 0.
0. 1. 0. 0. 1. 0.
0. 0. 0. 0. 37.73505101
-122.3995877 0. 0. 0. ]
I had assumed that the data was in the correct format for my fit/predict calls, but it is erroring out on the predict. Can anyone see what I am doing wrong here?
The input data used to train the model is train_data[0::,0::20], which I think is a mistake (why skip features in between?) -- it should be train_data[0::,0:20] instead based on the debug prints you did in the beginning.
Also, it seems that the last column represents the labels in both train_data and test_data. When predicting, you might want to pass test_data[:, :20] instead of test_data when calling thepredict function.
Related
I tried interpolating not-a-number (nan) pixels in a scan with inpaint from opencv. This worked fine in the bulk of the image, but nan pixels at the edges of the image remained nan pixels.
Here is a minimal python example to reproduce the problem:
import numpy as np
import cv2 as cv
if __name__ == '__main__':
input = np.zeros((6,6))
input[1, 3] = np.nan
input = np.float32(input)
mask = np.uint8(input != 0)
inpaintRadius = 2
inpaintAlgorithm = cv.INPAINT_NS
output = cv.inpaint(input, mask, inpaintRadius, inpaintAlgorithm)
print(output)
This gives the output:
[[ 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. nan 0. 0.]
[ 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0.]]
As the nan pixel is interpolated with the Navier-Stokes equation, the correct solution is the equilibrium state. Therefore, I would expect and want for the output.
expected output:
[[ 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0.]]
If I relocate the nan from [1, 3] to [2, 3], then I obtain the expected output.
Does someone know, how inpaint from openCV handels the edges and what is the appropriate way to interpolate the edges?
I am trying to select k nearest points in numpy 2d array without zero value.
my data is following:
print(data)
[[0. 0. 0. ... 0. 0. 0. ]
[0. 0. 0. ... 0. 0. 0. ]
[0. 0. 6.41 ... 0. 0. 0. ]
...
[0. 0. 2.99 ... 0. 0. 0. ]
[0. 0. 0. ... 0. 0. 0. ]
[0. 0. 0. ... 0. 0. 0. ]]
I want to find k nearest neighbour among over zero value points.
How can I deal with this problem?
Thank you!!!
How can i find log likilihood layer if i have:
logP = [[-5.8971105e+00 -1.3536860e-01 -2.3225722e+00 -3.6559267e+00]
[-7.1035299e+00 -7.1037712e+00 -8.0828800e+00 -1.9549085e-03]]
oneHotTruth = [[0. 0. 0. 1.]
[0. 0. 0. 1.]]
gradInput should be equal:
[[ 0. 0. 0. -0.5]
[ 0. 0. 0. -0.5]]
Need to implement without using the library pytorch / tf
Hi All :) I'm trying to coding in python to compute (and print) a cosine similarity matrix between words in a text file.
So, for example what I have is this text file:
f.txt:
"hello my name is Sara and now I'm looking for your help"
the output should be like:
hello my name is Sara and now I'm looking for your help
hello 1 0.54 0.42 ... ........ .......... ...
my
name
is
sara
and
now
I'm
looking
for
your
help
And so on. Any help to code that ?
this is my try:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
tokens = []
with open('try.txt', 'r') as f:
for line in f.readlines():
tokens += (nltk.word_tokenize(line)) # BC I have long file so this give me memory error
# Vectorise the data
vec = TfidfVectorizer()
X = vec.fit_transform(tokens)
S = cosine_similarity(X)
print(S)
I think you misunderstand what the cosine similarity is. I suggest reading up on what comparing texts based on their cosine similarity involves exactly, but just to give you a rough idea: Cosine similarity is commonly used to compare two strings, where each string consists of multiple tokens. You first tokenize each string and then translate the tokens into vectors. While anything in a string can be a token, it is quite common to choose individual words as tokens.
In your example, each string just consists of one token, which is the word. So you are essentially asking: "What is the similarity between the string 'Hello' and the string 'Sara', using the words in each string as the unit of comparison". That doesn't make any sense. 'Hello' is not in 'Sara' and 'Sara' is not in 'Hello', so the similarity is 0. To show this, here is a working code for your example:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
the_corpus = ['Hello', 'my', 'name', 'is', 'Sara', 'and', 'now', 'I\'m',
'looking', 'for', 'your', 'help']
# Vectorise the data
vec = TfidfVectorizer()
X = vec.fit_transform(the_corpus)
S = cosine_similarity(X)
print(S)
The output is not really helpful.
[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]
If you are interested in the similarity between individual words, there are a couple of things you can do. For instance, you could tokenize your strings using as tokens individual letters in the word. But what is more common is to use alternative concepts such as the "Minimum Edit Distance". It might make sense to read up on those as well.
I tried a method to split data between train and test sets, but it seems that it fill the train with zeros and leave the data in test...
In theory, it works :
When I apply the following function which randomly selects some columns of the given array, it worked with the DataLens with numpy matrix but not with others.
def train_test_split(array):
test = np.zeros(array.shape)
train = array.copy()
for user in xrange(array.shape[0]):
test_ratings = np.random.choice(array[user, :].nonzero()[0],
size=10,
replace=False)
train[user, test_ratings] = 0.
test[user, test_ratings] = ratings[user, test_ratings]
# Test and training are truly disjoint
assert(np.all((train * test) == 0))
return train, test
train, test = train_test_split(ratings)
With simple data it doesn't work :
When using simple data :
ratings :
[[ 1. 1. 0. 0. 0.]
[ 1. 0. 0. 0. 0.]
[ 0. 0. 1. 0. 0.]
[ 1. 0. 0. 0. 0.]
[ 0. 0. 0. 1. 1.]]
It fill the array with 0 one by one even if train was a copy of ratings at the very beginning :
train :
[[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]]