Python NLTK Maximum Entropy Classifier Error - python

I'm currently using NLTK's Naive Bayes classifier, however I also wanted to try out the Max Ent classifier. It seems from the documentation that it should take the same format for the feature set as the Naive Bayes, but for some reason I am getting this error when I try it:
File "/usr/lib/python2.7/site-packages/nltk/classify/maxent.py", line 323, in train
gaussian_prior_sigma, **cutoffs)
File "/usr/lib/python2.7/site-packages/nltk/classify/maxent.py", line 1453, in train_maxent_classifier_with_scipy
model.fit(algorithm=algorithm)
File "/usr/lib64/python2.7/site-packages/scipy/maxentropy/maxentropy.py", line 1026, in fit
return model.fit(self, self.K, algorithm)
File "/usr/lib64/python2.7/site-packages/scipy/maxentropy/maxentropy.py", line 226, in fit
callback=callback)
File "/usr/lib64/python2.7/site-packages/scipy/optimize/optimize.py", line 636, in fmin_cg
gfk = myfprime(x0)
File "/usr/lib64/python2.7/site-packages/scipy/optimize/optimize.py", line 176, in function_wrapper
return function(x, *args)
File "/usr/lib64/python2.7/site-packages/scipy/maxentropy/maxentropy.py", line 420, in grad
G = self.expectations() - self.K
ValueError: shape mismatch: objects cannot be broadcast to a single shape
I'm not sure what this means, but I am using the same exact input as I am when I run Naive Bayes and that works.(Training data, represented as a list of pairs, the first member of which is a featureset, and the second of which is a classification label.) Any ideas?
Thanks!

I also encountered this problem with NLTK. While I was unable to resolve it satisfactorily (i.e. get Maxent working using scipy), I was able to train a maxent classifier in NLTK when I used a different algorithm. Try training with
me_classifier = nltk.MaxentClassifier.train(trainset,algorithm="iis")
or one of the other acceptable values for algorithm, like "gis" or "megam".

This issue is also dependent on what version of scipy you are using.
NLTK makes use of scipy.maxentropy which was deprecated in scipy 0.10 and removed in 0.11, see the docs for it: http://docs.scipy.org/doc/scipy-0.10.0/reference/maxentropy.html#
I did create an issue for that on github: https://github.com/nltk/nltk/issues/307

you must install nltk then you can classify.
use the code bellow to classify using maximum entropy in python
me_classifier = nltk.MaxentClassifier.train(trainset,algorithm="gis")
print(me_classifier.classify(testing))

Related

Extreme Learning Machine Implementation Error

I am new to the extreme learning machine (ELM) and trying to implement its code. I am using a sample code from this link.
import elm
# load dataset
data = elm.read("iris.data")
# create a classifier
elmk = elm.ELMKernel()
# search for best parameter for this dataset
# define "kfold" cross-validation method, "accuracy" as a objective function
# to be optimized and perform 10 searching steps.
# best parameters will be saved inside 'elmk' object
elmk.search_param(data, cv="kfold", of="accuracy", eval=10)
# split data in training and testing sets
tr_set, te_set = elm.split_sets(data, training_percent=.8, perm=True)
#train and test
tr_result = elmk.train(tr_set)
te_result = elmk.test(te_set)
print(te_result.get_accuracy)
So far, I just ran the below portion of the code and I got an error:
import elm
# load dataset
data = elm.read("iris.data")
# create a classifier
elmk = elm.ELMKernel()
# search for best parameter for this dataset
# define "kfold" cross-validation method, "accuracy" as a objective function
# to be optimized and perform 10 searching steps.
# best parameters will be saved inside 'elmk' object
elmk.search_param(data, cv="kfold", of="accuracy", eval=10)
Here is the error:
Traceback (most recent call last):
File "C:/Users/Mahsa/PycharmProjects/test/ELM.py", line 16, in <module>
elmk.search_param(data, cv="kfold", of="accuracy", eval=10)
File "C:\Users\Mahsa\AppData\Local\Programs\Python\Python37\lib\site-packages\elm\elmk.py", line 489, in search_param
param_kernel=param_ranges[1])
File "C:\Users\Mahsa\AppData\Local\Programs\Python\Python37\lib\site-packages\optunity\api.py", line 212, in minimize
pmap=pmap)
File "C:\Users\Mahsa\AppData\Local\Programs\Python\Python37\lib\site-packages\optunity\api.py", line 245, in optimize
solution, report = solver.optimize(f, maximize, pmap=pmap)
File "C:\Users\Mahsa\AppData\Local\Programs\Python\Python37\lib\site-packages\optunity\solvers\CMAES.py", line 139, in optimize
sigma=self.sigma)
File "C:\Users\Mahsa\AppData\Local\Programs\Python\Python37\lib\site-packages\deap\cma.py", line 90, in __init__
self.dim = len(self.centroid)
TypeError: len() of unsized object
I searched a lot to find the solution but it seems many people had this problem and no one offered them any solution. Can anyone help me please?
BTW, I know someone posted the same question on stackoverflow, but since he didn't received any answers I asked this question again.

Not enough memory to perform factorization expm scipy.sparse.linalg.splu

I have been trying to use TieDIE. In a few words, this software includes an algorithm that find significant subnetwork when you pass some query nodes and a network. With smaller networks It works just fine, but the network that I am interested in, is quite big, It has 21988 nodes and 360474 edges. TieDIE generates an initial network kernel using scipy (although Matlab is also an option to generate this kernel I do not own a license). During the generation of this kernel I get the following error:
Not enough memory to perform factorization. Traceback (most recent call last):
File "Trials.py",
line 44, in <module> diffuser = SciPYKernel(network_path)
File "lib/kernel_scipy.py",
line 83, in __init__ self.kernel = expm(time_T*L)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/matfuncs.py",
line 602, in expm return _expm(A, use_exact_onenorm='auto')
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/matfuncs.py",
line 665, in _expm X = _solve_P_Q(U, V, structure=structure)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/matfuncs.py",
line 699, in _solve_P_Q return spsolve(Q, P)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/dsolve/linsolve.py",
line 198, in spsolve Afactsolve = factorized(A)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/dsolve/linsolve.py",
line 440, in factorized return splu(A).solve
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/dsolve/linsolve.py",
line 309, in splu ilu=False, options=_options)
MemoryError
What is the most interesting thing about this is that I am using a cluster computer that has 64 cpus, and 700GB or RAM and the software peaks at 1.3% of Memory usage (~10GB), according to a ps monitoring, at some moment of execution and crushing later. I have been told that there is no limit in the usage of RAM... So I really have no clue about what could be happening ...
Maybe someone here could help me on finding an alternative to scipy or solving it.
Is it possible that the memory error comes because of just one node is being used? In this the case, how could I distribute the work across the nodes?
Thanks in advance.
That's right, for a very large network like that you'll need high memory on a single node. The easiest solution is of course a workaround, either:
(1) Is there any way you reduce the size of your input network while still capturing relevant biology? Maybe just look for all the nodes 2 steps away from your input nodes?
(2) Use the new Cytoscape API to do the diffusion for you: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005598 (https://github.com/idekerlab/heat-diffusion)
(3) Use PageRank instead of computing a heat kernel (not ideal, as we've shown that Diffusion tends to work better on biological networks).
Hope this helps!
-Evan Paull (TieDIE developer/lead author)

Memory error for decision tree with 66k features, using scikit python packages

Problem Statement
I am using a document of 1600000 lines and ~66k features.
I am using the bag of words approach to build a decision tree.
Following code is working fine for 1000 line document.
But throws memory error for the actual 1600000 line document.
My Server has a 64GB of RAM.
Instead of using .todense() or .toarray(), is there any way to use the sparse matrix directly ? OR
Is there any options to reduce the default type float64?
Kindly help me on this.
Code:
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,stop_words='english')
X_train = vectorizer.fit_transform(corpus)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train.todense(),corpus2)
Error:
Traceback (most recent call last):
File "test123.py", line 103, in <module>
clf = clf.fit(X_train.todense(),corpus2)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/base.py", line 458, in todense
return np.asmatrix(self.toarray())
File "/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 550, in toarray
return self.tocoo(copy=False).toarray()
File "/usr/lib/python2.7/dist-packages/scipy/sparse/coo.py", line 219, in toarray
B = np.zeros(self.shape, dtype=self.dtype)
MemoryError
In short, is there any methods to use classification tree for large data set with 66k features.?
Add dtype=np.float32 eg: vec = TfidfVectorizer(..., dtype=np.float32)
As for sparse/dense I have similar problem.
GradientBoostingClassifier, RandomForestClassifier or DecisionTreeClassifier need dense data, for that reason I use SVC.

Meanshift in scikit learn (python) doesn't understand datatype

I have a dataset which has 7265 samples and 132 features.
I want to use the meanshift algorithm from scikit learn but I ran into this error:
Traceback (most recent call last):
File "C:\Users\OJ\Dropbox\Dt\Code\visual\facetest\facetracker_video.py", line 130, in <module>
labels, centers = getClusters(data,clusters)
File "C:\Users\OJ\Dropbox\Dt\Code\visual\facetest\facetracker_video.py", line 34, in getClusters
ms.fit(np.array(dataarray))
File "C:\python2.7\lib\site-packages\sklearn\cluster\mean_shift_.py", line 280, in fit
cluster_all=self.cluster_all)
File "C:\python2.7\lib\site-packages\sklearn\cluster\mean_shift_.py", line 137, in mean_shift
nbrs = NearestNeighbors(radius=bandwidth).fit(sorted_centers)
File "C:\python2.7\lib\site-packages\sklearn\neighbors\base.py", line 642, in fit
return self._fit(X)
File "C:\python2.7\lib\site-packages\sklearn\neighbors\base.py", line 180, in _fit
raise ValueError("data type not understood")
ValueError: data type not understood
My code:
dataarray = np.array(data)
bandwidth = estimate_bandwidth(dataarray, quantile=0.2, n_samples=len(dataarray))
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(dataarray)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
If I check the datatype of the data variable I see:
print isinstance( dataarray, np.ndarray )
>>> True
The bandwidth is 0.925538333061 and the dataarray.dtype is float64
I'm using scikit learn 0.14.1
I can cluster with other algorithms in sci-kit (tried kmeans and dbscan). What am I doing wrong ?
EDIT:
The data can be found here:
(pickle format) : http://ojtwist.be/datatocluster.p
and : http://ojtwist.be/datatocluster.npz
That`s a bug in scikit project. It is documented here.
There is a float -> int casting during the fitting process that can crash in some cases (by making the seed points be placed at the corner of the bins instead in the center). There is some code in the link to fix the problem.
If you don't wanna get your hands into the scikit code (and maintain compatibility between your code with other machines) i suggest you normalize your data before passing it to MeanShift.
Try this:
>>>from sklearn import preprocessing
>>>data2 = preprocessing.scale(dataarray)
And then use data2 into your code.
It worked for me.
If you don't want to do either solution, it is a great opportunity to contribute to the project, making a pull request with the solution :)
Edit: You probably want to retain information to "descale" the results of meanshift. So, use a StandardScaler object, instead using a function to scale.
Good luck!

Using sparse matrices/online learning in Naive Bayes (Python, scikit)

I'm trying to do Naive Bayes on a dataset that has over 6,000,000 entries and each entry 150k features. I've tried to implement the code from the following link:
Implementing Bag-of-Words Naive-Bayes classifier in NLTK
The problem is (as I understand), that when I try to run the train-method with a dok_matrix as it's parameter, it cannot find iterkeys (I've paired the rows with OrderedDict as labels):
Traceback (most recent call last):
File "skitest.py", line 96, in <module>
classif.train(add_label(matr, labels))
File "/usr/lib/pymodules/python2.6/nltk/classify/scikitlearn.py", line 92, in train
for f in fs.iterkeys():
File "/usr/lib/python2.6/dist-packages/scipy/sparse/csr.py", line 88, in __getattr__
return _cs_matrix.__getattr__(self, attr)
File "/usr/lib/python2.6/dist-packages/scipy/sparse/base.py", line 429, in __getattr__
raise AttributeError, attr + " not found"
AttributeError: iterkeys not found
My question is, is there a way to either avoid using a sparse matrix by teaching the classifier entry by entry (online), or is there a sparse matrix format I could use in this case efficiently instead of dok_matrix? Or am I missing something obvious?
Thanks for anyone's time. :)
EDIT, 6th sep:
Found the iterkeys, so atleast the code runs. It's still too slow, as it has taken several hours with a dataset of the size of 32k, and still hasn't finished. Here's what I got at the moment:
matr = dok_matrix((6000000, 150000), dtype=float32)
labels = OrderedDict()
#collect the data into the matrix
pipeline = Pipeline([('nb', MultinomialNB())])
classif = SklearnClassifier(pipeline)
add_label = lambda lst, lab: [(lst.getrow(x).todok(), lab[x])
for x in xrange(lentweets-foldsize)]
classif.train(add_label(matr[:(lentweets-foldsize),0], labels))
readrow = [matr.getrow(x + foldsize).todok() for x in xrange(lentweets-foldsize)]
data = np.array(classif.batch_classify(readrow))
The problem might be that each row that is taken doesn't utilize the sparseness of the vector, but goes through each of the 150k entry. As a continuation for the issue, does anyone know how to utilize this Naive Bayes with sparse matrices, or is there any other way to optimize the above code?
Check out the document classification example in scikit-learn. The trick is to let the library handle the feature extraction for you. Skip the NLTK wrapper, as it's not intended for such large datasets.(*)
If you have the documents in text files, then you can just hand those text files to the TfidfVectorizer, which creates a sparse matrix from them:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(input='filename')
X = vect.fit_transform(list_of_filenames)
You now have a training set X in the CSR sparse matrix format, that you can feed to a Naive Bayes classifier if you also have a list of labels y (perhaps derived from the filenames, if you encoded the class in them):
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X, y)
If it turns out this doesn't work because the set of documents is too large (unlikely since the TfidfVectorizer was optimized for just this number of documents), look at the out-of-core document classification example, which demonstrates the HashingVectorizer and the partial_fit API for minibatch learning. You'll need scikit-learn 0.14 for this to work.
(*) I know, because I wrote that wrapper. Like the rest of NLTK, it's intended for educational purposes. I also worked on performance improvements in scikit-learn, and some of the code I'm advertising is my own.

Categories