scikit learn: Problems creating customized CountVectorizer and ChiSquare - python

I have the following code (based on the samples here), but it is not working:
[...]
def my_analyzer(s):
return s.split()
my_vectorizer = CountVectorizer(analyzer=my_analyzer)
X_train = my_vectorizer.fit_transform(traindata)
ch2 = SelectKBest(chi2,k=1)
X_train = ch2.fit_transform(X_train,Y_train)
[...]
The following error is given when calling fit_transform:
AttributeError: 'function' object has no attribute 'analyze'
According to the documentation, CountVectorizer should be created like this: vectorizer = CountVectorizer(tokenizer=my_tokenizer). However, if I do that, I get the following error: "got an unexpected keyword argument 'tokenizer'".
My actual scikit-learn version is 0.10.

You're looking at the documentation for 0.11 (to be released soon), where the vectorizer has been overhauled. Check the documentation for 0.10, where there is no tokenizer argument and the analyzer should be an object implementing an analyze method:
class MyAnalyzer(object):
#staticmethod
def analyze(s):
return s.split()
v = CountVectorizer(analyzer=MyAnalyzer())
http://scikit-learn.org/dev is the documentation for the upcoming release (which may change at any time), while http://scikit-learn/stable has the documentation for the current stable version.

Related

TypeError: 'Vocab' object is not callable

I'm following the tutorial on torchtext transformers which is published on 1.9 pytorch. However, because I'm working on a Tegra TX2, I am stuck to using torchtext 0.6.0, and not 0.10.0 (which is what I assume the tutorial uses).
Following the tutorial, the following throws an error:
data = [torch.tensor(vocab(tokenizer(item)), dtype=torch.long) for item in raw_text_iter]
return torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))
The error is:
TypeError: 'Vocab' object is not callable
I understand what the error means, what I don't know, is that is the expected return from Vocab in this case?
Looking at the documentation for TorchText 0.6.0 I see that it has:
stoi
itos
freqs
vectors
Is the example expecting the vectors from Vocab?
EDIT:
I looked up the 0.10.0 documentation and it doesn't have a __call__.
Looking at the source for the implementation of Vocab in 0.10.0, apparently it is a subclass of torch.nn.Module, which means it inherits __call__ from there (calling it is roughly equivalent to calling its forward() method, but with some additional machinery for implementing hooks and such).
We can also see that it wraps some underling VocabPyBind object (equivalent to the Vocab class in older versions), and its forward() method just calls its lookup_indices method.
So in short, it seems the equivalent in older versions of the library would be to call vocab.lookup_indices(tokenizer(item)).
Update: Apparently in 0.6.0 the Vocab class does not even have a lookup_indices method, but reading the source for that, this is just equivalent to:
[vocab[token] for token in tokenizer]
If you're ever able to upgrade, for the sake of forward-compatibility you could write a wrapper like:
from torchtext.vocab import Vocab as _Vocab
class Vocab(_Vocab):
def lookup_indices(self, tokens):
return [vocab[token] for token in tokens]
def __call__(self, tokens):
return self.lookup_indices(tokens)

Got `AttributeError` from `from_pymc3` of ArviZ

I am learning Bayesian inference by the book Bayesian Analysis with Python. However, when using plot_ppc, I got AttributeError and the warning
/usr/local/Caskroom/miniconda/base/envs/kaggle/lib/python3.9/site-packages/pymc3/sampling.py:1689: UserWarning: samples parameter is smaller than nchains times ndraws, some draws and/or chains may not be represented in the returned posterior predictive sample
warnings.warn(
The model is
shift = pd.read_csv('../data/chemical_shifts.csv')
with pm.Model() as model_g:
μ = pm.Uniform('μ', lower=40, upper=70)
σ = pm.HalfNormal('σ', sd=10)
y = pm.Normal('y', mu=μ, sd=σ, observed=shift)
trace_g = pm.sample(1000, return_inferencedata=True)
If I used the following codes
with model_g:
y_pred_g = pm.sample_posterior_predictive(trace_g, 100, random_seed=123)
data_ppc = az.from_pymc3(trace_g.posterior, posterior_predictive=y_pred_g) # 'Dataset' object has no attribute 'report'
I got 'Dataset' object has no attribute 'report'.
If I used the following codes
with model_g:
y_pred_g = pm.sample_posterior_predictive(trace_g, 100, random_seed=123)
data_ppc = az.from_pymc3(trace_g, posterior_predictive=y_pred_g) # AttributeError: 'InferenceData' object has no attribute 'report'
I got AttributeError: 'InferenceData' object has no attribute 'report'.
ArviZ version: 0.11.2
PyMC3 Version: 3.11.2
Aesara/Theano Version: 1.1.2
Python Version: 3.9.6
Operating system: MacOS Big Sur
How did you install PyMC3: conda
You are passing return_inferancedata=True to pm.sample(), which according to the PyMC3 documentation will return an InferenceData object rather than a MultiTrace object.
return_inferencedatabool, default=False
Whether to return the trace as an arviz.InferenceData (True) object or a MultiTrace (False) Defaults to False, but we’ll switch to True in an upcoming release.
The from_pymc3 function, however, expects a MultiTrace object.
The good news is that from_pymc3 returns an InferenceData object, so you can solve this in one of two ways:
The easiest solution is to simply remove the from_pymc3 calls, since it returns InferenceData, which you already have due to return_inferencedata=True.
Set return_inferencedata=False (you can also remove that argument, but the documentation states that in the future it will default to True, so to be future proof it's best to explicitly set it to False). This will return a MultiTrace which can be passed to from_pymc3.

Doc2Vec __init__() got an unexpected keyword argument 'size'

instantiated the Doc2Vec model like this
mv_tags_doc = [TaggedDocument(words=word_tokenize_clean(D), tags=[str(i)]) for i, D in enumerate(mv_tags_corpus)]
max_epochs = 50
vector_size = 20
alpha = 0.025
model = Doc2Vec(size=vector_size,
alpha=alpha,
min_alpha=0.00025,
min_count=1,
dm=0)
model.build_vocab(mv_tags_doc)
but getting error
TypeError: __init__() got an unexpected keyword argument 'size'
In the latest version of the Gensim library that you appear to be using, the parameter size is now more consistently vector_size everywhere. See the 'migrating to Gensim 4.0' help page:
https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#1-size-ctr-parameter-is-now-consistently-vector_size-everywhere
Separately, if you're consulting any online example with that outdated parameter name, and which also suggested that unnecessary specification of min_alpha and alpha, there's a good chance the example you're following is a bad reference in other ways.
So, also take a look at this answer: My Doc2Vec code, after many loops of training, isn't giving good results. What might be wrong?

PyML: graphing the decision surface

PyML has a function for graphing decision surfaces.
First you need to tell PyML which data to use. Here I use a sparsevectordata with my feature vectors. This is the one I used to train my SVM.
demo2d.setData(training_vector)
Then you need to tell it which classifier you want to use. I give it a trained SVM.
demo2d.decisionSurface(best_svm, fileName = "dec.pdf")
However, I get this error message:
Traceback (most recent call last):
**deleted by The Unfun Cat**
demo2d.decisionSurface(best_svm, fileName = "dec.pdf")
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/PyML/demo/demo2d.py", line 140, in decisionSurface
results = classifier.test(gridData)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/PyML/evaluators/assess.py", line 45, in test
classifier.verifyData(data)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/PyML/classifiers/baseClassifiers.py", line 55, in verifyData
if len(misc.intersect(self.featureID, data.featureID)) != len(self.featureID) :
AttributeError: 'SVM' object has no attribute 'featureID'
I'm going to dive right into the source, because I have never used PyML. Tried to find it online, but I couldn't track down the verifyData method in the PyML 0.7.2 that was online, so I had to search through downloaded source.
A classifier's featureID is only set in the baseClassifier class's train method (lines 77-78):
if data.__class__.__name__ == 'VectorDataSet' :
self.featureID = data.featureID[:]
In your code, data.__class__.__name__ is evaluating to "SparseDataSet" (or what ever other class you are using) and the expression evaluates to False (never setting featureID).
Then in demo2d.decisionSurface:
gridData = VectorDataSet(gridX)
gridData.attachKernel(data.kernel)
results = classifier.test(gridData)
Which tries to test your classifier using a VectorDataSet. In this instance classifier.test is equivalent to a call to the assess.test method which tries to verify if the data has the same features the training data had by using baseClassifier.verifyData:
def verifyData(self, data) :
if data.__class__.__name__ != 'VectorDataSet' :
return
if len(misc.intersect(self.featureID, data.featureID)) != len(self.featureID) :
raise ValueError, 'missing features in test data'
Which then tests the class of the passed data, which is now "VectorDataSet", and proceeds to try to access the featureID attribute that was never created.
Basically, it's either a bug, or a hidden feature.
Long story short, You have to convert your data to a VectorDataSet because SVM.featureID is not set otherwise.
Also, you don't need to pass it a trained data set, the function trains the classifier for you.
Edit:
I would also like to bring attention to the setData method:
def setData(data_) :
global data
data = data_
There is no type-checking at all. So someone could potentially set data to anything, e.g. an integer, a string, etc., which will cause an error in decisionSurface.
If you are going to use setData, you must use it carefully (only with a VectorDataSet), because the code is not as flexible as you would like.

How to decipher this cPickle error?

I have a pickled file called classifier.pkl that I am trying to load into another module. However, I get an error I don't understand.
My code to pickle:
features = ['bob','ice','snowing'] #... shortened for exposition's sake
def extract_features(document):
return {'contains(%s)'% word: (word in set(document))
for word in all_together_word_list}
training_set = classify.util.apply_features(extract_features,tweets[0])
classifier = NaiveBayesClassifier.train(training_set)
cPcikle.dump(open('cocaine_classifier.pkl','wb'))
My code to unpickle:
features, extract_features, classifier =
cPickle.load(open('cocaine_classifier.pkl','rb'))
My error:
AttributeError: 'module' object has no attribute 'extract_features'
A while ago I made the .pkl file by pickling three things:
features : list
extract_features : function
classifier : instance of NLTK Naive Bayes Classifier
Puzzlingly, I get the same error with the following code:
x = cPickle.load(open('cocaine_classifier.pkl','rb'))
Why can't I retrieve three things? Even when I'm not trying to unpack the tuple?
Update
As NPE pointed out the path of the function to be unpickled must exactly match the function into which its being unpickled. I was debugging and Terminal and so from mod import * loads everything into the namespace whereas import mod as m does not.
The problem is that when you pickle a function, only the (fully-qualified) name of the function is pickled, not the function itself. This means that you have to have the function definition in place when you're unpickling.
Did you by any chance mean to pickle the result of calling extract_features?

Categories