PyML: graphing the decision surface

PyML: graphing the decision surface - python

PyML has a function for graphing decision surfaces.
First you need to tell PyML which data to use. Here I use a sparsevectordata with my feature vectors. This is the one I used to train my SVM.
demo2d.setData(training_vector)
Then you need to tell it which classifier you want to use. I give it a trained SVM.
demo2d.decisionSurface(best_svm, fileName = "dec.pdf")
However, I get this error message:
Traceback (most recent call last):
**deleted by The Unfun Cat**
demo2d.decisionSurface(best_svm, fileName = "dec.pdf")
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/PyML/demo/demo2d.py", line 140, in decisionSurface
results = classifier.test(gridData)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/PyML/evaluators/assess.py", line 45, in test
classifier.verifyData(data)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/PyML/classifiers/baseClassifiers.py", line 55, in verifyData
if len(misc.intersect(self.featureID, data.featureID)) != len(self.featureID) :
AttributeError: 'SVM' object has no attribute 'featureID'

I'm going to dive right into the source, because I have never used PyML. Tried to find it online, but I couldn't track down the verifyData method in the PyML 0.7.2 that was online, so I had to search through downloaded source.
A classifier's featureID is only set in the baseClassifier class's train method (lines 77-78):
if data.__class__.__name__ == 'VectorDataSet' :
self.featureID = data.featureID[:]
In your code, data.__class__.__name__ is evaluating to "SparseDataSet" (or what ever other class you are using) and the expression evaluates to False (never setting featureID).
Then in demo2d.decisionSurface:
gridData = VectorDataSet(gridX)
gridData.attachKernel(data.kernel)
results = classifier.test(gridData)
Which tries to test your classifier using a VectorDataSet. In this instance classifier.test is equivalent to a call to the assess.test method which tries to verify if the data has the same features the training data had by using baseClassifier.verifyData:
def verifyData(self, data) :
if data.__class__.__name__ != 'VectorDataSet' :
return
if len(misc.intersect(self.featureID, data.featureID)) != len(self.featureID) :
raise ValueError, 'missing features in test data'
Which then tests the class of the passed data, which is now "VectorDataSet", and proceeds to try to access the featureID attribute that was never created.
Basically, it's either a bug, or a hidden feature.
Long story short, You have to convert your data to a VectorDataSet because SVM.featureID is not set otherwise.
Also, you don't need to pass it a trained data set, the function trains the classifier for you.
Edit:
I would also like to bring attention to the setData method:
def setData(data_) :
global data
data = data_
There is no type-checking at all. So someone could potentially set data to anything, e.g. an integer, a string, etc., which will cause an error in decisionSurface.
If you are going to use setData, you must use it carefully (only with a VectorDataSet), because the code is not as flexible as you would like.

Related

ValueError and AttributeError in Pycaret

I am using pycaret in my new laptop and also after a gap 6 months so not sure if this problem is due to some issue in my laptop or due to some changes in pycaret package itself. Earlier I simply used to create experiment using setup method of pycaret and it used to work. But now it keep raising one error after another. Like I used below 2 lines to setup experiment.
from pycaret.classification import *
exp = setup(data=df.drop(['id'], axis=1), target='cancer', session_id=123)
But this gave error:-
ValueError: Setting a random_state has no effect since shuffle is False. You should leave random_state to its default (None), or set shuffle=True.
Then I changed my second as below-
exp = setup(data=df.drop(['id'], axis=1), target='cancer', session_id=123, fold_shuffle=True, imputation_type='iterative')
Then it returned a new error-
AttributeError: 'Make_Time_Features' object has no attribute 'list_of_features'
I remember well earlier I never had to use these attributes in my setup method. Looks like even default values of attributes in setup method of pycaret are not working. Can anyone suggest me how to troubleshoot this?

Uploading and running Gensim model for data augmentation

I am trying to follow this example https://github.com/dsfsi/textaugment to upload a pre-trained Gensim model for data augmentation
import textaugment
import gensim
from textaugment import Word2vec
model = gensim.models.KeyedVectors.load_word2vec_format(r'\GoogleNews-vectors-negative300.bin', binary=True)
from textaugment import Word2vec
t = Word2vec(model)
t.augment('The stories are good')
but I get the following error:
TypeError: __init__() takes 1 positional argument but 2 were given
at line
t = Word2vec(model)
What am I doing wrong?

If you edit your question to include the full error message shown, including the traceback identifying exact files/lines-of-code leading to the error, it will often provide extra important info to know what's going wrong. (Whenever possible, show answerers all the text/output that you see, not just excerpts.)
But also, the examples at the page you link, https://github.com/dsfsi/textaugment, all show the model passed in as a named parameter (model=SOMETHING), not merely a positional parameter. You should try to do it the same way (and here I've changed the name of your local variabe to make it more distinct from the parameter-name, and removed the out-of-place r prefix):
my_model = gensim.models.KeyedVectors.load_word2vec_format('\GoogleNews-vectors-negative300.bin', binary=True)
t = Word2vec(model=my_model)
The error you got may be less confusing once you know, from experience or careful viewing of the traceback, that the call to the constructor Word2vec() actually calls another method, __init__(), behind the scenes. And that __init__() method receives both the newly-created instance, and whatever else you supplied, as 'positional' arguments. That is: 2 positional arguments, when it normally only expects 1 (the new instances), with any extra arguments as named arguments (model=SOMETHING style).

How to properly pickle sklearn pipeline when using custom transformer

I am trying to pickle a sklearn machine-learning model, and load it in another project. The model is wrapped in pipeline that does feature encoding, scaling etc. The problem starts when i want to use self-written transformers in the pipeline for more advanced tasks.
Let's say I have 2 projects:
train_project: it has the custom transformers in src.feature_extraction.transformers.py
use_project: it has other things in src, or has no src catalog at all
If in "train_project" I save the pipeline with joblib.dump(), and then in "use_project" i load it with joblib.load() it will not find something such as "src.feature_extraction.transformers" and throw exception:
ModuleNotFoundError: No module named 'src.feature_extraction'
I should also add that my intention from the beginning was to simplify usage of the model, so programist can load the model as any other model, pass very simple, human readable features, and all "magic" preprocessing of features for actual model (e.g. gradient boosting) is happening inside.
I thought of creating /dependencies/xxx_model/ catalog in root of both projects, and store all needed classes and functions in there (copy code from "train_project" to "use_project"), so structure of projects is equal and transformers can be loaded. I find this solution extremely inelegant, because it would force the structure of any project where the model would be used.
I thought of just recreating the pipeline and all transformers inside "use_project" and somehow loading fitted values of transformers from "train_project".
The best possible solution would be if dumped file contained all needed info and needed no dependencies, and I am honestly shocked that sklearn.Pipelines seem to not have that possibility - what's the point of fitting a pipeline if i can not load fitted object later? Yes it would work if i used only sklearn classes, and not create custom ones, but non-custom ones do not have all needed functionality.
Example code:
train_project
src.feature_extraction.transformers.py
from sklearn.pipeline import TransformerMixin
class FilterOutBigValuesTransformer(TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
self.biggest_value = X.c1.max()
return self
def transform(self, X):
return X.loc[X.c1 <= self.biggest_value]
train_project
main.py
from sklearn.externals import joblib
from sklearn.preprocessing import MinMaxScaler
from src.feature_extraction.transformers import FilterOutBigValuesTransformer
pipeline = Pipeline([
('filter', FilterOutBigValuesTransformer()),
('encode', MinMaxScaler()),
])
X=load_some_pandas_dataframe()
pipeline.fit(X)
joblib.dump(pipeline, 'path.x')
test_project
main.py
from sklearn.externals import joblib
pipeline = joblib.load('path.x')
The expected result is pipeline loaded correctly with transform method possible to use.
Actual result is exception when loading the file.

I found a pretty straightforward solution. Assuming you are using Jupyter notebooks for training:
Create a .py file where the custom transformer is defined and import it to the Jupyter notebook.
This is the file custom_transformer.py
from sklearn.pipeline import TransformerMixin
class FilterOutBigValuesTransformer(TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
self.biggest_value = X.c1.max()
return self
def transform(self, X):
return X.loc[X.c1 <= self.biggest_value]
Train your model importing this class from the .py file and save it using joblib.
import joblib
from custom_transformer import FilterOutBigValuesTransformer
from sklearn.externals import joblib
from sklearn.preprocessing import MinMaxScaler
pipeline = Pipeline([
('filter', FilterOutBigValuesTransformer()),
('encode', MinMaxScaler()),
])
X=load_some_pandas_dataframe()
pipeline.fit(X)
joblib.dump(pipeline, 'pipeline.pkl')
When loading the .pkl file in a different python script, you will have to import the .py file in order to make it work:
import joblib
from utils import custom_transformer # decided to save it in a utils directory
pipeline = joblib.load('pipeline.pkl')

Apparently this problem raises when you split definitions and saving code part in two different files. So I have found this workaround that has worked for me.
It consists in these steps:
Guess we have your 2 projects/repositories : train_project and use_project
train_project:
On your train_project create a jupyter notebook or .py
On that file lets define every Custom transformer in a class, and import all other tools needed from sklearn to design the pipelines. Then lets write the saving code to pickle just inside the same file.(Don't create an external .py file src.feature_extraction.transformers to define your customtransformers).
Then fit and dumb your pipeline by running that file.
On use_project:
Create a customthings.py file with all the functions and transformers defined inside.
Create another file_where_load.py where you wish load the pickle. Inside, make sure you have imported all the definitions from customthings.py . Ensure that functions and classes have the same name than the ones you've used on train_project.
I hope it works for everyone with same problem

I have created a workaround solution. I do not consider it a complete answer to my question, but non the less it let me move on from my problem.
Conditions for the workaround to work:
I. Pipeline needs to have only 2 kinds of transformers:
sklearn transformers
custom transformers, but with only attributes of types:
number
string
list
dict
or any combination of those e.g. list of dicts with strings and numbers. Generally important thing is that attributes are json serializable.
II. names of pipeline steps need to be unique (even if there is pipeline nesting)
In short model would be stored as a catalog with joblib dumped files, a json file for custom transformers, and a json file with other info about model.
I have created a function that goes through steps of a pipeline and checks __module__ attribute of transformer.
If it finds sklearn in it it then it runs joblib.dump function under a name specified in steps (first element of step tuple), to some selected model catalog.
Otherwise (no sklearn in __module__) it adds __dict__ of transformer to result_dict under a key equal to name specified in steps. At the end I json.dump the result_dict to model catalog under name result_dict.json.
If there is a need to go into some transformer, because e.g. there is a Pipeline inside a pipeline, you can probably run this function recursively by adding some rules to the beginning of the function, but it becomes important to have always unique steps/transformers names even between main pipeline and subpipelines.
If there are other information needed for creation of model pipeline then save them in model_info.json.
Then if you want to load the model for usage:
You need to create (without fitting) the same pipeline in target project. If pipeline creation is somewhat dynamic, and you need information from source project, then load it from model_info.json.
You can copy function used for serialization and:
replace all joblib.dump with joblib.load statements, assign __dict__ from loaded object to __dict__ of object already in pipeline
replace all places where you added __dict__ to result_dict with assignment of appropriate value from result_dict to object __dict__ (remember to load result_dict from file beforehand)
After running this modified function, previously unfitted pipeline should have all transformer attributes that were effect of fitting loaded, and pipeline as a whole ready to predict.
The main things I do not like about this solution is that it needs pipeline code inside target project, and needs all attrs of custom transformers to be json serializable, but I leave it here for other people that stumble on a similar problem, maybe somebody comes up with something better.

I was similarly surprised when I came across the same problem some time ago. Yet there are multiple ways to address this.
Best practice solution:
As others have mentioned, the best practice solution is to move all dependencies of your pipeline into a separate Python package and define that package as a dependency of your model environment.
The environment then has to be recreated whenever the model is deployed. In simple cases this can be done manually e.g. via virtualenv or Poetry. But model stores and versioning frameworks (MLflow being one example) typically provide a way to define the required Python environment (e.g. via conda.yaml). They often can automatically recreate the environment at deployment time.
Solution by putting code into main:
In fact, class and function declearations can be serialized, but only declarations in __main__ actually get serialized. __main__ is the entry point of the script, the file that is run. So if all the custom code and all of its dependencies are in that file, then custom objects can later be loaded in Python environments that do not include the code. This kind of solves the problem, but who wants to have all that code in __main__? (Note that this property also applies to cloudpickle)
Solution by "mainifying":
There is one other way which is to "mainify" the classes or function objects before saving. I came across that same problem some time ago and have written a function that does that. It essentially redefines an existing object's code in __main__. Its application is simple: Pass object to function, then serialize the object, voilà, it can be loaded anywhere. Like so:
# ------ In file1.py: ------
class Foo():
pass
# ------ In file2.py: ------
from file1 import Foo
foo = Foo()
foo = mainify(foo)
import dill
with open('path/file.dill', 'wb') as f
dill.dump(foo, f)
I post the function code below. Note that I have tested this with dill, but I think it should work with pickle as well.
Also note that the original idea is not mine, but came from a blog post that I cannot find right now. I will add the reference/acknowledgement when I find it.
Edit: Blog post by Oege Dijk by which my code was inspired.
def mainify(obj, warn_if_exist=True):
''' If obj is not defined in __main__ then redefine it in main. Allows dill
to serialize custom classes and functions such that they can later be loaded
without them being declared in the load environment.
Parameters
---------
obj : Object to mainify (function or class instance)
warn_if_exist : Bool, default True. Throw exception if function (or class) of
same name as the mainified function (or same name as mainified
object's __class__) was already defined in __main__. If False
don't throw exception and instead use what was defined in
__main__. See Limitations.
Limitations
-----------
Assumes `obj` is either a function or an instance of a class.
'''
if obj.__module__ != '__main__':
import __main__
is_func = True if isinstance(obj, types.FunctionType) else False
# Check if obj with same name is already defined in __main__ (for funcs)
# or if class with same name as obj's class is already defined in __main__.
# If so, simply return the func with same name from __main__ (for funcs)
# or assign the class of same name to obj and return the modified obj
if is_func:
on = obj.__name__
if on in __main__.__dict__.keys():
if warn_if_exist:
raise RuntimeError(f'Function with __name__ `{on}` already defined in __main__')
return __main__.__dict__[on]
else:
ocn = obj.__class__.__name__
if ocn in __main__.__dict__.keys():
if warn_if_exist:
raise RuntimeError(f'Class with obj.__class__.__name__ `{ocn}` already defined in __main__')
obj.__class__ = __main__.__dict__[ocn]
return obj
# Get source code and compile
source = inspect.getsource(obj if is_func else obj.__class__)
compiled = compile(source, '<string>', 'exec')
# "declare" in __main__, keeping track which key of __main__ dict is the new one
pre = list(__main__.__dict__.keys())
exec(compiled, __main__.__dict__)
post = list(__main__.__dict__.keys())
new_in_main = list(set(post) - set(pre))[0]
# for function return mainified version, else assign new class to obj and return object
if is_func:
obj = __main__.__dict__[new_in_main]
else:
obj.__class__ = __main__.__dict__[new_in_main]
return obj

Have you tried using cloud pickle?
https://github.com/cloudpipe/cloudpickle

Based on my research it seems that the best solution is to create a Python package that includes your trained pipeline and all files.
Then you can pip install it in the project where you want to use it and import the pipeline with from <package name> import <pipeline name>.

Credit to Ture Friese for mentioning cloudpickle >=2.0.0, but here's an example for your use case.
import cloudpickle
cloudpickle.register_pickle_by_value(FilterOutBigValuesTransformer)
with open('./pipeline.cloudpkl', mode='wb') as file:
pipeline.dump(
obj=Pipe
, file=file
)
register_pickle_by_value() is the key as it will ensure your custom module (src.feature_extraction.transformers) is also included when serializing your primary object (pipeline). However, this is not built for recursive module dependence, e.g. if FilterOutBigValuesTransformer also contains another import statement

Calling the location of the transform.py file with sys.path.append may resolve the issue.
import sys
sys.path.append("src/feature_extraction/transformers")

Training and evaluating bigram/trigram distributions with NgramModel in nltk, using Witten Bell Smoothing

I would like to train an NgramModel on one set of sentences, using Witten-Bell smoothing to estimate the unseen ngrams, and then use it to get the log-likelihood of a test set having been generated by that distribution. I want to do almost the same thing as in the documentation example found here: http://nltk.org/_modules/nltk/model/ngram.html, but with Witten-Bell smoothing instead. Here's some toy code trying to do about what I want to do:
from nltk.probability import WittenBellProbDist
from nltk import NgramModel
est = lambda fdist, bins: WittenBellProbDist(fdist)
fake_train = [str(t) for t in range(3000)]
fake_test = [str(t) for t in range(2900, 3010)]
lm = NgramModel(2, fake_train, estimator = est)
print lm.entropy(fake_test)
Unfortunately, when I try running that, I get the following error:
Traceback (most recent call last):
File "ngram.py", line 8, in <module>
lm = NgramModel(2, fake_train, estimator = est)
File "/usr/lib/python2.7/dist-packages/nltk/model/ngram.py", line 63, in __init__
self._model = ConditionalProbDist(cfd, estimator, len(cfd))
File "/usr/lib/python2.7/dist-packages/nltk/probability.py", line 2016, in __init__
**factory_kw_args)
File "ngram.py", line 4, in <lambda>
est = lambda fdist, bins: WittenBellProbDist(fdist)
File "/usr/lib/python2.7/dist-packages/nltk/probability.py", line 1210, in __init__
self._P0 = self._T / float(self._Z * (self._N + self._T))
ZeroDivisionError: float division by zero
What's causing this error? As far as I can tell I'm using everything correctly according to the documentation, and this works fine when I use Lidstone instead of Witten-Bell.
As a second question, I have data in the form of a collection of disjoint sentences. How can I use the sentences like a list of lists of strings, or do something equivalent that would produce the same distribution? (i.e. of course I could just use a list that has all the sentences with a dummy token separating subsequent sentences, but that wouldn't produce the same distribution.) The documentation in one place says that a list of list of strings is allowed, but then I found a bug report where the documentation was supposedly edited to reflect that that wasn't allowed (and when I just try a list of lists of strings I get an error).

Its apparently been a known issue for almost 3 years. The reason for ZeroDivisionError is because of the following lines in __init__,
if bins == None:
bins = freqdist.B()
self._freqdist = freqdist
self._T = self._freqdist.B()
self._Z = bins - self._freqdist.B()
Whenever the bins argument is not specified, it defaults to None so self._Z is really just freqdist.B() - freqdist.B() and
self._P0 = self._T / float(self._Z * (self._N + self._T))
reduces down to,
self._P0 = freqdist.B() / 0.0
Additionally, if you specify bins as any value greater than freqdist.B(), in executing this line of your code,
print lm.entropy(fake_test)
you will receive NotImplementedError because within the WittenBellProbDist class,
def discount(self):
raise NotImplementedError()
The discount method is apparently also used in prob and logprob of the NgramModel class so you won't be able to call them either.
One way to fix these problems, without changing NLTK, would be to inherit from WittenBellProbDist and override the relevant methods.

Update Dec 2018
NLTK 3.4 contains the reworked ngram modeling module importable as nltk.lm

I would stay away from NLTK's NgramModel for the time being. There is currently a smoothing bug that causes the model to greatly overestimate likelihoods when n>1. This applies for all estimators including WittenBellProbDist and even LidstoneProbDist. I think this error has been around for a few years, suggesting that this part of NLTK is not well tested.
See:
https://github.com/nltk/nltk/issues/367

How to decipher this cPickle error?

I have a pickled file called classifier.pkl that I am trying to load into another module. However, I get an error I don't understand.
My code to pickle:
features = ['bob','ice','snowing'] #... shortened for exposition's sake
def extract_features(document):
return {'contains(%s)'% word: (word in set(document))
for word in all_together_word_list}
training_set = classify.util.apply_features(extract_features,tweets[0])
classifier = NaiveBayesClassifier.train(training_set)
cPcikle.dump(open('cocaine_classifier.pkl','wb'))
My code to unpickle:
features, extract_features, classifier =
cPickle.load(open('cocaine_classifier.pkl','rb'))
My error:
AttributeError: 'module' object has no attribute 'extract_features'
A while ago I made the .pkl file by pickling three things:
features : list
extract_features : function
classifier : instance of NLTK Naive Bayes Classifier
Puzzlingly, I get the same error with the following code:
x = cPickle.load(open('cocaine_classifier.pkl','rb'))
Why can't I retrieve three things? Even when I'm not trying to unpack the tuple?
Update
As NPE pointed out the path of the function to be unpickled must exactly match the function into which its being unpickled. I was debugging and Terminal and so from mod import * loads everything into the namespace whereas import mod as m does not.

The problem is that when you pickle a function, only the (fully-qualified) name of the function is pickled, not the function itself. This means that you have to have the function definition in place when you're unpickling.
Did you by any chance mean to pickle the result of calling extract_features?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PyML: graphing the decision surface - python

Related

ValueError and AttributeError in Pycaret

Uploading and running Gensim model for data augmentation

How to properly pickle sklearn pipeline when using custom transformer

Training and evaluating bigram/trigram distributions with NgramModel in nltk, using Witten Bell Smoothing

How to decipher this cPickle error?

Categories

Resources