How to export SpaCy model with multiple components - python

I'm trying to build a SpaCy pipeline using multiple components. My current pipeline only has two components at the moment, one entity ruler, and another custom component.
The way I build it is like this:
class EntityLookupComponent:
def __call__(self, doc: Doc) -> Doc:
print("Just testing")
return doc
#Language.factory("entity_lookup_component")
def my_component(nlp, name):
return EntityLookupComponent(nlp)
def main(patterns_path: Path, output_path: Path):
"""Build the spaCy model and output it to disk"""
# Ensure output_path directory exists
if not Path(os.path.dirname(output_path)).is_dir():
os.makedirs(os.path.dirname(output_path))
nlp = English()
nlp.add_pipe("entity_ruler").from_disk(patterns_path)
nlp.add_pipe("entity_lookup_component", name="entity_lookup", last=True)
print(nlp.pipe_names)
nlp.to_disk('./test')
with open(output_path, "wb") as output_file:
pickle.dump(nlp, output_file)
Outputting the pipe_names gives me: ['entity_ruler', 'entity_lookup'].
However, when I then try to load the model and test, by doing:
nlp = spacy.load("en_core_web_lg", disable=["ner"])
nlp.add_pipe("entity_ruler", source=spacy.load("./test"))
It's instantly throwing me the following error:
ValueError: [E002] Can't find factory for 'entity_lookup_component' for language English (en). This usually happens when spaCy calls `nlp.create_pipe` with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator `#Language.component` (for function components) or `#Language.factory` (for class components).
This only happens after I added the entity_lookup_component. This component was supposed to use a lookup table, to add some metadata to existing entities.

At the place where you load the model, you need to have access to the code that defined the custom component. So if your file that defines the custom component is custom.py, you can put import custom at the top of the file where you're loading your pipeline and it should work.
Also see the docs on saving and loading custom components.

Related

Customizing Spacy's Text Categorizer

I am trying to to train a spacy model with a small dataset in Spacy 2.2. It is overfitting, I want to customize the architecture of the TextCategorizer. I referred to this post on GitHub :
https://github.com/explosion/spaCy/issues/3320
However, I am unable
from spacy.pipeline import TextCategorizer
from thinc.api import layerize
from spacy.language import Language
class StupidTextCategorizer(TextCategorizer):
name = 'stupid_textcat'
#classmethod
def Model(cls, nr_class, **cfg):
return create_dummy_model(nr_class, cfg.get('preferred_class', 0))
def create_dummy_model(nr_class, preferred_class):
"""Create a Thinc model that always predicts the same class."""
def dummy_model(docs, drop=0.):
scores = model.ops.allocate((len(docs), nr_class))
scores[:, preferred_class] = 1.0
return scores
model = layerize(dummy_model)
return model
However, when I'm trying to pass it to my training script, it throws this error which I can't seem to understand.
"[E002] Can't find factory for 'stupid_textcat'. This usually happens when spaCy calls `nlp.create_pipe` with a component name that's not built in - for example, when constructing the pipeline from a model's meta.json. If you're using a custom component, you can write to `Language.factories['stupid_textcat']` or remove it from the model meta and add it via `nlp.add_pipe` instead."
PS : Still learning Spacy but I can't find any helping documentation or tutorial for the above.

Load existing data catalog programmatically

I want to write pytest unit test in Kedro 0.17.5. They need to perform integrity checks on dataframes created by the pipeline.
These dataframes are specified in the catalog.yml and already persisted successfully using kedro run. The catalog.yml is in conf/base.
I have a test module test_my_dataframe.py in src/tests/pipelines/my_pipeline/.
How can I load the data catalog based on my catalog.yml programmatically from within test_my_dataframe.py in order to properly access my specified dataframes?
Or, for that matter, how can I programmatically load the whole project context (including the data catalog) in order to also execute nodes etc.?
For unit testing, we test just the function which we are testing, and everything external to the function we should mock/patch. Check if you really need kedro project context while writing the unit test.
If you really need project context in test, you can do something like following
from kedro.framework.project import configure_project
from kedro.framework.session import KedroSession
with KedroSession.create(package_name="demo", project_path=Path.cwd()) as session:
context = session.load_context()
catalog = context.catalog
or you can also create pytest fixture to use it again and again with scope of your choice.
#pytest.fixture
def get_project_context():
session = KedroSession.create(
package_name="demo",
project_path=Path.cwd()
)
_activate_session(session, force=True)
context = session.load_context()
return context
Different args supported by KedroSession create you can check it here https://kedro.readthedocs.io/en/0.17.5/kedro.framework.session.session.KedroSession.html#kedro.framework.session.session.KedroSession.create
To read more about pytest fixture you can refer to https://docs.pytest.org/en/6.2.x/fixture.html#scope-sharing-fixtures-across-classes-modules-packages-or-session

How to call Python function by name dynamically using a string?

I have a Python function call like so:
import torchvision
model = torchvision.models.resnet18(pretrained=configs.use_trained_models)
Which works fine.
If I attempt to make it dynamic:
import torchvision
model_name = 'resnet18'
model = torchvision.models[model_name](pretrained=configs.use_trained_models)
then it fails with:
TypeError: 'module' object is not subscriptable
Which makes sense since model is a module which exports a bunch of things, including the resnet functions:
# __init__.py for the "models" module
...
from .resnet import *
...
How can I call this function dynamically without knowing ahead of time its name (other than that I get a string with the function name)?
You can use the getattr function:
import torchvision
model_name = 'resnet18'
model = getattr(torchvision.models, model_name)(pretrained=configs.use_trained_models)
This essentially is along the lines the same as the dot notation just in function form accepting a string to retrieve the attribute/method.
The new APIs since Aug 2022 are as follows:
# returns list of available model names
torchvision.models.list_models()
# returns specified model with pretrained common weights
torchvision.models.get_model("alexnet", weights="DEFAULT")
# returns specified model with pretrained=False
torchvision.models.get_model("alexnet", weights=None)
# returns specified model with specified pretrained weights
torchvision.models.get_model("alexnet", weights=ResNet50_Weights.IMAGENET1K_V2)
Reference:
https://pytorch.org/blog/easily-list-and-initialize-models-with-new-apis-in-torchvision/

Writing a custom predict method using MLFlow and pyspark

I am having trouble writing a custom predict method using MLFlow and pyspark (2.4.0). What I have so far is a custom transformer that changes the data into the format I need.
class CustomGroupBy(Transformer):
def __init__(self):
pass
def _transform(self, dataset):
df = dataset.select("userid", explode(split("widgetid", ',')).alias("widgetid"))
return(df)
Then I built a custom estimator to run one of the pyspark machine learning algorithms
class PipelineFPGrowth(Estimator, HasInputCol, DefaultParamsReadable, DefaultParamsWritable):
def __init__(self, inputCol=None, minSupport=0.005, minConfidence=0.01):
super(PipelineFPGrowth, self).__init__()
self.minSupport = minSupport
self.minConfidence = minConfidence
def setInputCol(self, value):
return(self._set(inputCol=value))
def _fit(self, dataset):
c = self.getInputCol()
fpgrowth = FPGrowth(itemsCol=c, minSupport=self.minSupport, minConfidence=self.minConfidence)
model = fpgrowth.fit(dataset)
return(model)
This runs in the MLFlow pipeline.
pipeline = Pipeline(stages = [CustomGroupBy,PipelineFPGrowth]).fit(df)
This all works. If I create a new pyspark dataframe with new data to predict on, I get predictions.
newDF = spark.createDataFrame([(123456,['123ABC', '789JSF'])], ["userid", "widgetid"])
pipeline.stages[1].transform(newDF).show(3, False)
# How to access frequent itemset.
pipeline.stages[1].freqItemsets.show(3, False)
Where I run into problems is writing a custom predict. I need to append the frequent itemset that FPGrowth generates to the end of the predictions. I have written the logic for that, but I am having a hard time figuring out how to put it into a custom method. I have tried adding it to my custom estimator but this didn't work. Then I wrote a separate class to take in the returned model and give the extended predictions. This was also unsuccessful.
Eventually I need to log and save the model so I can Dockerize it, which means I will need a custom flavor and to use the pyfunc function. Does anyone have a hint on how to extend the predict method and then log and save the model?

Python statsmodels OLS: how to save learned model to file

I am trying to learn an ordinary least squares model using Python's statsmodels library, as described here.
sm.OLS.fit() returns the learned model. Is there a way to save it to the file and reload it? My training data is huge and it takes around half a minute to learn the model. So I was wondering if any save/load capability exists in OLS model.
I tried the repr() method on the model object but it does not return any useful information.
The models and results instances all have a save and load method, so you don't need to use the pickle module directly.
Edit to add an example:
import statsmodels.api as sm
data = sm.datasets.longley.load_pandas()
data.exog['constant'] = 1
results = sm.OLS(data.endog, data.exog).fit()
results.save("longley_results.pickle")
# we should probably add a generic load to the main namespace
from statsmodels.regression.linear_model import OLSResults
new_results = OLSResults.load("longley_results.pickle")
# or more generally
from statsmodels.iolib.smpickle import load_pickle
new_results = load_pickle("longley_results.pickle")
Edit 2 We've now added a load method to main statsmodels API in master, so you can just do
new_results = sm.load('longley_results.pickle')
I've installed the statsmodels library and found that you can save the values using the pickle module in python.
Models and results are pickleable via save/load, optionally saving the model data.
[source]
As an example:
Given that you have the results saved in the variable results:
To save the file:
import pickle
with open('learned_model.pkl','w') as f:
pickle.dump(results,f)
To read the file:
import pickle
with open('learned_model.pkl','r') as f:
model_results = pickle.load(f)

Categories