I am trying to to train a spacy model with a small dataset in Spacy 2.2. It is overfitting, I want to customize the architecture of the TextCategorizer. I referred to this post on GitHub :
https://github.com/explosion/spaCy/issues/3320
However, I am unable
from spacy.pipeline import TextCategorizer
from thinc.api import layerize
from spacy.language import Language
class StupidTextCategorizer(TextCategorizer):
name = 'stupid_textcat'
#classmethod
def Model(cls, nr_class, **cfg):
return create_dummy_model(nr_class, cfg.get('preferred_class', 0))
def create_dummy_model(nr_class, preferred_class):
"""Create a Thinc model that always predicts the same class."""
def dummy_model(docs, drop=0.):
scores = model.ops.allocate((len(docs), nr_class))
scores[:, preferred_class] = 1.0
return scores
model = layerize(dummy_model)
return model
However, when I'm trying to pass it to my training script, it throws this error which I can't seem to understand.
"[E002] Can't find factory for 'stupid_textcat'. This usually happens when spaCy calls `nlp.create_pipe` with a component name that's not built in - for example, when constructing the pipeline from a model's meta.json. If you're using a custom component, you can write to `Language.factories['stupid_textcat']` or remove it from the model meta and add it via `nlp.add_pipe` instead."
PS : Still learning Spacy but I can't find any helping documentation or tutorial for the above.
Related
Experts,
I have trained a machine learning NER(name entity Recognition) model and saved it in my local in .pkl format.
My prediction.py looks like:
class NER:
def __init__(self, ner_model):
self.ner_model = (
pickle.load(open(ner_model, "rb")))
def extract(self, tokens):
<prediction of ner and return entity_list>
# In the unit test I mocked the "prediction.NER" class, but this test case gives 0% coverage. I m #not sure what I am doing wrong.
#unittesting #modelmocking #mock
I'm going through the documentation of the sktime package. One thing I just cannot find is the feature importance (that we'd get with sklearn models) or model summary (like the one we can obtain from statsmodels). Is it something that is just not implemented yet?
It seems that this functionality is implemented for models like AutoETS or AutoARIMA.
from matplotlib import pyplot as plt
from sktime.datasets import load_airline
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.forecasting.base import ForecastingHorizon
y = load_airline()
y_train,y_test = temporal_train_test_split(y)
fh = ForecastingHorizon(y_test.index, is_relative=False)
from sktime.forecasting.ets import AutoETS
model = AutoETS(trend='add',seasonal='mul',sp=12)
model.fit(y_train,fh=y_test.index)
model.summary()
I wonder if these summaries are accessible from instances like ForecastingPipeline.
Ok, I was able to solve it myself. I'm really glad the functionality is there!
The source code for ForecastingPipeline indicates that an instance of this class has an attribute steps_ - it holds the fitted instance of the model in a pipeline.
from sktime.forecasting.compose import ForecastingPipeline
model = ForecastingPipeline(steps=[
("forecaster", AutoETS(sp=1))])
model.fit(y_train)
model.steps_[-1][1].summary() # model.steps[-1][1].summary() would throw an error
The output of model.steps_ is [('forecaster', AutoETS())] (as mentioned before AutoETS() is in this case already fitted).
I'm trying to build a SpaCy pipeline using multiple components. My current pipeline only has two components at the moment, one entity ruler, and another custom component.
The way I build it is like this:
class EntityLookupComponent:
def __call__(self, doc: Doc) -> Doc:
print("Just testing")
return doc
#Language.factory("entity_lookup_component")
def my_component(nlp, name):
return EntityLookupComponent(nlp)
def main(patterns_path: Path, output_path: Path):
"""Build the spaCy model and output it to disk"""
# Ensure output_path directory exists
if not Path(os.path.dirname(output_path)).is_dir():
os.makedirs(os.path.dirname(output_path))
nlp = English()
nlp.add_pipe("entity_ruler").from_disk(patterns_path)
nlp.add_pipe("entity_lookup_component", name="entity_lookup", last=True)
print(nlp.pipe_names)
nlp.to_disk('./test')
with open(output_path, "wb") as output_file:
pickle.dump(nlp, output_file)
Outputting the pipe_names gives me: ['entity_ruler', 'entity_lookup'].
However, when I then try to load the model and test, by doing:
nlp = spacy.load("en_core_web_lg", disable=["ner"])
nlp.add_pipe("entity_ruler", source=spacy.load("./test"))
It's instantly throwing me the following error:
ValueError: [E002] Can't find factory for 'entity_lookup_component' for language English (en). This usually happens when spaCy calls `nlp.create_pipe` with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator `#Language.component` (for function components) or `#Language.factory` (for class components).
This only happens after I added the entity_lookup_component. This component was supposed to use a lookup table, to add some metadata to existing entities.
At the place where you load the model, you need to have access to the code that defined the custom component. So if your file that defines the custom component is custom.py, you can put import custom at the top of the file where you're loading your pipeline and it should work.
Also see the docs on saving and loading custom components.
I am having trouble writing a custom predict method using MLFlow and pyspark (2.4.0). What I have so far is a custom transformer that changes the data into the format I need.
class CustomGroupBy(Transformer):
def __init__(self):
pass
def _transform(self, dataset):
df = dataset.select("userid", explode(split("widgetid", ',')).alias("widgetid"))
return(df)
Then I built a custom estimator to run one of the pyspark machine learning algorithms
class PipelineFPGrowth(Estimator, HasInputCol, DefaultParamsReadable, DefaultParamsWritable):
def __init__(self, inputCol=None, minSupport=0.005, minConfidence=0.01):
super(PipelineFPGrowth, self).__init__()
self.minSupport = minSupport
self.minConfidence = minConfidence
def setInputCol(self, value):
return(self._set(inputCol=value))
def _fit(self, dataset):
c = self.getInputCol()
fpgrowth = FPGrowth(itemsCol=c, minSupport=self.minSupport, minConfidence=self.minConfidence)
model = fpgrowth.fit(dataset)
return(model)
This runs in the MLFlow pipeline.
pipeline = Pipeline(stages = [CustomGroupBy,PipelineFPGrowth]).fit(df)
This all works. If I create a new pyspark dataframe with new data to predict on, I get predictions.
newDF = spark.createDataFrame([(123456,['123ABC', '789JSF'])], ["userid", "widgetid"])
pipeline.stages[1].transform(newDF).show(3, False)
# How to access frequent itemset.
pipeline.stages[1].freqItemsets.show(3, False)
Where I run into problems is writing a custom predict. I need to append the frequent itemset that FPGrowth generates to the end of the predictions. I have written the logic for that, but I am having a hard time figuring out how to put it into a custom method. I have tried adding it to my custom estimator but this didn't work. Then I wrote a separate class to take in the returned model and give the extended predictions. This was also unsuccessful.
Eventually I need to log and save the model so I can Dockerize it, which means I will need a custom flavor and to use the pyfunc function. Does anyone have a hint on how to extend the predict method and then log and save the model?
After creating a FastText model using Gensim, I want to load it but am running into errors seemingly related to callbacks.
The code used to create the model is
TRAIN_EPOCHS = 30
WINDOW = 5
MIN_COUNT = 50
DIMS = 256
vocab_model = gensim.models.FastText(sentences=model_input,
size=DIMS,
window=WINDOW,
iter=TRAIN_EPOCHS,
workers=6,
min_count=MIN_COUNT,
callbacks=[EpochSaver("./ftchkpts/")])
vocab_model.save('ft_256_min_50_model_30eps')
and the callback EpochSaver is defined as
from gensim.models.callbacks import CallbackAny2Vec
class EpochSaver(CallbackAny2Vec):
'''Callback to save model after each epoch and show training parameters '''
def __init__(self, savedir):
self.savedir = savedir
self.epoch = 0
os.makedirs(self.savedir, exist_ok=True)
def on_epoch_end(self, model):
savepath = os.path.join(self.savedir, f"ft256_{self.epoch}e")
model.save(savepath)
print(f"Epoch saved: {self.epoch + 1}")
if os.path.isfile(os.path.join(self.savedir, f"ft256_{self.epoch-1}e")):
os.remove(os.path.join(self.savedir, f"ft256_{self.epoch-1}e"))
print("Previous model deleted ")
self.epoch += 1
Aside from the type of model, this is identical to my process for Word2Vec which worked without issue. However when I open another file and try to load the model with
from gensim.models import FastText
vocab = FastText.load(r'vocab/ft_256_min_50_model_30eps')
I'm greeted with the error
AttributeError: Can't get attribute 'EpochSaver' on <module '__main__'>
What can I do to get the vocabulary to load so I can create the embedding layer for my keras model? If it's relevant, this is happening in JupyterLab.
This extra difficulty loading models with custom callbacks is a known, open issue (at least through gensim-3.8.1 and October 2019).
You can see discussions of possible workarounds and fixes there – and the gensim team is considering simply disabling the auto-saving of callbacks at all, requiring them to be re-specified for each later train()/etc call that needs them.
You may be able to load existing models saved with your custom callbacks by importing those same callback classes, as the same names, into the code context where you're doing a load().
You could save callback-free versions of your trained models by blanking the model's callbacks property to its empty default value, just before you save(), eg:
model.callbacks = ()
model.save(save_path)
Then, you wouldn't need to do any special importing of custom classes before a load(). (Of course if you again needed callback functionality on the re-loaded model, they'd then have to be explicitly reestablished after load()).