AttributeError when trying to deploy streamlit app to Heroku - python

I have a simple streamlit app that that includes tranforms + estimator stored as a pickle file for prediction. The app works well when I deployed to the local host. When deployed to Heroku, the web layout works but the prediction app generates the error "AttributeError: 'ColumnTransformer' object has no attribute '_feature_names_in'.
I used the requirements.txt below:
"numpy==1.17.2 pandas==0.25.1 streamlit==0.67.1 Pillow==7.2.0 scikit_learn==0.23.2"
generated by pipreqs.
From published answers to similar questions, I gather that this could be due to incapatibility of sklearn versions. But not sure how to correct it.
Here is the error message from Heruko:
AttributeError: 'ColumnTransformer' object has no '__feature_names_in'
Here is the code for app.py:
import pandas as pd
import numpy as np
import pickle
import streamlit as st
from PIL import Image
#from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
#from sklearn.impute import SimpleImputer
#from sklearn.pipeline import Pipeline
#from sklearn.preprocessing import MinMaxScaler
#from sklearn.compose import ColumnTransformer
import warnings
warnings.filterwarnings('ignore')
acc_ix, wt_ix, hpower_ix, cyl_ix = 4, 3, 2, 0
##custom class inheriting the BaseEstimator and TransformerMixin
class CustomAttrAdder(BaseEstimator, TransformerMixin):
def __init__(self, acc_and_power=True):
self.acc_and_power = acc_and_power # new optional variable
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X):
wt_and_cyl = X[:, wt_ix] * X[:, cyl_ix] # required new variable
if self.acc_and_power:
acc_and_power = X[:, acc_ix] * X[:, hpower_ix]
return np.c_[X, acc_and_power, wt_and_cyl] # returns a 2D array
return np.c_[X, wt_and_cyl]
def predict_mpg_web1(config,regressor):
if type(config)==dict:
df=pd.DataFrame(config)
else:
df=config
# Note the model is in the form of pipeline_m, including both transforms and the estimator
# The config is with Origin already in country code
y_pred=regressor.predict(df)
return y_pred
# this is the main function in which we define our webpage
def main():
# giving the webpage a title
#st.title("MPG Prediction")
st.write("""
# MPG Prediction App
based on a Random Forest Model built from
"http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
""")
# here we define some of the front end elements of the web page like
# the font and background color, the padding and the text to be displayed
html_temp = """
<div style ="background-color:yellow;padding:13px">
<h1 style ="color:black;text-align:center;">What is the mpg of my car? </h1>
</div>
"""
# this line allows us to display the front end aspects we have
# defined in the above code
st.markdown(html_temp, unsafe_allow_html = True)
# the following lines create dropdowns and nueemric sliders in which the user can enter
# the data required to make the prediction
st.sidebar.header('Set My Car Configurations')
Orig = st.sidebar.selectbox("Select Car Origin",("India", "USA", "Germany"))
Cyl = st.sidebar.slider('Cylinders', 3, 6, 8)
Disp = st.sidebar.slider('Displacement', 68.0, 455.0, 193.0)
Power = st.sidebar.slider('Horsepower', 46.0, 230.0, 104.0)
WT = st.sidebar.slider(' Weight', 1613.0, 5140.0, 2970.0)
Acc = st.sidebar.slider('Acceleration', 8.0, 25.0, 15.57)
MY = st.sidebar.slider('Model_Year', 70, 82, 76)
image = Image.open('car.jpg')
st.image(image, caption='MPG Prediction',
use_column_width=True)
st.subheader("Click the 'Predict' button below")
# loading the saved model
pickle_in = open('final_model.pkl', 'rb')
regressor=pickle.load(pickle_in)
result =""
# the below line ensures that when the button called 'Predict' is clicked,
# the prediction function defined above is called to make the prediction
# and store it in the variable result
# Set up the Vehicale configurations
vehicle={"Origin": [Orig], "Cylinders": [Cyl], "Displacement": Disp, "Horsepower": [Power],
"Weight":[WT], "Acceelation": [Acc], "Model Year": [MY]
}
if st.button("Predict"):
result = predict_mpg_web1(vehicle, regressor)
mpg=int(result[0])
st.success('The prediction is {}'.format(mpg))
if __name__=='__main__':
main()

Is it possible that you are trying to call predict using a ColumnTransformer whcih has not been fitted yet?
the attribute _feature_names_in is set in the fit_transform call. I have the same sklearn version and the method is present, so imho shouldn't be a problem with the version

I fixed the problem. It turns out that somehow the pickle file for the saved model was corrupted. I regenerated the model and the deployment works.
Thanks for anyone who spent the time reviewing my problem.
Apollo.

Related

Python: PyCUDA ERROR: The context stack was not empty upon module cleanup

I have created a Streamlit App to as a demo of a project on Multilingual Text Classification using mBERT in PyTorch. When I run the app with the command python app.py it works fine but when I try to use Streamlit with the command streamlit run app.py it throws a PyCUDA Error.
Following is the code present in app.py:
import torch
from typing import Text
import streamlit as st
import pandas as pd
from textblob import TextBlob
from inference.inference_onnx import run_onnx_inference
from inference.inference_tensorRT import run_trt_inference
from googletrans import Translator
st.title("LinClass: Multilingual Text Classifier")
input_text = st.text_input('Text:')
####################
# Google Translate API
####################
translator = Translator()
input_text = translator.translate(
input_text,
dest= "en"
)
input_text = input_text.text
####################
#Select Precision and Inference Method
####################
df = pd.DataFrame()
df["lang"] = ["en"]
precision = st.sidebar.selectbox("Select Precision:",
("16 Bit", "32 Bit")
)
inference = st.sidebar.selectbox("Inference Method:",
("ONNX", "TensorRT")
)
if st.button('Show Selected Configuration'):
st.subheader("Selected Configuration:")
st.write("Precision: ", precision)
st.write("Inference: ", inference)
st.subheader("Results")
def result(x):
"""
Function to classify the comment toxicity based on the probability and given threshold
params: x(float) - Probability of Toxicity
"""
if x >= 0.4:
st.write("Toxic")
else:
st.write("Non Toxic")
####################
# Implement Selected Configuration
####################
if precision=="16 Bit":
if inference=="ONNX":
df["comment_text"] = [input_text]
predictions = run_onnx_inference(
onnx_model_path = "/workspace/data/multilingual-text-classifier/output models/mBERT_lightning_fp16_2GPU.onnx",
stage="inference",
df_test = df
)
predictions = torch.sigmoid(torch.tensor(predictions))
st.write(input_text)
st.write(predictions)
result(predictions)
if inference=="TensorRT":
df["content"] = [input_text]
predictions = run_trt_inference(
trt_model_path = "/workspace/data/multilingual-text-classifier/output models/mBERT_lightning_fp16_bs16.engine",
stage="inference",
df_test = df
)
predictions = predictions.astype("float32")
predictions = torch.sigmoid(torch.tensor(predictions))
st.write(input_text)
st.write(predictions)
result(predictions)
if precision=="32 Bit":
if inference=="ONNX":
df["comment_text"] = [input_text]
predictions = run_onnx_inference(
onnx_model_path = "/workspace/data/multilingual-text-classifier/output models/mBERT_fp32.onnx",
stage="inference",
df_test = df
)
predictions = torch.sigmoid(torch.tensor(predictions))
st.write(input_text)
st.write(predictions)
result(predictions)
if inference=="TensorRT":
df["content"] = [input_text]
predictions = run_trt_inference(
trt_model_path = "/workspace/data/multilingual-text-classifier/output models/mBERT_fp32.engine",
stage="inference",
df_test = df
)
predictions = predictions.astype("float32")
predictions = torch.sigmoid(torch.tensor(predictions))
st.write(input_text)
st.write(predictions)
result(predictions)
####################
# Take Feedback
####################
st.subheader("Feedback:")
feedback = st.radio(
"Are you satisfied with the results?",
('Yes', 'No'))
st.write("Thanks for the Feedback!")
Error
-------------------------------------------------------------------
PyCUDA ERROR: The context stack was not empty upon module cleanup.
-------------------------------------------------------------------
A context was still active when the context stack was being
cleaned up. At this point in our execution, CUDA may already
have been deinitialized, so there is no way we can finish
cleanly. The program will be aborted now.
Use Context.pop() to avoid this problem.
-------------------------------------------------------------------
Aborted (core dumped)

How to fix: "error": "Prediction failed: unknown error." in custom prediction routine with scikit-learn?

I am trying to write a custom prediction routine on Google's AI Platform using scikit-learn's MLPClassifier. I have packaged and deployed the model successfully, but when I request online predictions via gcloud ai-platform predict, the error I get the error "error": "Prediction failed: unknown error." I then went to the console to test my model manually in the "Test & Use" section of my model and received the same error.
The training vectors are numpy arrays with 6 elements (e.g. [1,2,3,4,5,6]) and the targets are either 0, 1, or 2.
Here is my preprocess.py code:
import numpy as np
class MySimpleScaler(object):
def __init__(self):
self._means = None
self._stds = None
def preprocess(self, data):
if self._means is None: # during training only
self._means = np.mean(data, axis=0)
if self._stds is None: # during training only
self._stds = np.std(data, axis=0)
if not self._stds.all():
raise ValueError('At least one column has standard deviation of 0.')
return (data - self._means) / self._stds
Here is my predictor.py code:
import os
import pickle
import numpy as np
from sklearn.externals import joblib
from sklearn.neural_network import MLPClassifier
class MyPredictor(object):
def __init__(self, model, preprocessor):
self._model = model
self._preprocessor = preprocessor
self._class_names = ["0-6 months", "7-18 months", "18+ months"]
def predict(self, instances, **kwargs):
inputs = np.asarray(instances)
preprocessed_inputs = self._preprocessor.preprocess(inputs)
if kwargs.get('probabilities'):
probabilities = self._model.predict_proba(preprocessed_inputs)
return probabilities.tolist()
else:
outputs = self._model.predict(preprocessed_inputs)
return [self._class_names[class_num] for class_num in outputs]
#classmethod
def from_path(cls, model_dir):
model_path = os.path.join(model_dir, 'model.joblib')
model = joblib.load(model_path)
preprocessor_path = os.path.join(model_dir, 'preprocessor.pkl')
with open(preprocessor_path, 'rb') as f:
preprocessor = pickle.load(f)
return cls(model, preprocessor)
Here is the code where I train and export my model:
scaler = MySimpleScaler()
y = data[:, [0]]
features_scaled = scaler.preprocess(data[:, 1:])
scaled_data = np.concatenate((y, features_scaled), 1) # put the scaled features and the y column back
together
X = scaled_data[:, 1:]
clf = MLPClassifier()
clf.fit(X, y)
# export the model
joblib.dump(clf, 'model.joblib')
with open ('preprocessor.pkl', 'wb') as f:
pickle.dump(scaler, f)
setup.py:
from setuptools import setup
setup(
name='my_custom_code',
version='0.1',
include_package_data=True,
scripts=['predictor.py', 'preprocess.py'])
I have tried serving online predictions with the input.json file looking like this
[1,2,3,4,5,6]
with this command
gcloud ai-platform predict --version $CORRECT_VERSION --model $CORRECT_MODEL --json-instances
input.json
and I get the error above. Can someone please help? I wish Google AI Platform had more informative error messages.

keras and tensorflow(backend error) Tensor conv2d_1_input:0, specified in either feed_devices or fetch_devices was not found in the Graph

I was using keras and tensorfoow and i'm completely new to it.
I had trained my models and when i make to predict it the error is showing.
This is the code i have used for the image prediction
import numpy as np
from flask import Flask, request, jsonify, render_template
import numpy
from PIL import Image
import os
import tensorflow.keras
from werkzeug.utils import secure_filename
from keras.models import load_model
app = Flask(__name__)
model = load_model('traffic_classifier.h5')
model._make_predict_function()
#app.route('/')
def index():
# Main page
return render_template('index.html')
#app.route('/traffic')
def traffic():
# Main page
return render_template('traffic.html')
#app.route('/sleep')
def sleep():
# Main page
return render_template('sleep.html')
#app.route('/predict',methods=['POST'])
def predict():
'''
For rendering results on HTML GUI
'''
classes = { 1:'Speed limit (20km/h)',
2:'Speed limit (30km/h)',
3:'Speed limit (50km/h)',
4:'Speed limit (60km/h)',
5:'Speed limit (70km/h)',
6:'Speed limit (80km/h)',
7:'End of speed limit (80km/h)',
8:'Speed limit (100km/h)',
9:'Speed limit (120km/h)',
10:'No passing',
11:'No passing veh over 3.5 tons',
12:'Right-of-way at intersection',
13:'Priority road',
14:'Yield',
15:'Stop',
16:'No vehicles',
17:'Veh > 3.5 tons prohibited',
18:'No entry',
19:'General caution',
20:'Dangerous curve left',
21:'Dangerous curve right',
22:'Double curve',
23:'Bumpy road',
24:'Slippery road',
25:'Road narrows on the right',
26:'Road work',
27:'Traffic signals',
28:'Pedestrians',
29:'Children crossing',
30:'Bicycles crossing',
31:'Beware of ice/snow',
32:'Wild animals crossing',
33:'End speed + passing limits',
34:'Turn right ahead',
35:'Turn left ahead',
36:'Ahead only',
37:'Go straight or right',
38:'Go straight or left',
39:'Keep right',
40:'Keep left',
41:'Roundabout mandatory',
42:'End of no passing',
43:'End no passing veh > 3.5 tons' }
if request. method == "POST":
#image=request. form["fileupload"]
f = request.files['file']
# Save the file to ./uploads
basepath = os.path.dirname(__file__)
file_path = os.path.join(
basepath, 'uploads', secure_filename(f.filename))
f.save(file_path)
image = Image.open(file_path)
image = image.resize((30,30))
image = numpy.expand_dims(image, axis=0)
image = numpy.array(image)
pred = model.predict_classes([image])[0]
sign = classes[pred+1]
return render_template('traffic.html', prediction_text='This sign represents {}'.format(sign))
if __name__ == "__main__":
app.run(debug=True)
I'm getting error
tensorflow.python.framework.errors_impl.InvalidArgumentError
tensorflow.python.framework.errors_impl.InvalidArgumentError: Tensor conv2d_1_input:0, specified in either feed_devices or fetch_devices was not found in the Graph
what to do with it??
Solved it by adding these codes
config = tensorflow.ConfigProto(
device_count={'GPU': 1},
intra_op_parallelism_threads=1,
allow_soft_placement=True
)
config.gpu_options.allow_growth = True
config.gpu_options.per_process_gpu_memory_fraction = 0.6
session = tensorflow.Session(config=config)
keras.backend.set_session(session)
model = load_model('traffic_classifier.h5')
model._make_predict_function()
The problem is that Flask is using threads. This means that for each request, Flask creates a new thread. As such, your model is not visible from the request.
To solve this problem, you need to make the model part of a global session that is used throughout.
The solution can be found here as this bug.
from tensorflow.python.keras.backend import set_session
from tensorflow.python.keras.models import load_model
tf_config = some_custom_config
sess = tf.Session(config=tf_config)
graph = tf.get_default_graph()
# IMPORTANT: models have to be loaded AFTER SETTING THE SESSION for keras!
# Otherwise, their weights will be unavailable in the threads after the session there has been set
set_session(sess)
model = load_model(...)
then, inside your method:
def predict():
....
global sess
global graph
with graph.as_default():
set_session(sess)
pred = model.predict_classes(...)
...

Create a temporary keras session?

I have a bunch of models floating around, I clone them, cross-validate them, do hyperparameter selection and what have you. As such, my keras global session can get quite mucked up. The solution per various threads is to call .clear_session(). However, this will throw away any models that I want to keep. One option is to train all of my models in a multiprocessing thread. However, it would be convenient to just instantiate a new session for each model as one might do with Tensorflow:
def score_model(**hyperparameters):
with tf.Graph().as_default()
my_model = build_model(**hyperparameters)
with tf.Session() as sess:
my_model.train(X,y)
score = my_model.score()
# now it's all gone, I have the score, so I don't need the model anymore
# the rest of my_model should get garbage collected, hooray!
return score
Can I do this sort of thing with keras?
UPDATE
The sess.as_default() method is crashing my kernel. My memory does not seem to be running low, and it gives no error whatsoever. In the following loop I can't even make it to i=2 before crashing.
from sklearn.datasets import load_iris
import numpy as np
import sklearn
import keras
import keras.wrappers.scikit_learn
import tensorflow as tf
import keras.models
import os
def sessioned(f):
def sessioned_f(self, *args, **kwargs):
if not hasattr(self, "sess"):
self.sess = tf.Session()
with self.sess.as_default():
return f(self, *args, **kwargs)
return result
return sessioned_f
class LogisticRegression(keras.wrappers.scikit_learn.KerasClassifier):
def __init__(self, n_epochs=100, **kwargs):
self.n_epochs = n_epochs
super().__init__(**kwargs)
#sessioned
def fit(self, X, y,**kwargs):
# get the shape of X and one hot y
self.input_shape = X.shape[-1]
self.label_encoder = sklearn.preprocessing.LabelEncoder()
self.label_encoder.fit(y)
self.output_shape = len(self.label_encoder.classes_)
label_encoded = self.label_encoder.transform(y).reshape((-1,1))
y_onehot = sklearn.preprocessing.OneHotEncoder().fit_transform(label_encoded).toarray()
super().fit(X,y_onehot,epochs=self.n_epochs,verbose=1,**kwargs)
return self
#sessioned
def predict_proba(self, X):
return super().predict_proba(X)
def check_params(self, params):
#fuckit
pass
#sessioned
def __call__(self): # the build_fn thing
# create model
model = keras.models.Sequential()
model.add(keras.layers.Dense(self.output_shape, input_dim=self.input_shape, kernel_initializer="normal", activation="softmax"))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam')
return model
data = load_iris()
i=0
while True:
print(i)
graph = tf.Graph()
with graph.as_default():
model = LogisticRegression()
model.fit(data.data, data.target)
model.sess.close()
del model
i+=1
del graph
You can use Keras exactly as you described, except instead of running Tensorflow code inside the with statements you run the Keras code.
To set the session you would use
with sess.as_default()
Here is a link with with more information:
https://blog.keras.io/keras-as-a-simplified-interface-to-tensorflow-tutorial.html
I have also found it helpful to look at the source code inside keras.backend. If you look at get_session() you can see that Keras first looks to see if there is a tensorflow default session. Otherwise it uses the session set to Keras using set_session(). Finally if no session has been set then it creates one.

Sklearn classifier and flask issues

I have been trying to self host with apache an sklearn classifier that I put together, and I ended up using joblib to serialize the saved model, then load it in a flask app. Now, this app worked perfectly when running flask's built in development server, but when I set this up with a debian 9 apache server, I get a 500 error. Delving into apache's error.log, I get:
AttributeError: module '__main__' has no attribute 'tokenize'
Now, this is funny to me because while I did write my own tokenizer, the web app gave me no problems when I was running it locally. Furthermore, the saved model that I used was trained on the webserver, so slightly different library versions should not be a problem.
My code for the web app is:
import re
import sys
from flask import Flask, request, render_template
from nltk import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.externals import joblib
app = Flask(__name__)
def tokenize(text):
# text = text.translate(str.maketrans('','',string.punctuation))
text = re.sub(r'\W+', ' ', text)
tokens = word_tokenize(text)
lemas = []
for item in tokens:
lemas.append(WordNetLemmatizer().lemmatize(item))
return lemas
#app.route('/')
def home():
return render_template('home.html')
#app.route('/analyze',methods=['POST','GET'])
def analyze():
if request.method=='POST':
result=request.form
input_text = result['input_text']
clf = joblib.load("model.pkl.z")
parameters = clf.named_steps['clf'].get_params()
predicted = clf.predict([input_text])
# print(predicted)
certainty = clf.decision_function([input_text])
# Is it bonkers?
if predicted[0]:
verdict = "Not too nuts!"
else:
verdict = "Bonkers!"
return render_template('result.html',prediction=[input_text, verdict, float(certainty), parameters])
if __name__ == '__main__':
#app.debug = True
app.run()
With the .wsgi file being:
import sys
sys.path.append('/var/www/mysite')
from conspiracydetector import app as application
Furthermore, I trained the model with this code:
import logging
import pprint # Pretty stuff
import re
import sys # For command line arguments
from time import time # to show progress
import numpy as np
from nltk import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn import metrics
from sklearn.datasets import load_files
from sklearn.externals import joblib # In order to save
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
# Tokenizer that does stemming and strips punctuation
def tokenize(text):
# text = text.translate(str.maketrans('','',string.punctuation))
text = re.sub(r'\W+', ' ', text)
tokens = word_tokenize(text)
lemas = []
for item in tokens:
lemas.append(WordNetLemmatizer().lemmatize(item))
return lemas
if __name__ == "__main__":
# NOTE: we put the following in a 'if __name__ == "__main__"' protected
# block to be able to use a multi-core grid search that also works under
# Windows, see: http://docs.python.org/library/multiprocessing.html#windows
# The multiprocessing module is used as the backend of joblib.Parallel
# that is used when n_jobs != 1 in GridSearchCV
# Display progress logs on stdout
print("Initializing...")
# Command line arguments
save = sys.argv[1]
training_directory = sys.argv[2]
logging.basicConfig(level=logging.INFO,
format='%(asctime)s %(levelname)s %(message)s')
dataset = load_files(training_directory, shuffle=False)
print("n_samples: %d" % len(dataset.data))
# split the dataset in training and test set:
print("Splitting the dataset in training and test set...")
docs_train, docs_test, y_train, y_test = train_test_split(
dataset.data, dataset.target, test_size=0.25, random_state=None)
# Build a vectorizer / classifier pipeline that filters out tokens
# that are too rare or too frequent
# Also remove stop words
print("Loading list of stop words...")
with open('stopwords.txt', 'r') as f:
words = [line.strip() for line in f]
print("Stop words list loaded...")
print("Setting up pipeline...")
pipeline = Pipeline(
[
# ('vect', TfidfVectorizer(stop_words=words, min_df=0.001, max_df=0.5, ngram_range=(1,1))),
('vect',
TfidfVectorizer(tokenizer=tokenize, stop_words=words, min_df=0.001, max_df=0.5, ngram_range=(1, 1))),
('clf', LinearSVC(C=5000)),
])
print("Pipeline:", [name for name, _ in pipeline.steps])
# Build a grid search to find out whether unigrams or bigrams are
# more useful.
# Fit the pipeline on the training set using grid search for the parameters
print("Initializing grid search...")
# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = {
# 'vect__ngram_range': [(1, 1), (1, 2)],
# 'vect__min_df': (0.0005, 0.001),
# 'vect__max_df': (0.25, 0.5),
# 'clf__C': (10, 15, 20),
}
print("Parameters:")
pprint.pprint(parameters)
grid_search = GridSearchCV(
pipeline,
parameters,
n_jobs=-1,
verbose=True)
print("Training and performing grid search...\n")
t0 = time()
grid_search.fit(docs_train, y_train)
print("\nDone in %0.3fs!\n" % (time() - t0))
# Print the mean and std for each candidate along with the parameter
# settings for all the candidates explored by grid search.
n_candidates = len(grid_search.cv_results_['params'])
for i in range(n_candidates):
print(i, 'params - %s; mean - %0.2f; std - %0.2f'
% (grid_search.cv_results_['params'][i],
grid_search.cv_results_['mean_test_score'][i],
grid_search.cv_results_['std_test_score'][i]))
# Predict the outcome on the testing set and store it in a variable
# named y_predicted
print("\nRunning against testing set...\n")
y_predicted = grid_search.predict(docs_test)
# Save model
print("\nSaving model to", save, "...")
joblib.dump(grid_search.best_estimator_, save)
print("Model Saved! \nPrepare for some awesome stats!")
I must confess that I am pretty stumped, and after tinkering around, searching, and making sure that my server is configured correctly, I felt that perhaps someone here might be able to help.
Any help is appreciated, and if there is any more information that I need to provide, please let me know and I will be happy to.
Also, I am running:
python 3.5.3 with nltk and sklearn.
I solved this problem, although imperfectly, by removing my custom tokenizer and falling back on one of sklearn's.
However, I am still in the dark on how to integrate my own tokenizer.

Categories