How does real world machine learning production systems run?

How does real world machine learning production systems run? - python

Dear Machine Learning/AI Community,
I am just a budding and aspiring Machine Learner who has worked on open online data sets and some POC's built locally for my project. I have built some models and converted into pickle objects in order to avoid re-training.
And this question always puzzles me. How does a real production system work for ML algorithms?
Say, I have trained my ML algorithm with some millions of data and I want to move it to production system or host it on a server. In real world, do they convert into pickle objects? If so, it would be huge pickled file, isn't. The ones I trained locally and converted for 50000 rows data itself took 300 Mb space on disk for that pickled object. I don't think so this is right approach.
So how does it work in order to avoid my ML algorithm to re-train and start predicting on incoming data? And how do we actually make ML algorithm as a continuous online learner. For example, I built a image classifier, and start predicting the incoming images. But I want to again train algorithm by adding the incoming online images to my previously trained data sets. May be not for every data, but daily once I want to combine all received data for that day and re-train with newly 100 images which my previously trained classifier predicted with actual value. And this approach shouldn't effect my previously trained algorithm to stop predicting incoming data as this re-training may take time based on computational resources and data.
I have Googled and read many articles, but couldn't find or understand to my above question. And this is puzzling me every day. Do manual intervention is needed for production systems as well? or any automated approach is there for it?
Any leads or answers to above questions would be highly helpful and appreciated. Please let me know if my questions doesn't make sense or not understandable.
This is not a project centric I am looking for. Just a generic case of real world production ML systems example.
Thank you in advance!

Note that this is is very broadly formulated, and your question should be put on hold probably, but I try to give a brief summary of what you are trying to ask:
"How does a real production system work?"
Well, it always depends on the scale of your product, and in what way you are using ML/AI in your system. For the most parts, you would deploy a model on your server or app. Note that deployment does NOT lineraly scale with the amount of training data you have. Rather, the size of your network is solely determined by the number of activations in your network. Note that, after training, you might not even need as much storage space, since for example CNNs have a very limited number of connections, while retaining a much larger number during training. I can highly recommend Roger Grosse's slides on the size of a network. This also directly relates to the second point.
"How to avoid re-training?"
From what I am aware of, most systems will not be retrained on a regular basis, at least for the smaller scale. This means that a network will mostly run in inference mode only, which has the aforementioned benefit of what I mentioned about the size of the network (and the time it takes to compute a result). Then again, this also highly depends on the specific task for which you are deploying a ML model. Image classification on "standard categories" have the benefit of already delivering quite substantial models (AlexNet, Inception, ResNet,...), whereas a model for machine translation mostly depends on your specific domain and vocabulary.
"How would I go about re-training?"
This is actually the tricky part, which has a significant field called "bandit learning" behind it. The problem is that most of your incoming "new" data will be unlabeled, i.e. cannot be used for the direct integration into a new training phase. Instead, you rely on feedback from users to give you a sense of what was wrong or right. Then again, not every user has the same ratings for the same machine translation (or same recommendations on Amazon etc.), for example, so judging whether your system is "right" or "wrong" becomes very hard.
There are obviously quite a few methods to automate labeling (i.e. nearest neighbor for images, or other similarity-based searches). Online learning therefore only also works if you have this continuous loop of feedback/retraining.
For larger scale systems, it also becomes important to scale your models, to perform the desired amount of predictions/classifications per second. This is also mentioned in the link to the TensorFlow deployment page I provided, and mainly builds on top of cloud/distributed architectures, such as Hadoop or (more recently) Kubernetes. Then again, for smaller products this is mostly overkill, but serves the purpose of delivering enough resources at any arbitrary scale (and possibly on demand).
As for the integration cycle of machine learning models, there is a nice overview in this article. I want to conclue by stressing that this is a heavily opinionated question, so every answer might be different!

Related

Data testing framework for data streaming (deequ vs Great Expectations)

I want to introduce data quality testing (empty fields/max-min values/regex/etc...) into my pipeline which will essentially consume kafta topics testing the data before it is logged into the DB.
I am having a hard time choosing between the Deequ and Great Expectations frameworks. Deequ lacks clear documentation but has "anomaly detection" which can compare previous scans to current ones. Great expectations has very nice and clear documentation and thus less overhead. I think neither of these frameworks is made for data streaming specifically.
Can anyone offer some advice/other framework suggestions?

As Philipp observed, in most cases batches of some sort are a good way to apply tests to streaming data (even Spark Streaming is effectively using a "mini-batch" system).
That said: if you need to use a streaming algorithm to compute a metric required for your validation (e.g. to maintain running counts over observed data), it is possible to decompose your target metric into a "state" and "update" portion, which can be properties of the "last" and "current" batches (even if those are only one record each). Improved support for that kind of cross-batch metric is actually the area we're most actively working on in Great Expectations now!
In that way, I think of the concept of the Batch as both baked deeply into the core concepts of what gets validated, but also sufficiently flexible to work in a streaming system.
Disclaimer: I am one of the authors of Great Expectations. (Stack Overflow alerts! :))

You can mini-batch your data and apply data quality verification to each of these batches individually. Moreover, deequ allows for stateful computation of data quality metrics where, like James already pointed out, metrics are computed on two partitions of data and are then merged. You can find deequ examples of this here.
Is there a specific example that was not covered in deequ's documentation? You can find a basic example of running deequ against a Spark Dataframe here. Also, there are more examples in the same folder, for example for anomaly detection use-cases.
Disclaimer: I am one of the authors of deequ.

Linear Regression Model that improves as the user selects and trains data

I'm developing a script that detects peaks on a signal data from a biological source. I want to create a semi-automated model that helps predict which peaks are the correct ones. This script improves as the user manually selects a few of these peaks to help teach the model which ones are correct.
The workflow I'm trying to attain is this:
1. User manually selects data
2. Script obtains the correct data and fits it into the model
3. Use the model to predict the likelihood of a given peak to be correct.
4. Hopefully with enough data and training, it could be automated to run through the rest.
I also don't know the name of the general topic and I'm struggling to find what to google.
I've tried to fit it on linear regression model in scikit learn but I don't have enough datasets (as it learns from the user's first intervention). Is what I'm doing possible?

Sorry for the general-ness of this answer but the OP asked for general topics.
It sounds like semi-supervised learning and here for scikit-learn and here for more details may work.
There is no labeled data to start. A manual process is started to gain some labeled data. Soon, semi-supervised can kick in and take over - with a process measuring its accuracy. A match to your situation and a good place to start.
Eventually you may have "enough" correctly labeled data that you can investigate fitting a classic algorithm to predict the remainder. "Enough" being relative to how hard the problem is. Could be tens, hundreds, thousands, ...
Depending on other details of your situation, Reinforcement learning may work. As you described the situation, this may not work but there may be other details in your environment to leverage this family.
Word of warning - machine learning and semi-supervised in particular may not always work great to every problem. Measure, measure, measure.

Thank you everyone for all your help. I was talking to a colleague and he referred me to Online Machine Learning. I think this was the one I was looking for. Although I would not be handling time-series data nor streaming data from online, the method i think is sufficient for my needs. This method allows that data is trained one by one and not as a batch. I think SciKit Learn currently does not have the ability of out-of-the-box online machine learning.
This i think gives a great rundown on the strengths of online machine learning (also showcasing of the creme python library).
Thanks again!

Deep Q learning Replay method Memory Vanishing

In the Q-learning algorithm used in Reinforcement Learning with replay, one would use a data structure in which it stores previous experience that is used in training (a basic example would be a tuple in Python). For a complex state space, I would need to train the agent in a very large number of different situations to obtain a NN that correctly approximates the Q-values. The experience data will occupy more and more memory and thus I should impose a superior limit for the number of experience to be stored, after which the computer should drop the experience from memory.
Do you think FIFO (first in first out) would be a good way of manipulating the data vanishing procedure in the memory of the agent (that way, after reaching the memory limit I would discard the oldest experience, which may be useful for permitting the agent to adapt quicker to changes in the medium)? How could I compute a good maximum number of experiences in the memory to make sure that Q-learning on the agent's NN converges towards the Q function approximator I need (I know that this could be done empirically, I would like to know if an analytical estimator for this limit exists)?

In the preeminent paper on "Deep Reinforcement Learning", DeepMind achieved their results by randomly selecting which experiences should be stored. The rest of the experiences were dropped.
It's hard to say how a FIFO approach would affect your results without knowing more about the problem you're trying to solve. As dblclik points out, this may cause your learning agent to overfit. That said, it's worth trying. There very well may be a case where using FIFO to saturate the experience replay would result in an accelerated rate of learning. I would try both approaches and see if your agent reaches convergence more quickly with one.

Using Attention-OCR model (tensorflow/research) for extracting specific information from scanned documents

I have a few questions regarding the Attention-OCR model described in this paper: https://arxiv.org/pdf/1704.03549.pdf
Some context
My goal is to let Attention-OCR learn where to look for and read a specific information in a scanned document. It should find a 10 digit number that is (in most of the cases) preceded by a descriptive label. The layout and type of documents vary, thus I concluded that without using an attention mechanism the task is unsolvable due to the variable position...
My first question is: am I interpreting the model's capabilities right? Can it actually solve my problem? (1)
Progress so far
I managed to run the training on my own dataset with about 200k images of size 736x736 (pretty big, though the quality isn't that high and scaling it down more would make the text unrecognizable). Unfortunately, the machine I have at my disposal only has one GPU (Nvidia Quadro M4000), and time is a significant aspect. I need a proof of concept soon, so I figured that I could try to overtrain the model with a significantly smaller dataset, just to see if it is able to learn.
I managed to overtrain it with 5k images - it predicts every image successfully. But I have some concerns regarding my interpretation of this result. It seems like the model hasn't successfully memorized where to look for the desired information, but just memorized all of the strings disregarding whether they are actually written somewhere in the document or not. I mean, it's not very surprising that the model memorized it all, but my question is what image amount threshold must be exceeded for the model to start generalizing and actually learning the attention? (2)
Spatial attention
Another thing I'd like to address is the spatial attention mechanism. In the early stage of implementing the model I assumed that the spatial attention mechanism described in the paper was already included and working. Some time ago I stumbled upon an issue in the tensorflow-repository created by Alexander Gorban (one of the developers of Attention-OCR), where he stated that it was disabled by default.
So I turned it back on and realized that the memory usage became unbelievably high. The spatial dimensions of the Tensor including the encoded coordinates changed from
[batch_size, width, height, features]
to
[batch_siz, width, height, features+width+height]
That caused the memory consumption to jump by the factor of ~10 (taking the magnitude of images into account) -> can't afford that! Resulting in my third question: Is the spatial attention mechanism necessary for my task? (3)
Bonus question
Is it possible to visualize the silency and attention map with disabled coordinate encoding?

Detecting 'unusual behavior' using machine learning with CouchDB and Python?

I am collecting a lot of really interesting data points as users come to my Python web service. For example, I have their current city, state, country, user-agent, etc. What I'd like to be able to do is run these through some type of machine learning system / algorithm (maybe a Bayesian classifier?), with the eventual goal of getting e-mail notifications when something out-of-the-ordinary occurs (anomaly detection). For example, Jane Doe has only ever logged in from USA on Chrome. So if she suddenly logs into my web service from the Ukraine on Firefox, I want to see that as a highly 'unusual' event and fire off a notification.
I am using CouchDB (specifically with Cloudant) already, and I see people often saying here and there online that Cloudant / CouchDB is perfect for this sort of thing (big data analysis). However I am at a complete loss for where to start. I have not found much in terms of documentation regarding relatively simple tracking of outlying events for a web service, let alone storing previously 'learned' data using CouchDB. I see several dedicated systems for doing this type of data crunching (PredictionIO comes to mind), but I can't help but feel that they are overkill given the nature of CouchDB in the first place.
Any insight would be much appreciated. Thanks!

You're correct in assuming that this is a problem ideally suited to Machine Learning, and scikit-learn.org is my preferred library for these types of problems. Don't worry about specifics - (couchdb cloudant) for now, lets get your problem into a state where it can be solved.
If we can assume that variations in log-in details (time, location, user-agent etc.) for a given user are low, then any large variation from this would trigger your alert. This is where the 'outlier' detection that #Robert McGibbon suggested comes into play.
For example, squeeze each log-in detail into one dimension, and the create a log-in detail vector for each user (there is significant room for improving this digest of log-in information);
log-in time (modulo 24 hrs)
location (maybe an array of integer locations, each integer representing a different country)
user-agent (a similar array of integer user-agents)
and so on. Every time a user logs in, create this detail array and store it. Once you have accumulated a large set of test data you can try running some ML routines.
So, we have a user and a set of log-in data corresponding to successful log-ins (a training set). We can now train a Support Vector Machine to recognise this users log-in pattern:
from sklearn import svm
# training data [[11.0, 2, 2], [11.3, 2, 2] ... etc]
train_data = my_training_data()
# create and fit the model
clf = svm.OneClassSVM()
clf.fit(train_data)
and then, every time a new log-in even occurs, create a single log-in detail array and pass that past the SVM
if clf.predict(log_in_data) < 0:
fire_alert_event()
else:
# log-in is not dissimilar to previous attempts
print('log in ok')
if the SVM finds the new data point to be significantly different from it's training set then it will fire the alarm.
My Two Pence. Once you've got hold of a good training set, there are many more ML techniques that may be better suited to your task (they may be faster, more accurate etc) but creating your training sets and then training the routines would be the most significant challenge.
There are many exciting things to try! If you know you have bad log-in attempts, you can add these to the training sets by using a more complex SVM which you train with good and bad log-ins. Instead of using an array of disparate 'location' values, you could find the Euclidean different log-ins and use that! This sounds like great fun, good luck!

I also thought the approach using svm.OneClassSVM from sklearn was going to produce a good outlier detector. However, I put together some representative data based upon the example in the question and it simply could not detect an outlier. I swept the nu and gamma parameters from .01 to .99 and found no satisfactory SVM predictor.
My theory is that because the samples have categorical data (cities, states, countries, web browsers) the SVM algorithm is not the right approach. (I did, BTW, first convert the data into binary feature vectors with the DictVectorizer.fit_transform method).
I believe #sullivanmatt is on the right track when he suggests using a Bayesian classifier. Bayesian classifiers are used for supervised learning but, at least on the surface, this problem was cast as an unsupervised learning problem, ie we don't know a priori which observations are normal and which are outliers.
Because the outliers you want to detect are very rare in the stream of web site visits, I believe you could train the Bayesian classifier by labeling every observation in your training set as a positive/normal observation. The classifier should predict that true normal observations have higher probability simply because the majority of the observations really are normal. A true outlier should stand out as receiving a low predicted probability.

If you're trying to investigate on anomalies of user behaviours during the time, I'd recommend you to look at time-series anomaly detectors. With this approach you'll be able to statistically/automatically figure out new, potentially suspicious, emerging patters and abnormal events.
http://www.autonlab.org/tutorials/biosurv.html and http://web.engr.oregonstate.edu/~wong/workshops/icml2006/slides/agarwal.ppt
explain some techniques based on machine learning. In this case you can use scikit-learn.org, a very powerful Python library that contains tons of ML algos.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.