Google Cloud - Compute Engine VS Machine Learning

Google Cloud - Compute Engine VS Machine Learning - python

Does anyone know what is the difference between using Google Cloud Machine Learning compare to a Virtual Machine instance in the Google Cloud Engine ?
I am using Keras with Python 3 and feel like GML is more restricting (using python 2.7, older version of TensorFlow, must follow the given structure...). I guess they are benefits of using GML over a VM in GCE but I would like to know what they are.

Google Cloud ML is a fully managed service whereas Google Compute Engine is not (the latter is IaaS).
Assuming that you just want to know some differences for the case when you have your own model, here you have some:
The most noticeable feature of Google CloudML is the deployment
itself. You don't have to take care of things like setting up your
cluster (that is, scaling), launching it, installing the packages and
deploy your model for training. This is all done automatically, and
you would have to do it yourself in Compute Engine although you would be unrestricted in what you can install.
Although all that deployment you can automatise more or less, there
is no magic to it. In fact, you can see in the logs of CloudML for a
training job that it is quite rudimentary in the sense that a cluster
of instances is launched and thereafter TF is installed and your
model is run with the options you set. This is due to TensorFlow
being a framework decoupled from Google systems.
However, there is a substancial difference of CloudMl vs Compute
Engine when it comes to prediction. And that is what you pay for
mostly I would say with CloudML. You can have deployed model in
CloudML for online and batch prediction out of the box pretty much.
In Compute Engine, you would have to take care of all the quirks of
TensorFlow Serving which are not that trivial (compared to
training your model).
Another advantage of CloudML is hyper-parameter tuning. It is no more
than just a somewhat smart brute-forcing tool to find out the best
combination of parameters for your given model, and you could
possibly automatise this in Compute Engine, but you would have to do
that part of figuring out the optimisation algorithms to find the
combination of parameters and values that would improve the objective
function (usually maximise your accuracy or reduce your loss).
Finally, pricing is slightly different in either service. Until recently, pricing of CloudML was in pair with other competitors (you would pay for computing time in both training and prediction but also per prediction which you could compare with the computing time in Compute Engine). However, now you will only pay for that computing time (and it is even cheaper than before) which probably renders the idea of managing and scaling your own cluster (with TensorFlow) in Compute Engine useless in most scenarios.

Related

AWS Sagemaker Multiple Training Jobs

We currently have a system running on AWS Sagemaker whereby several units have their own trained machine learning model artifact (using an SKLearn training script with the Sagemaker SKLearn estimator).
Through the use of Sagemaker's multi-model endpoints, we are able to host all of these units on a single instance.
The problem we have is that we need to scale this system up such that we can train individual models for hundreds of thousand of units and then host the resulting model artifacts on a multi-model endpoint. But, Sagemaker has a limit to the number of models you can train in parallel (our limit is 30).
Aside from training our models in batches, does anyone have any ideas how to go about implementing a system in AWS Sagemaker whereby for hundreds of thousands of units, we can have a separate trained model artifact for each unit?
Is there a way to output multiple model artifacts for 1 sagemaker training job with the use of an SKLearn estimator?
Furthermore, how does Sagemaker make use of multiple CPUs when a training script is submitted? Does this have to be specified in the training script/estimator object or is this handled automatically?

Here are some ideas:
1. does anyone have any ideas how to go about implementing a system in AWS Sagemaker whereby for hundreds of thousands of units, we can have a separate trained model artifact for each unit? Is there a way to output multiple model artifacts for 1 sagemaker training job with the use of an SKLearn estimator?
I don't know if the 30-training job concurrency is a hard limit, if it is a blocker you should try and open a support ticket to ask if it is and try and get it raised. Otherwise as you can point out, you can try and train multiple models in one job, and produce multiple artifacts that you can either (a) send to S3 manually, or (b) save to opt/ml/model so that they all get sent to the model.tar.gz artifact in S3. Note that if this artifact gets too big this could get impractical though
2. how does Sagemaker make use of multiple CPUs when a training script is submitted? Does this have to be specified in the training script/estimator object or is this handled automatically?
This depends on the type of training container you are using. SageMaker built-in containers are developed by Amazon teams and designed to efficiently use available resources. If you use your own code such as custom python in the Sklearn container, you are responsible for making sure that your code is efficiently written and uses available hardware. Hence framework choice is quite important :) for example, some sklearn models support explicitly using multiple CPUs (eg the n_jobs parameter in the random forest), but I don't think that Sklearn natively supports GPU, multi-GPU or multi-node training.

Training Machine Learning in Production

Is there a way to train your machine learning model in the cloud? Or does it really have to be batch training? i.e. (Pull some data on SQL, then feed that to the model)
What i was thinking is implementing my own model from scratch, use Stochastic Gradient Descent to update the parameters for every row from the database.

I think you are looking for something like GCP AI platform
You can use BigQuery to store your data and do some analytics and perform inbuilt ML models.
AI Platform Notebooks for manage your notebooks
Check this list for built in algorithms in GCP
Or if you have a your model, you can use cloud resources to run your model.check this link how to use GCP resources for your model

Once Apache Beam supports Python 3, will tf.data be integrated into tf.Transform?

Reading about TFX, Kubeflow, Beam, Flink and a neverending stream of Apache projects I'm getting more and more confused. I'm curious what the TensorFlow team intends to promote to the canonical ETL API for training Keras models.
I'm currently pretty happy with tf.data support in tf.keras but two things are sorely missing:
Dataset reductions for standardizing features/targets.
Full dataset shuffling with persistent per-element caching (e.g. an equivalent of doing index permutations with a NumPy memmap).
My hunch is tf.data exists because tf.Transform only works on Python 2. Once that is finally fixed, what's the future for tf.data? Will it be integrated into tf.Transform or the other way around? Will tf.data get the above mentioned features eventually, independently of Apache Beam's status?
TL;DR: What's going to be the canonical ETL API for training Keras models? Could a TensorFlower clarify the plans for the TensorFlow ecosystem and how it all ought to fit together?
PS: Where does tensorflow_io and tensorflow_datasets fit into all of this? They seem to do a lot of reinventing of the wheel instead of relying on tf.Transform.

How to deploy trained tensorflow network on e.g. Raspberry Pi

I'm trying to make a simple gesture recognition system to use with my Raspberry Pi equipped with a camera. I would like to train a neural network with tensorflow on my more powerful laptop and then transfer it to the RPi for prediction (as part of a Magic Mirror). Is there a way to export the trained network and weights and use a lightweight version of tensorflow for the linear algebra and prediction without the overhead of all the symbolic graph machinery that are necessary for training? I have seen the tutorials on tensorflow server, but I'd rather not set up a server and just have it run the prediction on the RPi.

Yes, possible and available in the source repository. This allows to deploy and run a model trained on your laptop. Note that this is the same model, which can be big.
To deal with size and efficiency, TF is currently moving along a quantization approach. After your model is trained, a few extra steps allow to "translate" it into a lighter model with similar accuracy. Currently, the implementation is quite slow, though. There is a recent post that shows the whole process for iOS---pretty similar to RaspberryPI overall.
The Makefile contribution is also quite relevant for tuning and extra configuration.
Beware that this code moves often and breaks. It is sometimes useful to checkout an old "release" tag to get something that works end to end.

Using sklearn and Python for a large application classification/scraping exercise

I am working on a relatively large text-based web classification problem and I am planning on using the multinomial Naive Bayes classifier in sklearn in python and the scrapy framework for the crawling. However, I am a little concerned that sklearn/python might be too slow for a problem that could involve classifications of millions of websites. I have already trained the classifier on several thousand websites from DMOZ.
The research framework is as follows:
1) The crawler lands on a domain name and scrapes the text from 20 links on the site (of depth no larger than one). (The number of tokenized words here seems to vary between a few thousand to up to 150K for a sample run of the crawler)
2) Run the sklearn multionmial NB classifier with around 50,000 features and record the domain name depending on the result
My question is whether a Python-based classifier would be up to the task for such a large scale application or should I try re-writing the classifier (and maybe the scraper and word tokenizer as well) in a faster environment? If yes what might that environment be?
Or perhaps Python is enough if accompanied with some parallelization of the code?
Thanks

Use the HashingVectorizer and one of the linear classification modules that supports the partial_fit API for instance SGDClassifier, Perceptron or PassiveAggresiveClassifier to incrementally learn the model without having to vectorize and load all the data in memory upfront and you should not have any issue in learning a classifier on hundreds of millions of documents with hundreds of thousands (hashed) features.
You should however load a small subsample that fits in memory (e.g. 100k documents) and grid search good parameters for the vectorizer using a Pipeline object and the RandomizedSearchCV class of the master branch. You can also fine tune the value of the regularization parameter (e.g. C for PassiveAggressiveClassifier or alpha for SGDClassifier) using the same RandomizedSearchCVor a larger, pre-vectorized dataset that fits in memory (e.g. a couple of millions of documents).
Also linear models can be averaged (average the coef_ and intercept_ of 2 linear models) so that you can partition the dataset, learn linear models independently and then average the models to get the final model.

Fundamentally, if you rely on numpy, scipy, and sklearn, Python will not be a bottleneck as most critical portions of those libraries are implemented as C-extensions.
But, since you're scraping millions of sites, you're going to be bounded by your single machine's capabilities. I would consider using a service like PiCloud [1] or Amazon Web Services (EC2) to distribute your workload across many servers.
An example would be to funnel your scraping through Cloud Queues [2].
[1] http://www.picloud.com
[2] http://blog.picloud.com/2013/04/03/introducing-queues-creating-a-pipeline-in-the-cloud/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.