Loading the data in memory to get faster response - python

I am using svm classifier, which classify the data using learned model. Here is the command I execute:
./svm_classify input.txt modelrank > input.txt.entities
svm_classify - is the opensource classifier (Link)
input.txt - input file that is to be classified.
modelrank - this is svm model for classification. having 124MB. Trained using large dataset.
input.txt.entities - output file
but as modelrank is large file, 124mb. Loading everytime during new request for classification makes process slower.
Is there anyway so that it can be In memory so that it respond instantly while there's new request?

Since editing the software would probably be a hassle, what you can do to speed things up is getting the file to RAM first. To do that you'd have to mount a partition stored in RAM, link.
This should reduce loading time noticably.

Related

Is it a good idea to store my dataset in my notebook instance in sagemaker?

I'm new to AWS and I am considering to use amazon sagemaker to train my deep learning model because I'm having memory issues due to the large dataset and neural network that I have to train. I'm confused whether to store my data in my notebook instance or in S3? If I store it in my s3 would I be able to access it to train on my notebook instance? I'm confused on the concepts. Can anyone explain the use of S3 in machine learning in AWS?
Yes you can use S3 as storage for your training datasets.
Refer diagram in this link describing how everything works together: https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html
You may also want to checkout following blogs that details about File mode and Pipe mode, two mechanisms for transferring training data:
https://aws.amazon.com/blogs/machine-learning/accelerate-model-training-using-faster-pipe-mode-on-amazon-sagemaker/
In File mode, the training data is downloaded first to an encrypted EBS volume attached to the training instance prior to commencing the training. However, in Pipe mode the input data is streamed directly to the training algorithm while it is running.
https://aws.amazon.com/blogs/machine-learning/using-pipe-input-mode-for-amazon-sagemaker-algorithms/
With Pipe input mode, your data is fed on-the-fly into the algorithm container without involving any disk I/O. This approach shortens the lengthy download process and dramatically reduces startup time. It also offers generally better read throughput than File input mode. This is because your data is fetched from Amazon S3 by a highly optimized multi-threaded background process. It also allows you to train on datasets that are much larger than the 16 TB Amazon Elastic Block Store (EBS) volume size limit.
The blog also contains python code snippets using Pipe input mode for reference.

Is the Word2Vec Spark implementation distributed?

I'm relatively new to Spark and having some difficulty understanding Spark ML.
The problem I have is that I have 3TB of text, which I want to train a Word2Vec model on. The server I'm running on has around 1TB of ram and so I can't save the file temporarily.
The file is saved as a parquet that I import into Spark. The question I have is does the Spark ML library distribute the Word2Vec training? If so is there anything I need to worried about while processing such a large text file? If not, is there anyway to stream this data while training Word2Vec?
From this https://github.com/apache/spark/pull/1719 already in 2014 you can glean that parallel processing is possible - per partition.
Quote:
To make our implementation more scalable, we train each partition
separately and merge the model of each partition after each iteration.
To make the model more accurate, multiple iterations may be needed.
But you have to have partitoned data.

How can I save my trained SVM model to retrieve it later for time saving in python?

I'm new to python and working on machine learning. I have trained LinearSVC from sklearn.svm, and training takes quite a long time, mostly because of stemming (7-8 minutes), I want to know if it is possible to save model results as some extension that can be fed as it is back to python when running the application, just to save the time of the training happening in every run of the application..
My Answer:-
Pickle or Joblib is used to save a trained model
For your reference, check it out the link given below.
Reference Link

Real time data using sklearn

I have a real time data feed of health patient data that I connect to with python. I want to run some sklearn algorithms over this data feed so that I can predict in real time if someone is going to get sick. Is there a standard way in which one connects real time data to sklearn? I have traditionally had static datasets and never an incoming stream so this is quite new to me. If anyone has sort of some general rules/processes/tools used that would be great.
With most algorithms training is slow and predicting is fast. Therefore it is better to train offline using training data; and then use the trained model to predict each new case in real time.
Obviously you might decide to train again later if you acquire more/better data. However there is little benefit in retraining after every case.
It is feasible to train the model from a static dataset and predict classifications for incoming data with the model. Retraining the model with each new set of patient data not so much. Also breaks the train/test mode of testing a ML model.
Trained models can be saved to file and imported in the code used for real time prediction.
In python scikit learn, this is via the pickle package.
R programming saves to an rda object. saveRDS
yay... my first answering a ML question!

Save Naive Bayes Classifier in memory

I am new in NLTK and machine learning. I'm using Python with NLTK Naive Bayes Classifier . I have create a Naive Bayes Classifier for text classification using NLTK and save it on disk. I am also able to load it when needed to classify some test data by using this python code:
import pickle
f = open('classifier.pickle')
classifier = pickle.load(f)
f.close()
But my problem is that whenever an new test data come , I have to load this classifier again and again in memory that takes lots of time (2-3 min) to load as it have large size. Also if I have to run two instances of the same sentimental analysis program, that will take double RAM as both program will load this classifier separately. My questions is: Is there any technique to store this classifier in memory so that whenever needed the sentimental anylysis programs can read this directly from memory or is there any other method through which the load time of the classifier can be minimize.
Thanks in advance for your help.
You can't have it both ways. You can either keep pickling/unpickling one at a time to use less RAM, or you can store both in memory, using twice as much ram, but reducing load times and disk i/o wait times.
Are the two classifiers trained using different training data, or are you using the same classifier in parallel? It sounds like the latter from your usage of "two instances", and in that case you may want to look into threading to allow the same classifier to work with two sets of data (some parallelism may be achieved by classify some of the data, then doing some other stuff like results processing to allow the other thread to classify, repeat).
My expertise in this comes from having started an open source NLTK based sentiment analysis system: https://bitbucket.org/tommyjcarpenter/evopminer.

Categories