I am trying to run a classifier, naive bayes, over 1.6 million tweets using nltk and python.
Please can someone tell me if this is a stupid thing to do as the process has taken about 12 hours so far and is currently using 3.2 gb of memory.
Is this just a waiting game that's affected by how good your processing power is or are there more efficient ways of doing things?
Your data set is very large, so you should expect a long running time and memory consumption. Its hard to tell if that is reasonable without more info.
You could however trying to use some classifiers from scikit-learn instead of the nltk basic classifiers, there are many efficient options there - K-nearest neighbors, linear regression to name a few, and also alternative implementations of naive Bayes classifiers. I have had better success classifying text with those.
here is a link to a wrapper for using them with nltk based datasets. Hope this helps..
Related
I hope someone can help me with this. I am new to working with large datasets and need help optimizing run time and also memory usage.
I am working with news articles with articles from 30 newspapers between 2000-2018. There are approximately 12 million articles in the entire dataset. I am working on calculating TFIDF and cosine similarity between the articles and given that data is around 40GB, I am not sure how well it will scale.
At the moment I am only working with data of 1 month and while it works, it is extremely slow.
For the TFIDF vocabulary building, the corpus duration will never exceed 1year data (that's the upper limit) on the duration of analysis we wish to perform. What is the best way to build vocabulary? I searched a bit around the internet and found that with gensim we can build vocabulary in an incremental way. Is that best I can do or is there a better/fast way to handle this?
Given that I succeed with building the vocabulary for corpus, I need to calculate all the articles(from a particular date) which have cosine similarity less than a given threshold with other articles (for the same date). Since the vocabulary can huge, repeatedly calling transform and cosine_similarity can be quite expensive. Any idea how I can improve on this? I thought of using the Kruskal algorithm for finding disconnected components so as to minimize the call to transform and cosine_similarity.
While using iterator and gensim to build a dictionary in an iterating way might help save memory use, I am still not sure how to decrease the time for calculating the number of articles that have no similar articles?
3) If anyone has experience working with similar data in pandas, should I move to a database or is pandas sufficient for this task?
Thanks :)
I'm developing a script that detects peaks on a signal data from a biological source. I want to create a semi-automated model that helps predict which peaks are the correct ones. This script improves as the user manually selects a few of these peaks to help teach the model which ones are correct.
The workflow I'm trying to attain is this:
1. User manually selects data
2. Script obtains the correct data and fits it into the model
3. Use the model to predict the likelihood of a given peak to be correct.
4. Hopefully with enough data and training, it could be automated to run through the rest.
I also don't know the name of the general topic and I'm struggling to find what to google.
I've tried to fit it on linear regression model in scikit learn but I don't have enough datasets (as it learns from the user's first intervention). Is what I'm doing possible?
Sorry for the general-ness of this answer but the OP asked for general topics.
It sounds like semi-supervised learning and here for scikit-learn and here for more details may work.
There is no labeled data to start. A manual process is started to gain some labeled data. Soon, semi-supervised can kick in and take over - with a process measuring its accuracy. A match to your situation and a good place to start.
Eventually you may have "enough" correctly labeled data that you can investigate fitting a classic algorithm to predict the remainder. "Enough" being relative to how hard the problem is. Could be tens, hundreds, thousands, ...
Depending on other details of your situation, Reinforcement learning may work. As you described the situation, this may not work but there may be other details in your environment to leverage this family.
Word of warning - machine learning and semi-supervised in particular may not always work great to every problem. Measure, measure, measure.
Thank you everyone for all your help. I was talking to a colleague and he referred me to Online Machine Learning. I think this was the one I was looking for. Although I would not be handling time-series data nor streaming data from online, the method i think is sufficient for my needs. This method allows that data is trained one by one and not as a batch. I think SciKit Learn currently does not have the ability of out-of-the-box online machine learning.
This i think gives a great rundown on the strengths of online machine learning (also showcasing of the creme python library).
Thanks again!
I tried to use SVM classifier to train a data with about 100k samples, but I found it to be extremely slow and even after two hours there was no response. When the dataset has around 1k samples, I can get the result immediately. I also tried SGDClassifier and naïve bayes which is quite fast and I got results within couple of minutes. Could you explain this phenomena?
General remarks about SVM-learning
SVM-training with nonlinear-kernels, which is default in sklearn's SVC, is complexity-wise approximately: O(n_samples^2 * n_features) link to some question with this approximation given by one of sklearn's devs. This applies to the SMO-algorithm used within libsvm, which is the core-solver in sklearn for this type of problem.
This changes much when no kernels are used and one uses sklearn.svm.LinearSVC (based on liblinear) or sklearn.linear_model.SGDClassifier.
So we can do some math to approximate the time-difference between 1k and 100k samples:
1k = 1000^2 = 1.000.000 steps = Time X
100k = 100.000^2 = 10.000.000.000 steps = Time X * 10000 !!!
This is only an approximation and can be even worse or less worse (e.g. setting cache-size; trading-off memory for speed-gains)!
Scikit-learn specific remarks
The situation could also be much more complex because of all that nice stuff scikit-learn is doing for us behind the bars. The above is valid for the classic 2-class SVM. If you are by any chance trying to learn some multi-class data; scikit-learn will automatically use OneVsRest or OneVsAll approaches to do this (as the core SVM-algorithm does not support this). Read up scikit-learns docs to understand this part.
The same warning applies to generating probabilities: SVM's do not naturally produce probabilities for final-predictions. So to use these (activated by parameter) scikit-learn uses a heavy cross-validation procedure called Platt scaling which will take a lot of time too!
Scikit-learn documentation
Because sklearn has one of the best docs, there is often a good part within these docs to explain something like that (link):
If you are using intel CPU then Intel has provided the solution for it.
Intel Extension for Scikit-learn offers you a way to accelerate existing scikit-learn code. The acceleration is achieved through patching: replacing the stock scikit-learn algorithms with their optimized versions provided by the extension.
You should follow the following steps:
First install intelex package for sklearn
pip install scikit-learn-intelex
Now just add the following line in the top of the program
from sklearnex import patch_sklearn
patch_sklearn()
Now run the program it will be much faster than before.
You can read more about it from the following link:
https://intel.github.io/scikit-learn-intelex/
I recently started working on Document clustering using SciKit module in python. However I am having a hard time understanding the basics of document clustering.
What I know ?
Document clustering is typically done using TF/IDF. Which essentially
converts the words in the documents to vector space model which is
then input to the algorithm.
There are many algorithms like k-means, neural networks, hierarchical
clustering to accomplish this.
My Data :
I am experimenting with linkedin data, each document would be the
linkedin profile summary, I would like to see if similar job
documents get clustered together.
Current Challenges:
My data has huge summary descriptions, which end up becoming 10000's
of words when I apply TF/IDF. Is there any proper way to handle this
high dimensional data.
K - means and other algorithms requires I specify the no. of clusters
( centroids ), in my case I do not know the number of clusters
upfront. This I believe is a completely unsupervised learning. Are
there algorithms which can determine the no. of clusters themselves?
I've never worked with document clustering before, if you are aware
of tutorials , textbooks or articles which address this issue, please
feel free to suggest.
I went through the code on SciKit webpage, it consists of too many technical words which I donot understand, if you guys have any code with good explanation or comments please share. Thanks in advance.
My data has huge summary descriptions, which end up becoming 10000's of words when I apply TF/IDF. Is there any proper way to handle this high dimensional data.
My first suggestion is that you don't unless you absolutely have to, due to memory or execution time problems.
If you must handle it, you should use dimensionality reduction (PCA for example) or feature selection (probably better in your case, see chi2 for example)
K - means and other algorithms requires I specify the no. of clusters ( centroids ), in my case I do not know the number of clusters upfront. This I believe is a completely unsupervised learning. Are there algorithms which can determine the no. of clusters themselves?
If you look at the clustering algorithms available in scikit-learn, you'll see that not all of them require that you specify the number of clusters.
Another one that does not is hierarchical clustering, implemented in scipy. Also see this answer.
I would also suggest that you use KMeans and try to manually tweak the number of clusters until you are satisfied with the results.
I've never worked with document clustering before, if you are aware of tutorials , textbooks or articles which address this issue, please feel free to suggest.
Scikit has a lot of tutorials for working with text data, just use the "text data" search query on their site. One is for KMeans, others are for supervised learning, but I suggest you go over those too to get more familiar with the library. From a coding, style and syntax POV, unsupervised and supervised learning are pretty similar in scikit-learn, in my opinion.
Document clustering is typically done using TF/IDF. Which essentially converts the words in the documents to vector space model which is then input to the algorithm.
Minor correction here: TF-IDF has nothing to do with clustering. It is simply a method for turning text data into numerical data. It does not care what you do with that data (clustering, classification, regression, search engine things etc.) afterwards.
I understand the message you were trying to get across, but it is incorrect to say that "clustering is done using TF-IDF". It's done using a clustering algorithm, TF-IDF only plays a preprocessing role in document clustering.
For the large matrix after TF/IDF transformation, consider using sparse matrix.
You could try different k values. I am not an expert in unsupervised clustering algorithms, but I bet with such algorithms and different parameters, you could also end up with a varied number of clusters.
This link might be useful. It provides good amount of explanation for k-means clustering with a visual output http://brandonrose.org/clustering
I have a dataset of 22 GB. I would like to process it on my laptop. Of course I can't load it in memory.
I use a lot sklearn but for much smaller datasets.
In this situations the classical approach should be something like.
Read only part of the data -> Partial train your estimator -> delete the data -> read other part of the data -> continue to train your estimator.
I have seen that some sklearn algorithm have the partial fit method that should allow us to train the estimator with various subsamples of the data.
Now I am wondering is there an easy why to do that in sklearn?
I am looking for something like
r = read_part_of_data('data.csv')
m = sk.my_model
`for i in range(n):
x = r.read_next_chunk(20 lines)
m.partial_fit(x)
m.predict(new_x)
Maybe sklearn is not the right tool for these kind of things?
Let me know.
I've used several scikit-learn classifiers with out-of-core capabilities to train linear models: Stochastic Gradient, Perceptron and Passive Agressive and also Multinomial Naive Bayes on a Kaggle dataset of over 30Gb. All these classifiers share the partial_fit method which you mention. Some behave better than others though.
You can find the methodology, the case study and some good resources in of this post:
http://www.opendatascience.com/blog/riding-on-large-data-with-scikit-learn/
I think sklearn is fine for larger data. If your chosen algorithms support partial_fit or an online learning approach then you're on track. One thing to be aware of is that your chunk size may influence your success.
This link may be useful...
Working with big data in python and numpy, not enough ram, how to save partial results on disc?
I agree that h5py is useful but you may wish to use tools that are already in your quiver.
Another thing you can do is to randomly pick whether or not to keep a row in your csv file...and save the result to a .npy file so it loads quicker. That way you get a sampling of your data that will allow you to start playing with it with all algorithms...and deal with the bigger data issue along the way(or not at all! sometimes a sample with a good approach is good enough depending on what you want).
You may want to take a look at Dask or Graphlab
http://dask.pydata.org/en/latest/
https://turi.com/products/create/
They are similar to pandas but working on large scale data (using out-of-core dataframes). The problem with pandas is all data has to fit into memory.
Both frameworks can be used with scikit learn. You can load 22 GB of data into Dask or SFrame, then use with sklearn.
I find it interesting that you have chosen to use Python for statistical analysis rather than R however, I would start by putting my data into a format that can handle such large datasets. The python h5py package is fantastic for this kind of storage - allowing very fast access to your data. You will need to chunk up your data in reasonable sizes say 1 million element chunks e.g. 20 columns x 50,000 rows writing each chunk to the H5 file. Next you need to think about what kind of model you are running - which you haven't really specified.
The fact is that you will probably have to write the algorithm for model and the machine learning cross validation because the data is large. Start by writing an algorithm to summarize the data, so that you know what you am looking at. Then once you decide what model you want to run you will need to think about what the cross validation will be. Put in a "column" into each chunk of the data set that denotes which validation set each row belongs to. You many choose to label each chunk to a particular validation set.
Next you will need to write a map reduce style algorithm to run your model on the validation subsets. The alternative is simply to run models on each chunk of each validation set and average the result (consider the theoretical validity of this approach).
Consider using spark, or R and rhdf5 or something similar. I haven't supplied any code because this is a project rather than just a simple coding question.