I'm training word2vec from scratch on 34 GB pre-processed MS_MARCO corpus(of 22 GB). (Preprocessed corpus is sentnecepiece tokenized and so its size is more) I'm training my word2vec model using following code :
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
class Corpus():
"""Iterate over sentences from the corpus."""
def __init__(self):
self.files = [
"sp_cor1.txt",
"sp_cor2.txt",
"sp_cor3.txt",
"sp_cor4.txt",
"sp_cor5.txt",
"sp_cor6.txt",
"sp_cor7.txt",
"sp_cor8.txt"
]
def __iter__(self):
for fname in self.files:
for line in open(fname):
words = line.split()
yield words
sentences = Corpus()
model = Word2Vec(sentences, size=300, window=5, min_count=1, workers=8, sg=1, hs=1, negative=10)
model.save("word2vec.model")
My model is running now for about more than 30 hours now. This is doubtful since on my i5 laptop with 8 cores, I'm using all the 8 cores at 100% for every moment of time. Plus, my program seems to have read more than 100 GB of data from the disk now. I don't know if there is anything wrong here, but the main reason after my doubt on the training is because of this 100 GB of read from the disk. The whole corpus is of 34 GB, then why my code has read 100 GB of data from the disk? Does anyone know how much time should it take to train word2vec on 34 GB of text, with 8 cores of i5 CPU running all in parallel? Thank you. For more information, I'm also attaching the photo of my process from system monitor.
I want to know why my model has read 112 GB from memory, even when my corpus is of 34 GB in total? Will my training ever get finished? Also I'm bit worried about health of my laptop, since it is running constantly at its peak capacity since last 30 hours. It is really hot now.
Should I add any additional parameter in Word2Vec for quicker training without much performance loss?
Completing a model requires one pass over all the data to discover the vocabulary, then multiple passes, with a default of 5, to perform vector training. So, you should expect to see about 6x your data size in disk-reads, just from the model training.
(If your machine winds up needing to use virtual-memory swapping during the process, there could be more disk activity – but you absolutely do not want that to happen, as the random-access pattern of word2vec training is nearly a worst-case for virtual memory usage, which will slow training immensely.)
If you'd like to understand the code's progress, and be able to estimate its completion time, you should enable Python logging to at least the INFO level. Various steps of the process will report interim results (such as the discovered and surviving vocabulary size) and estimated progress. You can often tell if something is going wrong before the end of a run by studying the logging outputs for sensible values, and once the 'training' phase has begun the completion time will be a simple projection from the training completed so far.
I believe most laptops should throttle their own CPU if it's becoming so hot as to become unsafe or risk extreme wear on the CPU/components, but whether yours does, I can't say, and definitely make sure its fans work & vents are unobstructed.
I'd suggest you choose some small random subset of your data – maybe 1GB? – to be able to run all your steps to completion, becoming familiar with the Word2Vec logging output, resource usage, and results, and tinkering with settings to observe changes, before trying to run on your full dataset, which might require days of training time.
Some of your shown parameters aren't optimal for speedy training. In particular:
min_count=1 retains every word seen in the corpus-survey, including those with only a single occurrence. This results in a much, much larger model - potentially risking a model that doesn't fit into RAM, forcing disastrous swapping. But also, words with just a few usage examples can't possibly get good word vectors, as the process requires seeing many subtly-varied alternate uses. Still, via typical 'Zipfian' word-frequencies, the number of such words with just a few uses may be very large in total, so retaining all those words takes a lot of training time/effort, and even serves a bit like 'noise' making the training of other words, with plenty of usage examples, less effective. So for model size, training speed, and quality of remaining vectors, a larger min_count is desirable. The default of min_count=5 is better for more projects than min_count=1 – this is a parameter that should only really be changed if you're sure you know the effects. And, when you have plentiful data – as with your 34GB – the min_count can go much higher to keep the model size manageable.
hs=1 should only be enabled if you want to use the 'hierarchical-softmax' training mode instead of 'negative-sampling' – and in that case, negative=0 should also be set to disable 'negative-sampling'. You probably don't want to use hierarchical-softmax: it's not the default for a reason, and it doesn't scale as well to larger datasets. But here you've enabled in in addition to negative-sampling, likely more-than-doubling the required training time.
Did you choose negative=10 because you had problems with the default negative=5? Because this non-default choice, again, would slow training noticeably. (But also, again, a non-default choice here would be more common with smaller datasets, while larger datasets like yours are more likely to experiment with a smaller negative value.)
The theme of the above observations is: "only change the defaults if you've already got something working, and you have a good theory (or way of testing) how that change might help".
With a large-enough dataset, there's another default parameter to consider changing to speed up training (& often improve word-vector quality, as well): sample, which controls how-aggressively highly-frequent words (with many redundant usage-examples) may be downsampled (randomly skipped).
The default value, sample=0.001 (aka 1e-03), is very conservative. A smaller value, such as sample=1e-05, will discard many-more of the most-frequent-words' redundant usage examples, speeding overall training considerably. (And, for a corpus of your size, you could eventually experiment with even smaller, more-aggressive values.)
Finally, to the extent all your data (for either a full run, or a subset run) can be in an already-space-delimited text file, you can use the corpus_file alternate method of specifying the corpus. Then, the Word2Vec class will use an optimized multithreaded IO approach to assign sections of the file to alternate worker threads – which, if you weren't previously seeing full saturation of all threads/CPU-cores, could increase our throughput. (I'd put this off until after trying other things, then check if your best setup still leaves some of your 8 threads often idle.)
Some preliminary testing shows that a project I'm working on could potentially benefit from the use of a Support-Vector-Machine to solve a tricky problem. The concern that I have is that there will be major memory constraints. Prototyping and testing is being done in python with scikit-learn. The final version will be custom written in C. The model would be pre-trained and only the decision function would be stored on the final product. There would be <= 10 training features, and <= 5000 training data-points. I've been reading mixed things regarding SVM memory, and I know the default sklearn memory cache is 200 MB. (Much larger than what I have available) Is this feasible? I know there are multiple different types of SVM kernel and that the kernel's can also be custom written. What kernel types could this potentially work with, if any?
If you're that strapped for space, you'll probably want to skip scikit and simply implement the math yourself. That way, you can cycle through the data in structures of your own choosing. Memory requirements depend on the class of SVM you're using; a two-class linear SVM can be done with a single pass through the data, considering only one observation at a time as you accumulate sum-of-products, so your command logic would take far more space than the data requirements.
If you need to keep the entire data set in memory for multiple passes, that's "only" 5000*10*8 bytes for floats, or 400k of your 1Mb, which might be enough room to do your manipulations. Also consider a slow training process, re-reading the data on each pass, as this reduces the 400k to a triviality at the cost of wall-clock time.
All of this is under your control if you look up a usable SVM implementation and alter the I/O portions as needed.
Does that help?
I'm very new to scikit-learn and machine learning in general. I have a data set which comprises 140,565 rows and 17 columns. I'm using someone else's code which runs a Random Forest model, on a machine that has a 2.7GHz processor, 4GB RAM, Windows 10.
Obviously 4GB RAM isn't enough, and I can't upgrade this system (ultrabook). It has an SSD in it. Is there a way to configure scikit to use the hard drive instead of RAM (more space at the expense of speed)?
You still need memory, read and write from the disk as far I know is not possible for sklearn ML tasks (you need to try other software), you can try to fit in the memory by using strategies listed below but is limited to what algorithms you can implement:Scaling with instances using out-of-core learning
Performance and results will be impacted in this cases and also batches size also impact the results.
Note: for read and write to and from the disk better suits SAS and hadoop(map-reduce) but sklearn requires RAM.
I'm quite comfortable using XGBoost to come up with predictive models; my concern is using it with a dataset that is (to me, at least) massive. I have 4 ~20gb CSV files with some training data that I am trying to clean up and get ready for model training. I am a little confused on how to start getting the data 'primed' for everything else; a few thoughts that I had (and I'm not certain if they are the best) and some limitations I foresee:
pymysql or sqlalchemy: take data, somehow pass it in to a SQL database. QUESTION: Do I process data first, or process it once it is in the database?
Dask on a single computer (and not a cluster); again, just not sure how to interface it with XGBoost after one-hot encoding.
Using Numpy somehow; I remember reading about how that would work with representing arrays of each column somehow but I can't be damned to remember.
HDF5 file format; still don't think it would make it small enough to work with reasonably.
My system has 24 GB of RAM on 64-bit Ubuntu. Is there any way to use swap memory somehow to do all of the processing? It would be stupidly slow, certainly.
Effectively I'm wondering what one would recommend for cleaning, one-hot encoding, and training a machine learning algorithm with such a massive data set. Thank you!
I am trying to find a means of starting to work with very large CSV files in Pandas, ultimately to be able to do some machine learning with XGBoost.
I am torn between using mySQL or some sqllite framework to manage chunks of my data; my issue is in the machine learning aspect of it later on, and in loading in chunks at a time to train the model.
My other thought was to use Dask, which is built of off Pandas, but also has XGBoost functionality.
I'm not sure what the best starting point is and was hoping to ask for an opinion! I am leaning towards Dask but I have not used it yet.
This blogpost goes through an example using XGBoost on a large CSV dataset. However it did so by using a distributed cluster with enough RAM to fit the entire dataset in memory at once. While many dask.dataframe operations can operate in small space I don't think that XGBoost training is likely to be one of them. XGBoost seems to operate best when all data is available all the time.
I haven't tried this, but I would load your data into an hdf5 file using h5py. This library let's you store data on disk but access it like a numpy array. Therefore you are no longer constrained by memory for your dataset.
For the XGBoost part, I would use the sklearn API and pass in the h5py object as the X value. I recommend the sklearn API since it accepts numpy like arrays for input which should let h5py objects work. Make sure to use a small value for subsample otherwise you'll likely run out of memory fast.