processing text with nlp.pipe taking hours - python

I have 300.000 news articles in my dataset and I'm using en_core_web_sm to do POS tagging, parsing, ner extraction. However, this is taking hours and hours and seems to never be done.
The code is working, but very slow. When I take a sample of my dataset 650 articles it takes 37seconds and 6500 articles takes 6min. However I really need my full dataset of 300.000 articles to be completed in a reasonable amount of time...
texts = df["content"]
for doc in nlp.pipe(texts, n_process=2, batch_size=100, disable=["senter","attribute_ruler" "lemmatizer"]):
df["spacy_sm"] = texts
Is there a way to speed this up significantly, or am I missing something here?

On CPU you can usually use a much larger batch size and you can use a larger number of processes depending on your CPU. The default batch size for sm/md/lg models is currently 256 but you can increase it further as long as you aren't running out of RAM. A batch size of 1000 or higher would be totally reasonable on CPU for many tasks, but it really depends on the pipeline components and text lengths.
As long as your text lengths don't vary too much, you can figure out the approximate RAM required for a given batch size with one process and then you can estimate how many processes you can fit in the available RAM for that batch size.

Related

Query execution time with small batches vs entire input set

I'm using ArangoDB 3.9.2 for search task. The number of items in dataset is 100.000. When I pass the entire dataset as an input list to the engine - the execution time is around ~10 sec, which is pretty quick. But if I pass the dataset in small batches one by one - 100 items per batch, the execution time is rapidly growing. In this case, to process the full dataset takes about ~2 min. Could you explain please, why is it happening? The dataset is the same.
I'm using python driver "ArangoClient" from python-arango lib ver 0.2.1
PS: I had the similar problem with Neo4j, but the problem was solved using transactions committing with HTTP API. Does the ArangoDB have something similar?
Every time you make a call to a remote system (Neo4J or ArangoDB or any database) there is overhead in making the connection, sending the data, and then after executing your command, tearing down the connection.
What you're doing is trying to find the 'sweet spot' for your implementation as to the most efficient batch size for the type of data you are sending, the complexity of your query, the performance of your hardware, etc.
What I recommend doing is writing a test script that sends the data in varying batch sizes to help you determine the optimal settings for your use case.
I have taken this approach with many systems that I've designed and the optimal batch sizes are unique to each implementation. It totally depends on what you are doing.
See what results you get for the overall load time if you use batch sizes of 100, 1000, 2000, 5000, and 10000.
This way you'll work out the best answer for you.

Featuretools taking too long to build features without using CPU cores

I'm using featuretools Deep Feature Sintesys to build features for a dataset of 40k rows and 200 columns. I choose about 40 transformation primitivies, as you can see in the code bellow:
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="df", n_jobs=6,
trans_primitives=primitives.name.to_list(),
verbose=True)
but when I run my code, It takes a lot of time to discover the features to build, and this process doesn't run in multiple cores in my CPU, and not even a single-core gets 100% of usage. In other words, I'm waiting hours to run a process that is just using the minimal resources of my machine (memory also is not a problem).
After the feature tools discover the features (and print a log "built n features") then it creates the cluster and uses all the cores specified in the "n_jobs" parameter, in 100% of capability. This second moment is really fast, just some seconds, once all my resources are being used.
My question is, why is this happening? It's possible to discover the features faster to reduce this time? And just don't understand how a process that doesn't use resources takes too long.

Training time of gensim word2vec

I'm training word2vec from scratch on 34 GB pre-processed MS_MARCO corpus(of 22 GB). (Preprocessed corpus is sentnecepiece tokenized and so its size is more) I'm training my word2vec model using following code :
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
class Corpus():
"""Iterate over sentences from the corpus."""
def __init__(self):
self.files = [
"sp_cor1.txt",
"sp_cor2.txt",
"sp_cor3.txt",
"sp_cor4.txt",
"sp_cor5.txt",
"sp_cor6.txt",
"sp_cor7.txt",
"sp_cor8.txt"
]
def __iter__(self):
for fname in self.files:
for line in open(fname):
words = line.split()
yield words
sentences = Corpus()
model = Word2Vec(sentences, size=300, window=5, min_count=1, workers=8, sg=1, hs=1, negative=10)
model.save("word2vec.model")
My model is running now for about more than 30 hours now. This is doubtful since on my i5 laptop with 8 cores, I'm using all the 8 cores at 100% for every moment of time. Plus, my program seems to have read more than 100 GB of data from the disk now. I don't know if there is anything wrong here, but the main reason after my doubt on the training is because of this 100 GB of read from the disk. The whole corpus is of 34 GB, then why my code has read 100 GB of data from the disk? Does anyone know how much time should it take to train word2vec on 34 GB of text, with 8 cores of i5 CPU running all in parallel? Thank you. For more information, I'm also attaching the photo of my process from system monitor.
I want to know why my model has read 112 GB from memory, even when my corpus is of 34 GB in total? Will my training ever get finished? Also I'm bit worried about health of my laptop, since it is running constantly at its peak capacity since last 30 hours. It is really hot now.
Should I add any additional parameter in Word2Vec for quicker training without much performance loss?
Completing a model requires one pass over all the data to discover the vocabulary, then multiple passes, with a default of 5, to perform vector training. So, you should expect to see about 6x your data size in disk-reads, just from the model training.
(If your machine winds up needing to use virtual-memory swapping during the process, there could be more disk activity – but you absolutely do not want that to happen, as the random-access pattern of word2vec training is nearly a worst-case for virtual memory usage, which will slow training immensely.)
If you'd like to understand the code's progress, and be able to estimate its completion time, you should enable Python logging to at least the INFO level. Various steps of the process will report interim results (such as the discovered and surviving vocabulary size) and estimated progress. You can often tell if something is going wrong before the end of a run by studying the logging outputs for sensible values, and once the 'training' phase has begun the completion time will be a simple projection from the training completed so far.
I believe most laptops should throttle their own CPU if it's becoming so hot as to become unsafe or risk extreme wear on the CPU/components, but whether yours does, I can't say, and definitely make sure its fans work & vents are unobstructed.
I'd suggest you choose some small random subset of your data – maybe 1GB? – to be able to run all your steps to completion, becoming familiar with the Word2Vec logging output, resource usage, and results, and tinkering with settings to observe changes, before trying to run on your full dataset, which might require days of training time.
Some of your shown parameters aren't optimal for speedy training. In particular:
min_count=1 retains every word seen in the corpus-survey, including those with only a single occurrence. This results in a much, much larger model - potentially risking a model that doesn't fit into RAM, forcing disastrous swapping. But also, words with just a few usage examples can't possibly get good word vectors, as the process requires seeing many subtly-varied alternate uses. Still, via typical 'Zipfian' word-frequencies, the number of such words with just a few uses may be very large in total, so retaining all those words takes a lot of training time/effort, and even serves a bit like 'noise' making the training of other words, with plenty of usage examples, less effective. So for model size, training speed, and quality of remaining vectors, a larger min_count is desirable. The default of min_count=5 is better for more projects than min_count=1 – this is a parameter that should only really be changed if you're sure you know the effects. And, when you have plentiful data – as with your 34GB – the min_count can go much higher to keep the model size manageable.
hs=1 should only be enabled if you want to use the 'hierarchical-softmax' training mode instead of 'negative-sampling' – and in that case, negative=0 should also be set to disable 'negative-sampling'. You probably don't want to use hierarchical-softmax: it's not the default for a reason, and it doesn't scale as well to larger datasets. But here you've enabled in in addition to negative-sampling, likely more-than-doubling the required training time.
Did you choose negative=10 because you had problems with the default negative=5? Because this non-default choice, again, would slow training noticeably. (But also, again, a non-default choice here would be more common with smaller datasets, while larger datasets like yours are more likely to experiment with a smaller negative value.)
The theme of the above observations is: "only change the defaults if you've already got something working, and you have a good theory (or way of testing) how that change might help".
With a large-enough dataset, there's another default parameter to consider changing to speed up training (& often improve word-vector quality, as well): sample, which controls how-aggressively highly-frequent words (with many redundant usage-examples) may be downsampled (randomly skipped).
The default value, sample=0.001 (aka 1e-03), is very conservative. A smaller value, such as sample=1e-05, will discard many-more of the most-frequent-words' redundant usage examples, speeding overall training considerably. (And, for a corpus of your size, you could eventually experiment with even smaller, more-aggressive values.)
Finally, to the extent all your data (for either a full run, or a subset run) can be in an already-space-delimited text file, you can use the corpus_file alternate method of specifying the corpus. Then, the Word2Vec class will use an optimized multithreaded IO approach to assign sections of the file to alternate worker threads – which, if you weren't previously seeing full saturation of all threads/CPU-cores, could increase our throughput. (I'd put this off until after trying other things, then check if your best setup still leaves some of your 8 threads often idle.)

DASK Memory Per Worker Guide

I'm currently working on refactoring some legacy analytics into Python/DASK to show the efficacy of this as a solution going forward.
I'm trying to set up a demo scenario and am having problems with memory and would like some advice.
My scenario; I have my data split into 52 gzip compressed parquet files on S3, each one, uncompressed in memory is around 100MB, giving a total dataset size of ~5.5GB and exactly 100,000,000 rows.
My scheduler is on a T2.Medium (4GB/2vCPUs) as are my 4 workers.
Each worker is being run with 1 process, 1 thread and a memory limit of 4GB, I.e. dask-worker MYADDRESS --nprocs 1 --nthreads=1 --memory-limit=4GB.
Now, I'm pulling the parquet files and immediately repartitioning on a column in such a way that I end up with roughly 480 partitions each of ~11MB.
Then I'm using map_partitions to do the main body of work.
This works fine for small datasets, however for the 100 mil dataset, my workers keep crashing due to not having enough memory.
What am I doing wrong here?
For implementation specific info, the function i'm passing to map_partitions can sometimes need roughly 1GB, due to what is essentially a cross join on the partition dataframe.
Am I not understanding something to do with DASK's architecture? Between my scheduler and my 4 workers there is 20GB of memory to work with, yet this is proving to be not enough.
From what I've read from the DASK documentation is that, so long as each partition, and what you do with that partition, fits in the memory of the worker then you should be ok?
Is 4GB just not enough? Does it need way more to handle scheduler/inter process communication oerheads?
Thanks for reading.
See https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-partitions
I'll copy the text here for convenience
Your chunks of data should be small enough so that many of them fit in a worker’s available memory at once. You often control this when you select partition size in Dask DataFrame or chunk size in Dask Array.
Dask will likely manipulate as many chunks in parallel on one machine as you have cores on that machine. So if you have 1 GB chunks and ten cores, then Dask is likely to use at least 10 GB of memory. Additionally, it’s common for Dask to have 2-3 times as many chunks available to work on so that it always has something to work on.
If you have a machine with 100 GB and 10 cores, then you might want to choose chunks in the 1GB range. You have space for ten chunks per core which gives Dask a healthy margin, without having tasks that are too small

Is it feasible to run a Support Vector Machine Kernel on a device with <= 1 MB RAM and <= 10 MB ROM?

Some preliminary testing shows that a project I'm working on could potentially benefit from the use of a Support-Vector-Machine to solve a tricky problem. The concern that I have is that there will be major memory constraints. Prototyping and testing is being done in python with scikit-learn. The final version will be custom written in C. The model would be pre-trained and only the decision function would be stored on the final product. There would be <= 10 training features, and <= 5000 training data-points. I've been reading mixed things regarding SVM memory, and I know the default sklearn memory cache is 200 MB. (Much larger than what I have available) Is this feasible? I know there are multiple different types of SVM kernel and that the kernel's can also be custom written. What kernel types could this potentially work with, if any?
If you're that strapped for space, you'll probably want to skip scikit and simply implement the math yourself. That way, you can cycle through the data in structures of your own choosing. Memory requirements depend on the class of SVM you're using; a two-class linear SVM can be done with a single pass through the data, considering only one observation at a time as you accumulate sum-of-products, so your command logic would take far more space than the data requirements.
If you need to keep the entire data set in memory for multiple passes, that's "only" 5000*10*8 bytes for floats, or 400k of your 1Mb, which might be enough room to do your manipulations. Also consider a slow training process, re-reading the data on each pass, as this reduces the 400k to a triviality at the cost of wall-clock time.
All of this is under your control if you look up a usable SVM implementation and alter the I/O portions as needed.
Does that help?

Categories