I have a pandas dataframe with 3 million rows of social media comments. I'm using the language-tool-python library to find the number of grammatical errors in a comment. Afaik the language-tool library by default sets up a local language-tool server on your machine and queries responses from that.
Getting the number of grammatical errors is just consists of creating an instance of the language tool object and calling the .check() method with the string you want to check as a parameter.
>>> tool = language_tool_python.LanguageTool('en-US')
>>> text = 'A sentence with a error in the Hitchhiker’s Guide tot he Galaxy'
>>> matches = tool.check(text)
>>> len(matches)
2
So the method I used is df['body_num_errors'] = df['body'].apply(lambda row: len(tool.check(row))). Now I am pretty sure this works. Its quite straight forward. This single line of code has been running for the past hour.
Because running the above example took 10-20 second, so with 3 million instances, it might as well take virtually forever.
Is there any way I can cut my losses and speed this process up? Would iterating over every row and putting the whole thing inside of a threadpoolexecutor help? Intuitively it makes sense to me as its a I/O bound task.
I am open to any suggestions as to how to speed up this process and if the above method works would appreciate if someone can show me some sample code.
edit - Correction.
It takes 10-20 seconds along with the instantiation, calling the method is almost instantaneous.
I'm the creator of language_tool_python. First, none of the comments here make sense. The bottleneck is in tool.check(); there is nothing slow about using pd.DataFrame.map().
LanguageTool is running on a local server on your machine. There are at least two major ways to speed this up:
Method 1: Initialize multiple servers
servers = []
for i in range(100):
servers.append(language_tool_python.LanguageTool('en-US'))
Then call to each server from a different thread. Or alternatively initialize each server within its own thread.
Method 2: Increase the thread count
LanguageTool takes a maxCheckThreads option – see the LT HTTPServerConfig documentation – so you could also try playing around with that? From a glance at LanguageTool's source code, it looks like the default number of threads in a single LanguageTool server is 10.
In the documentation, we can see that language-tool-python has the configuration option maxSpellingSuggestions.
However, despite the name of the variable and the default value being 0, I have noticed that the code runs noticeably faster (almost 2 times faster) when this parameter is actually set to 1.
I don't know where this discrepancy comes from, and the documentation does not mention anything specific about the default behavior. It is a fact however that (at least for my own dataset, which I don't think can affect this much the running time) this setting improves the performance.
Example initialization:
import language_tool_python
language_tool = language_tool_python.LanguageTool('en-US', config={'maxSpellingSuggestions': 1})
If you are worried about scaling up with pandas, switch to Dask instead. It integrates with Pandas and will use multiple cores in your CPU, which I am assuming you have, instead of a single-core that pandas use. This helps parallelize the 3 million instances and can speed up your execution time. You can read more about dask here or see an example here.
I've been having a hard time using a large dictionary (~86GB, 1.75 billion keys) to process a big dataset (2TB) using multiprocessing in Python.
Context: a dictionary mapping strings to strings is loaded from pickled files into memory. Once loaded, worker processes (ideally >32) are created that must lookup values in the dictionary but not modify it's contents, in order to process the ~2TB dataset. The data set needs to be processed in parallel otherwise the task would take over a month.
Here are the two three four five six seven eight nine approaches (all failing) that I have tried:
Store the dictionary as a global variable in the Python program and then fork the ~32 worker processes. Theoretically this method might work since the dictionary is not being modified and therefore the COW mechanism of fork on Linux would mean that the data structure would be shared and not copied among processes. However, when I attempt this, my program crashes on os.fork() inside of multiprocessing.Pool.map from OSError: [Errno 12] Cannot allocate memory. I'm convinced that this is because the kernel is configured to never overcommit memory (/proc/sys/vm/overcommit_memory is set to 2, and I can't configure this setting on the machine since I don't have root access).
Load the dictionary into a shared-memory dictionary with multiprocessing.Manager.dict. With this approach I was able to fork the 32 worker process without crashing but the subsequent data processing is orders of magnitude slower than another version of the task that required no dictionary (only difference is no dictionary lookup). I theorize that this is because of the inter-process communication between the manager process containing the dictionary and each worker process, that is required for every single dictionary lookup. Although the dictionary is not being modified, it is being accessed many many times, often simultaneously by many processes.
Copy the dictionary into a C++ std::map and rely on Linux's COW mechanism to prevent it from being copied (like approach #1 except with the dictionary in C++). With this approach, it took a long time to load the dictionary into std::map and subsequently crashed from ENOMEM on os.fork() just as before.
Copy the dictionary into pyshmht. It takes far too long to copy the dictionary into pyshmht.
Try using SNAP's HashTable. The underlying implementation in C++ allows for it to be made and used in shared memory. Unfortunately the Python API does not offer this functionality.
Use PyPy. Crash still happened as in #1.
Implement my own shared-memory hash table in python on top of multiprocessing.Array. This approach still resulted in the out of memory error that ocured in #1.
Dump the dictionary into dbm. After trying to dump the dictionary into a dbm database for four days and seeing an ETA of "33 days", I gave up on this approach.
Dump the dictionary into Redis. When I try to dump the dictionaries (the 86GB dict is loaded from 1024 smaller dicts) into Redis using redis.mset I get a connection reset by peer error. When I try to dump the key-value pairs using a loop, it takes an extremely long time.
How can I process this dataset in parallel efficiently without requiring inter-process communication in order to lookup values in this dictionary. I would welcome any suggestions for solving this problem!
I'm using Python 3.6.3 from Anaconda on Ubuntu on a machine with 1TB RAM.
Edit: What finally worked:
I was able to get this to work using Redis. To get around the issued in #9, I had to chunk the large key-value insertion and lookup queries into "bite-sized" chunks so that it was still processing in batches, but didn't time-out from too large a query. Doing this allowed the insertion of the 86GB dictionary to be performed in 45 minutes (with 128 threads and some load balancing), and the subsequent processing was not hampered in performance by the Redis lookup queries (finished in 2 days).
Thank you all for your help and suggestions.
You should probably use a system that's meant for sharing large amounts of data with many different processes -- like a Database.
Take your giant dataset and create a schema for it and dump it into a database. You could even put it on a separate machine.
Then launch as many processes as you want, across as many hosts as you want, to process the data in parallel. Pretty much any modern database will be more than capable of handling the load.
Instead of using a dictionary, use a data structure that compresses data, but still has fast lookups.
e.g:
keyvi: https://github.com/cliqz-oss/keyvi
keyvi is a FSA-based key-value data structure optimized for space & lookup speed. multiple processes reading from keyvi will re-use the memory, because a keyvi structure is memory mapped and it uses shared memory. Since your worker processes don't need to modify the data structure, I think this would be your best bet.
marisa trie: https://github.com/pytries/marisa-trie static trie structure for Python, based on the marisa-trie C++ library. Like keyvi, marisa-trie also uses memory-mapping. Multiple processes using the same trie will use the same memory.
EDIT:
To use keyvi for this task, you can first install it with pip install pykeyvi. Then use it like this:
from pykeyvi import StringDictionaryCompiler, Dictionary
# Create the dictionary
compiler = StringDictionaryCompiler()
compiler.Add('foo', 'bar')
compiler.Add('key', 'value')
compiler.Compile()
compiler.WriteToFile('test.keyvi')
# Use the dictionary
dct = Dictionary('test.keyvi')
dct['foo'].GetValue()
> 'bar'
dct['key'].GetValue()
> 'value'
marisa trie is just a trie, so it wouldn't work as a mapping out of the box, but you can for example us a delimiter char to separate keys from values.
If you can successfully load that data into a single process in point 1, you can most likely work around the problem of fork doing copies by using gc.freeze introduced in https://bugs.python.org/issue31558
You have to use python 3.7+ and call that function before you fork. (or before you do the map over process pool)
Since this requires a virtual copy of the whole memory for the CoW to work, you need to make sure your overcommit settings allow you to do that.
As most people here already mentioned:
Don't use that big a dictionary, Dump it on a Database instead!!!
After dumping your data into a database, using indexes will help reduce data retrieval times.
A good indexing explanation for PostgreSQL databases here.
You can optimize your database even further (I give a PostgreSQL example because that is what I mostly use, but those concepts apply to almost every database)
Assuming you did the above (or if you want to use the dictionary either way...), you can implement a parallel and asynchronous processing routine using Python's asyncio (needs Python version >= 3.4).
The base idea is to create a mapping method to assign (map) an asynchronous task to each item of an iterable and register each task to asyncio's event_loop.
Finally, we will collect all those promises with asyncio.gather and we will wait to receive all the results.
A skeleton code example of this idea:
import asyncio
async def my_processing(value):
do stuff with the value...
return processed_value
def my_async_map(my_coroutine, my_iterable):
my_loop = asyncio.get_event_loop()
my_future = asyncio.gather(
*(my_coroutine(val) for val in my_iterable)
)
return my_loop.run_until_complete(my_future)
my_async_map(my_processing, my_ginormous_iterable)
You can use gevent instead of asyncio, but keep in mind that asyncio is part of the standard library.
Gevent implementation:
import gevent
from gevent.pool import Group
def my_processing(value):
do stuff with the value...
return processed_value
def my_async_map(my_coroutine, my_iterable):
my_group = Group()
return my_group.map(my_coroutine, my_iterable)
my_async_map(my_processing, my_ginormous_iterable)
The already mentioned keyvi (http://keyvi.org) sounds like the best option to me, because "python shared memory dictionary" describes exactly what it is. I am the author of keyvi, call me biased, but give me the chance to explain:
Shared memory make it scalable, especially for python where the GIL-problematic forces you to use multiprocessing rather than threading. That's why a heap-based in-process solution wouldn't scale. Also shared memory can be bigger than main memory, parts can be swapped in and out.
External process network based solutions require an extra network hop, which you can avoid by using keyvi, this makes a big performance difference even on the local machine. The question is also whether the external process is single-threaded and therefore introduces a bottleneck again.
I wonder about your dictionary size: 86GB: there is a good chance that keyvi compresses that nicely, but hard to say without knowing the data.
As for processing: Note that keyvi works nicely in pySpark/Hadoop.
Your usecase BTW is exactly what keyvi is used for in production, even on a higher scale.
The redis solution sounds good, at least better than some database solution. For saturating the cores you should use several instances and divide the key space using consistent hashing. But still, using keyvi, I am sure, would scale way better. You should try it, if you have to repeat the task and/or need to process more data.
Last but not least, you find nice material on the website, explaining the above in more detail.
Maybe you should try do it in database, and maybe try to use Dask to solve your problem,let Dask to care about how to multiprocessing in the low level. You can focus on the main question you want to solve using that large data.
And this the link you may want to look Dask
Well I do believe that the Redis or a database would be the easiest and quickest fix.
But from what I understood, why not reduce the problem from your second solution? That is, first try to load a portion of the billion keys into memory (say 50 Million). Then using Multi-processing, create a pool to work on the 2 TB file. If the lookup of the line exists in the table, push the data to a list of processed lines. If it doesn't exist, push it to a list. Once you complete reading the data set, pickle your list and flush the keys you have stored from memory. Then load the next million and repeat the process instead reading from your list. Once it is finished completely, read all your pickle objects.
This should handle the speed issue that you were facing. Of course, I have very little knowledge of your data set and do not know if this is even feasible. Of course, you might be left with lines that did not get a proper dictionary key read, but at this point your data size would be significantly reduced.
Don't know if that is of any help.
Another solution could be to use some existing database driver which can allocate / retire pages as necessary and deal with the index lookup quickly.
dbm has a nice dictionary interface available and with automatic caching of pages may be fast enough for your needs. If nothing is modified, you should be able to effectively cache the whole file at VFS level.
Just remember to disable locking, open in not synch-ed mode, and open for 'r' only so nothing impacts caching/concurrent access.
Since you're only looking to create a read-only dictionary it is possible that you can get better speed than some off the shelf databases by rolling your own simple version. Perhaps you could try something like:
import os.path
import functools
db_dir = '/path/to/my/dbdir'
def write(key, value):
path = os.path.join(db_dir, key)
with open(path, 'w') as f:
f.write(value)
#functools.lru_cache(maxsize=None)
def read(key):
path = os.path.join(db_dir, key)
with open(path) as f:
return f.read()
This will create a folder full of text files. The name of each file is the dictionary key and the contents are the value. Timing this myself I get about 300us per write (using a local SSD). Using those numbers theoretically the time taken to write your 1.75 billion keys would be about a week but this is easily parallelisable so you might be able to get it done a lot faster.
For reading I get about 150us per read with warm cache and 5ms cold cache (I mean the OS file cache here). If your access pattern is repetitive you could memoize your read function in process with lru_cache as above.
You may find that storing this many files in one directory is not possible with your filesystem or that it is inefficient for the OS. In that case you can do like the .git/objects folder: Store the key abcd in a file called ab/cd (i.e. in a file cd in folder ab).
The above would take something like 15TB on disk based on a 4KB block size. You could make it more efficient on disk and for OS caching by trying to group together keys by the first n letters so that each file is closer to the 4KB block size. The way this would work is that you have a file called abc which stores key value pairs for all keys that begin with abc. You could create this more efficiently if you first output each of your smaller dictionaries into a sorted key/value file and then mergesort as you write them into the database so that you write each file one at a time (rather than repeatedly opening and appending).
While the majority suggestion of "use a database" here is wise and proven, it sounds like you may want to avoid using a database for some reason (and you are finding the load into the db to be prohibitive), so essentially it seems you are IO-bound, and/or processor-bound. You mention that you are loading the 86GB index from 1024 smaller indexes. If your key is reasonably regular, and evenly-distributed, is it possible for you to go back to your 1024 smaller indexes and partition your dictionary? In other words, if, for example, your keys are all 20 characters long, and comprised of the letters a-z, create 26 smaller dictionaries, one for all keys beginning with 'a', one for keys beginning 'b' and so on. You could extend this concept to a large number of smaller dictionaries dedicated to the first 2 characters or more. So, for example, you could load one dictionary for the keys beginning 'aa', one for keys beginning 'ab' and so on, so you would have 676 individual dictionaries. The same logic would apply for a partition over the first 3 characters, using 17,576 smaller dictionaries. Essentially I guess what I'm saying here is "don't load your 86GB dictionary in the first place". Instead use a strategy that naturally distributes your data and/or load.
I need some help. I've been working on a file searching app as I learn Python, and it's been a very interesting experience so far, learned a lot and realized how little that actually is.
So, it's my first app, and it needs to be fast! I am unsatisfied with (among other things) the speed of finding matches for sparse searches.
The app caches file and folder names as dbm keys, and the search is basically running search words past these keys.
The GUI is in Tkinter, and to try not get it jammed, I've put my search loop in a thread. The thread recieves queries from the GUI via a queue, then passes results back via another queue.
That's how the code looks:
def TMakeSearch(fdict, squeue=None, rqueue=None):
'''Circumventing StopIteration(), did not see speed advantage'''
RESULTS_PER_BATCH=50
if whichdb(DB)=='dbhash' or 'dumb' in whichdb(DB):
'''iteration is not implemented for gdbm and (n)dbm, forced to
pop the keys out in advance for "for key in fdict:" '''
fdict=fdict
else:
# 'dbm.gnu', 'gdbm', 'dbm.ndbm', 'dbm'
fdict=fdict.keys()
search_list=None
while True:
query=None
while not squeue.empty():
#more items may get in (or not?) while condition is checked
query=squeue.get()
try:
search_list=query.lower().encode(ENCODING).split()
if Tests.is_query_passed:
print (search_list)
except:
#No new query, or a new database has been created and needs to be synced
sleep(0.1)
continue
else:
is_new_query=True
result_batch=[]
for key in fdict:
separator='*'.encode(ENCODING) #Python 3, yaaay
filename=key.split(separator)[0].lower()
#Add key if matching
for token in search_list:
if not token in filename:
break
else:
#Loop hasn't ended abruptly
result_batch.append(key)
if len(result_batch)>=RESULTS_PER_BATCH:
#Time to send off a batch
rqueue.put((result_batch, is_new_query))
if Tests.is_result_batch:
print(result_batch, len(result_batch))
print('is_result_batch: results on queue')
result_batch=[]
is_new_query=False
sleep(0.1)
if not squeue.empty():
break
#Loop ended naturally, with some batch<50
rqueue.put((result_batch, is_new_query))
Once there are few results, the results cease to be real-time, but rather take a few seconds, and that's on my smallish 120GB hard disk.
I believe it can be faster, and wish to make the search real-time.
What approaches exist to make the search faster?
My current marks all involve ramping up the faculties that I use - use multiprocessing somehow, use cython, perhaps somehow use ctypes to make the searches circumvent the Python runtime.
However, I suspect there are simpler things that can be done to make it work, as I am not savvy with Python and optimization.
Assistance please!
I wish to stay within the standard library if possible, as a proof of concept and for portability (currently I only scandir as an external library on Python <3.5), so for example ctypes would be preferrable to cython.
If it's relevant/helpful, the rest of the code is here -
https://github.com/h5rdly/Jiffy
EDIT:
This is the heart of the function, take a few pre-arrangements:
for key in fdict:
for token in search_list:
if not token in key:
break
else:
result_batch.append(key)
where search_list is a list of strings, and fdict is a dictionary or a dbm (didn't see a speed difference trying both).
This is what I wish to make faster, so that results arrive in real-time, even when there are only few keys containing my search words.
EDIT 2:
On #hpaulj 's advice, I've put the dbm keys in a (frozen) set, to gain a noticable imrovement on Windows/Python27 (dbhash):
I have some caveats though -
For my ~50Gb in use, the frozenset takes 28Mb, as by pympler.asizeof. So for the full 1Tb, I suspect it'll take a nice share of RAM.
On linux, for some reason, the conversion not only doesn't help, but the query itself stops getting updated in real time for some weird reason for the duration of the search, making the GUI look unrespnsive.
On Windows, This is almost as fast as I want, but still not warp-immediate.
So this comes around to this addition:
if 'win' in sys.platform:
try:
fdict=frozenset(fdict)
except:
fdict=frozenset(fdict.keys())
Since it would take a significant amount of RAM for larger disks, I think I'll add it as an optional faster search for now, "Scorch Mode".
I wonder what to do next. I thought that perhaps, if I could somehow export the keys/filenames to a datatype that ctypes can pass along, I could then pop a relevant C function to do the searches.
Also, perhaps learn the Python bytecode and do some lower-level optimization.
I'd like this to be as fast as Python would let me, please advise.
I'm designing a (hopefully) simple GUI application using PyQt4 that I'm trying to make scalable. In brief, the user inputs some basic information and sends it into one of n queues (implementing waiting lists). Each of these n queues (QTableviews) are identical and each have controls to pop, delete from and rearrange its queue. These, along with some labels etc. form a 'module'. Currently my application is hardcoded to 4 queue modules, so there's elements named btn_table1_pop, btn_table2_pop...etc; four copies of every single module widget. This is obviously not very good UI design if you always assume your clients have four people that need waiting lists! I'd like to be able to easily modify this program so 8 people could use it, or 3 people could use it without a chunk of useless screen-estate!
The really naive solution to programming my application is duplicating the code for each module, but this is really messy, unmaintainable, and bounds my application to always four queues. A better thought would be to write functions for each button that sets an index and calls a function that implements the common logic, but I'm still hardcoded to 4, because the branch logic and the calling functions still have to take into account the names of the elements. If there was a way to 'vectorize' the names of the elements so that I could for example write
btn_table[index]_pop.setEnabled(False)
...I could eliminate this branch logic and really condense my code. But I'm way too new at Python/PyQt to know if this is 1) even possible? or 2) how to even go about it/if this is even the way to go?
Thanks again, SO.
In case anyone is interested I was able to get it working with dummybutton = getattr(self,'btn_table{}'.format(i)) and calling the button's methods on dummybutton.
In my python/pyramid app, I let users generate html pages which are stored in an amazon s3 bucket. I want each page to have a separate path like www.domain.com/2cxj4kl. I have figured out how to generate the random string to put in the url, but I am more concerned with duplicates. how can I check each of these strings against a list of existing strings so that nothing is overwritten? Can I just put each string in a dictionary or array and check the ever growing array/dict each time a new one is created? Are there issues with continuing to grow such an object, and will it permanently exist in the app memory somehow? How can I do this?
The approach of storing a list of existing identifiers in some storage and comparing new identifiers with the list would work in a simple case, however, this may become tricky if you have to store, say, billions of identifiers, or if you want to generate them on more than one machine. This also complicates things with storing the list, retrieving, comparing etc. Not to mention locking - what if two users decide to create a page at exactly the same second?
Universally Unique Identifiers (UUIDs) have a very-very low chance of collision - much lower than, say, a chance of our planet being swallowed by a black hole in the next five minutes. So low that you can ignore it for any practical purposes.
Python has a library called uuid to generate UUIDs
>>> import uuid
>>> # make a random UUID
>>> u = uuid.uuid4()
>>> u.hex
'f3db6f9a34ed48938a45113ac4b5f156'
The resulting string is 32 characters long, which may be too long for you.
Alternatively, you may just generate a random string like this:
''.join(random.choice(string.ascii_letters + string.digits) for x in range(12))
at 10-15 characters long it probably will be less random than an UUID, but still a chance of a collision would be much lower than, say, a chance of a janitor at Amazon data center going mental, destroying your server with an axe and setting the data center on fire :)
I am new to Python and programming but here a few issues I can see with 'random string' idea:
You will most probably end up generating same string over and over if you are using shorter strings. On the other side if you are using longer strings, the changes of getting the same string is less. However you will want to watch out for duplicates on either case. Therefore my suggestion is to make some estimations of how many urls you will need, and use an optimal string length for it.
Easiest way is to keep these urls in a list, and use a simple if check before registering new ones:
if new_url in url_list:
generate_new_url()
else:
url_list.append(new_url)
However it also sounds like you will want to employ a database to permanently store your urls. In most sql based databases you can setup your url column to be 'unique'; therefore database stops you from having dublicate urls.
I am not sure but with the database you can probably do this:
try:
#insert value to database
except:
generate_new_url()