Multiple Threads Accessing Same Nested Dictionary

Multiple Threads Accessing Same Nested Dictionary - python

(warning incoming novice). I am about to run a multi threaded python script that will go through 100s of thousands of files, and will update a dictionary. More specifically the threads will be appending to a list that is stored in a nested dictionary.
For example Dict['X1']['X1PROPERTIES'] where 'X1PROPERTIES' is a list.
Now depending on which file a thread is currently reading, there may be cases where multiple threads are appending to the same X#'s PROPERTIES. After reading about atomic properties and locks, what I was wondering is if I will require the use of locks. To my understanding if it is simply appending to a list in place I will not need to use locks, but I am a bit unclear on which cases locks are necessary.
Any Insight will be much appreciated.
To add a little bit more context here, the actual processing being done on the data extracted from each file is very minimal and the result is either an append in place to a list inside a global dictionary or nothing. It is really just the sheer volume of directories and files I need to navigate through. The time spent reading in files is brutal. Are there more effective solutions I can use to speed up the IO component? I am wondering if I may be misusing threading in this context.

Related

A good way to manage human readable and programmable config files in Python across multiple processes and source files

As the title says, I am looking for a way to manage config files in python with options that should be accessible across multiple source files and multiple processes, with a (hopefully) human readable format, where I can also set some options programmatically.
I have tried/looked at the following:
Using a global.py: file and importing it everywhere. However, when I change any of the values, the change is not reflected in the other processes. I suppose this cannot be done as per this answer: Globals variables and Python multiprocessing. Is there a way to make this work?
configparser: This seems to be a really good option, though I don't understand the following: When I change some of the options after reading a file, how do I make sure this change is reflected in the other files/processes?
configargparser: I like this as well, since it is extremely customizable. However, I again have the same question as above.
yaml: This seems to be the most human readable option. In that, I can give lists and dicts as arguments without much of a problem. I still don't know how to use it across files and processes though. I guess one way would be to read everything in a main file, dump it somewhere before spawning other processes and read that dumped file in other processes. Is there a better way?

Python Shared Memory Dictionary for Mapping Big Data

I've been having a hard time using a large dictionary (~86GB, 1.75 billion keys) to process a big dataset (2TB) using multiprocessing in Python.
Context: a dictionary mapping strings to strings is loaded from pickled files into memory. Once loaded, worker processes (ideally >32) are created that must lookup values in the dictionary but not modify it's contents, in order to process the ~2TB dataset. The data set needs to be processed in parallel otherwise the task would take over a month.
Here are the two three four five six seven eight nine approaches (all failing) that I have tried:
Store the dictionary as a global variable in the Python program and then fork the ~32 worker processes. Theoretically this method might work since the dictionary is not being modified and therefore the COW mechanism of fork on Linux would mean that the data structure would be shared and not copied among processes. However, when I attempt this, my program crashes on os.fork() inside of multiprocessing.Pool.map from OSError: [Errno 12] Cannot allocate memory. I'm convinced that this is because the kernel is configured to never overcommit memory (/proc/sys/vm/overcommit_memory is set to 2, and I can't configure this setting on the machine since I don't have root access).
Load the dictionary into a shared-memory dictionary with multiprocessing.Manager.dict. With this approach I was able to fork the 32 worker process without crashing but the subsequent data processing is orders of magnitude slower than another version of the task that required no dictionary (only difference is no dictionary lookup). I theorize that this is because of the inter-process communication between the manager process containing the dictionary and each worker process, that is required for every single dictionary lookup. Although the dictionary is not being modified, it is being accessed many many times, often simultaneously by many processes.
Copy the dictionary into a C++ std::map and rely on Linux's COW mechanism to prevent it from being copied (like approach #1 except with the dictionary in C++). With this approach, it took a long time to load the dictionary into std::map and subsequently crashed from ENOMEM on os.fork() just as before.
Copy the dictionary into pyshmht. It takes far too long to copy the dictionary into pyshmht.
Try using SNAP's HashTable. The underlying implementation in C++ allows for it to be made and used in shared memory. Unfortunately the Python API does not offer this functionality.
Use PyPy. Crash still happened as in #1.
Implement my own shared-memory hash table in python on top of multiprocessing.Array. This approach still resulted in the out of memory error that ocured in #1.
Dump the dictionary into dbm. After trying to dump the dictionary into a dbm database for four days and seeing an ETA of "33 days", I gave up on this approach.
Dump the dictionary into Redis. When I try to dump the dictionaries (the 86GB dict is loaded from 1024 smaller dicts) into Redis using redis.mset I get a connection reset by peer error. When I try to dump the key-value pairs using a loop, it takes an extremely long time.
How can I process this dataset in parallel efficiently without requiring inter-process communication in order to lookup values in this dictionary. I would welcome any suggestions for solving this problem!
I'm using Python 3.6.3 from Anaconda on Ubuntu on a machine with 1TB RAM.
Edit: What finally worked:
I was able to get this to work using Redis. To get around the issued in #9, I had to chunk the large key-value insertion and lookup queries into "bite-sized" chunks so that it was still processing in batches, but didn't time-out from too large a query. Doing this allowed the insertion of the 86GB dictionary to be performed in 45 minutes (with 128 threads and some load balancing), and the subsequent processing was not hampered in performance by the Redis lookup queries (finished in 2 days).
Thank you all for your help and suggestions.

You should probably use a system that's meant for sharing large amounts of data with many different processes -- like a Database.
Take your giant dataset and create a schema for it and dump it into a database. You could even put it on a separate machine.
Then launch as many processes as you want, across as many hosts as you want, to process the data in parallel. Pretty much any modern database will be more than capable of handling the load.

Instead of using a dictionary, use a data structure that compresses data, but still has fast lookups.
e.g:
keyvi: https://github.com/cliqz-oss/keyvi
keyvi is a FSA-based key-value data structure optimized for space & lookup speed. multiple processes reading from keyvi will re-use the memory, because a keyvi structure is memory mapped and it uses shared memory. Since your worker processes don't need to modify the data structure, I think this would be your best bet.
marisa trie: https://github.com/pytries/marisa-trie static trie structure for Python, based on the marisa-trie C++ library. Like keyvi, marisa-trie also uses memory-mapping. Multiple processes using the same trie will use the same memory.
EDIT:
To use keyvi for this task, you can first install it with pip install pykeyvi. Then use it like this:
from pykeyvi import StringDictionaryCompiler, Dictionary
# Create the dictionary
compiler = StringDictionaryCompiler()
compiler.Add('foo', 'bar')
compiler.Add('key', 'value')
compiler.Compile()
compiler.WriteToFile('test.keyvi')
# Use the dictionary
dct = Dictionary('test.keyvi')
dct['foo'].GetValue()
> 'bar'
dct['key'].GetValue()
> 'value'
marisa trie is just a trie, so it wouldn't work as a mapping out of the box, but you can for example us a delimiter char to separate keys from values.

If you can successfully load that data into a single process in point 1, you can most likely work around the problem of fork doing copies by using gc.freeze introduced in https://bugs.python.org/issue31558
You have to use python 3.7+ and call that function before you fork. (or before you do the map over process pool)
Since this requires a virtual copy of the whole memory for the CoW to work, you need to make sure your overcommit settings allow you to do that.

As most people here already mentioned:
Don't use that big a dictionary, Dump it on a Database instead!!!
After dumping your data into a database, using indexes will help reduce data retrieval times.
A good indexing explanation for PostgreSQL databases here.
You can optimize your database even further (I give a PostgreSQL example because that is what I mostly use, but those concepts apply to almost every database)
Assuming you did the above (or if you want to use the dictionary either way...), you can implement a parallel and asynchronous processing routine using Python's asyncio (needs Python version >= 3.4).
The base idea is to create a mapping method to assign (map) an asynchronous task to each item of an iterable and register each task to asyncio's event_loop.
Finally, we will collect all those promises with asyncio.gather and we will wait to receive all the results.
A skeleton code example of this idea:
import asyncio
async def my_processing(value):
do stuff with the value...
return processed_value
def my_async_map(my_coroutine, my_iterable):
my_loop = asyncio.get_event_loop()
my_future = asyncio.gather(
*(my_coroutine(val) for val in my_iterable)
)
return my_loop.run_until_complete(my_future)
my_async_map(my_processing, my_ginormous_iterable)
You can use gevent instead of asyncio, but keep in mind that asyncio is part of the standard library.
Gevent implementation:
import gevent
from gevent.pool import Group
def my_processing(value):
do stuff with the value...
return processed_value
def my_async_map(my_coroutine, my_iterable):
my_group = Group()
return my_group.map(my_coroutine, my_iterable)
my_async_map(my_processing, my_ginormous_iterable)

The already mentioned keyvi (http://keyvi.org) sounds like the best option to me, because "python shared memory dictionary" describes exactly what it is. I am the author of keyvi, call me biased, but give me the chance to explain:
Shared memory make it scalable, especially for python where the GIL-problematic forces you to use multiprocessing rather than threading. That's why a heap-based in-process solution wouldn't scale. Also shared memory can be bigger than main memory, parts can be swapped in and out.
External process network based solutions require an extra network hop, which you can avoid by using keyvi, this makes a big performance difference even on the local machine. The question is also whether the external process is single-threaded and therefore introduces a bottleneck again.
I wonder about your dictionary size: 86GB: there is a good chance that keyvi compresses that nicely, but hard to say without knowing the data.
As for processing: Note that keyvi works nicely in pySpark/Hadoop.
Your usecase BTW is exactly what keyvi is used for in production, even on a higher scale.
The redis solution sounds good, at least better than some database solution. For saturating the cores you should use several instances and divide the key space using consistent hashing. But still, using keyvi, I am sure, would scale way better. You should try it, if you have to repeat the task and/or need to process more data.
Last but not least, you find nice material on the website, explaining the above in more detail.

Maybe you should try do it in database, and maybe try to use Dask to solve your problem,let Dask to care about how to multiprocessing in the low level. You can focus on the main question you want to solve using that large data.
And this the link you may want to look Dask

Well I do believe that the Redis or a database would be the easiest and quickest fix.
But from what I understood, why not reduce the problem from your second solution? That is, first try to load a portion of the billion keys into memory (say 50 Million). Then using Multi-processing, create a pool to work on the 2 TB file. If the lookup of the line exists in the table, push the data to a list of processed lines. If it doesn't exist, push it to a list. Once you complete reading the data set, pickle your list and flush the keys you have stored from memory. Then load the next million and repeat the process instead reading from your list. Once it is finished completely, read all your pickle objects.
This should handle the speed issue that you were facing. Of course, I have very little knowledge of your data set and do not know if this is even feasible. Of course, you might be left with lines that did not get a proper dictionary key read, but at this point your data size would be significantly reduced.
Don't know if that is of any help.

Another solution could be to use some existing database driver which can allocate / retire pages as necessary and deal with the index lookup quickly.
dbm has a nice dictionary interface available and with automatic caching of pages may be fast enough for your needs. If nothing is modified, you should be able to effectively cache the whole file at VFS level.
Just remember to disable locking, open in not synch-ed mode, and open for 'r' only so nothing impacts caching/concurrent access.

Since you're only looking to create a read-only dictionary it is possible that you can get better speed than some off the shelf databases by rolling your own simple version. Perhaps you could try something like:
import os.path
import functools
db_dir = '/path/to/my/dbdir'
def write(key, value):
path = os.path.join(db_dir, key)
with open(path, 'w') as f:
f.write(value)
#functools.lru_cache(maxsize=None)
def read(key):
path = os.path.join(db_dir, key)
with open(path) as f:
return f.read()
This will create a folder full of text files. The name of each file is the dictionary key and the contents are the value. Timing this myself I get about 300us per write (using a local SSD). Using those numbers theoretically the time taken to write your 1.75 billion keys would be about a week but this is easily parallelisable so you might be able to get it done a lot faster.
For reading I get about 150us per read with warm cache and 5ms cold cache (I mean the OS file cache here). If your access pattern is repetitive you could memoize your read function in process with lru_cache as above.
You may find that storing this many files in one directory is not possible with your filesystem or that it is inefficient for the OS. In that case you can do like the .git/objects folder: Store the key abcd in a file called ab/cd (i.e. in a file cd in folder ab).
The above would take something like 15TB on disk based on a 4KB block size. You could make it more efficient on disk and for OS caching by trying to group together keys by the first n letters so that each file is closer to the 4KB block size. The way this would work is that you have a file called abc which stores key value pairs for all keys that begin with abc. You could create this more efficiently if you first output each of your smaller dictionaries into a sorted key/value file and then mergesort as you write them into the database so that you write each file one at a time (rather than repeatedly opening and appending).

While the majority suggestion of "use a database" here is wise and proven, it sounds like you may want to avoid using a database for some reason (and you are finding the load into the db to be prohibitive), so essentially it seems you are IO-bound, and/or processor-bound. You mention that you are loading the 86GB index from 1024 smaller indexes. If your key is reasonably regular, and evenly-distributed, is it possible for you to go back to your 1024 smaller indexes and partition your dictionary? In other words, if, for example, your keys are all 20 characters long, and comprised of the letters a-z, create 26 smaller dictionaries, one for all keys beginning with 'a', one for keys beginning 'b' and so on. You could extend this concept to a large number of smaller dictionaries dedicated to the first 2 characters or more. So, for example, you could load one dictionary for the keys beginning 'aa', one for keys beginning 'ab' and so on, so you would have 676 individual dictionaries. The same logic would apply for a partition over the first 3 characters, using 17,576 smaller dictionaries. Essentially I guess what I'm saying here is "don't load your 86GB dictionary in the first place". Instead use a strategy that naturally distributes your data and/or load.

How to keep a very large dictionary loaded in memory in python?

I have a very large dictionary of size ~ 200 GB which I need to query very often for my algorithm. To get quick results, I want to put it in memory which is possible, because fortunately I have a 500GB RAM.
However, my main issue is that I want to load it only once in memory and then let other processes query the same dictionary, rather than having to load it again everytime I create a new process or iterate over my code.
So, I would like something like this:
Script 1:
# Load dictionary in memory
def load(data_dir):
dictionary = load_from_dir(data_dir) ...
Script 2:
# Connect to loaded dictionary (already put in memory by script 1)
def use_dictionary(my_query):
query_loaded_dictionary(my_query)
What's the best way to achieve this ? I have considered a rest API, but I wonder if going over a REST request will erode all the speed I gained by putting the dictionary in memory in the first place.
Any suggestions ?

Either run a separate service that you access with a REST API like you mentioned, or use an in-memory database.
I had a very good experience with Redis personally, but there are many others (Memcached is also popular). Redis was easy to use with Python and Django.
In both solutions there can be data serialization though, so some performance will be dropped. There is a way to fill Redis with simple structures such as lists, but I haven't tried. I packed my numeric arrays and serialized them (with numpy), it was fast enough in the end. If you use simple string key-value pairs anyway, then the performance will be optimal, and maybe better with memcached.

Common or centralised dictionary in python

I am looking for a way to use a common or centralised dictionary in python. I have several scripts that work on the same data sets, but do different things. I have dictionarys in these scripts (e.g. instrument name) that are used in more than one script. If I make changes or add values I have to make shure I do this in all scripts, which is of course painfull and prone for errors.
Is there a possibility, that I have one dictionary and use it in all the different scripts?
Any help is appreciated - thanks.

I didn't exactly understand if you're talking about several scripts running in the same process, or several processes that need to share the same data:
Several processes:
I think the simplest way of doing this, sharing data between several scripts (different processes), is to use a simple file database.
SQLite3 comes bundled with Python, so that would be a good choice.
Single process:
If you simply want to centralize the access of several methods running at the same time (within the same process), then you can declare your dictionary in one script, and have all the others access that one location, without duplicating the structure.
You basically declare this dict in a script, and import it in all the others.

For such situations, I have a general_values.py file in my project root.
Since lists and dictionaries are all use reference to store values, using the same dict or list in different scripts do not cause problems since they all will use the same single data stored in memory.
Likwise:
main_values.py
my_dict = {1:1}
some.py
import main_values
main_values.my_dict[2] = 2
other.py
import main_values
print main_values.my_dict
>> {1:1, 2:2}

Some options:
You could use a shared shelved file. See http://docs.python.org/2/library/shelve.html
You could store configuration information in a JSON/YAML or similar file
It may be overkill for your usecase, but something like http://redis.io (or a lightweight key store) database will enable you to use a dict-like interface to persist keys across processes and be available to other languages.

Are there memory efficiencies gained when code is wrapped in functions?

I have been working on some code. My usual approach is to first solve all of the pieces of the problem, creating the loops and other pieces of code I need as I work through the problem and then if I expect to reuse the code I go back through it and group the parts of code together that I think should be grouped to create functions.
I have just noticed that creating functions and calling them seems to be much more efficient than writing lines of code and deleting containers as I am finished with them.
for example:
def someFunction(aList):
do things to aList
that create a dictionary
return aDict
seems to release more memory at the end than
>>do things to alist
>>that create a dictionary
>>del(aList)
Is this expected behavior?
EDIT added example code
When this function finishes running the PF Usage shows an increase of about 100 mb the filingsList has about 8 million lines.
def getAllCIKS(filingList):
cikDICT=defaultdict(int)
for filing in filingList:
if filing.startswith('.'):
del(filing)
continue
cik=filing.split('^')[0].strip()
cikDICT[cik]+=1
del(filing)
ciklist=cikDICT.keys()
ciklist.sort()
return ciklist
allCIKS=getAllCIKS(open(r'c:\filinglist.txt').readlines())
If I run this instead I show an increase of almost 400 mb
cikDICT=defaultdict(int)
for filing in open(r'c:\filinglist.txt').readlines():
if filing.startswith('.'):
del(filing)
continue
cik=filing.split('^')[0].strip()
cikDICT[cik]+=1
del(filing)
ciklist=cikDICT.keys()
ciklist.sort()
del(cikDICT)
EDIT
I have been playing around with this some more today. My observation and question should be refined a bit since my focus has been on the PF Usage. Unfortunately I can only poke at this between my other tasks. However I am starting to wonder about references versus copies. If I create a dictionary from a list does the dictionary container hold a copy of the values that came from the list or do they hold references to the values in the list? My bet is that the values are copied instead of referenced.
Another thing I noticed is that items in the GC list were items from containers that were deleted. Does that make sense? Soo I have a list and suppose each of the items in the list was [(aTuple),anInteger,[another list]]. When I started learning about how to manipulate the gc objects and inspect them I found those objects in the gc even though the list had been forcefully deleted and even though I passed the 0,1 & 2 value to the method that I don't remember to try to still delete them.
I appreciate the insights people have been sharing. Unfortunately I am always interested in figuring out how things work under the hood.

Maybe you used some local variables in your function, which are implicitly released by reference counting at the end of the function, while they are not released at the end of your code segment?

You can use the Python garbage collector interface provided to more closely examine what (if anything) is being left around in the second case. Specifically, you may want to check out gc.get_objects() to see what is left uncollected, or gc.garbage to see if you have any reference cycles.

Some extra memory is freed when you return from a function, but that's exactly as much extra memory as was allocated to call the function in the first place. In any case - if you seeing a large amount of difference, that's likely an artifact of the state of the runtime, and is not something you should really be worrying about. If you are running low on memory, the way to solve the problem is to keep more data on disk using things like b-trees (or just use a database), or use algorithms that use less memory. Also, keep an eye out for making unnecessary copies of large data structures.
The real memory savings in creating functions is in your short-term memory. By moving something into a function, you reduce the amount of detail you need to remember by encapsulating part of the minutia away.

Maybe you should re-engineer your code to get rid of unnecessary variables (that may not be freed instantly)... how about the following snippet?
myfile = file(r"c:\fillinglist.txt")
ciklist = sorted(set(x.split("^")[0].strip() for x in myfile if not x.startswith(".")))
EDIT: I don't know why this answer was voted negative... Maybe because it's short? Or maybe because the dude who voted was unable to understand how this one-liner does the same that the code in the question without creating unnecessary temporal containers?
Sigh...

I asked another question about copying lists and the answers, particularly the answer directing me to look at deepcopy caused me to think about some dictionary behavior. The problem I was experiencing had to do with the fact that the original list is never garbage collected because the dictionary maintains references to the list. I need to use the information about weakref in the Python Docs.
objects referenced by dictionaries seem to stay alive. I think (but am not sure) the process of pushing the dictionary out of the function forces the copy process and kills the object. This is not complete I need to do some more research.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.