I have a Python application that performs correlation an large files. It stores those in a dict. Depending on the input files, this dict can become really large, to the point where it does not fit into memory anymore. This causes the system to hang, so I want to prevent this.
My idea is that there are always correlations which are not so relevant for the later processing. These could be deleted without changing the overall result too much. I want to do this when I have not much memory left.
Hence, I check for available memory periodically. If it becomes too few (say, less than 300MB), if delete the irrelevant correlations to gain more space. That's the theory.
Now for my problem: In Python, you cannot delete from a dict while iterating over it. But this is exactly what I need to do, since I have to check each dict entry for relevancy before deleting.
The usual solution would be to create a copy of the dict for iteration, or to create a new dict containing only the elements that I want to preserve. However, the dict might be several GBs big and there are only a few hundred MB of free memory left. So I cannot do much copying since that may again cause the system to hang.
Here I am stuck. Can anyone think of a better method to achieve what I need? If in-place deletion of dict entries is absolutely not possible while iterating, maybe there is some workaround that could save me?
Thanks in advance!
EDIT -- some more information about the dict:
The keys are tuples specifying the values by which the data is correlated.
The values are dicts containing the correlated date. The keys of these dicts are always strings, the values are numbers (int or float).
I am checking for relevancy by comparing the number values in the value-dicts with certain thresholds. If the values are below the thresholds, the particular correlation can be dropped.
I do not think that your solution to the problem is prudent.
If you have that much data I recommend you find some bigger tools in your toolbox, a suggestion would be to let a local Redis server help you out.
Take a look at redis-collections, that will provide you with a dictionary like object with a redis backend, giving you a sustainable solution.
>>> from redis_collections import Dict
>>> d = Dict()
>>> d['answer'] = 42
>>> d
<redis_collections.Dict at fe267c1dde5d4f648e7bac836a0168fe {'answer': 42}>
>>> d.items()
[('answer', 42)]
Best of luck!
Are the keys large? If not, you can loop over the dict to determine which entries should be deleted; store the key for each such entry in a list. Then loop over those keys and delete them from the dict.
Related
The reason why I am asking this question is because I am working with huge datas.
In my algorithm, I basically need something like this:
users_per_document = []
documents_per_user = []
As you understand it from the names of the dictionaries, I need users that clicked a specific document and documents that are clicked by a specific user.
In that case I have "duplicated" datas, and both of them together overflows the memory and my script gets killed after a while. Because I use very large data sets, I have to make it in a efficient way.
I think that is not possible but I need to ask it, is there a way to get all keys of a specific value from dictionary?
Because if there is a way to do that, I will not need one of the dictionaries anymore.
For example:
users_per_document["document1"] obviously returns the appropriate
users,
what I need is users_per_document.getKeys("user1") because this will basically return the same thing with documents_per_user["user1"]
If it is not possible, any suggestion is pleased..
If you are using Python 3.x, you can do the following. If 2.x, just use .iteritems() instead.
user1_values = [key for key,value in users_per_document.items() if value == "user1"]
Note: This does iterate over the whole dictionary. A dictionary isn't really an ideal data structure to get all keys for a specific value, as it will be O(n^2) if you have to perform this operation n times.
I am not very sure about the python, but in general computer science you can solve the problem with the following way;
Basically, you can have three-dimensional array, first index is for users , second index for documents and the third index would be a boolean value.
The boolean value represents if there is relation between the specific user and the specific document.
PS: if you have really sparse matrix, you can make it much more efficient, but it is another story
I want to create the following structure for a dictionary:
{ id1: {id2: {id3: [] }}}
That would be a triple dictionary that will finally point to a list.
I use the following code to initiate it in Python:
for i in range(2160):
for j in range(2160):
for k in range(2160):
subnetwork.update({i: {j: {k: [] }}})
This code takes too much time to execute. It is of Big-O(N^3) complexity.
Are there any ways to speed-up this process? Serializing maybe the data structure and retrieving it from hard drive is faster?
What data structures can achieve similar results? Would a flat dictionary using three-element tuples as keys serve my purpose?
Do you really need 10 billion entries (2,160 ** 3 == 10,077,696,000) in this structure? It's unlikely that any disk-based solution will be quicker than a memory-based one, but at the same time your program might be exceeding the bounds of real memory, causing "page thrashing" to occur.
Without knowing anything about your intended application it's hard to propose a suitable solution. What is it that you are trying to do?
For example, if you don't need to randomly look up items you might consider a flat dictionary using three-element tuples as keys. But since you don't tell use what you are trying to do anything more would probably be highly speculative.
I have a dict containing about 50,000 integer values, and a set that contains the keys of 100 of them. My inner loop increments or decrements the values of the dict items in an unpredictable way.
Periodically I need to replace one member of the set with the key of the largest element not already in the set. As an aside, if the dict items were sorted, the order of the sort would change slightly, not dramatically, between invocations of this routine.
Re-sorting the entire dict every time seems wasteful, although maybe less so given that it's already "almost" sorted. While I may be guilty of premature optimization, performance will matter as this will run a very large number of iterations, so I thought it worth asking my betters whether there's an obviously more efficient and pythonic approach.
I'm aware of the notion of dict "views" - Windows onto the contents which are updated as the contents change. Is there such a thing as a "sorted view"?
Instead of using a dict you could use a Counter object which has a neat most_common(n) method which
Return a list of the n most common elements and their counts from the most common to the least.
I'm going to store on the order of 10,000 securities X 300 date pairs X 2 Types in some caching mechanism.
I'm assuming I'm going to use a dictionary.
Question Part 1:
Which is more efficient or Faster? Assume that I'll be generally looking up knowing a list of security IDs and the 2 dates plus type. If there is a big efficiency gain by tweaking my lookup, I'm happy to do that. Also assume I can be wasteful of memory to an extent.
Method 1: store and look up using keys that look like strings "securityID_date1_date2_type"
Method 2: store and look up using keys that look like tuples (securityID, date1, date2, type)
Method 3: store and look up using nested dictionaries of some variation mentioned in methods 1 and 2
Question Part 2:
Is there an easy and better way to do this?
It's going to depend a lot on your use case. Is lookup the only activity or will you do other things, e.g:
Iterate all keys/values? For simplicity, you wouldn't want to nest dictionaries if iteration is relatively common.
What about iterating a subset of keys with a given securityID, type, etc.? Nested dictionaries (each keyed on one or more components of your key) would be beneficial if you needed to iterate "keys" with one component having a given value.
What about if you need to iterate based on a different subset of the key components? If that's the case, plain dict is probably not the best idea; you may want relational database, either the built-in sqlite3 module or a third party module for a more "production grade" DBMS.
Aside from that, it matters quite a bit how you construct and use keys. Strings cache their hash code (and can be interned for even faster comparisons), so if you reuse a string for lookup having stored it elsewhere, it's going to be fast. But tuples are usually safer (strings constructed from multiple pieces can accidentally produce the same string from different keys if the separation between components in the string isn't well maintained). And you can easily recover the original components from a tuple, where a string would need to be parsed to recover the values. Nested dicts aren't likely to win (and require some finesse with methods like setdefault to populate properly) in a simple contest of lookup speed, so it's only when iterating a subset of the data for a single component of the key that they're likely to be beneficial.
If you want to benchmark, I'd suggest populating a dict with sample data, then use the timeit module (or ipython's %timeit magic) to test something approximating your use case. Just make sure it's a fair test, e.g. don't lookup the same key each time (using itertools.cycle to repeat a few hundred keys would work better) since dict optimizes for that scenario, and make sure the key is constructed each time, not just reused (unless reuse would be common in the real scenario) so string's caching of hash codes doesn't interfere.
So I currently have a 2d list of objects that define the map of a game where each object represents a tile on that map. As I was repurposing the code for something else, I wondered if it would make more sense to use a dictionary to store the map data or to continue using a list. With a list, the indices represent the x and y of the map, whereas in a dictionary, a (x,y) tuple would be the keys for the dictionary.
The reason I ask is because the map changing is a rare event, so the data is fairly static, and as far as i know, the fairly constant lookups will be faster in dictionaries. It should also simplify looping through the map to draw it. Mostly I think using dictionaries will simplify accessing the data, though I'm not sure that will be the case in all cases.
Are these benefits worth the additional memory that I assume the dictionary will take up? or am I even right about the benefits being benefits?
EDIT
I know that the current method works, its was moreso to whether or not it would make sense to switch in order to have cleaner looking code and to find any potential drawbacks.
Stuff like looping through the array would go from something like
for i in range(size[0]):
for e in range(size[1]):
thing.blit(....using i and e)
to
for i, e in dict.items():
i.blit(....using i and e)
or looking up a dict item would be
def get(x, y):
if (x in range(size[0])) and (y in range(size[1])):
return self.map[x][y].tile
to
def get(item):
return self.dict.get(item)
its not much, but its somewhat cleaner, and if its not any slower and there are no other drawbacks i see no reason not to.
I would be wary of premature optimization.
Does your current approach have unacceptable performance? Does the data structure you're using make it harder to reason about or write your code?
If there isn't a specific problem you need to solve that can't be addressed with your current architecture, I would be wary about changing it.
This is a good answer to reference about the speed and memory usage of python lists vs dictionarys: https://stackoverflow.com/a/513906/1506904
Until you get to an incredibly large data set it is most likely that your current method, if it is working well for you, will be perfectly suitable.
I'm not sure that you'll get "the right" answer for this, but when I created the *Game of Life *in Python I used a dict. Realistically there should not be a substantial difference between the lookup cost of multi-dimensional lists and the lookup cost in a dict (both are O(1)), but if you're using a dict then you don't need to worry about instantiating the entire game-board. In chess, this means that you are only creating 32 pieces instead of 64 squares and 32 pieces. In the game of Go, on the other hand you can create only 1 object instead of 361 list-cells.
That said, in the case of the dict you will need to instantiate the tuples. If you can cache those (or only iterate the dict's keys) then maybe you will get the best of all worlds.