I want to create the following structure for a dictionary:
{ id1: {id2: {id3: [] }}}
That would be a triple dictionary that will finally point to a list.
I use the following code to initiate it in Python:
for i in range(2160):
for j in range(2160):
for k in range(2160):
subnetwork.update({i: {j: {k: [] }}})
This code takes too much time to execute. It is of Big-O(N^3) complexity.
Are there any ways to speed-up this process? Serializing maybe the data structure and retrieving it from hard drive is faster?
What data structures can achieve similar results? Would a flat dictionary using three-element tuples as keys serve my purpose?
Do you really need 10 billion entries (2,160 ** 3 == 10,077,696,000) in this structure? It's unlikely that any disk-based solution will be quicker than a memory-based one, but at the same time your program might be exceeding the bounds of real memory, causing "page thrashing" to occur.
Without knowing anything about your intended application it's hard to propose a suitable solution. What is it that you are trying to do?
For example, if you don't need to randomly look up items you might consider a flat dictionary using three-element tuples as keys. But since you don't tell use what you are trying to do anything more would probably be highly speculative.
Related
The reason why I am asking this question is because I am working with huge datas.
In my algorithm, I basically need something like this:
users_per_document = []
documents_per_user = []
As you understand it from the names of the dictionaries, I need users that clicked a specific document and documents that are clicked by a specific user.
In that case I have "duplicated" datas, and both of them together overflows the memory and my script gets killed after a while. Because I use very large data sets, I have to make it in a efficient way.
I think that is not possible but I need to ask it, is there a way to get all keys of a specific value from dictionary?
Because if there is a way to do that, I will not need one of the dictionaries anymore.
For example:
users_per_document["document1"] obviously returns the appropriate
users,
what I need is users_per_document.getKeys("user1") because this will basically return the same thing with documents_per_user["user1"]
If it is not possible, any suggestion is pleased..
If you are using Python 3.x, you can do the following. If 2.x, just use .iteritems() instead.
user1_values = [key for key,value in users_per_document.items() if value == "user1"]
Note: This does iterate over the whole dictionary. A dictionary isn't really an ideal data structure to get all keys for a specific value, as it will be O(n^2) if you have to perform this operation n times.
I am not very sure about the python, but in general computer science you can solve the problem with the following way;
Basically, you can have three-dimensional array, first index is for users , second index for documents and the third index would be a boolean value.
The boolean value represents if there is relation between the specific user and the specific document.
PS: if you have really sparse matrix, you can make it much more efficient, but it is another story
I have a Python application that performs correlation an large files. It stores those in a dict. Depending on the input files, this dict can become really large, to the point where it does not fit into memory anymore. This causes the system to hang, so I want to prevent this.
My idea is that there are always correlations which are not so relevant for the later processing. These could be deleted without changing the overall result too much. I want to do this when I have not much memory left.
Hence, I check for available memory periodically. If it becomes too few (say, less than 300MB), if delete the irrelevant correlations to gain more space. That's the theory.
Now for my problem: In Python, you cannot delete from a dict while iterating over it. But this is exactly what I need to do, since I have to check each dict entry for relevancy before deleting.
The usual solution would be to create a copy of the dict for iteration, or to create a new dict containing only the elements that I want to preserve. However, the dict might be several GBs big and there are only a few hundred MB of free memory left. So I cannot do much copying since that may again cause the system to hang.
Here I am stuck. Can anyone think of a better method to achieve what I need? If in-place deletion of dict entries is absolutely not possible while iterating, maybe there is some workaround that could save me?
Thanks in advance!
EDIT -- some more information about the dict:
The keys are tuples specifying the values by which the data is correlated.
The values are dicts containing the correlated date. The keys of these dicts are always strings, the values are numbers (int or float).
I am checking for relevancy by comparing the number values in the value-dicts with certain thresholds. If the values are below the thresholds, the particular correlation can be dropped.
I do not think that your solution to the problem is prudent.
If you have that much data I recommend you find some bigger tools in your toolbox, a suggestion would be to let a local Redis server help you out.
Take a look at redis-collections, that will provide you with a dictionary like object with a redis backend, giving you a sustainable solution.
>>> from redis_collections import Dict
>>> d = Dict()
>>> d['answer'] = 42
>>> d
<redis_collections.Dict at fe267c1dde5d4f648e7bac836a0168fe {'answer': 42}>
>>> d.items()
[('answer', 42)]
Best of luck!
Are the keys large? If not, you can loop over the dict to determine which entries should be deleted; store the key for each such entry in a list. Then loop over those keys and delete them from the dict.
I am developing AI to perform MDP, I am getting states(just integers in this case) and assigning it a value, and I am going to be doing this a lot. So I am looking for a data structure that can hold(no need for delete) that information and will have a very fast get/update function. Is there something faster than the regular dictionary? I am looking for anything really so native python, open sourced, I just need fast getting.
Using a Python dictionary is the way to go.
You're saying that all your keys are integers? In that case, it might be faster to use a list and just treat the list indices as the key values. However, you'd have to make sure that you never delete or add list items; just start with as many as you think you'll need, setting them all equal to None, as shown:
mylist = [None for i in xrange(totalitems)]
Then, when you need to "add" an item, just set the corresponding value.
Note that this probably won't actually gain you much in terms of actual efficiency, and it might be more confusing than just using a dictionary.
For 10,000 items, it turns out (on my machine, with my particular test case) that accessing each one and assigning it to a variable takes about 334.8 seconds with a list and 565 seconds with a dictionary.
If you want a rapid prototype, use python. And don't worry about speed.
If you want to write fast scientific code (and you can't build on fast native libraries, like LAPACK for linear algebra stuff) write it in C, C++ (maybe only to call from Python). If fast instead of ultra-fast is enough, you can also use Java or Scala.
So I currently have a 2d list of objects that define the map of a game where each object represents a tile on that map. As I was repurposing the code for something else, I wondered if it would make more sense to use a dictionary to store the map data or to continue using a list. With a list, the indices represent the x and y of the map, whereas in a dictionary, a (x,y) tuple would be the keys for the dictionary.
The reason I ask is because the map changing is a rare event, so the data is fairly static, and as far as i know, the fairly constant lookups will be faster in dictionaries. It should also simplify looping through the map to draw it. Mostly I think using dictionaries will simplify accessing the data, though I'm not sure that will be the case in all cases.
Are these benefits worth the additional memory that I assume the dictionary will take up? or am I even right about the benefits being benefits?
EDIT
I know that the current method works, its was moreso to whether or not it would make sense to switch in order to have cleaner looking code and to find any potential drawbacks.
Stuff like looping through the array would go from something like
for i in range(size[0]):
for e in range(size[1]):
thing.blit(....using i and e)
to
for i, e in dict.items():
i.blit(....using i and e)
or looking up a dict item would be
def get(x, y):
if (x in range(size[0])) and (y in range(size[1])):
return self.map[x][y].tile
to
def get(item):
return self.dict.get(item)
its not much, but its somewhat cleaner, and if its not any slower and there are no other drawbacks i see no reason not to.
I would be wary of premature optimization.
Does your current approach have unacceptable performance? Does the data structure you're using make it harder to reason about or write your code?
If there isn't a specific problem you need to solve that can't be addressed with your current architecture, I would be wary about changing it.
This is a good answer to reference about the speed and memory usage of python lists vs dictionarys: https://stackoverflow.com/a/513906/1506904
Until you get to an incredibly large data set it is most likely that your current method, if it is working well for you, will be perfectly suitable.
I'm not sure that you'll get "the right" answer for this, but when I created the *Game of Life *in Python I used a dict. Realistically there should not be a substantial difference between the lookup cost of multi-dimensional lists and the lookup cost in a dict (both are O(1)), but if you're using a dict then you don't need to worry about instantiating the entire game-board. In chess, this means that you are only creating 32 pieces instead of 64 squares and 32 pieces. In the game of Go, on the other hand you can create only 1 object instead of 361 list-cells.
That said, in the case of the dict you will need to instantiate the tuples. If you can cache those (or only iterate the dict's keys) then maybe you will get the best of all worlds.
I have a problem with my code running on google app engine. I dont know how to modify my code to suit GAE. The following is my problem
for j in range(n):
for d in range(j):
for d1 in range(d):
for d2 in range(d1):
# block which runs in O(n^2)
Efficiently the entire code block is O(N^6) and it will run for more than 10 mins depending on n. Thus I am using task queues. I will also be needing a 4 dimensional array which is stored as a list (eg A[j][d][d1][d2]) of n x n x n x n ie needs memory space O(N^4)
Since the limitation of put() is 10 MB, I cant store the entire array. So I tried chopping into smaller chunks and store it and when retrieve combine them. I used the json function for this but it doesnt support for larger n (> 40).
Then I stored the whole matrix as individual entities of lists in datastore ie each A[j][d][d1] entity. So there is no local variable. When i access A[j][d][d1][d2] in my code I would call my own functions getitem and putitem to get and put data from datastore (used caching also). As a result, my code takes more time for computation. After few iterations, I get the error 203 raised by GAE and task fails with code 500.
I know that my code may not be best suited for GAE. But what is the best way to implement it on GAE ?
There may be even more efficient ways to store your data and to iterate over it.
Questions:
What datatype are you storing, list of list ... of int?
What range of the nested list does your innermost loop O(n^2) portion typically operate over?
When you do the putitem, getitem how many values are you retrieving in a single put or get?
Ideas:
You could try compressing your json (and base64 for cut and pasting). 'myjson'.encode('zlib').encode('base64')
Using a divide and conquer (map reduce) as #Robert suggested. You may be able to use a dictionary with tuples for keys, this may be fewer lookups then A[j][d][d1][d2] in your inner loop. It would also allow you to sparsly populate your structure. You would need to track and know your bounds of what data you loaded in another way. A[j][d][d1][d2] becomes D[(j,d,d1,d2)] or D[j,d,d1,d2]
You've omitted important details like the expected size of n from your question. Also, does the "# block which runs in O(n^2)" need access to the entire matrix, or are you simply populating the matrix based on the index values?
Here is a general answer: you need to find a way to break this up into smaller chunks. Maybe you can use some type of divide and conquer strategy and use tasks for parallelism. How you store your matrix depends on how you split the problem up. You might be able to store submatrices, or perhaps subvectors using the index values as key-names; again, this will depend on your problem and the strategy you use.
An alternative, if for some reason you can not figure out how to parallelize your algorithm, is to use a continuation strategy of some type. In other works, figure out about how many iterations you can typically do within the time constraints (leaving a safety margin), then once you hit that limit save your data and insert a new task to continue the processing. You'll just need to pass in the starting position, then resume running from there. You may be able to do this easily by giving a starting parameter to the outermost range, but again it depends on the specifics of your problem.
Sam, just give you an idea and pointer on where to start.
If what you need is somewhere between storing the whole matrix and storing the numbers one-by-one, may be you will be interested to use pickle to serialize your list, and store them in datastore for later retrieval.
list is a python object, and you should be able to serialize it.
http://appengine-cookbook.appspot.com/recipe/how-to-put-any-python-object-in-a-datastore