Is there a dictionary in python which will only keep the most recently accessed keys. Specifically, I am caching relatively large blobs of data in a dictionary, and I am looking for a way of preventing the dictionary from ballooning in size, and to drop to the variables which were only accessed a long time ago [i.e. to only keep the say the 1000 most recently accessed keys - and when a new key gets added, to drop the key that was accessed the longest ago].
I suspect this is not part of the standard dictionary class, but am hoping there is something analogous.
Sounds like you want a Least Recently Used (LRU) cache.
Here's a Python implementation already: https://pypi.python.org/pypi/lru-dict/
Here's another one: https://www.kunxi.org/blog/2014/05/lru-cache-in-python/
Related
I need a dictionary where I can store items with a TTL (time to live) so that the items disappear once the time is up. I found the ExpiringDict class for this purpose but it appears to be restricted to having the same timeout for each item in the dictionary. Is there an alternative that lets me specify different timeout values for each key?
It is easy to build yourself. Ingredients: a normal dict to store the values; a heapq to store (expiry, key) pairs; a Thread to run a loop, check the top of the heap and delete (or mark expired, depending on what your need is) while top's expiry is in the past (don't forget to let it sleep). When you push to dict, at the same time add (now + ttl, key) to the heapq. There's some details that you might want to attend to (e.g. removing stuff from heapq if you delete from dict etc, though that'd be a bit slow as you'd have to search the heap, then re-heapify - again, only necessary if your use case requires it) but the basic idea is quite simple.
One place to look for inspiration might be Django's LocMemCache in-memory key-value object. It basically wraps a _cache dict to hold the actual cache and an _expire_info dict to store any expiry. Then in get(), it will call self._has_expired() and make the comparison to the current timestamp with time.time().
You can find the class at django.core.cache.backends.locmem.
Granted, this is not a dict subclass; as mentioned above, it actually wraps two separate dictionaries, one for caching and one for storing expiries; but, its API is dictionary-like.
I am implementing a stack data structure using redis-py and Redis list data type. I am not clear on how to handle the case when the corresponding list data type is empty. Default Redis behaviour appears to be that once a list is empty the related key is deleted. The empty list case is hit on Redis for example, when I pop or clear all elements in my stack data structure on the Python end. Basically, my set up is that I have stack object in my code that calls operations on the Redis list. For example, when a client of the stack object does, stack.pop(), the stack object then calls BRPOP on the corresponding list in Redis using redis-py. Also, in my set up, the stack object has key attribute, which is the key of the related list in Redis.
I have thought about 2 possible solutions so far:
Never empty the Redis list completely. At least maintain one element in the list. From the client's perspective a stack is empty if the Redis list only contains 1 element. This approach works, but I mainly dont like it as it involves keeping track of number of elements pushed/popped.
If the list is empty, and the related key is deleted. Upon subsequent push, just create a new list on Redis. This approach also works, but an added complexity here is that I cannot be sure if someone else has created k,v pair on Redis using the same key as that of my stack object.
So, I am basically looking for a way to keep a key with an empty list that does not involve the bookkeeping required in the above two approaches. Thanks.
Your 2nd solution is the best Fit.
Yes you have to maintain the naming conventions, this is very basic for any no sql databases or key value store. You must have the control over the keys you put in. If you don't have that control, then over the period of time you don't know which key is used for which purpose. To achieve the same, you can prefix some meaningful string in all the keys you put in.
For example, if i want to store 3 hashmaps for a single user user1 I will do like this
hmset ACTIONS_user1 a 10 b 20 ...
hmset VIEWS_user1 home_page 4 login 10 ...
hmset ALERTS_user1 daily 5 hourly 12 ...
In the above example user1 is dynamically created by the app logic and you append them with a meaningful string representing what that key holds.
In this way you will always have control over the keys you put in and you will never face key collision.
Hope this helps.
Does the python code
for key in dict:
...
, where dict is a dict data type, always iterate in a fixed order with regrard to key? For example, suppose dict={"aaa":1,"bbb",2}, will the above code always first let key="aaa" and then key="bbb" (or in another fixed order)? Is it possible that the order is random? I am using python 3.3 in ubuntu 13 and let's assume this running environment doesn't change. Thank you.
add one thing: during multiple runs, the variable dict remains unchanged, i.e., generate once and read multiple times.
Intrinsically, a dictionary has no order in which it stores it keys. So you can not rely on the order. (I wouldn't assume the order to be unchanged even when the environment is identical).
One of the few reliable ways:
for key in sorted(yourDictionary.keys()):
# Use as key and yourDictionary[key]
EDIT: Response to your comment:
Python does not store keys in a random fashion. All the documentation says is that, you should not rely on this order. It depends on the implementation how the keys are ordered. What I will say here about your question is: If you are relying on this order, you are probably doing something wrong. In general you should/need not rely on this at all. :-)
CPython implementation detail: Keys and values are listed in an arbitrary
order which is non-random, varies across Python implementations, and
depends on the dictionary’s history of insertions and deletions.
For more info:http://docs.python.org/2/library/stdtypes.html#dict.items
What's more, You could use collections.OrderedDict to make the order fixed.
I want my Python program to be deterministic, so I have been using OrderedDicts extensively throughout the code. Unfortunately, while debugging memory leaks today, I discovered that OrderedDicts have a custom __del__ method, making them uncollectable whenever there's a cycle. It's rather unfortunate that there's no warning in the documentation about this.
So what can I do? Is there any deterministic dictionary in the Python standard library that plays nicely with gc? I'd really hate to have to roll my own, especially over a stupid one line function like this.
Also, is this something I should file a bug report for? I'm not familiar with the Python library's procedures, and what they consider a bug.
Edit: It appears that this is a known bug that was fixed back in 2010. I must have somehow gotten a really old version of 2.7 installed. I guess the best approach is to just include a monkey patch in case the user happens to be running a broken version like me.
If the presence of the __del__ method is problematic for you, just remove it:
>>> import collections
>>> del collections.OrderedDict.__del__
You will gain the ability to use OrderedDicts in a reference cycle. You will lose having the OrderedDict free all its resources immediately upon deletion.
It sounds like you've tracked down a bug in OrderedDict that was fixed at some point after your version of 2.7. If it wasn't in any actual released versions, maybe you can just ignore it. But otherwise, yeah, you need a workaround.
I would suggest that, instead of monkeypatching collections.OrderedDict, you should instead use the Equivalent OrderedDict recipe that runs on Python 2.4 or later linked in the documentation for collections.OrderedDict (which does not have the excess __del__). If nothing else, when someone comes along and says "I need to run this on 2.6, how much work is it to port" the answer will be "a little less"…
But two more points:
rewriting everything to avoid cycles is a huge amount of effort.
The fact that you've got cycles in your dictionaries is a red flag that you're doing something wrong (typically using strong refs for a cache or for back-pointers), which is likely to lead to other memory problems, and possibly other bugs. So that effort may turn out to be necessary anyway.
You still haven't explained what you're trying to accomplish; I suspect the "deterministic" thing is just a red herring (especially since dicts actually are deterministic), so the best solution is s/OrderedDict/dict/g.
But if determinism is necessary, you can't depend on the cycle collector, because it's not deterministic, and that means your finalizer ordering and so on all become non-deterministic. It also means your memory usage is non-deterministic—you may end up with a program that stays within your desired memory bounds 99.999% of the time, but not 100%; if those bounds are critically important, that can be worse than failing every time.
Meanwhile, the iteration order of dictionaries isn't specified, but in practice, CPython and PyPy iterate in the order of the hash buckets, not the id (memory location) of either the value or the key, and whatever Jython and IronPython do (they may be using some underlying Java or .NET collection that has different behavior; I haven't tested), it's unlikely that the memory order of the keys would be relevant. (How could you efficiently iterate a hash table based on something like that?) You may have confused yourself by testing with objects that use id for hash, but most objects hash based on value.
For example, take this simple program:
d={}
d[0] = 0
d[1] = 1
d[2] = 2
for k in d:
print(k, d[k], id(k), id(d[k]), hash(k))
If you run it repeatedly with CPython 2.7, CPython 3.2, and PyPy 1.9, the keys will always be iterated in order 0, 1, 2. The id columns may also be the same each time (that depends on your platform), but you can fix that in a number of ways—insert in a different order, reverse the order of the values, use string values instead of ints, assign the values to variables and then insert those variables instead of the literals, etc. Play with it enough and you can get every possible order for the id columns, and yet the keys are still iterated in the same order every time.
The order of iteration is not predictable, because to predict it you need the function for converting hash(k) into a bucket index, which depends on information you don't have access to from Python. Even if it's just hash(k) % self._table_size, unless that _table_size is exposed to the Python interface, it's not helpful. (It's a complex function of the sequence of inserts and deletes that could in principle be calculated, but in practice it's silly to try.)
But it is deterministic; if you insert and delete the same keys in the same order every time, the iteration order will be the same every time.
Note that the fix made in Python 2.7 to eliminate the __del__ method and so stop them from being uncollectable does unfortunately mean that every use of an OrderedDict (even an empty one) results in a reference cycle which must be garbage collected. See this answer for more details.
I am new to python. I need a data structure to store counts of some objects. For example, I want to store the most visited webpages. Lets say. I have 100 the most visited webpages. I keep the counts of visits to each webpage. I may need to update the list. I will definitely update the visit-counts. It does not have to be ordered. I will look at the associated visit-count given the webpage ID. I am planning to use a dictionary. Is there a faster way of doing this in python?
The dictionary is an appropriate and fast data structure for this task (mapping webpage IDs to visit counts).
Python dictionaries are implemented using hash tables for fast O(1) access. They are so fast that almost any attempt to avoid them will make code run slower and make the code unpleasant to look at.
P.S. Also take a look at collections.Counter which is specifically designed for this kind of work (counting hits). It is implemented as a dictionary with initial default values set to zero.
Python dictionary object is one of the most optimized parts of the whole Python language and the reason is that dictionaries are used everywhere.
For example normally every object instance of every class uses a dictionary to keep instance data members content, the class is a dictionary containing the methods, the modules use a dictionary to keep the globals, the system uses a dictionary to keep and lookup the modules and so on.
For keeping a counter using a dictionary is a good approach in Python.