How to remove many elements in a very big dict? - python

I have a very large dict, and I want to del many elements from it. Perhaps I should do this:
new_dict = { key:big_dict[key] for key in big_dict if check(big_dict[key] }
However, I don't have enough memory to keep both old_dict and new_dict in RAM. Is there any way to deal?
Add:
I can't del elements one by one. I need to do a test for the values to find which elements I want to del.
I also can't del elements in a for loop like:
for key in dic:
if test(dic(key)):
del dic[key]
It case a error... Can't change len(dic) in the loop...
My God... I even can't make a set to remember keys to del, there are too much keys...
I see, if dict class don't have a function to do this, perhaps the only way to do this is to bug a new computer...

Here are some options:
Make a new 'dict' on disk, for which pickle and shelve may be helpful.
Iterate through and build up a list of keys until it reaches a certain size, delete those, and then repeat the iteration again, allowing you to make a bigger list each time.
Store the keys to delete in terms of their index in .keys(), which can be more memory efficient. This is OK as long as the dictionary is not modified between calls to .keys(). If about half of the elements are to be deleted, do this with a binary sequeunce (1 = delete, 0 = keep). If a vast majority of elements are to be deleted (or not deleted) store the appropriate keys as integers in a list.

You could try iterating through the dictionary and deleting the element that you do not require by
del big_dict[key]
This way you wouldn't be making copies of the dictionary.

You can use
big_dict.pop("key", None)
refer here
How to remove a key from a python dictionary?

Related

What is the faster method?

Python:
I have to use the length of a list which is the value for a key in a dictionary. I have to use this value in FOR loop. Is it better to fetch the length of the list associated with the key every time or fetch the length from a different dictionary which has the same keys?
I am using len() in the for loop as of now.
len() is very fast - it runs in contant time (see Cost of len() function) so I would not build a new data structure just to cache its answer. Just use it each time you need it.
Building a whole extra data structure, that would definitely be using more resources, and most likely slower. Just make sure you write your loop over my_dict.items(), not over the keys, so you don't unnecessarily redo the key lookups inside the loop.
E.g., use something like this for efficient looping over your dict:
my_dict = <some dict where the values are lists>
for key, value in my_dict.items():
# use key, value (your list) and len(value) (its length) as needed

How do I get the last inserted key in Python (>3.7) dict without iterating through the whole list?

I know that Python dict's keys() since 3.7 are ordered by insertion order. I also know that I can get the first inserted key in O(1) time by doing next(dict.keys())
What I want to know is, is it possible to get the last inserted key in O(1)?
Currently, the only way I know of is to do list(dict.keys())[-1] but this takes O(n) time and space.
By last inserted key, I mean:
d = {}
d['a'] = ...
d['b'] = ...
d['a'] = ...
The last inserted key is 'b' as it's the last element in d.keys()
Dicts support reverse iteration:
next(reversed(d))
Note that due to how the dict implementation works, this is O(1) for a dict where no deletions have occurred, but it may take longer than O(1) if items have been deleted from the dict. If items have been deleted, the iterator may have to traverse a bunch of dummy entries to find the last real entry. (Finding the first item may also take longer than O(1) in such a case.)
If you want to both find the last element and remove it, use d.popitem(), as Tim Peters suggested. popitem is optimized for this. Starting with a dict where no deletions have occurred, removing all elements from the dict with del d[next(reversed(d))] or del d[next(iter(d))] is O(N^2), while removing all elements with d.popitem() is O(N).
In more complex cases, you may want to consider using collections.OrderedDict. collections.OrderedDict uses a linked list-based ordering implementation with a few advantages, one of those advantages being that finding the first or last element is always O(1) regardless of what deletions have occurred.
#[user2357112 supports Monica]'s next(reversed(d)) does the trick. I'll add that, when I want the most recently added key, I usually want to remove it from the dict too. In that case, a simple d.popitem() does the trick.
CAUTIONS
I also know that I can get the first inserted key in O(1) time by doing next(dict.keys())
That doesn't work. If you try, you'll get
TypeError: 'dict_keys' object is not an iterator
However, e.g., next(iter(d)) will work.
But it's not necessarily O(1). It is for the OrderedDict library implementation, but not for the built-in dicts. The problem is that if, e.g., you delete "the first" key over & over & over ... again, that leaves an increasing number of "holes" at the start of the data structure, and finding the first non-hole has to skip over all of those one at a time.
So, in a loop, repeatedly deleting the first and then accessing the new first can take time quadratic in the number of loop iterations. You probably won't notice unless the dict has at least about 10 thousand elements, though.

Python: Fastest way to search if long string is in list of strings

I have an input of about 2-5 millions strings of about 400 characters each, coming from a stored text file.
I need to check for duplicates before adding them to the list that I check (doesn't have to be a list, can be any other data type, the list is technically a set since all items are unique).
I can expect about 0.01% at max of my data to be non-unique and I need to filter them out.
I'm wondering if there is any faster way for me to check if the item exists in the list rather than:
a=[]
for item in data:
if item not in a:
a.add(item)
I do not want to lose the order.
Would hashing be faster (I don't need encryption)? But then I'd have to maintain a hash table for all the values to check first.
Is there any way I'm missing?
I'm on python 2, can at max go upto python 3.5.
It's hard to answer this question because it keeps changing ;-) The version I'm answering asks whether there's a faster way than:
a=[]
for item in data:
if item not in a:
a.add(item)
That will be horridly slow, taking time quadratic in len(data). In any version of Python the following will take expected-case time linear in len(data):
seen = set()
for item in data:
if item not in seen:
seen.add(item)
emit(item)
where emit() does whatever you like (append to a list, write to a file, whatever).
In comments I already noted ways to achieve the same thing with ordered dictionaries (whether ordered by language guarantee in Python 3.7, or via the OrderedDict type from the collections package). The code just above is the most memory-efficient, though.
You can try this,
a = list(set(data))
A List is an ordered sequence of elements whereas Set is a distinct list of elements which is unordered.

Python, how to check object is in list?

I have an object to store data:
Vertex(key, children)
and a list to store this objects
vertices = []
im using another list
key_vertices = []
to store keys of my vertices, so i can easy access(withot looping every object), each time i need to check vertex with such key exist, like that:
if key not in self.key_vertices:
# add a new key to the array
self.key_vertices.append(key)
# create a vertex object to store dependency information
self.verteces.append(Vertex(key, children))
i think it a bit complicated, maybe someone now better way, to store multiple Vertices objects with ability easy to check and access them
Thanks
your example works fine, the only problem you could have is a performance issue with the in operator for list which is O(n).
If you don't care about the order of the keys (which is likely), just do this:
self.key_vertices = set()
then:
if key not in self.key_vertices:
# add a new key to the array
self.key_vertices.add(key)
# create a vertex object to store dependency information
self.verteces.append(Vertex(key, children))
you'll save a lot of time in the in operator because set in is way faster due to key hashing.
And if you don't care about order in self.verteces, just do a dictionary, and in that case, you probably don't need the first key parameter to your Vertex structure.
self.verteces = dict()
if key not in self.verteces:
# create a vertex object to store dependency information
self.verteces[key] = Vertex(children)
When you need to check for membership, a list is not the best choice as every object in the list will be checked.
If key is hashable, use a set.
If it's not hashable but is comparable, use a tree (unavailable in the standard library). Try to make it hashable.
If I understand correctly, you want to check if an element has already been added for O(1) (i.e. that you do not have to check every element in the list).
The easiest way to do that is use a set. A set is an unordered list that allows you to check if an element exists with a constant time O(1). You can think of a set like a dict with keys only but it works just like a list:
for value in mySet:
print(value)
print("hello" in mySet)
If you need an ordered list (most of the time, you don't), your approach is pretty good but I would use a set instead:
self.vertece_set = set() # Init somewhere else ;)
if key not in self.vertece_set:
# add a new key to the array
self.vertece_set.add(key)
# create a vertex object to store dependency information
self.verteces.append(Vertex(key, children))

Look up python dict value by expression

I have a dict that has unix epoch timestamps for keys, like so:
lookup_dict = {
1357899: {} #some dict of data
1357910: {} #some other dict of data
}
Except, you know, millions and millions and millions of entries. I'd like to subset this dict, over and over again. Ideally, I'd love to be able to write something like I can in R, like:
lookup_value = 1357900
dict_subset = lookup_dict[key >= lookup_value]
# dict_subset now contains {1357910: {}}
But I confess, I can't find any actual proof that this is something Python can do without having, one way or the other, to iterate over every row. If I understand Python correctly (and I might not), key lookup of the form key in dict uses binary search, and is thus very fast; any way to do a binary search, on dict keys?
To do this without iterating, you're going to need the keys in sorted order. Then you just need to do a binary search for the first one >= lookup_value, instead of checking each one for >= lookup_value.
If you're willing to use a third-party library, there are plenty out there. The first two that spring to mind are bintrees (which uses a red-black tree, like C++, Java, etc.) and blist (which uses a B+Tree). For example, with bintrees, it's as simple as this:
dict_subset = lookup_dict[lookup_value:]
And this will be as efficient as you'd hope—basically, it adds a single O(log N) search on top of whatever the cost of using that subset. (Of course usually what you want to do with that subset is iterate the whole thing, which ends up being O(N) anyway… but maybe you're doing something different, or maybe the subset is only 10 keys out of 1000000.)
Of course there is a tradeoff. Random access to a tree-based mapping is O(log N) instead of "usually O(1)". Also, your keys obviously need to be fully ordered, instead of hashable (and that's a lot harder to detect automatically and raise nice error messages on).
If you want to build this yourself, you can. You don't even necessarily need a tree; just a sorted list of keys alongside a dict. You can maintain the list with the bisect module in the stdlib, as JonClements suggested. You may want to wrap up bisect to make a sorted list object—or, better, get one of the recipes on ActiveState or PyPI to do it for you. You can then wrap the sorted list and the dict together into a single object, so you don't accidentally update one without updating the other. And then you can extend the interface to be as nice as bintrees, if you want.
Using the following code will work out
some_time_to_filter_for = # blah unix time
# Create a new sub-dictionary
sub_dict = {key: val for key, val in lookup_dict.items()
if key >= some_time_to_filter_for}
Basically we just iterate through all the keys in your dictionary and given a time to filter out for we take all the keys that are greater than or equal to that value and place them into our new dictionary

Categories