fast data comparison in python

fast data comparison in python - python

I want to compare a large set of data in the form of 2 dictionaries of varying lengths.
(edit)
post = {0: [0.96180319786071777, 0.37529754638671875],
10: [0.20612385869026184, 0.17849941551685333],
20: [0.20612400770187378, 0.17510984838008881],...}
pre = {0: [0.96180319786071777, 0.37529754638671875],
1: [0.20612385869026184, 0.17849941551685333],
2: [0.20612400770187378, 0.17510984838008881],
5065: [0.80861318111419678, 0.76381617784500122],...}
The answer we need to get is 5065: [0.80861318111419678, 0.76381617784500122]. This is based on the fact that we are only comparing the values and not the indices at all.
I am using this key value pair only to remember the sequence of data. The data type can be replaced with a list/set if need be. I need to find out the key:value (index and value) pairs of the elements that are not in common to the dictionaries.
The code that I am using is very simple..
new = {}
found = []
for i in range(0, len(post)):
found= []
for j in range(0, len(pre)):
if post[i] not in pre.values():
if post[i] not in new:
new[i] = post[i]
found.append(j)
break
if found:
for f in found: pre.pop(f)
new{} contains the elements I need.
The problem I am facing is that this process is too slow. It takes sometimes over an hour to process. The data can be much larger at times. I need it to be faster.
Is there an efficient way of doing what I am trying to achieve ? I would like it if we dont depend on external packages apart from those bundled with python 2.5 (64 bit) unless absolutely necessary.
Thank you all.

This is basically what sets are designed for (computing differences in sets of items). The only gotcha is that the things you put into a set need to be hashable, and lists aren't. However, tuples are, so if you convert to that, you can put those into a set:
post_set = set(tuple(x) for x in post.itervalues())
pre_set = set(tuple(x) for x in pre.itervalues())
items_in_only_one_set = post_set ^ pre_set
For more about sets: http://docs.python.org/library/stdtypes.html#set
To get the original indices after you've computed the differences, what you'd probably want is to generate reverse lookup tables:
post_indices = dict((tuple(v),k) for k,v in post.iteritems())
pre_indices = dict((tuple(v),k) for k,v in pre.iteritems())
Then you can just take a given tuple and look up its index via the dictionaries:
index = post_indices.get(a_tuple, pre_indices.get(a_tuple))

Your problem is likely the nested for loops combined with use of range(), which creates a new list each time which can be slow. You will probably get some automatic speedups by iterating pre and post directly, and avoid doing so in a nested fashion.
post = {0: [0.96180319786071777, 0.37529754638671875],
10: [0.20612385869026184, 0.17849941551685333],
20: [0.20612400770187378, 0.17510984838008881]}
pre = {0: [0.96180319786071777, 0.37529754638671875],
1: [0.20612385869026184, 0.17849941551685333],
2: [0.20612400770187378, 0.17510984838008881],
5065: [0.80861318111419678, 0.76381617784500122]}
'''Create sets of values, independent of dict key for O(1) lookup'''
post_set=set(map(tuple, post.values()))
pre_set=set(map(tuple, pre.values()))
'''Iterate through each structure only once, filtering items that are found in
the sets we created earlier, updating new_diff'''
from itertools import ifilterfalse
new_diff=dict(ifilterfalse(lambda x: tuple(x[1]) in pre_set, post.items()))
new_diff.update(ifilterfalse(lambda x: tuple(x[1]) in post_set, pre.items()))
new_diff is now a dict such that each value is not found in both post and pre, with the original index preserved.
>>> print new_diff
{5065: [0.80861318111419678, 0.76381617784500122]}

Related

The following code is gives the output as i = 0, 1, 2, 3, 4 for some reason. Can anyone explain how this is happening? [duplicate]

Let's say we have a Python dictionary d, and we're iterating over it like so:
for k, v in d.iteritems():
del d[f(k)] # remove some item
d[g(k)] = v # add a new item
(f and g are just some black-box transformations.)
In other words, we try to add/remove items to d while iterating over it using iteritems.
Is this well defined? Could you provide some references to support your answer?
See also How to avoid "RuntimeError: dictionary changed size during iteration" error? for the separate question of how to avoid the problem.

Alex Martelli weighs in on this here.
It may not be safe to change the container (e.g. dict) while looping over the container.
So del d[f(k)] may not be safe. As you know, the workaround is to use d.copy().items() (to loop over an independent copy of the container) instead of d.iteritems() or d.items() (which use the same underlying container).
It is okay to modify the value at an existing index of the dict, but inserting values at new indices (e.g. d[g(k)] = v) may not work.

It is explicitly mentioned on the Python doc page (for Python 2.7) that
Using iteritems() while adding or deleting entries in the dictionary may raise a RuntimeError or fail to iterate over all entries.
Similarly for Python 3.
The same holds for iter(d), d.iterkeys() and d.itervalues(), and I'll go as far as saying that it does for for k, v in d.items(): (I can't remember exactly what for does, but I would not be surprised if the implementation called iter(d)).

You cannot do that, at least with d.iteritems(). I tried it, and Python fails with
RuntimeError: dictionary changed size during iteration
If you instead use d.items(), then it works.
In Python 3, d.items() is a view into the dictionary, like d.iteritems() in Python 2. To do this in Python 3, instead use d.copy().items(). This will similarly allow us to iterate over a copy of the dictionary in order to avoid modifying the data structure we are iterating over.

I have a large dictionary containing Numpy arrays, so the dict.copy().keys() thing suggested by #murgatroid99 was not feasible (though it worked). Instead, I just converted the keys_view to a list and it worked fine (in Python 3.4):
for item in list(dict_d.keys()):
temp = dict_d.pop(item)
dict_d['some_key'] = 1 # Some value
I realize this doesn't dive into the philosophical realm of Python's inner workings like the answers above, but it does provide a practical solution to the stated problem.

The following code shows that this is not well defined:
def f(x):
return x
def g(x):
return x+1
def h(x):
return x+10
try:
d = {1:"a", 2:"b", 3:"c"}
for k, v in d.iteritems():
del d[f(k)]
d[g(k)] = v+"x"
print d
except Exception as e:
print "Exception:", e
try:
d = {1:"a", 2:"b", 3:"c"}
for k, v in d.iteritems():
del d[f(k)]
d[h(k)] = v+"x"
print d
except Exception as e:
print "Exception:", e
The first example calls g(k), and throws an exception (dictionary changed size during iteration).
The second example calls h(k) and throws no exception, but outputs:
{21: 'axx', 22: 'bxx', 23: 'cxx'}
Which, looking at the code, seems wrong - I would have expected something like:
{11: 'ax', 12: 'bx', 13: 'cx'}

Python 3 you should just:
prefix = 'item_'
t = {'f1': 'ffw', 'f2': 'fca'}
t2 = dict()
for k,v in t.items():
t2[k] = prefix + v
or use:
t2 = t1.copy()
You should never modify original dictionary, it leads to confusion as well as potential bugs or RunTimeErrors. Unless you just append to the dictionary with new key names.

This question asks about using an iterator (and funny enough, that Python 2 .iteritems iterator is no longer supported in Python 3) to delete or add items, and it must have a No as its only right answer as you can find it in the accepted answer. Yet: most of the searchers try to find a solution, they will not care how this is done technically, be it an iterator or a recursion, and there is a solution for the problem:
You cannot loop-change a dict without using an additional (recursive) function.
This question should therefore be linked to a question that has a working solution:
How can I remove a key:value pair wherever the chosen key occurs in a deeply nested dictionary? (= "delete")
Also helpful as it shows how to change the items of a dict on the run: How can I replace a key:value pair by its value wherever the chosen key occurs in a deeply nested dictionary? (= "replace").
By the same recursive methods, you will also able to add items as the question asks for as well.
Since my request to link this question was declined, here is a copy of the solution that can delete items from a dict. See How can I remove a key:value pair wherever the chosen key occurs in a deeply nested dictionary? (= "delete") for examples / credits / notes.
import copy
def find_remove(this_dict, target_key, bln_overwrite_dict=False):
if not bln_overwrite_dict:
this_dict = copy.deepcopy(this_dict)
for key in this_dict:
# if the current value is a dict, dive into it
if isinstance(this_dict[key], dict):
if target_key in this_dict[key]:
this_dict[key].pop(target_key)
this_dict[key] = find_remove(this_dict[key], target_key)
return this_dict
dict_nested_new = find_remove(nested_dict, "sub_key2a")
The trick
The trick is to find out in advance whether a target_key is among the next children (= this_dict[key] = the values of the current dict iteration) before you reach the child level recursively. Only then you can still delete a key:value pair of the child level while iterating over a dictionary. Once you have reached the same level as the key to be deleted and then try to delete it from there, you would get the error:
RuntimeError: dictionary changed size during iteration
The recursive solution makes any change only on the next values' sub-level and therefore avoids the error.

I got the same problem and I used following procedure to solve this issue.
Python List can be iterate even if you modify during iterating over it.
so for following code it will print 1's infinitely.
for i in list:
list.append(1)
print 1
So using list and dict collaboratively you can solve this problem.
d_list=[]
d_dict = {}
for k in d_list:
if d_dict[k] is not -1:
d_dict[f(k)] = -1 # rather than deleting it mark it with -1 or other value to specify that it will be not considered further(deleted)
d_dict[g(k)] = v # add a new item
d_list.append(g(k))

Today I had a similar use-case, but instead of simply materializing the keys on the dictionary at the beginning of the loop, I wanted changes to the dict to affect the iteration of the dict, which was an ordered dict.
I ended up building the following routine, which can also be found in jaraco.itertools:
def _mutable_iter(dict):
"""
Iterate over items in the dict, yielding the first one, but allowing
it to be mutated during the process.
>>> d = dict(a=1)
>>> it = _mutable_iter(d)
>>> next(it)
('a', 1)
>>> d
{}
>>> d.update(b=2)
>>> list(it)
[('b', 2)]
"""
while dict:
prev_key = next(iter(dict))
yield prev_key, dict.pop(prev_key)
The docstring illustrates the usage. This function could be used in place of d.iteritems() above to have the desired effect.

Python Set Comprehension Nested in Dict Comprehension

I have a list of tuples, where each tuple contains a string and a number in the form of:
[(string_1, num_a), (string_2, num_b), ...]
The strings are nonunique, and so are the numbers, e.g. (string_1 , num_m) or (string_9 , num_b) are likely to exist in the list.
I'm attempting to create a dictionary with the string as the key and a set of all numbers occurring with that string as the value:
dict = {string_1: {num_a, num_m}, string_2: {num_b}, ...}
I've done this somewhat successfully with the following dictionary comprehension with nested set comprehension:
#st_id_list = [(string_1, num_a), ...]
#st_dict = {string_1: {num_a, num_m}, ...}
st_dict = {
st[0]: set(
st_[1]
for st_ in st_id_list
if st_[0] == st[0]
)
for st in st_id_list
}
There's only one issue: st_id_list is 18,000 items long. This snippet of code takes less than ten seconds to run for a list of 500 tuples, but over twelve minutes to run for the full 18,000 tuples. I have to think this is because I've nested a set comprehension inside a dict comprehension.
Is there a way to avoid this, or a smarter way to it?

You have a double loop, so you take O(N**2) time to produce your dictionary. For 500 items, 250.000 steps are taken, and for your 18k items, 324 million steps need to be done.
Here is a O(N) loop instead, so 500 steps for your smaller dataset, 18.000 steps for the larger dataset:
st_dict = {}
for st, id in st_id_list:
st_dict.setdefault(st, set()).add(id)
This uses the dict.setdefault() method to ensure that for a given key (your string values), there is at least an empty set available if the key is missing, then adds the current id value to that set.
You can do the same with a collections.defaultdict() object:
from collections import defaultdict
st_dict = defaultdict(set)
for st, id in st_id_list:
st_dict[st].add(id)
The defaultdict() uses the factory passed in to set a default value for missing keys.
The disadvantage of the defaultdict approach is that the object continues to produce default values for missing keys after your loop, which can hide application bugs. Use st_dict.default_factory = None to disable the factory explicitly to prevent that.

Why are you using two loops when you could do in one loop like this:
list_1=[('string_1', 'num_a'), ('string_2', 'num_b'),('string_1' , 'num_m'),('string_9' , 'num_b')]
string_num={}
for i in list_1:
if i[0] not in string_num:
string_num[i[0]]={i[1]}
else:
string_num[i[0]].add(i[1])
print(string_num)
output:
{'string_9': {'num_b'}, 'string_1': {'num_a', 'num_m'}, 'string_2': {'num_b'}}

Pythonic way to get the index of element from a list of dicts depending on multiple keys

I am very new to python, and I have the following problem. I came up with the following solution. I am wondering whether it is "pythonic" or not. If not, what would be the best solution ?
The problem is :
I have a list of dict
each dict has at least three items
I want to find the position in the list of the dict with specific three values
This is my python example
import collections
import random
# lets build the list, for the example
dicts = []
dicts.append({'idName':'NA','idGroup':'GA','idFamily':'FA'})
dicts.append({'idName':'NA','idGroup':'GA','idFamily':'FB'})
dicts.append({'idName':'NA','idGroup':'GB','idFamily':'FA'})
dicts.append({'idName':'NA','idGroup':'GB','idFamily':'FB'})
dicts.append({'idName':'NB','idGroup':'GA','idFamily':'FA'})
dicts.append({'idName':'NB','idGroup':'GA','idFamily':'FB'})
dicts.append({'idName':'NB','idGroup':'GB','idFamily':'FA'})
dicts.append({'idName':'NB','idGroup':'GB','idFamily':'FB'})
# let's shuffle it, again for example
random.shuffle(dicts)
# now I want to have for each combination the index
# I use a recursive defaultdict definition
# because it permits creating a dict of dict
# even if it is not initialized
def tree(): return collections.defaultdict(tree)
# initiate mapping
mapping = tree()
# fill the mapping
for i,d in enumerate(dicts):
idFamily = d['idFamily']
idGroup = d['idGroup']
idName = d['idName']
mapping[idName][idGroup][idFamily] = i
# I end up with the mapping providing me with the index within
# list of dicts

Looks reasonable to me, but perhaps a little too much. You could instead do:
mapping = {
(d['idName'], d['idGroup'], d['idFamily']) : i
for i, d in enumerate(dicts)
}
Then access it with mapping['NA', 'GA', 'FA'] instead of mapping['NA']['GA']['FA']. But it really depends how you're planning to use the mapping. If you need to be able to take mapping['NA'] and use it as a dictionary then what you have is fine.

How to compare an element of a tuple (int) to determine if it exists in a list

I have the two following lists:
# List of tuples representing the index of resources and their unique properties
# Format of (ID,Name,Prefix)
resource_types=[('0','Group','0'),('1','User','1'),('2','Filter','2'),('3','Agent','3'),('4','Asset','4'),('5','Rule','5'),('6','KBase','6'),('7','Case','7'),('8','Note','8'),('9','Report','9'),('10','ArchivedReport',':'),('11','Scheduled Task',';'),('12','Profile','<'),('13','User Shared Accessible Group','='),('14','User Accessible Group','>'),('15','Database Table Schema','?'),('16','Unassigned Resources Group','#'),('17','File','A'),('18','Snapshot','B'),('19','Data Monitor','C'),('20','Viewer Configuration','D'),('21','Instrument','E'),('22','Dashboard','F'),('23','Destination','G'),('24','Active List','H'),('25','Virtual Root','I'),('26','Vulnerability','J'),('27','Search Group','K'),('28','Pattern','L'),('29','Zone','M'),('30','Asset Range','N'),('31','Asset Category','O'),('32','Partition','P'),('33','Active Channel','Q'),('34','Stage','R'),('35','Customer','S'),('36','Field','T'),('37','Field Set','U'),('38','Scanned Report','V'),('39','Location','W'),('40','Network','X'),('41','Focused Report','Y'),('42','Escalation Level','Z'),('43','Query','['),('44','Report Template ','\\'),('45','Session List',']'),('46','Trend','^'),('47','Package','_'),('48','RESERVED','`'),('49','PROJECT_TEMPLATE','a'),('50','Attachments','b'),('51','Query Viewer','c'),('52','Use Case','d'),('53','Integration Configuration','e'),('54','Integration Command f'),('55','Integration Target','g'),('56','Actor','h'),('57','Category Model','i'),('58','Permission','j')]
# This is a list of resource ID's that we do not want to reference directly, ever.
unwanted_resource_types=[0,1,3,10,11,12,13,14,15,16,18,20,21,23,25,27,28,32,35,38,41,47,48,49,50,57,58]
I'm attempting to compare the two in order to build a third list containing the 'Name' of each unique resource type that currently exists in unwanted_resource_types. e.g. The final result list should be:
result = ['Group','User','Agent','ArchivedReport','ScheduledTask','...','...']
I've tried the following that (I thought) should work:
result = []
for res in resource_types:
if res[0] in unwanted_resource_types:
result.append(res[1])
and when that failed to populate result I also tried:
result = []
for res in resource_types:
for type in unwanted_resource_types:
if res[0] == type:
result.append(res[1])
also to no avail. Is there something i'm missing? I believe this would be the right place to perform list comprehension, but that's still in my grey basket of understanding fully (The Python docs are a bit too succinct for me in this case).
I'm also open to completely rethinking this problem, but I do need to retain the list of tuples as it's used elsewhere in the script. Thank you for any assistance you may provide.

Your resource types are using strings, and your unwanted resources are using ints, so you'll need to do some conversion to make it work.
Try this:
result = []
for res in resource_types:
if int(res[0]) in unwanted_resource_types:
result.append(res[1])
or using a list comprehension:
result = [item[1] for item in resource_types if int(item[0]) in unwanted_resource_types]

The numbers in resource_types are numbers contained within strings, whereas the numbers in unwanted_resource_types are plain numbers, so your comparison is failing. This should work:
result = []
for res in resource_types:
if int( res[0] ) in unwanted_resource_types:
result.append(res[1])

The problem is that your triples contain strings and your unwanted resources contain numbers, change the data to
resource_types=[(0,'Group','0'), ...
or use int() to convert the strings to ints before comparison, and it should work. Your result can be computed with a list comprehension as in
result=[rt[1] for rt in resource_types if int(rt[0]) in unwanted_resource_types]
If you change ('0', ...) into (0, ... you can leave out the int() call.
Additionally, you may change the unwanted_resource_types variable into a set, like
unwanted_resource_types=set([0,1,3, ... ])
to improve speed (if speed is an issue, else it's unimportant).

The one-liner:
result = map(lambda x: dict(map(lambda a: (int(a[0]), a[1]), resource_types))[x], unwanted_resource_types)
without any explicit loop does the job.
Ok - you don't want to use this in production code - but it's fun. ;-)
Comment:
The inner dict(map(lambda a: (int(a[0]), a[1]), resource_types)) creates a dictionary from the input data:
{0: 'Group', 1: 'User', 2: 'Filter', 3: 'Agent', ...
The outer map chooses the names from the dictionary.

Python dictionary error

In the below code d_arr is an array of dictionaries
def process_data(d_arr):
flag2 = 0
for dictionaries in d_arr:
for k in dictionaries:
if ( k == "*TYPE" ):
""" Here we determine the type """
if (dictionaries[k].lower() == "name"):
dictionaries.update({"type" : 0})
func = name(dictionaries)
continue
elif (dictionaries[k].lower() == "ma"):
dictionaries.update({"type" : 1})
func = DCC(dictionaries)
logging.debug(type(func))
continue
When the above is done i get an error saying
for k in dictionaries:
RuntimeError: dictionary changed size during iteration
2010-08-02 05:26:44,167 DEBUG Returning
Is this forbidden to do something like this

It is, indeed, forbidden. Moreover, you don't really need a loop over all keys here, given that the weirdly named dictionaries appears to be a single dict; rather than the for k in dictionaries: (or the workable for k in dictionaries.keys() that #Triptych's answer suggests), you could use...:
tp = dictionaries.get('*TYPE')
if tp is not None:
""" Here we determine the type """
if tp.lower() == 'name':
dictionaries.update({"type" : 0})
func = name(dictionaries)
elif tp.lower() == "ma":
dictionaries.update({"type" : 1})
func = DCC(dictionaries)
logging.debug(type(func))
This is going to be much faster if dictionaries has any considerable length, for you're reaching directly for the one entry you care about, rather than looping over all entries to check each of them for the purpose of seeing if it is the one you care about.
Even if you've chosen to omit part of your code, so that after this start the loop on dictionaries is still needed, I think my suggestion is still preferable because it lets you get any alteration to dictionaries done and over with (assuming of course that you don't keep altering it in the hypothetical part of your code I think you may have chosen to omit;-).

That error is pretty informative; you can't change the size of a dictionary you are currently iterating over.
The solution is to get the keys all at once and iterate over them:
# Do this
for k in dictionaries.keys():
# Not this
for k in dictionaries:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

fast data comparison in python - python

Related

The following code is gives the output as i = 0, 1, 2, 3, 4 for some reason. Can anyone explain how this is happening? [duplicate]

Python Set Comprehension Nested in Dict Comprehension

Pythonic way to get the index of element from a list of dicts depending on multiple keys

How to compare an element of a tuple (int) to determine if it exists in a list

Python dictionary error

Categories

Resources