Avoid double lookup when updating dictionary integer members - python

If a dictionary contains something to which you can hold a reference, you can default-or-update it with one dictionary lookup:
d.setdefault('k', []).append(2)
However, modifying dictionary entries in the same manner is not possible if they're numbers:
d.setdefault('k', 0) += 1 # doesn't work
Instead, you need to do two dict lookups, one for read and one for write:
d['a'] = d.get('a', 0) + 1
This doesn't seem like a great idea for dictionaries with a huge number of keys. So, is there a way to do a default-or-update operation on dictionaries containing numbers? Or, phrased another way, what's the most performant way to apply a default-or-update operation on such dictionaries?

A quick test suggests that collections.defaultdict is about 2.5 times faster than your double-lookup (tested on Python 2.6):
>>> import timeit
>>> s1 = "d = dict((str(n), 0) for n in range(1000000))"
>>> timeit.repeat("d['a'] = d.get('a', 0) + 1", setup=s1)
[0.17711305618286133, 0.17411494255065918, 0.17812514305114746]
>>> s2 = """
... from collections import defaultdict
... d = defaultdict(int, ((str(n), 0) for n in range(1000000)))
... """
>>> timeit.repeat("d['a'] += 1", setup=s2)
[0.07185506820678711, 0.07294416427612305, 0.12155508995056152]

Related

Hashing bitarray ? Counting

It seems for some reason that a dict can not have a non-duplicate key which is bitarray()
ex.:
data = {}
for _ in xrange(10):
ba = ...generate repeatable bitarrays ...
data[ba] = 1
print ba
{bitarray('11011'): 1, bitarray('11011'): 1, bitarray('11011'): 1, bitarray('01111'): 1, bitarray('11110'): 1, bitarray('11110'): 1, bitarray('01111'): 1, bitarray('01111'): 1, bitarray('11110'): 1, bitarray('11110'): 1}
You can clearly see that duplicate are stored as different keys (f.e. first two elements) !! which is weird. What could be the reason.
My goal is simply to count the number of times a bit pattern shows up, and of course Dict's are perfect for this, but it seems that bitarray() for some reason is opaque to the hashing algorithm.
btw.. i have to use bitarray(), cause i do 10000 bits+ patterns.
Any other idea of efficient way of counting occurrence of bit pattens ..
This answer addresses your first confusion regarding duplicate dictionary keys and I assume you're referring to bitarray() from bitarray module, *I've not used this module myself.
In your example above, you're not actually getting duplicate dictionary keys, you might see them that way, but they're duplicates to the naked eye only, for instance:
>>> class X:
... def __repr__(self):
... return '"X obj"'
...
>>> x1 = X()
>>> x2 = X()
>>> d = {x1:1, x2:2}
>>> d
{"X obj": 2, "X obj": 1}
But x1 isn't exactly equals to to x2 and hence they're not duplicates, they're distinct objects of class X:
>>> x1 == x2
False
>>> #same as
... id(x1) == id(x2)
False
>>> #same as
...x1 is x2
False
Moreover, because X class defines __repr__ which returns the string representation for its objects, you would think dictionary d has duplicate keys, again there are no duplicated keys nor are the keys of type str; key of value 1 is X object and key of value 2 is another object of X -- literally two different objects with a single string representation returned by their class's __repr__ method:
>>> # keys are instance of X not strings
... d
{"X obj": 2, "X obj": 1}
>>> d["X obj"]
KeyError: 'X obj'
>>>[x1]
1
>>>[x2]
2
Till BitArray 0.8.1 (or later) I believe it does not satisfy the hash invariant property.
To work around it, you should convert the bit array to byte format as follows.
>>> from bitarray import bitarray
>>> l = [bitarray('11111'), bitarray('11111'), bitarray('11010'), bitarray('11110'), bitarray('11111'), bitarray('11010')]
>>> for x in l: ht[x.tobytes()] = 0
...
>>> for x in l: ht[x.tobytes()] += 1
...
>>> ht
{'\xf8': 3, '\xf0': 1, '\xd0': 2}
Remember you can get back the bitarray from the byte format by using the command frombytes(byte). Though, in this case you will have to keep track of the size of bitarray explicitly as it will return bitarray of size multiple of 8.
If you want to keep the bitarray in the dictionary also:
>>> from bitarray import bitarray
>>> l = [bitarray('11111'), bitarray('11111'), bitarray('11010'), bitarray('11110'), bitarray('11111'), bitarray('11010')]
>>> ht = {}
>>> for x in l: ht[x.tobytes()] = (0, x)
...
>>> for x in l:
... old_count = ht[x.tobytes()][0]
... ht[x.tobytes()] = (old_count+1, x)
...
>>> ht
{'\xf8': (3, bitarray('11111')), '\xf0': (1, bitarray('11110')), '\xd0': (2, bitarray('11010'))}
>>> for x,y in ht.iteritems(): print(y)
...
(3, bitarray('11111'))
(1, bitarray('11110'))
(2, bitarray('11010'))
I solved it :
desc = bitarray(res).to01()
if desc in data : data[desc] += 1
else : data[desc] = 1
gosh I miss perl no-nonsense autovivification :)

How to hash strings in python to match within 1 character?

I've read about LSH hashing and am wondering what is the best implementation to match strings within 1 character?
test = {'dog':1, 'cat': 2, 'eagle': 3}
test['dog']
>> 1
I would want to also return 1 if I lookup test['dogs'] or test['dogg']. I realize that it would also return 1 if I were to look up "log" or "cog", but I can write a method to exclude those results.
Also how can I further this method for general strings to return a match within X characters?
string1 = "brown dogs"
string2 = "brown doggie"
Assuming only string1 is stored in my dictionary, a lookup for string2 would return string1.
Thanks
Well, you can define the similarity between 2 strings by the length of the start they share in common (3 for doga and dogs, for instance). This is simplistic, but that could fit your needs.
With this assumption, you can define this:
>>> test = {'dog':1, 'cat': 2, 'eagle': 3}
>>> def same_start(s1, s2):
ret = 0
for i in range(min(len(s1), len(s2))):
if s1[i] != s2[i]:
break
ret += 1
return ret
>>> def closest_match(s):
return max(((k, v, same_start(k, s)) for k, v in test.iteritems()), key=lambda x: x[2])[1]
>>> closest_match('dogs') # matches dog
1
>>> closest_match('cogs') # matches cat
2
>>> closest_match('eaogs') # matches eagle
3
>>>
Maybe you could try using a Soundex function as your dictionary key?
Since your relation is not 1:1, maybe you could define your own dict type with redefined __getitem__ which could return a list of possible items. Here's what I mean:
class MyDict(dict):
def __getitem__(self, key):
l = []
for k, v in self.items():
if key.startswith(k): # or some other comparation method
l.append(v)
return l
This is just an idea, probably other dict methods should be redefined too in order to avoid possible errors or infinite loops. Also, #Emmanuel's answer could be very useful here if you want only one item returned instead of the list, and that way you wouldn't have to redefine everything.

Efficiency of dict.get() method

I have just started learning python and worried that if I use dict.get(key,default_value) or I define my own method for it....so do they have any differences:
[1st method]:
dict={}
for c in string:
if c in dict:
dict[c]+=1
else:
dict[c]=1
and the other dict.get() method that python provides
for c in string:
dict[c]=dict.get(c,0)+1
do they have any differences on efficiency or speed...or they are just the same and 2nd one only saves writing few more lines of code...
For this specific case, use either a collections.Counter() or a collections.defaultdict() object instead:
import collections
dct = collections.defaultdict(int)
for c in string:
dict[c] += 1
or
dct = collections.Counter(string)
Both are subclasses of the standard dict type. The Counter type adds some more helpful functionality like summing two counters or listing the most common entities that have been counted. The defaultdict class can also be given other default types; use defaultdict(list) for example to collect things into lists per key.
When you want to compare performance of two different approaches, you want to use the timeit module:
>>> import timeit
>>> def intest(dct, values):
... for c in values:
... if c in dct:
... dct[c]+=1
... else:
... dct[c]=1
...
>>> def get(dct, values):
... for c in values:
... dct[c] = dct.get(c, 0) + 1
...
>>> values = range(10) * 10
>>> timeit.timeit('test(dct, values)', 'from __main__ import values, intest as test; dct={}')
22.210275888442993
>>> timeit.timeit('test(dct, values)', 'from __main__ import values, get as test; dct={}')
27.442166090011597
This shows that using in is a little faster.
There is, however, a third option to consider; catching the KeyError exception:
>>> def tryexcept(dct, values):
... for c in values:
... try:
... dct[c] += 1
... except KeyError:
... dct[c] = 1
...
>>> timeit.timeit('test(dct, values)', 'from __main__ import values, tryexcept as test; dct={}')
18.023509979248047
which happens to be the fastest, because only 1 in 10 cases are for a new key.
Last but not least, the two alternatives I proposed:
>>> def default(dct, values):
... for c in values:
... dct[c] += 1
...
>>> timeit.timeit('test(dct, values)', 'from __main__ import values, default as test; from collections import defaultdict; dct=defaultdict(int)')
15.277361154556274
>>> timeit.timeit('Counter(values)', 'from __main__ import values; from collections import Counter')
38.657804012298584
So the Counter() type is slowest, but defaultdict is very fast indeed. Counter()s do a lot more work though, and the extra functionality can bring ease of development and execution speed benefits elsewhere.

how I fill a list with many variables python

I have a some variables and I need to compare each of them and fill three lists according the comparison, if the var == 1 add a 1 to lista_a, if var == 2 add a 1 to lista_b..., like:
inx0=2 inx1=1 inx2=1 inx3=1 inx4=4 inx5=3 inx6=1 inx7=1 inx8=3 inx9=1
inx10=2 inx11=1 inx12=1 inx13=1 inx14=4 inx15=3 inx16=1 inx17=1 inx18=3 inx19=1
inx20=2 inx21=1 inx22=1 inx23=1 inx24=2 inx25=3 inx26=1 inx27=1 inx28=3 inx29=1
lista_a=[]
lista_b=[]
lista_c=[]
#this example is the comparison for the first variable inx0
#and the same for inx1, inx2, etc...
for k in range(1,30):
if inx0==1:
lista_a.append(1)
elif inx0==2:
lista_b.append(1)
elif inx0==3:
lista_c.append(1)
I need get:
#lista_a = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
#lista_b = [1,1,1]
#lista_c = [1]
Your inx* variables should almost certinaly be a list to begin with:
inx = [2,1,1,1,4,3,1,1,3,1,2,1,1,1,4,3,1,1,3,1,2,1,1,1,2,3,1,1,3,1]
Then, to find out how many 2's it has:
inx.count(2)
If you must, you can build a new list out of that:
list_a = [1]*inx.count(1)
list_b = [1]*inx.count(2)
list_c = [1]*inx.count(3)
but it seems silly to keep a list of ones. Really the only data you need to keep is a single integer (the count), so why bother carrying around a list?
An alternate approach to get the lists of ones would be to use a defaultdict:
from collections import defaultdict
d = defaultdict(list)
for item in inx:
d[item].append(1)
in this case, what you want as list_a could be accessed by d[1], list_b could be accessed as d[2], etc.
Or, as stated in the comments, you could get the counts using a collections.Counter:
from collections import Counter #python2.7+
counts = Counter(inx)
list_a = [1]*counts[1]
list_b = [1]*counts[2]
...

using FOR statement on 2 elements at once python

I have the following list of variables and a mastervariable
a = (1,5,7)
b = (1,3,5)
c = (2,2,2)
d = (5,2,8)
e = (5,5,8)
mastervariable = (3,2,5)
I'm trying to check if 2 elements in each variable exist in the master variable, such that the above would show B (3,5) and D (5,2) as being elements with at least 2 elements matching in the mastervariable. Also note that using sets would result in C showing up as matchign but I don't want to count C cause only 'one' of the elements in C are in mastervariable (i.e. 2 only shows up once in mastervariable not twice)
I currently have the very inefficient:
if current_variable[0]==mastervariable[0]:
if current_variable[1] = mastervariable[1]:
True
elif current_variable[2] = mastervariable[1]:
True
#### I don't use OR here because I need to know which variables match.
elif current_variable[1] == mastervariable[0]: ##<-- I'm now checking 2nd element
etc. etc.
I then continue to iterate like the above by checking each one at a time which is extremely inefficient. I did the above because using a FOR statement resulted in me checking the first element twice which was incorrect:
For i in a:
for j in a:
### this checked if 1 was in the master variable and not 1,5 or 1,7
Is there a way to use 2 FOR statement that allows me to check 2 elements in a list at once while skipping any element that has been used already? Alternatively, can you suggest an efficient way to do what I'm trying?
Edit: Mastervariable can have duplicates in it.
For the case where matching elements can be duplicated so that set breaks, use Counter as a multiset - the duplicates between a and master are found by:
count_a = Counter(a)
count_master = Counter(master)
count_both = count_a + count_master
dups = Counter({e : min((count_a[e], count_master[e])) for e in count_a if count_both[e] > count_a[e]})
The logic is reasonably intuitive: if there's more of an item in the combined count of a and master, then it is duplicated, and the multiplicity is however many of that item are in whichever of a and master has less of them.
It gives a Counter of all the duplicates, where the count is their multiplicity. If you want it back as a tuple, you can do tuple(dups.elements()):
>>> a
(2, 2, 2)
>>> master
(1, 2, 2)
>>> dups = Counter({e : min((count_a[e], count_master[e])) for e in count_a if count_both[e] > count_a[e]})
>>> tuple(dups.elements())
(2, 2)
Seems like a good job for sets. Edit: sets aren't suitable since mastervariable can contain duplicates. Here is a version using Counters.
>>> a = (1,5,7)
>>>
>>> b = (1,3,5)
>>>
>>> c = (2,2,2)
>>>
>>> d = (5,2,8)
>>>
>>> e = (5,5,8)
>>> D=dict(a=a, b=b, c=c, d=d, e=e)
>>>
>>> from collections import Counter
>>> mastervariable = (5,5,3)
>>> mvc = Counter(mastervariable)
>>> for k,v in D.items():
... vc = Counter(v)
... if sum(min(count, vc[item]) for item, count in mvc.items())==2:
... print k
...
b
e

Categories