Can anyone explain this dictionary vs matrix size difference?

Can anyone explain this dictionary vs matrix size difference? - python

I'm creating a recommendation engine for work and ended up with an 8,000 by 8,000 item-item similarity matrix. The matrix is pretty sparse so I set out to make a dictionary with many keys where each key points to a list which is a sorted array of product recommendations (in the form of tuples). I got this to work, see below.
In [191]: dictionary["15454-M6-ECU2="]
Out[191]:
[('15454-M-TSCE-K9=', 0.8),
('15454-M2-AC=', 0.52),
('15454-M6-DC-RF', 0.45),
('15454-M6-ECU2=', 0.63)]
However, I now have a problem in interpreting the result:
In [204]: sys.getsizeof(dictionary)
Out[204]: 786712
In [205]: sys.getsizeof(similarity_matrix)
Out[205]: 69168
Even though I eliminated a ton of zeros (which were each being represented with either 32 or 64 bits) why did the object size increase even though we eliminated the sparsity in the matrix?

sys.getsizeof only returns the size of the container, not container plus size of the items inside. The dict returns the same size regardless of the size of the contained values and its still only 98 bytes per key/value pair. Its storing a reference to the key and a reference to the value plus other overhead for the hash /buckets.
>>> sys.getsizeof(dict((i,'a'*10000) for i in range(8000)))
786712
>>> sys.getsizeof(dict((i,'a'*1) for i in range(8000)))
786712
>>> 786712/8000
98
A tuple is much smaller, only storing the reference itself.
>>> sys.getsizeof(tuple((i,'a'*10000) for i in range(8000)))
64056
>>> sys.getsizeof(tuple((i,'a'*1) for i in range(8000)))
64056
>>> 64056/8000
8

According to the size of your dictionary it seems that you have one key/value pair for each possible key (even where there might be no other keys that are similar to that key).
I imagine your code looks something like this:
# initialise sparse dict with one empty list of similar nodes for each node
sparse_dict = dict((key, []) for key in range(1000))
sparse_dict[0].append((2, 0.5)) # 0 is similar to 2 by 50%
def get_similarity(d, x, y):
for key, value in d[x]:
if key == y:
return value
return 0
assert get_similarity(sparse_dict, 0, 1) == 0
assert get_similarity(sparse_dict, 0, 2) == 0.5
However, using the get method of a dict you can implement even sparser dictionaries
# initialise empty mapping -- literally an empty dict
very_sparse_dict = {}
very_sparse_dict[0] = [(2, 0.5)] # 0 is similar to 2 by 50%
def get_similarity2(d, x, y):
for key, value in d.get(x, ()):
if key == y:
return value
return 0
# 0 not linked to 1, so 0% similarity
assert get_similarity2(very_sparse_dict, 0, 1) == 0
# 0 and 2 are similar
assert get_similarity2(very_sparse_dict, 0, 2) == 0.5
# 1 not similar to anything as it is not even present in the dict
assert get_similarity2(very_sparse_dict, 1, 2) == 0
And the size of each dict is:
>>>> print("sparse_dict:", sys.getsizeof(sparse_dict))
sparse_dict: 49248
>>> print("very_sparse_dict", sys.getsizeof(very_sparse_dict))
very_sparse_dict: 288

Related

Giving a composite score for merged dictionary keys

I am working on a project that needs to say that a certain ID is most likely.
Let me explain using example.
I have 3 dictionaries that contain ID's and their score
Ex: d1 = {74701: 3, 90883: 2}
I assign percentage score like this,
d1_p = {74701: 60.0, 90883: 40.0} , here the score is the (value of key in d1)/(total sum of values)
Similarly i have 2 other dictionaries
d2 = {90883: 2, 74701: 2} , d2_p = {90883.0: 50.0, 74701.0: 50.0}
d3 = {75853: 2}, d3_p = {75853: 100.0}
My task is to give a composite score for each ID from the above 3 dictionaries a decide a winner by taking the highest score. How would i mathematically assign a composite score between 0-100 for each of these ID's??
Ex: in above case 74701 needs to be the clear winner.
I tried giving average, but it fails, because I need to give more preference for the ID's that occur in multiple dictionaries.
Ex: lets say 74701 was majority in d1 and d2 with 30,40 values. then its average will be (30+40+0)/3 = 23.33 , while 75853 which occurs only once with 100% will get (100+0+0)/3 = 33.33 and it will be given as winner, which is wrong.
Hence can somone suggest a good mathematical way in python with maybe code to give such score and decide majority?

Instead of trying to create a global score from different dictionaries, since your main goal is to analyze frequency I would suggest to summarize all the data into a single dictionary, which is less error prone in general. Say I have 3 dictionaries:
a = {1: 2, 2: 3}
b = {2: 4, 3: 5}
c = {3: 4, 4: 9}
You could summarize these three dictionaries into one by summing the values for each key:
result = {1: 2, 2: 7, 3: 9, 4: 9}
That could be easily achieved by using Counter:
from collections import Counter
result = Counter(a)
result.update(Counter(b))
result.update(Counter(c))
result = dict(result)
Which would yield the desired summary. If you want different weights for each dictionary that could also be done in a similar fashion, the takeaway is that you should not be trying to obtain information from the dictionaries as separate entities, but instead merge them together into one statistic.

Think of the data in a tabular way: for each game/match/whatever, each ID gets
a certain number of points. If you care the most about overall point total for
the entire sequences of games (the entire "season", so to speak), then add up
the points to determine a winner (and then scale everything down/up to 0 to
100).
74701 90883 75853
---------------------------
1 3 2 0
2 2 2 0
3 0 0 2
Total 5 4 2
Alternatively, we can express those same scores in percentage terms per game.
Again, every ID must be given a value. In this case, we need to average the
percentages -- all of them, including the zeros:
74701 90883 75853
---------------------------
1 .6 .4 0
2 .5 .5 0
3 0 0 100
Avg .37 .30 .33
Both approaches could make sense, depending on the context. And both also
declare 74701 to be the winner, as desired. But notice that they give different
results for 2nd and 3rd place. Such differences occur because the two systems
prioritize different things. You need to decide which approach you prefer.
Either way, the first step is to organize the data better. It seems more
convenient to have all scores or percentages for each ID, so you can do the
needed math with them: that sounds like a dict mapping IDs to lists of scores
or percentages.
# Put the data into one collection.
d1 = {74701: 3, 90883: 2}
d2 = {90883: 2, 74701: 2}
d3 = {75853: 2}
raw_scores = [d1, d2, d3]
# Find all IDs.
ids = tuple(set(i for d in raw_scores for i in d))
# Total points/scores for each ID.
points = {
i : [d.get(i, 0) for d in raw_scores]
for i in ids
}
# If needed, use that dict to create a similar dict for percentages. Or you
# could create a dict with the same structure holding *both* point totals and
# percentages. Just depends on the approach you pick.
pcts = {}
for i, scores in points.items():
tot = sum(scores)
pcts[i] = [sc / tot for sc in scores]

What's the output of 'd = {0, 1, 2} for x in d: print(d.add(x)) .And why?

The Python code is:
d = {0, 1, 2}
for x in d:
print(d.add(x))
What is the output, and why?

The output is just
None
None
None
This is because d.add(x) adds x to the set and returns None.

Let's understand this step by step
(1) d is a set of elements{0,1,2}
(2)Set is a data structure which has only unique values
Loop:
(3)for x in d:
print(d.add(x))
this will take each element from the set and add it to self (d.add(x))
AND it returns nothing.
(4)You will get the output three times None because the loop will run three times
As it is a set adding elements to itself will return the original set(It will discard the duplicate entries)

Check Multiple Values and Change them

Heyo everyone, I have a question.
I have three variables, rF, tF, and dF.
Now these values can range from -100 to +100. I want to check all of them and see if they are less than 1; if they are, set them to 1.
An easy way of doing this is just 3 if statements, like
if rF < 1:
rF = 1
if tF < 1:
tF = 1
if dF < 1:
dF = 1
However, as you can see, this looks bad, and if i had, say 50 of these values, this could get out of hand quite easily.
I tried to put them in an array like so:
for item in [rF, tF, dF]:
if item < 1:
item = 1
However this doesn't work. I believe that when you do that you create a completely different object (the array), and when you change the items you are not changing the variables themselves but the values of the array.
So my question is: What is an elegant way of doing this?

Why not use a dictionary, if you've only got three variables of which to keep track?
rF, tF, dF = 100, -100, 1
d = {'rF': rF, 'tF': tF, 'dF': dF}
for k in d:
if d[k] < 1:
d[k] = 1
print(d)
{'rF': 100, 'tF': 1, 'dF': 1}
Then if you're referencing any of those values later, you can simply do this (as a trivial example):
def f(var):
print("'%s' is equal to %d" % (var, d[var]))
>>> f('rF')
'rF' is equal to 100
If you really wanted to use lists, and you knew the order of your list, you could do this (but dictionaries are made for this type of problem):
arr = [rF, tF, dF]
arr = [1 if x < 1 else x for x in arr]
print(arr)
[100, 1, 1]
Note that the list comprehension approach won't actually change the values of rF, tF, and dF.

You can simply use a dictionary and then unpack the dict:
d = {'rF': rF, 'tF': tF, 'dF': dF}
for key in d:
if d[key] < 1:
d[key] = 1
rF, tF, dF = d['rF'], d['tF'], d['dF']
You can use the following instead of the last line:
rF, tF, dF = map(d.get, ('rF', 'tF', 'dF'))

Here's exactly what you asked for:
rF = -3
tF = 9
dF = -2
myenv = locals()
for k in list(myenv.keys()):
if len(k) == 2 and k[1] == "F":
myenv[k] = max(1, myenv[k])
print(rF, tF, dF)
# prints 1 9 1
This may accidentally modify any variables you don't really want to change, so I recommend using a proper data structure instead of hacking the user environment.
Edit: Fixed an error for RuntimeError: dictionary changed size during iteration. Dictionaries cannot be iterated over and modified at the same time. Avoid this by first copying the dictionary keys, and iterating over the original keys instead of the actual dictionary. Should work in Python 2 and 3 now, just Python 2 before.

Use List Comprehension and max function.
items = [-32, 0, 43]
items = [max(1, item) for item in items]
rF, tF, dF = items
print(rF, tF, dF)

Hashing bitarray ? Counting

It seems for some reason that a dict can not have a non-duplicate key which is bitarray()
ex.:
data = {}
for _ in xrange(10):
ba = ...generate repeatable bitarrays ...
data[ba] = 1
print ba
{bitarray('11011'): 1, bitarray('11011'): 1, bitarray('11011'): 1, bitarray('01111'): 1, bitarray('11110'): 1, bitarray('11110'): 1, bitarray('01111'): 1, bitarray('01111'): 1, bitarray('11110'): 1, bitarray('11110'): 1}
You can clearly see that duplicate are stored as different keys (f.e. first two elements) !! which is weird. What could be the reason.
My goal is simply to count the number of times a bit pattern shows up, and of course Dict's are perfect for this, but it seems that bitarray() for some reason is opaque to the hashing algorithm.
btw.. i have to use bitarray(), cause i do 10000 bits+ patterns.
Any other idea of efficient way of counting occurrence of bit pattens ..

This answer addresses your first confusion regarding duplicate dictionary keys and I assume you're referring to bitarray() from bitarray module, *I've not used this module myself.
In your example above, you're not actually getting duplicate dictionary keys, you might see them that way, but they're duplicates to the naked eye only, for instance:
>>> class X:
... def __repr__(self):
... return '"X obj"'
...
>>> x1 = X()
>>> x2 = X()
>>> d = {x1:1, x2:2}
>>> d
{"X obj": 2, "X obj": 1}
But x1 isn't exactly equals to to x2 and hence they're not duplicates, they're distinct objects of class X:
>>> x1 == x2
False
>>> #same as
... id(x1) == id(x2)
False
>>> #same as
...x1 is x2
False
Moreover, because X class defines __repr__ which returns the string representation for its objects, you would think dictionary d has duplicate keys, again there are no duplicated keys nor are the keys of type str; key of value 1 is X object and key of value 2 is another object of X -- literally two different objects with a single string representation returned by their class's __repr__ method:
>>> # keys are instance of X not strings
... d
{"X obj": 2, "X obj": 1}
>>> d["X obj"]
KeyError: 'X obj'
>>>[x1]
1
>>>[x2]
2

Till BitArray 0.8.1 (or later) I believe it does not satisfy the hash invariant property.
To work around it, you should convert the bit array to byte format as follows.
>>> from bitarray import bitarray
>>> l = [bitarray('11111'), bitarray('11111'), bitarray('11010'), bitarray('11110'), bitarray('11111'), bitarray('11010')]
>>> for x in l: ht[x.tobytes()] = 0
...
>>> for x in l: ht[x.tobytes()] += 1
...
>>> ht
{'\xf8': 3, '\xf0': 1, '\xd0': 2}
Remember you can get back the bitarray from the byte format by using the command frombytes(byte). Though, in this case you will have to keep track of the size of bitarray explicitly as it will return bitarray of size multiple of 8.
If you want to keep the bitarray in the dictionary also:
>>> from bitarray import bitarray
>>> l = [bitarray('11111'), bitarray('11111'), bitarray('11010'), bitarray('11110'), bitarray('11111'), bitarray('11010')]
>>> ht = {}
>>> for x in l: ht[x.tobytes()] = (0, x)
...
>>> for x in l:
... old_count = ht[x.tobytes()][0]
... ht[x.tobytes()] = (old_count+1, x)
...
>>> ht
{'\xf8': (3, bitarray('11111')), '\xf0': (1, bitarray('11110')), '\xd0': (2, bitarray('11010'))}
>>> for x,y in ht.iteritems(): print(y)
...
(3, bitarray('11111'))
(1, bitarray('11110'))
(2, bitarray('11010'))

I solved it :
desc = bitarray(res).to01()
if desc in data : data[desc] += 1
else : data[desc] = 1
gosh I miss perl no-nonsense autovivification :)

Testing for value in set of tuples

Suppose we have the following set, S, and the value v:
S = {(0,1),(2,3),(4,5)}
v = 3
I want to test if v is the second element of any of the pairs within the set. My current approach is:
for _, y in S:
if y == v:
return True
return False
I don't really like this, as I have to put it in a separate function and something is telling me there's probably a nicer way to do it. Can anyone shed some light?

The any function is tailor-made for this:
any( y == v for (_, y) in S )
If you have a large set that doesn't change often, you might want to project the y values onto a set.
yy = set( y for (_, y) in S )
v in yy
Of course, this is only of benefit if you compute yy once after S changes, not before every membership test.

You can't do an O(1) lookup, so you don't get much benefit from having a set. You might consider building a second set, especially if you'll be doing lots of lookups.
S = {(0,1), (2,3), (4,5)}
T = {x[1] for x in S}
v = 3
if v in T:
# do something

Trivial answer is any (see Marcelo's answer).
Alternative is zip.
>>> zip(*S)
[(4, 0, 2), (5, 1, 3)]
>>> v in zip(*S)[1]
True

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can anyone explain this dictionary vs matrix size difference? - python

Related

Giving a composite score for merged dictionary keys

What's the output of 'd = {0, 1, 2} for x in d: print(d.add(x)) .And why?

Check Multiple Values and Change them

Hashing bitarray ? Counting

Testing for value in set of tuples

Categories

Resources