Histogram of lists enteries - python

I have a number of lists as follows:
list1 = ['a_1','a_2','b_17','c_19']
list2 = ['aa_1','a_12','b_15','d_39']
list3 = ['a_1','a_200','ba_1','u_0']
I wish to generate a histogram based on the labels, ignoring the numbering, that is, a has 4 entries over all the lists, ba 1 entry, u has 1 entry, and so on. The labels, are file names from a specific folder, before adding the numbers, so it is a finite known list.
How can I perform such a count without a bunch of ugly loops? Can I use unique here, somehow?

You cannot acheive it without a loop. But you can instead use list comphrension to make it into a single line. Something like this.
list1 = ['a_1','a_2','b_17','c_19']
list2 = ['aa_1','a_12','b_15','d_39']
list3 = ['a_1','a_200','ba_1','u_0']
lst = [x.split('_')[0] for x in (list1 + list2 + list3)]
print({x: lst.count(x) for x in lst})

You can use a defaultdict initialized to 0 to count the occurrence and get a nice container with the required information.
So, define the container:
from collections import defaultdict
histo = defaultdict(int)
I'd like to split the operation into methods.
First get the prefix from the string, to be used as key in the dictionary:
def get_prefix(string):
return string.split('_')[0]
This works like
get_prefix('az_1')
#=> 'az'
Then a method to update de dictionary:
def count_elements(lst):
for e in lst:
histo[get_prefix(e)] += 1
Finally you can call this way:
count_elements(list1)
count_elements(list2)
count_elements(list3)
dict(histo)
#=> {'a': 5, 'b': 2, 'c': 1, 'aa': 1, 'd': 1, 'ba': 1, 'u': 1}
Or directly
count_elements(list1 + list2 + list3)
To get the unique count, call it using set:
count_elements(set(list1 + list2 + list3))
dict(histo)
{'ba': 1, 'a': 4, 'aa': 1, 'b': 2, 'u': 1, 'd': 1, 'c': 1}

Related

Strange behaviour at key substitution in a dictionary

I have a dictionary with keys as single characters. I want to substitute the upper-cased characters with doubled versions of them.
For example, I have this structure:
x = 'AbCDEfGH'
a = dict(zip(list(x), range(len(x))))
print(a)
which creates this dictionary:
{'A': 0, 'b': 1, 'C': 2, 'D': 3, 'E': 4, 'f': 5, 'G': 6, 'H': 7}
The values don't matter, so I just use some integers. What I want is to substitute the upper-cased keys with double characters, so that I get this:
{'AA': 0, 'b': 1, 'CC': 2, 'DD': 3, 'EE': 4, 'f': 5, 'GG': 6, 'HH': 7}
So, I tried the following in-place substitution:
for k, v in a.items():
if k.isupper():
a[k+k] = a.pop(k)
print(a)
But this, strangely, results in:
{'b': 1, 'E': 4, 'f': 5, 'G': 6, 'CCCCCCCCCCCCCCCC': 2, 'DDDDDDDDDDDDDDDD': 3, 'HHHHHHHHHHHHHHHH': 7, 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA': 0}
Even stranger, if I set all keys to upper-case:
y = 'ABCDEFGH'
a = dict(zip(list(y), range(len(y))))
for k, v in a.items():
if k.isupper():
a[k+k] = a.pop(k)
print(a)
it yields:
{'D': 3, 'E': 4, 'F': 5, 'CCCCCCCC': 2, 'GGGGGGGG': 6, 'HHHHHHHH': 7, 'AAAAAAAAAAAAAAAA': 0, 'BBBBBBBBBBBBBBBB': 1}
What is happening? I see the keys are repeating in magnitudes of 2. But, why?
I don't really care about the order of the items, but I see some aren't even being changed.
Is there any other way to substitute the keys the way I intend to?
.items() returns a live view of the underlying dict contents. Mutating the dict while iterating it causes unpredictable effects, usually leading to some keys being processed more than once (thus some keys doubling multiple times), while others aren't processed at all. Python tries to defend you from this by raising a RuntimeError if the dict changes size during iteration, but your code is keeping a constant size at the time of the check (when the next item is requested from the iterator), so Python's cheap length check doesn't save you.
The minimal fix is to make your loop run over a snapshot of the items:
for k, v in tuple(a.items()):
A simpler solution is a dict comprehension though:
a = {k*2 if k.isupper() else k: v for k, v in a.items()}
That builds a whole new dict with the doubled keys before reassigning a, so no mutation issues occur. You could build a in one fell swoop for that matter, just by doing:
a = {let*2 if let.isupper() else let: i for i, let in enumerate(x)}
No need to listify x (strings already iterate by character) and enumerate can take care of numbering the values for you without needing zip, range or len at all.

Using other dictionary values to define a dictionary value during initialization

Say I have three variables which I want to store in a dictionary such that the third is the sum of the first two. Is there a way to do this in one call when the dictionary is initialized? For example:
myDict = {'a': 1, 'b': 2, 'c': myDict['a'] + myDict['b']}
Python>=3.8's named assignment allows something like the following, which I guess you could interpret as one call:
>>> md = {**(md := {'a': 2, 'b': 3}), **{'c': md['a'] + md['b']}}
>>> md
{'a': 2, 'b': 3, 'c': 5}
But this is really just a fanciful way of forcing a two-liner into a single line and making it less readable and less memory-efficient (because of the intermediate dicts). Also note that the md used on the right hand side of the = really could be any name.
You could actually be a little more efficient and get rid of one spurious auxiliary dict:
(md := {'a': 2, 'b': 3}).update({'c': md['a'] + md['b']})
You can do:
>>> myDict = {'a': 1, 'b': 2}
>>> myDict["c"] = myDict["a"] + myDict["b"]
>>> myDict
{'a': 1, 'b': 2, 'c': 3}
You can not do this in 1 line, because myDict is not even exist while assigning to c

Finding item frequency in list of lists

Let's say I have a list of lists and I want to find the frequency in which pairs (or more) of elements appears in total.
For example, if i have [[a,b,c],[b,c,d],[c,d,e]
I want :(a,b) = 1, (b,c) = 2, (c,d) = 2, etc.
I tried finding a usable apriori algorithm that would allow me to do this, but i couldn't find a easy to implement one in python.
How would I approach this problem in a better way?
This is a way to do it:
from itertools import combinations
l = [['a','b','c'],['b','c','d'],['c','d','e']]
d = {}
for i in l:
# for every item on l take all the possible combinations of 2
comb = combinations(i, 2)
for c in comb:
k = ''.join(c)
if d.get(k):
d[k] += 1
else:
d[k] = 1
Result:
>>> d
{'bd': 1, 'ac': 1, 'ab': 1, 'bc': 2, 'de': 1, 'ce': 1, 'cd': 2}

How to store parts of a string into a dictionary

For example I have
from collections import Counter
cnt = Counter()
text = 'CTGGAT'
def freqWords(text, k):
for i in text:
cnt [i] += 1
print cnt
Outputs: Counter({'A': 10, 'C': 9, 'T': 8, 'G': 4})
Which returns a nice dictionary, however, I want to store my items by the value of k. Like so, if k=2, then the dict will populate with the values of:
CT, TG, GG, GA, AT. If k=3 then: CTG, TGG, GGA, GAT.
Your for i in text iterates over the characters of text. You have to iterate over the length of text minus k and take a substring of text:
def freqWords(text, k):
return Counter(text[i:i+k] for i in xrange(len(text) - k))
works like this:
freqWords('CTGGAT', 2)
# Counter({'GG': 1, 'TG': 1, 'GA': 1, 'CT': 1})

How to quickly convert from items in a list of lists to list of dictionaries in python?

Assuming I have the following structure:
listoflist = [[0,1,2,3,4],[2,4,2,3,4],[3,4,5,None,3],...]
Assuming I have:
headers = ["A","B","C","D","E"]
I want to convert each to:
listofobj = [{"A":0,"B":2,"C":3,"D":4,"E":5},{"A":2,"B":4,"C":2,"E":4}]
What is the best way to do this?
Note that D: does not show up for the 3rd dictionary in the converted list because it is None. Am looking for the most optimal way/quickest performance for this.
You can use list comprehension to perform an operation on each element of a list, the zip builtin function to match each element of headers against the corresponding element in listoflist, and the dict builtin function to convert each of those into a dictionary. So, the code you want is
listofobj = [dict(zip(headers, sublist)) for sublist in listoflist]
Removing None values is probably best done in another function:
def without_none_values(d):
return {k:d[k] for k in d if d[k] is not None}
With that function, we can complete the list with
listofobj = [without_none_values(dict(zip(headers, sublist))) for sublist in listoflist]
Easy to do in Python >= 2.7 using dictionary comprehension:
listofobj = [
{ k: v for k, v in zip(headers, sublist) if v is not None }
for sublist in listoflist
]
In Python 2.6 one needs to use dict:
listofobj = [
dict((k, v) for k, v in zip(headers, sublist) if v is not None)
for sublist in listoflist
]
I would iterate through each list, and zip it with the list of headers.
headers = ["A","B","C","D","E"]
listoflist = [[0,1,2,3,4],[2,4,2,3,4],[1,2,3,4,4],[5,6,7,8,9],[0,9,7,6,5]]
[dict(zip(headers, sublist)) for sublist in listoflist]
Output
[{'A': 0, 'C': 2, 'B': 1, 'E': 4, 'D': 3},
{'A': 2, 'C': 2, 'B': 4, 'E': 4, 'D': 3},
{'A': 1, 'C': 3, 'B': 2, 'E': 4, 'D': 4},
{'A': 5, 'C': 7, 'B': 6, 'E': 9, 'D': 8},
{'A': 0, 'C': 7, 'B': 9, 'E': 5, 'D': 6}]
Create a Pandas Series object from each of the lists in your listoflist, using headers as the index.
Then drop None values using dropna() method. And finally create a dictionary from each Series.
import pandas as pd
listofobj = [dict(pd.Series(x, index = headers).dropna()) for x in listoflist]
[x for x in l if x is not None] keeps all values that are not None
We enumerate over headers and access the element at the corresponding index in each sublist of listoflist using l[ind].
if l[ind] will be True if the value is not None so we use that element as a key or else ignore it if the value is None.
list_of_obj = [dict(zip([x for ind ,x in enumerate(headers) if l[ind] is not None],[x for x in l if x is not None])) for l in listoflist]

Categories