Counting items in pandas column of dictionaries

Counting items in pandas column of dictionaries - python

I have a dataframe with a column which has dictionaries in it. I want to count the number of occurrences of a dictionary key in the whole column.
One way of doing it is below:
import pandas as pd
from collections import Counter
df = pd.DataFrame({"data": [{"weight": 3, "color": "blue"},
{"size": 5, "weight": 2},{"size": 3, "color": "red"}]})
c = Counter()
for index, row in df.iterrows():
for item in list(row["data"].keys()):
c[item] += 1
print(c)
Which gives
Counter({'weight': 2, 'color': 2, 'size': 2})
Are there faster ways of doing it?

A much faster approach would be to flatten the column with itertools.chain and build a Counter from the result (which will only contain the dictionary keys):
from itertools import chain
Counter(chain.from_iterable(df.data.values.tolist()))
# Counter({'weight': 2, 'color': 2, 'size': 2})
Timings:
def OP(df):
c = Counter()
for index, row in df.iterrows():
for item in list(row["data"].keys()):
c[item] += 1
%timeit OP(df)
# 570 µs ± 49.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit Counter(chain.from_iterable(df.data.values.tolist()))
# 14.2 µs ± 902 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

We can using
pd.DataFrame(df.data.tolist()).notnull().sum().to_dict()
Out[653]: {'color': 2, 'size': 2, 'weight': 2}

First, creating an empty Counter is in my opinion pretty useless. Counter can count for you if you supply a list to it. It's its main purpose, I would say.
I would do:
from functools import reduce
c = reduce(lambda x, y : x+y, [Counter(x.keys()) for x in df['data']])
and c is:
Counter({'color': 2, 'size': 2, 'weight': 2})
To explain what the line abobe does, first it creates a list of Counter objects using a list comprehension. It iterates over the column and makes Counter objects using the keys of each dictionary.
Then using the function reduce those counters are summed. Counter supports the addition.
On my machine this approach, using the provided input, is roughly 4 times faster than OP method.

Shortly with pandas features:
In [171]: df['data'].apply(pd.Series).count(axis=0).to_dict()
Out[171]: {'weight': 2, 'color': 2, 'size': 2}

Related

Get a dictionary of lists of values from one list based on common values from another list

I have two lists. One gives me the IDs, say:
list1 = [0, 0, 0, 0, 1, 1, 2, 3, 3, 3]
Another gives me values associated with those IDs, say:
list2 = [10, 20, 30, 40, 1, 5, 9, 10, 15, 20]
I want to organize the values of list2 based on the values in list1, such that:
dict = {0:[10,20,30,40], 1:[1, 5], 2:[9], 3:[10,15,20]}
My current problem is that I need an efficient way to do this since my lists are ~150000 points long and my solution is very slow:
dict = {i:list2[np.where(np.array(obid)==i)] for i in np.unique(list1)}
Seems like a simple problem and perhaps there is an obvious answer but I just can't seem to find the right combination of words to get a good answer.

If list1 is sorted (or if you pre-sort) you can use groupby to collect them into subgroups on the fly, then create a dictionary where the key is from list1 and value is from the group of list2
>>> from itertools import groupby
>>> {k:[i[1] for i in g] for k,g in groupby(zip(list1, list2), key=lambda i: i[0])}
{0: [10, 20, 30, 40], 1: [1, 5], 2: [9], 3: [10, 15, 20]}

Are you open to Pandas:
import pandas as pd
out_dict = {k:list(v) for k,v in pd.Series(list2).groupby(list1)}
Output:
out_dict = {k:list(v) for k,v in pd.Series(list2).groupby(list1)}
Update: for a vanilla Python:
def python_func(list1,list2):
out_dict = {}
for k,v in zip(list1, list2):
if k in out_dict:
out_dict[k].append(v)
else:
out_dict[k] = [v]
return out_dict
Performance Test:
Data:
n=100000
list1 = np.random.randint(0,10,n)
list2 = np.random.randint(0,60, n)
Run time:
%timeit -n 10 out_dict = {k:list(v) for k,v in pd.Series(list2).groupby(list1)}
# 8.51 ms ± 95.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit -n 10 python_func(list1, list2);
# slightly slower, but equivalent to Stephan's answer
# 31.6 ms ± 777 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit -n 10 {k:[i[1] for i in g] for k,g in groupby(zip(list1, list2), key=lambda i: i[0])}
# 58.2 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In a single loop you can do the following using defaultdicts with no assumptions on the ordering of the indices-array:
In [1]: import numpy as np; from collections import defaultdict
In [2]: indices = np.asarray([0, 0, 0, 0, 1, 1, 2, 3, 3, 3])
In [3]: values = np.asarray([10, 20, 30, 40, 1, 5, 9, 10, 15, 20])
In [4]: mapping = defaultdict(list)
...: for i, v in zip(indices, values):
...: mapping[i].append(v)
...:
In [5]: mapping
Out[5]: defaultdict(list, {0: [10, 20, 30, 40], 1: [1, 5], 2: [9], 3: [10, 15, 20]})

Fastest way to count two-letter pairs in a string

What is the fastest way to count number of two-letter pairs in a string (i.e AA, AB, AC, ... etc)? Is it possible to use numpy to speed up this computation?
I am using a list comprehension with str.count(), but this is quite slow.
import itertools
seq = 'MRNLAIIPARSGSKGLKDKNIKLLSGKPLLAYTIEAARESGLFGEIMVSTDSQEYAD'\
'IAKQWGANVPFLRSNELSNDTASSWDVVKEVIEGYKNLGTEFDTVVLLQPTSPLRTS'\
'IEGYKIMKEKDANFVVGVCEMDHSPLWANTLPEDLSMENFIRPEVVKMPRQSIPTYY'\
'RINGALYIVKVDYLMRTSDIYGERSIASVMRKENSIDIDNQMDFTIAEVLISERSKK'
chars = list('ACDEFGHIKLMNPQRSTVWY')
pairs = [''.join(pair) for pair in itertools.product(chars, chars)]
print(pairs[:10])
print(len(pairs))
['AA', 'AC', 'AD', 'AE', 'AF', 'AG', 'AH', 'AI', 'AK', 'AL']
400
%timeit counts = np.array([seq.count(pair) for pair in pairs])
231 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
print counts[:10]
[0, 1, 1, 0, 0, 0, 1, 1, 1, 0]

If you don't mind getting the counts in a dictionary, the Counter class from collections would process 2-3 times faster:
from collections import Counter
chars = set('ACDEFGHIKLMNPQRSTVWY')
counts = Counter( a+b for a,b in zip(seq,seq[1:]) if a in chars and b in chars)
print(counts)
Counter({'RS': 4, 'VV': 4, 'SI': 4, 'MR': 3, 'SG': 3, 'LL': 3, 'LS': 3,
'PL': 3, 'IE': 3, 'DI': 3, 'IA': 3, 'AN': 3, 'VK': 3, 'KE': 3,
'EV': 3, 'TS': 3, 'NL': 2, 'LA': 2, 'IP': 2, 'AR': 2, 'SK': 2,
...
This approach will properly count sequences of the same character repeated 3 or more times (i.e. "WWW" will count as 2 for "WW" whereas seq.count() or re.findall() would only count 1).
Keep in mind that he Counter dictionary will return zero for counts['LC'] but counts.items() will not contain 'LC' or any other pair not actually in the string.
If needed you could get the counts for all theoretical pairs in a second step:
from itertools import product
chars = 'ACDEFGHIKLMNPQRSTVWY'
print([counts[a+b] for a,b in product(chars,chars)][:10])
[1, 0, 1, 1, 0, 0, 0, 1, 1, 1]

There is a numpy function, np.char.count(). But it appears to be much slower than str.count().
%timeit counts = np.array([np.char.count(seq, pair) for pair in pairs])
1.79 ms ± 32.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Since speed is of utmost importance, below is a comparison of different methods:
import numpy as np
import itertools
from collections import Counter
seq = 'MRNLAIIPARSGSKGLKDKNIKLLSGKPLLAYTIEAARESGLFGEIMVSTDSQEYAD'\
'IAKQWGANVPFLRSNELSNDTASSWDVVKEVIEGYKNLGTEFDTVVLLQPTSPLRTS'\
'IEGYKIMKEKDANFVVGVCEMDHSPLWANTLPEDLSMENFIRPEVVKMPRQSIPTYY'\
'RINGALYIVKVDYLMRTSDIYGERSIASVMRKENSIDIDNQMDFTIAEVLISERSKK'
chars = list('ACDEFGHIKLMNPQRSTVWY')
pairs = [''.join(pair) for pair in itertools.product(chars, chars)]
def countpairs1():
return np.array([seq.count(pair) for pair in pairs])
%timeit counts = countpairs1()
144 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
def countpairs2():
counted = Counter(a+b for a,b in zip(seq,seq[1:]))
return np.array([counted[pair] for pair in pairs])
%timeit counts = countpairs2()
102 µs ± 729 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
def countpairs3():
return np.array([np.char.count(seq, pair) for pair in pairs])
%timeit counts = countpairs3()
1.65 ms ± 4.62 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Obviously, the best/fastest method is Counter.

Fast delta encoding for increasing sequence of integers in Python

Given a = [1, 2, 3, 4, 5]
After encoding, a' = [1, 1, 1, 1, 1], each element represents the difference compare to its previous element.
I know this can be done with
for i in range(len(a) - 1, 0, -1):
a[i] = a[i] - a[i - 1]
Is there a faster way? I am working with 2 billion numbers here, the process is taking about 30 minutes.

One way using itertools.starmap, islice and operator.sub:
from operator import sub
from itertools import starmap, islice
l = list(range(1, 10000000))
[l[0], *starmap(sub, zip(islice(l, 1, None), l))]
Output:
[1, 1, 1, ..., 1]
Benchmark:
l = list(range(1, 100000000))
# OP's method
%timeit [l[i] - l[i - 1] for i in range(len(l) - 1, 0, -1)]
# 14.2 s ± 373 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# numpy approach by #ynotzort
%timeit np.diff(l)
# 8.52 s ± 301 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# zip approach by #Nick
%timeit [nxt - cur for cur, nxt in zip(l, l[1:])]
# 7.96 s ± 243 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# itertool and operator approach by #Chris
%timeit [l[0], *starmap(sub, zip(islice(l, 1, None), l))]
# 6.4 s ± 255 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You could use zip to put together the list with an offset version and subtract those values
a = [1, 2, 3, 4, 5]
a[1:] = [nxt - cur for cur, nxt in zip(a, a[1:])]
print(a)
Output:
[1, 1, 1, 1, 1]
Out of interest, I ran this, the original code and #ynotzort answer through timeit and this was much faster than the numpy code for short lists; remaining faster up to about 10M values; both were about 30% faster than the original code. As the list size increased beyond 10M, the numpy code has more of a speed up and eventually is faster from about 20M values onward.
Update
Also tested the starmap code, and that is about 40% faster than the numpy code at 20M values...
Update 2
#Chris has some more comprehensive performance data in their answer. This answer can be sped up further (about 10%) by using itertools.islice to generate the offset list:
a = [a[0], *[nxt - cur for cur, nxt in zip(a, islice(a, 1, None))]]

You could use numpy.diff, For example:
import numpy as np
a = [1, 2, 3, 4, 5]
npa = np.array(a)
a_diff = np.diff(npa)

How to properly update a global variable in python using lambda

I have a dataframe in which each row shows one transaction and items within that transactions. Here is how my dataframe looks like
itemList
A,B,C
B,F
G,A
...
I want to find the frequency of each item (how many times it appeared in the transactions. I have defined a dictionary and try to update its value as shown below
dict ={}
def update(itemList):
#Update the value of each item in the dict
df.itemList.apply(lambda x: update(x))
As apply function gets executed for multiple row at the same time, multiple rows try to update the values in dict at the same time and it's causing an issue. How can I make sure multiple updated to dict does not cause any issue?

I think you only need Series.str.get_dummies:
df['itemList'].str.get_dummies(',').sum().to_dict()
#{'A': 2, 'B': 2, 'C': 1, 'F': 1, 'G': 1}
If there are more columns use:
df.stack().str.get_dummies(',').sum().to_dict()
if you want to count for each row:
df['itemList'].str.get_dummies(',').to_dict('index')
#{0: {'A': 1, 'B': 1, 'C': 1, 'F': 0, 'G': 0},
# 1: {'A': 0, 'B': 1, 'C': 0, 'F': 1, 'G': 0},
# 2: {'A': 1, 'B': 0, 'C': 0, 'F': 0, 'G': 1}}
As #Quang Hoang said in the comments apply simply apply the function to each row / column using a loop

You might be better off relying on native python here,
df = pd.DataFrame({'itemlist':['a,b,c', 'b,f', 'g,a', 'd,g,f,d,s,a,v', 'e,w,d,f,g,h', 's,d,f,e,r,t', 'e,d,f,g,r,r','s,d,f']})
Here is a solution using Counter,
df['itemlist'].str.replace(',','').apply(lambda x: Counter(x)).sum()
Some comparisons,
%timeit df['itemlist'].str.split(',', expand = True).stack().value_counts().to_dict()
2.64 ms ± 99.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df['itemlist'].str.get_dummies(',').sum().to_dict()
3.22 ms ± 68.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
from collections import Counter
%timeit df['itemlist'].str.replace(',','').apply(lambda x: Counter(x)).sum()
778 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

How to find the index of the element in a list that first appears in another given list?

a = [3, 4, 2, 1, 7, 6, 5]
b = [4, 6]
The answer should be 1. Because in a, 4 appears first in list b, and it's index is 1.
The question is that is there any fast code in python to achieve this?
PS: Actually a is a random permutation and b is a subset of a, but it's represented as a list.

If b is to be seen as a subset (order doesn't matter, all values are present in a), then use min() with a map():
min(map(a.index, b))
This returns the lowest index. This is a O(NK) solution (where N is the length of a, K that of b), but all looping is executed in C code.
Another option is to convert a to a set and use next() on a loop over enumerate():
bset = set(b)
next(i for i, v in enumerate(a) if v in bset)
This is a O(N) solution, but has higher constant cost (Python bytecode to execute). It heavily depends on the sizes of a and b which one is going to be faster.
For the small input example in the question, min(map(...)) wins:
In [86]: a = [3, 4, 2, 1, 7, 6, 5]
...: b = [4, 6]
...:
In [87]: %timeit min(map(a.index, b))
...:
608 ns ± 64.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [88]: bset = set(b)
...:
In [89]: %timeit next(i for i, v in enumerate(a) if v in bset)
...:
717 ns ± 30.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In one line :
print("".join([str(index) for item in b for index,item1 in enumerate(a) if item==item1][:1]))
output:
1
In detail :
a = [3, 4, 2, 1, 7, 6, 5]
b = [4, 6]
new=[]
for item in b:
for index,item1 in enumerate(a):
if item==item1:
new.append(index)
print("".join([str(x) for x in new[:1]]))

For little B sample, the set approach is output dependent, execution time grow linearly with index output. Numpy can provide better solution in this case.
N=10**6
A=np.unique(np.random.randint(0,N,N))
np.random.shuffle(A)
B=A[:3].copy()
np.random.shuffle(A)
def find(A,B):
pos=np.in1d(A,B).nonzero()[0]
return pos[A[pos].argsort()][B.argsort().argsort()].min()
def findset(A,B):
bset = set(B)
return next(i for i, v in enumerate(A) if v in bset)
#In [29]: find(A,B)==findset(A,B)
#Out[29]: True
#In [30]: %timeit findset(A,B)
# 63.5 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#
# In [31]: %timeit find(A,B)
# 2.24 ms ± 52.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Counting items in pandas column of dictionaries - python

We can using pd.DataFrame(df.data.tolist()).notnull().sum().to_dict() Out[653]: {'color': 2, 'size': 2, 'weight': 2}

Shortly with pandas features: In [171]: df['data'].apply(pd.Series).count(axis=0).to_dict() Out[171]: {'weight': 2, 'color': 2, 'size': 2}

Related

Get a dictionary of lists of values from one list based on common values from another list

Fastest way to count two-letter pairs in a string

Fast delta encoding for increasing sequence of integers in Python

How to properly update a global variable in python using lambda

How to find the index of the element in a list that first appears in another given list?

Categories

Resources