Fastest way to count two-letter pairs in a string - python

What is the fastest way to count number of two-letter pairs in a string (i.e AA, AB, AC, ... etc)? Is it possible to use numpy to speed up this computation?
I am using a list comprehension with str.count(), but this is quite slow.
import itertools
seq = 'MRNLAIIPARSGSKGLKDKNIKLLSGKPLLAYTIEAARESGLFGEIMVSTDSQEYAD'\
'IAKQWGANVPFLRSNELSNDTASSWDVVKEVIEGYKNLGTEFDTVVLLQPTSPLRTS'\
'IEGYKIMKEKDANFVVGVCEMDHSPLWANTLPEDLSMENFIRPEVVKMPRQSIPTYY'\
'RINGALYIVKVDYLMRTSDIYGERSIASVMRKENSIDIDNQMDFTIAEVLISERSKK'
chars = list('ACDEFGHIKLMNPQRSTVWY')
pairs = [''.join(pair) for pair in itertools.product(chars, chars)]
print(pairs[:10])
print(len(pairs))
['AA', 'AC', 'AD', 'AE', 'AF', 'AG', 'AH', 'AI', 'AK', 'AL']
400
%timeit counts = np.array([seq.count(pair) for pair in pairs])
231 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
print counts[:10]
[0, 1, 1, 0, 0, 0, 1, 1, 1, 0]

If you don't mind getting the counts in a dictionary, the Counter class from collections would process 2-3 times faster:
from collections import Counter
chars = set('ACDEFGHIKLMNPQRSTVWY')
counts = Counter( a+b for a,b in zip(seq,seq[1:]) if a in chars and b in chars)
print(counts)
Counter({'RS': 4, 'VV': 4, 'SI': 4, 'MR': 3, 'SG': 3, 'LL': 3, 'LS': 3,
'PL': 3, 'IE': 3, 'DI': 3, 'IA': 3, 'AN': 3, 'VK': 3, 'KE': 3,
'EV': 3, 'TS': 3, 'NL': 2, 'LA': 2, 'IP': 2, 'AR': 2, 'SK': 2,
...
This approach will properly count sequences of the same character repeated 3 or more times (i.e. "WWW" will count as 2 for "WW" whereas seq.count() or re.findall() would only count 1).
Keep in mind that he Counter dictionary will return zero for counts['LC'] but counts.items() will not contain 'LC' or any other pair not actually in the string.
If needed you could get the counts for all theoretical pairs in a second step:
from itertools import product
chars = 'ACDEFGHIKLMNPQRSTVWY'
print([counts[a+b] for a,b in product(chars,chars)][:10])
[1, 0, 1, 1, 0, 0, 0, 1, 1, 1]

There is a numpy function, np.char.count(). But it appears to be much slower than str.count().
%timeit counts = np.array([np.char.count(seq, pair) for pair in pairs])
1.79 ms ± 32.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Since speed is of utmost importance, below is a comparison of different methods:
import numpy as np
import itertools
from collections import Counter
seq = 'MRNLAIIPARSGSKGLKDKNIKLLSGKPLLAYTIEAARESGLFGEIMVSTDSQEYAD'\
'IAKQWGANVPFLRSNELSNDTASSWDVVKEVIEGYKNLGTEFDTVVLLQPTSPLRTS'\
'IEGYKIMKEKDANFVVGVCEMDHSPLWANTLPEDLSMENFIRPEVVKMPRQSIPTYY'\
'RINGALYIVKVDYLMRTSDIYGERSIASVMRKENSIDIDNQMDFTIAEVLISERSKK'
chars = list('ACDEFGHIKLMNPQRSTVWY')
pairs = [''.join(pair) for pair in itertools.product(chars, chars)]
def countpairs1():
return np.array([seq.count(pair) for pair in pairs])
%timeit counts = countpairs1()
144 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
def countpairs2():
counted = Counter(a+b for a,b in zip(seq,seq[1:]))
return np.array([counted[pair] for pair in pairs])
%timeit counts = countpairs2()
102 µs ± 729 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
def countpairs3():
return np.array([np.char.count(seq, pair) for pair in pairs])
%timeit counts = countpairs3()
1.65 ms ± 4.62 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Obviously, the best/fastest method is Counter.

Related

How to generate vector in Python

There is a random 1D array m_0
np.array([0, 1, 2])
I need to generate two 1D arrays:
np.array([0, 1, 2, 0, 1, 2, 0, 1, 2])
np.array([0, 0, 0, 1, 1, 1, 2, 2, 2])
Is there faster way to do it than this one:
import numpy as np
import time
N = 3
m_0 = np.arange(N)
t = time.time()
m_1 = np.tile(m_0, N)
m_2 = np.repeat(m_0, N)
t = time.time() - t
Size of m_0 is 10**3
You could use itertools.product to form the Cartesian product of m_0 with itself, then take the result apart again to get your two arrays.
import numpy as np
from itertools import product
N = 3
m_0 = np.arange(N)
m_2, m_1 = map(np.array, zip(*product(m_0, m_0)))
# m_1 is now array([0, 1, 2, 0, 1, 2, 0, 1, 2])
# m_2 is now array([0, 0, 0, 1, 1, 1, 2, 2, 2])
However, for large N this is probably quite a bit less performant than your solution, as it probably can't use many of NumPy's SIMD optimizations.
For alternatives and comparisons, you'll probably want to look at the answers to Cartesian product of x and y array points into single array of 2D points.
I guess you could try reshape:
>>> np.reshape([m_0]*3, (-1,), order='C')
array([0, 1, 2, 0, 1, 2, 0, 1, 2])
>>> np.reshape([m_0]*3, (-1,), order='F')
array([0, 0, 0, 1, 1, 1, 2, 2, 2])
Should be tiny bit faster for larger arrays.
>>> m_0 = np.random.randint(0, 10**3, size=(10**3,))
>>> %timeit np.tile([m_0]*10**3, N)
5.85 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit np.reshape([m_0]*10**3, (-1,), order='C')
1.94 ms ± 46.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You can slightly improve speed if you reuse your first variable to create the second.
N=1000
%timeit t = np.arange(N); a = np.tile(t, N); b = np.repeat(t, N)
%timeit t = np.arange(N); a = np.tile(t, N); b = np.reshape(a.reshape((N,N)),-1,'F')
7.55 ms ± 46.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.54 ms ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
If you insist on speeding it up further, you can specify the dtype of your array.
%timeit t = np.arange(N,dtype=np.uint16); a = np.tile(t, N); b = np.repeat(t, N)
%timeit t = np.arange(N,dtype=np.uint16); a = np.tile(t, N); b = np.reshape(a.reshape((N,N)),-1,'F')
6.03 ms ± 587 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.2 ms ± 37.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Be sure to keep the data type limit in mind.

Fast delta encoding for increasing sequence of integers in Python

Given a = [1, 2, 3, 4, 5]
After encoding, a' = [1, 1, 1, 1, 1], each element represents the difference compare to its previous element.
I know this can be done with
for i in range(len(a) - 1, 0, -1):
a[i] = a[i] - a[i - 1]
Is there a faster way? I am working with 2 billion numbers here, the process is taking about 30 minutes.
One way using itertools.starmap, islice and operator.sub:
from operator import sub
from itertools import starmap, islice
l = list(range(1, 10000000))
[l[0], *starmap(sub, zip(islice(l, 1, None), l))]
Output:
[1, 1, 1, ..., 1]
Benchmark:
l = list(range(1, 100000000))
# OP's method
%timeit [l[i] - l[i - 1] for i in range(len(l) - 1, 0, -1)]
# 14.2 s ± 373 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# numpy approach by #ynotzort
%timeit np.diff(l)
# 8.52 s ± 301 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# zip approach by #Nick
%timeit [nxt - cur for cur, nxt in zip(l, l[1:])]
# 7.96 s ± 243 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# itertool and operator approach by #Chris
%timeit [l[0], *starmap(sub, zip(islice(l, 1, None), l))]
# 6.4 s ± 255 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You could use zip to put together the list with an offset version and subtract those values
a = [1, 2, 3, 4, 5]
a[1:] = [nxt - cur for cur, nxt in zip(a, a[1:])]
print(a)
Output:
[1, 1, 1, 1, 1]
Out of interest, I ran this, the original code and #ynotzort answer through timeit and this was much faster than the numpy code for short lists; remaining faster up to about 10M values; both were about 30% faster than the original code. As the list size increased beyond 10M, the numpy code has more of a speed up and eventually is faster from about 20M values onward.
Update
Also tested the starmap code, and that is about 40% faster than the numpy code at 20M values...
Update 2
#Chris has some more comprehensive performance data in their answer. This answer can be sped up further (about 10%) by using itertools.islice to generate the offset list:
a = [a[0], *[nxt - cur for cur, nxt in zip(a, islice(a, 1, None))]]
You could use numpy.diff, For example:
import numpy as np
a = [1, 2, 3, 4, 5]
npa = np.array(a)
a_diff = np.diff(npa)

How to repeat string in an array?

I want to repeat len(non_current_assets) times a string in an array. So I tried:
["", "totalAssets", "total_non_current_assets" * len(non_current_assets), "totalAssets"]
But it returns:
['',
'totalAssets',
'total_non_current_assetstotal_non_current_assetstotal_non_current_assetstotal_non_current_assetstotal_non_current_assets',
'totalAssets']
Place your str inside list, multiply, then unpack (using * operator) that is:
non_current_assets = (1, 2, 3, 4, 5) # so len(non_current_assets) == 5, might be anything as long as supports len
lst = ["", "totalAssets", *["total_non_current_assets"] * len(non_current_assets), "totalAssets"]
print(lst)
Output:
['', 'totalAssets', 'total_non_current_assets', 'total_non_current_assets', 'total_non_current_assets', 'total_non_current_assets', 'total_non_current_assets', 'totalAssets']
(tested in Python 3.7)
This should work:
string_to_be_repeated = ["total_non_current_assets"]
needed_list = string_to_be_repeated * 3
list_to_appended = ["","totalAssets"]
list_to_appended.extend(needed_list)
print(list_to_appended)
You want to use a loop:
for x in range(len(non_current_assets)):
YOUR_ARRAY.append(”total_non_current_assets“)
You can use itertools.repeat together with the unpacking operator *:
import itertools as it
["", "totalAssets",
*it.repeat("total_non_current_assets", len(non_current_assets)),
"totalAssets"]
It makes the intent pretty clear and saves the creation of a temporary list (hence better performance).
In [1]: import itertools as it
In [2]: %timeit [0, 1, *[3]*1000, 4, 5]
6.51 µs ± 8.57 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [3]: %timeit [0, 1, *it.repeat(3, 1000), 4, 5]
4.94 µs ± 73.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

How to properly update a global variable in python using lambda

I have a dataframe in which each row shows one transaction and items within that transactions. Here is how my dataframe looks like
itemList
A,B,C
B,F
G,A
...
I want to find the frequency of each item (how many times it appeared in the transactions. I have defined a dictionary and try to update its value as shown below
dict ={}
def update(itemList):
#Update the value of each item in the dict
df.itemList.apply(lambda x: update(x))
As apply function gets executed for multiple row at the same time, multiple rows try to update the values in dict at the same time and it's causing an issue. How can I make sure multiple updated to dict does not cause any issue?
I think you only need Series.str.get_dummies:
df['itemList'].str.get_dummies(',').sum().to_dict()
#{'A': 2, 'B': 2, 'C': 1, 'F': 1, 'G': 1}
If there are more columns use:
df.stack().str.get_dummies(',').sum().to_dict()
if you want to count for each row:
df['itemList'].str.get_dummies(',').to_dict('index')
#{0: {'A': 1, 'B': 1, 'C': 1, 'F': 0, 'G': 0},
# 1: {'A': 0, 'B': 1, 'C': 0, 'F': 1, 'G': 0},
# 2: {'A': 1, 'B': 0, 'C': 0, 'F': 0, 'G': 1}}
As #Quang Hoang said in the comments apply simply apply the function to each row / column using a loop
You might be better off relying on native python here,
df = pd.DataFrame({'itemlist':['a,b,c', 'b,f', 'g,a', 'd,g,f,d,s,a,v', 'e,w,d,f,g,h', 's,d,f,e,r,t', 'e,d,f,g,r,r','s,d,f']})
Here is a solution using Counter,
df['itemlist'].str.replace(',','').apply(lambda x: Counter(x)).sum()
Some comparisons,
%timeit df['itemlist'].str.split(',', expand = True).stack().value_counts().to_dict()
2.64 ms ± 99.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df['itemlist'].str.get_dummies(',').sum().to_dict()
3.22 ms ± 68.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
from collections import Counter
%timeit df['itemlist'].str.replace(',','').apply(lambda x: Counter(x)).sum()
778 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

How to find the index of the element in a list that first appears in another given list?

a = [3, 4, 2, 1, 7, 6, 5]
b = [4, 6]
The answer should be 1. Because in a, 4 appears first in list b, and it's index is 1.
The question is that is there any fast code in python to achieve this?
PS: Actually a is a random permutation and b is a subset of a, but it's represented as a list.
If b is to be seen as a subset (order doesn't matter, all values are present in a), then use min() with a map():
min(map(a.index, b))
This returns the lowest index. This is a O(NK) solution (where N is the length of a, K that of b), but all looping is executed in C code.
Another option is to convert a to a set and use next() on a loop over enumerate():
bset = set(b)
next(i for i, v in enumerate(a) if v in bset)
This is a O(N) solution, but has higher constant cost (Python bytecode to execute). It heavily depends on the sizes of a and b which one is going to be faster.
For the small input example in the question, min(map(...)) wins:
In [86]: a = [3, 4, 2, 1, 7, 6, 5]
...: b = [4, 6]
...:
In [87]: %timeit min(map(a.index, b))
...:
608 ns ± 64.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [88]: bset = set(b)
...:
In [89]: %timeit next(i for i, v in enumerate(a) if v in bset)
...:
717 ns ± 30.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In one line :
print("".join([str(index) for item in b for index,item1 in enumerate(a) if item==item1][:1]))
output:
1
In detail :
a = [3, 4, 2, 1, 7, 6, 5]
b = [4, 6]
new=[]
for item in b:
for index,item1 in enumerate(a):
if item==item1:
new.append(index)
print("".join([str(x) for x in new[:1]]))
For little B sample, the set approach is output dependent, execution time grow linearly with index output. Numpy can provide better solution in this case.
N=10**6
A=np.unique(np.random.randint(0,N,N))
np.random.shuffle(A)
B=A[:3].copy()
np.random.shuffle(A)
def find(A,B):
pos=np.in1d(A,B).nonzero()[0]
return pos[A[pos].argsort()][B.argsort().argsort()].min()
def findset(A,B):
bset = set(B)
return next(i for i, v in enumerate(A) if v in bset)
#In [29]: find(A,B)==findset(A,B)
#Out[29]: True
#In [30]: %timeit findset(A,B)
# 63.5 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#
# In [31]: %timeit find(A,B)
# 2.24 ms ± 52.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Categories