How to repeat string in an array? - python

I want to repeat len(non_current_assets) times a string in an array. So I tried:
["", "totalAssets", "total_non_current_assets" * len(non_current_assets), "totalAssets"]
But it returns:
['',
'totalAssets',
'total_non_current_assetstotal_non_current_assetstotal_non_current_assetstotal_non_current_assetstotal_non_current_assets',
'totalAssets']

Place your str inside list, multiply, then unpack (using * operator) that is:
non_current_assets = (1, 2, 3, 4, 5) # so len(non_current_assets) == 5, might be anything as long as supports len
lst = ["", "totalAssets", *["total_non_current_assets"] * len(non_current_assets), "totalAssets"]
print(lst)
Output:
['', 'totalAssets', 'total_non_current_assets', 'total_non_current_assets', 'total_non_current_assets', 'total_non_current_assets', 'total_non_current_assets', 'totalAssets']
(tested in Python 3.7)

This should work:
string_to_be_repeated = ["total_non_current_assets"]
needed_list = string_to_be_repeated * 3
list_to_appended = ["","totalAssets"]
list_to_appended.extend(needed_list)
print(list_to_appended)

You want to use a loop:
for x in range(len(non_current_assets)):
YOUR_ARRAY.append(”total_non_current_assets“)

You can use itertools.repeat together with the unpacking operator *:
import itertools as it
["", "totalAssets",
*it.repeat("total_non_current_assets", len(non_current_assets)),
"totalAssets"]
It makes the intent pretty clear and saves the creation of a temporary list (hence better performance).
In [1]: import itertools as it
In [2]: %timeit [0, 1, *[3]*1000, 4, 5]
6.51 µs ± 8.57 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [3]: %timeit [0, 1, *it.repeat(3, 1000), 4, 5]
4.94 µs ± 73.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Related

Fastest way to count two-letter pairs in a string

What is the fastest way to count number of two-letter pairs in a string (i.e AA, AB, AC, ... etc)? Is it possible to use numpy to speed up this computation?
I am using a list comprehension with str.count(), but this is quite slow.
import itertools
seq = 'MRNLAIIPARSGSKGLKDKNIKLLSGKPLLAYTIEAARESGLFGEIMVSTDSQEYAD'\
'IAKQWGANVPFLRSNELSNDTASSWDVVKEVIEGYKNLGTEFDTVVLLQPTSPLRTS'\
'IEGYKIMKEKDANFVVGVCEMDHSPLWANTLPEDLSMENFIRPEVVKMPRQSIPTYY'\
'RINGALYIVKVDYLMRTSDIYGERSIASVMRKENSIDIDNQMDFTIAEVLISERSKK'
chars = list('ACDEFGHIKLMNPQRSTVWY')
pairs = [''.join(pair) for pair in itertools.product(chars, chars)]
print(pairs[:10])
print(len(pairs))
['AA', 'AC', 'AD', 'AE', 'AF', 'AG', 'AH', 'AI', 'AK', 'AL']
400
%timeit counts = np.array([seq.count(pair) for pair in pairs])
231 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
print counts[:10]
[0, 1, 1, 0, 0, 0, 1, 1, 1, 0]
If you don't mind getting the counts in a dictionary, the Counter class from collections would process 2-3 times faster:
from collections import Counter
chars = set('ACDEFGHIKLMNPQRSTVWY')
counts = Counter( a+b for a,b in zip(seq,seq[1:]) if a in chars and b in chars)
print(counts)
Counter({'RS': 4, 'VV': 4, 'SI': 4, 'MR': 3, 'SG': 3, 'LL': 3, 'LS': 3,
'PL': 3, 'IE': 3, 'DI': 3, 'IA': 3, 'AN': 3, 'VK': 3, 'KE': 3,
'EV': 3, 'TS': 3, 'NL': 2, 'LA': 2, 'IP': 2, 'AR': 2, 'SK': 2,
...
This approach will properly count sequences of the same character repeated 3 or more times (i.e. "WWW" will count as 2 for "WW" whereas seq.count() or re.findall() would only count 1).
Keep in mind that he Counter dictionary will return zero for counts['LC'] but counts.items() will not contain 'LC' or any other pair not actually in the string.
If needed you could get the counts for all theoretical pairs in a second step:
from itertools import product
chars = 'ACDEFGHIKLMNPQRSTVWY'
print([counts[a+b] for a,b in product(chars,chars)][:10])
[1, 0, 1, 1, 0, 0, 0, 1, 1, 1]
There is a numpy function, np.char.count(). But it appears to be much slower than str.count().
%timeit counts = np.array([np.char.count(seq, pair) for pair in pairs])
1.79 ms ± 32.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Since speed is of utmost importance, below is a comparison of different methods:
import numpy as np
import itertools
from collections import Counter
seq = 'MRNLAIIPARSGSKGLKDKNIKLLSGKPLLAYTIEAARESGLFGEIMVSTDSQEYAD'\
'IAKQWGANVPFLRSNELSNDTASSWDVVKEVIEGYKNLGTEFDTVVLLQPTSPLRTS'\
'IEGYKIMKEKDANFVVGVCEMDHSPLWANTLPEDLSMENFIRPEVVKMPRQSIPTYY'\
'RINGALYIVKVDYLMRTSDIYGERSIASVMRKENSIDIDNQMDFTIAEVLISERSKK'
chars = list('ACDEFGHIKLMNPQRSTVWY')
pairs = [''.join(pair) for pair in itertools.product(chars, chars)]
def countpairs1():
return np.array([seq.count(pair) for pair in pairs])
%timeit counts = countpairs1()
144 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
def countpairs2():
counted = Counter(a+b for a,b in zip(seq,seq[1:]))
return np.array([counted[pair] for pair in pairs])
%timeit counts = countpairs2()
102 µs ± 729 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
def countpairs3():
return np.array([np.char.count(seq, pair) for pair in pairs])
%timeit counts = countpairs3()
1.65 ms ± 4.62 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Obviously, the best/fastest method is Counter.

Search indexes where values in my array match a value in a different array (python) [duplicate]

I have two numpy arrays, A and B. A conatains unique values and B is a sub-array of A.
Now I am looking for a way to get the index of B's values within A.
For example:
A = np.array([1,2,3,4,5,6,7,8,9,10])
B = np.array([1,7,10])
# I need a function fun() that:
fun(A,B)
>> 0,6,9
You can use np.in1d with np.nonzero -
np.nonzero(np.in1d(A,B))[0]
You can also use np.searchsorted, if you care about maintaining the order -
np.searchsorted(A,B)
For a generic case, when A & B are unsorted arrays, you can bring in the sorter option in np.searchsorted, like so -
sort_idx = A.argsort()
out = sort_idx[np.searchsorted(A,B,sorter = sort_idx)]
I would add in my favorite broadcasting too in the mix to solve a generic case -
np.nonzero(B[:,None] == A)[1]
Sample run -
In [125]: A
Out[125]: array([ 7, 5, 1, 6, 10, 9, 8])
In [126]: B
Out[126]: array([ 1, 10, 7])
In [127]: sort_idx = A.argsort()
In [128]: sort_idx[np.searchsorted(A,B,sorter = sort_idx)]
Out[128]: array([2, 4, 0])
In [129]: np.nonzero(B[:,None] == A)[1]
Out[129]: array([2, 4, 0])
Have you tried searchsorted?
A = np.array([1,2,3,4,5,6,7,8,9,10])
B = np.array([1,7,10])
A.searchsorted(B)
# array([0, 6, 9])
Just for completeness: If the values in A are non negative and reasonably small:
lookup = np.empty((np.max(A) + 1), dtype=int)
lookup[A] = np.arange(len(A))
indices = lookup[B]
I had the same question these days. However, the timing performance is very critical for me. Therefore, I guess the timing comparison of different solutions may be useful for others.
As Divakar mentioned, you can use np.in1d(A, B) with np.where, np.nonzero. Moreover, you can use the np.in1d(A, B) with np.intersect1d (based on this page). Also, you can use np.searchsorted as another useful approach for sorted arrays.
I want to add another simple solution. You can use the comprehension list. It may take longer that the previous ones. However, if you take the advantage of Numba python package, it is much less time-consuming.
In [1]: import numpy as np
In [2]: from numba import njit
In [3]: a = np.array([1,2,3,4,5,6,7,8,9,10])
In [4]: b = np.array([1,7,10])
In [5]: np.where(np.in1d(a, b))[0]
...: array([0, 6, 9])
In [6]: np.nonzero(np.in1d(a, b))[0]
...: array([0, 6, 9])
In [7]: np.searchsorted(a, b)
...: array([0, 6, 9])
In [8]: np.searchsorted(a, np.intersect1d(a, b))
...: array([0, 6, 9])
In [9]: [i for i, x in enumerate(a) if x in b]
...: [0, 6, 9]
In [10]: #njit
...: def func(a, b):
...: return [i for i, x in enumerate(a) if x in b]
In [11]: func(a, b)
...: [0, 6, 9]
Now, let's compare the timing performance of these solutions.
In [12]: %timeit np.where(np.in1d(a, b))[0]
4.26 µs ± 6.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [13]: %timeit np.nonzero(np.in1d(a, b))[0]
4.39 µs ± 14.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [14]: %timeit np.searchsorted(a, b)
800 ns ± 6.04 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [15]: %timeit np.searchsorted(a, np.intersect1d(a, b))
8.8 µs ± 73.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [16]: %timeit [i for i, x in enumerate(a) if x in b]
15.4 µs ± 18.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [17]: %timeit func(a, b)
336 ns ± 0.579 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

How to find the index of the element in a list that first appears in another given list?

a = [3, 4, 2, 1, 7, 6, 5]
b = [4, 6]
The answer should be 1. Because in a, 4 appears first in list b, and it's index is 1.
The question is that is there any fast code in python to achieve this?
PS: Actually a is a random permutation and b is a subset of a, but it's represented as a list.
If b is to be seen as a subset (order doesn't matter, all values are present in a), then use min() with a map():
min(map(a.index, b))
This returns the lowest index. This is a O(NK) solution (where N is the length of a, K that of b), but all looping is executed in C code.
Another option is to convert a to a set and use next() on a loop over enumerate():
bset = set(b)
next(i for i, v in enumerate(a) if v in bset)
This is a O(N) solution, but has higher constant cost (Python bytecode to execute). It heavily depends on the sizes of a and b which one is going to be faster.
For the small input example in the question, min(map(...)) wins:
In [86]: a = [3, 4, 2, 1, 7, 6, 5]
...: b = [4, 6]
...:
In [87]: %timeit min(map(a.index, b))
...:
608 ns ± 64.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [88]: bset = set(b)
...:
In [89]: %timeit next(i for i, v in enumerate(a) if v in bset)
...:
717 ns ± 30.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In one line :
print("".join([str(index) for item in b for index,item1 in enumerate(a) if item==item1][:1]))
output:
1
In detail :
a = [3, 4, 2, 1, 7, 6, 5]
b = [4, 6]
new=[]
for item in b:
for index,item1 in enumerate(a):
if item==item1:
new.append(index)
print("".join([str(x) for x in new[:1]]))
For little B sample, the set approach is output dependent, execution time grow linearly with index output. Numpy can provide better solution in this case.
N=10**6
A=np.unique(np.random.randint(0,N,N))
np.random.shuffle(A)
B=A[:3].copy()
np.random.shuffle(A)
def find(A,B):
pos=np.in1d(A,B).nonzero()[0]
return pos[A[pos].argsort()][B.argsort().argsort()].min()
def findset(A,B):
bset = set(B)
return next(i for i, v in enumerate(A) if v in bset)
#In [29]: find(A,B)==findset(A,B)
#Out[29]: True
#In [30]: %timeit findset(A,B)
# 63.5 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#
# In [31]: %timeit find(A,B)
# 2.24 ms ± 52.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Finding indices of matches of one array in another array

I have two numpy arrays, A and B. A conatains unique values and B is a sub-array of A.
Now I am looking for a way to get the index of B's values within A.
For example:
A = np.array([1,2,3,4,5,6,7,8,9,10])
B = np.array([1,7,10])
# I need a function fun() that:
fun(A,B)
>> 0,6,9
You can use np.in1d with np.nonzero -
np.nonzero(np.in1d(A,B))[0]
You can also use np.searchsorted, if you care about maintaining the order -
np.searchsorted(A,B)
For a generic case, when A & B are unsorted arrays, you can bring in the sorter option in np.searchsorted, like so -
sort_idx = A.argsort()
out = sort_idx[np.searchsorted(A,B,sorter = sort_idx)]
I would add in my favorite broadcasting too in the mix to solve a generic case -
np.nonzero(B[:,None] == A)[1]
Sample run -
In [125]: A
Out[125]: array([ 7, 5, 1, 6, 10, 9, 8])
In [126]: B
Out[126]: array([ 1, 10, 7])
In [127]: sort_idx = A.argsort()
In [128]: sort_idx[np.searchsorted(A,B,sorter = sort_idx)]
Out[128]: array([2, 4, 0])
In [129]: np.nonzero(B[:,None] == A)[1]
Out[129]: array([2, 4, 0])
Have you tried searchsorted?
A = np.array([1,2,3,4,5,6,7,8,9,10])
B = np.array([1,7,10])
A.searchsorted(B)
# array([0, 6, 9])
Just for completeness: If the values in A are non negative and reasonably small:
lookup = np.empty((np.max(A) + 1), dtype=int)
lookup[A] = np.arange(len(A))
indices = lookup[B]
I had the same question these days. However, the timing performance is very critical for me. Therefore, I guess the timing comparison of different solutions may be useful for others.
As Divakar mentioned, you can use np.in1d(A, B) with np.where, np.nonzero. Moreover, you can use the np.in1d(A, B) with np.intersect1d (based on this page). Also, you can use np.searchsorted as another useful approach for sorted arrays.
I want to add another simple solution. You can use the comprehension list. It may take longer that the previous ones. However, if you take the advantage of Numba python package, it is much less time-consuming.
In [1]: import numpy as np
In [2]: from numba import njit
In [3]: a = np.array([1,2,3,4,5,6,7,8,9,10])
In [4]: b = np.array([1,7,10])
In [5]: np.where(np.in1d(a, b))[0]
...: array([0, 6, 9])
In [6]: np.nonzero(np.in1d(a, b))[0]
...: array([0, 6, 9])
In [7]: np.searchsorted(a, b)
...: array([0, 6, 9])
In [8]: np.searchsorted(a, np.intersect1d(a, b))
...: array([0, 6, 9])
In [9]: [i for i, x in enumerate(a) if x in b]
...: [0, 6, 9]
In [10]: #njit
...: def func(a, b):
...: return [i for i, x in enumerate(a) if x in b]
In [11]: func(a, b)
...: [0, 6, 9]
Now, let's compare the timing performance of these solutions.
In [12]: %timeit np.where(np.in1d(a, b))[0]
4.26 µs ± 6.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [13]: %timeit np.nonzero(np.in1d(a, b))[0]
4.39 µs ± 14.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [14]: %timeit np.searchsorted(a, b)
800 ns ± 6.04 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [15]: %timeit np.searchsorted(a, np.intersect1d(a, b))
8.8 µs ± 73.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [16]: %timeit [i for i, x in enumerate(a) if x in b]
15.4 µs ± 18.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [17]: %timeit func(a, b)
336 ns ± 0.579 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Python numpy.square vs **

Is there a difference between numpy.square and using the ** operator on a Numpy array?
From what I can see it yields the same result.
Any differences in efficiency of execution?
An example for clarification:
In [1]: import numpy as np
In [2]: A = np.array([[2, 2],[2, 2]])
In [3]: np.square(A)
Out[3]:
array([[4, 4],
[4, 4]])
In [4]: A ** 2
Out[4]:
array([[4, 4],
[4, 4]])
You can check the execution time to get clear picture of it
In [2]: import numpy as np
In [3]: A = np.array([[2, 2],[2, 2]])
In [7]: %timeit np.square(A)
1000000 loops, best of 3: 923 ns per loop
In [8]: %timeit A ** 2
1000000 loops, best of 3: 668 ns per loop
For most appliances, both will give you the same results.
Generally the standard pythonic a*a or a**2 is faster than the numpy.square() or numpy.pow(), but the numpy functions are often more flexible and precise.
If you do calculations that need to be very accurate, stick to numpy and probably even use other datatypes float96.
For normal usage a**2 will do a good job and way faster job than numpy.
The guys in this thread gave some good examples to a similar questions.
#saimadhu.polamuri and #foehnx/#Lumos
On my machine, currently, NumPy performs faster than **.
In [1]: import numpy as np
In [2]: A = np.array([[1,2],[3,4]])
In [3]: %timeit A ** 2
256 ns ± 0.922 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit np.square(A)
240 ns ± 0.759 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Categories