How to use numpy.char.join? - python

A critical portion of my script relies on the concatenation of a large number of fixed-length strings. So I would like to use low-level numpy.char.join function instead of the classical python build str.join.
However, I can't get it to work right:
import numpy as np
# Example array.
array = np.array([
['a', 'b', 'c'],
['d', 'e', 'f'],
['g', 'h', 'i'],
], dtype='<U1')
# Now I wish to get:
# array(['abc', 'def', 'ghi'], dtype='<U3')
# But none of these is successful :(
np.char.join('', array)
np.char.join('', array.astype('<U3'))
np.char.join(np.array(''), array.astype('<U3'))
np.char.join(np.array('').astype('<U3'), array.astype('<U3'))
np.char.join(np.array(['', '', '']).astype('<U3'), array.astype('<U3'))
np.char.join(np.char.asarray(['', '', '']).astype('<U3'), np.char.asarray(array))
np.char.asarray(['', '', '']).join(array)
np.char.asarray(['', '', '']).astype('<U3').join(array.astype('<U3'))
.. and my initial array is always left unchanged.
What am I missing here?
What's numpy's most efficient way to concatenate each line of a large 2D <U1 array?
[EDIT]: Since performance is a concern, I have benchmarked proposed solutions. But I still don't know how to call np.char.join properly.
import numpy as np
import numpy.random as rd
from string import ascii_lowercase as letters
from time import time
# Build up an array with many random letters
n_lines = int(1e7)
n_columns = 4
array = np.array(list(letters))[rd.randint(0, len(letters), n_lines * n_columns)]
array = array.reshape((n_lines, n_columns))
# One quick-n-dirty way to benchmark.
class MeasureTime(object):
def __enter__(self):
self.tic = time()
def __exit__(self, type, value, traceback):
toc = time()
print(f"{toc-self.tic:0.3f} seconds")
# And test three concatenations procedures.
with MeasureTime():
# Involves str.join
cat = np.apply_along_axis("".join, 1, array)
with MeasureTime():
# Involves str.join
cat = np.array(["".join(row) for row in array])
with MeasureTime():
# Involve low-level np functions instead.
# Here np.char.add for example.
cat = np.char.add(
np.char.add(np.char.add(array[:, 0], array[:, 1]), array[:, 2]), array[:, 3]
)
outputs
41.722 seconds
19.921 seconds
15.206 seconds
on my machine.
Would np.char.join do better? How to make it work?

On the original (3,3) array (timings may scale differently):
The chained np.char.add:
In [88]: timeit np.char.add(np.char.add(arr[:,0],arr[:,1]),arr[:,2])
29 µs ± 223 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
An equivalent approach, using object dtype. For python strings, '+' is a string join.
In [89]: timeit arr.astype(object).sum(axis=1)
14.1 µs ± 18.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
For a list of strings, ''.join() is supposed to be faster than string sum. Plus it lets you specify a 'delimiter':
In [90]: timeit np.array([''.join(row) for row in arr])
13.8 µs ± 41.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Without the conversion back to array:
In [91]: timeit [''.join(row) for row in arr]
10.2 µs ± 15.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Better yet, use tolist to convert the array to a list of lists of strings:
In [92]: timeit [''.join(row) for row in arr.tolist()]
1.01 µs ± 1.81 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
the list comprehension equivalent of the nested np.char.add:
In [97]: timeit [row[0]+row[1]+row[2] for row in arr.tolist()]
1.19 µs ± 2.68 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
numpy does not have low-level string code, at least not in the same sense that it has low-level compiled numeric code. It still depends on Python string code, even if it calls it from the C-API.
====
Since the strings are U1, we can view them as U3:
In [106]: arr.view('U3')
Out[106]:
array([['abc'],
['def'],
['ghi']], dtype='<U3')
In [107]: arr.view('U3').ravel()
Out[107]: array(['abc', 'def', 'ghi'], dtype='<U3')
In [108]: timeit arr.view('U3').ravel()
1.04 µs ± 9.81 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
===
To use np.char.join we have to collect the rows into some sort of tuple, list, etc. One way to do that is make an object dtype array, and fill it from the array:
In [110]: temp = np.empty(arr.shape[0], object)
In [111]: temp
Out[111]: array([None, None, None], dtype=object)
In [112]: temp[:] = list(arr)
In [113]: temp
Out[113]:
array([array(['a', 'b', 'c'], dtype='<U1'),
array(['d', 'e', 'f'], dtype='<U1'),
array(['g', 'h', 'i'], dtype='<U1')], dtype=object)
In [114]: np.char.join('',temp)
Out[114]: array(['abc', 'def', 'ghi'], dtype='<U3')
or filling it with a list of lists:
In [115]: temp[:] = arr.tolist()
In [116]: temp
Out[116]:
array([list(['a', 'b', 'c']), list(['d', 'e', 'f']),
list(['g', 'h', 'i'])], dtype=object)
In [117]: np.char.join('',temp)
Out[117]: array(['abc', 'def', 'ghi'], dtype='<U3')
In [122]: %%timeit
...: temp = np.empty(arr.shape[0], object)
...: temp[:] = arr.tolist()
...: np.char.join('', temp)
...:
...:
22.1 µs ± 69.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
====
To get a better idea of what np.char.join can do, compare it with split:
In [132]: temp
Out[132]:
array([list(['a', 'b', 'c']), list(['d', 'e', 'f']),
list(['g', 'h', 'i'])], dtype=object)
In [133]: b = np.char.join(',',temp)
In [134]: b
Out[134]: array(['a,b,c', 'd,e,f', 'g,h,i'], dtype='<U5')
In [135]: np.char.split(b,',')
Out[135]:
array([list(['a', 'b', 'c']), list(['d', 'e', 'f']),
list(['g', 'h', 'i'])], dtype=object)
Another way to apply ''.join to the elements of the object array:
In [136]: np.frompyfunc(lambda s: ','.join(s), 1,1)(temp)
Out[136]: array(['a,b,c', 'd,e,f', 'g,h,i'], dtype=object)

np.array([''.join(row) for row in array])
is the pythonic way, using list comprehension then treating it as a numpy array.

Related

What are the downsides of always using numpy arrays instead of python lists?

I'm writing a program in which I want to flatten an array, so I used the following code:
list_of_lists = [["a","b","c"], ["d","e","f"], ["g","h","i"]]
flattened_list = [i for j in list_of_lists for i in j]
This results in ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'], the desired output.
I then found out that using a numpy array, I could've done the same simply by using np.array(((1,2),(3,4),(5,6))).flatten().
I was wondering if there is any downside to always using numpy arrays in the place of regular Python lists? In other words, is there something that Python lists can do which numpy arrays can't?
With your small example, the list comprehension is faster than the array method, even when taking the array creation out of the timing loop:
In [204]: list_of_lists = [["a","b","c"], ["d","e","f"], ["g","h","i"]]
...: flattened_list = [i for j in list_of_lists for i in j]
In [205]: timeit [i for j in list_of_lists for i in j]
757 ns ± 17.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [206]: np.ravel(list_of_lists)
Out[206]: array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'], dtype='<U1')
In [207]: timeit np.ravel(list_of_lists)
8.05 µs ± 12.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [208]: %%timeit x = np.array(list_of_lists)
...: np.ravel(x)
2.33 µs ± 22.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
With a much larger example, I expect [208] to get better.
If the sublists differ in size, the array is not 2d, and flatten does nothing:
In [209]: list_of_lists = [["a","b","c",23], ["d",None,"f"], ["g","h","i"]]
...: flattened_list = [i for j in list_of_lists for i in j]
In [210]: flattened_list
Out[210]: ['a', 'b', 'c', 23, 'd', None, 'f', 'g', 'h', 'i']
In [211]: np.array(list_of_lists)
Out[211]:
array([list(['a', 'b', 'c', 23]), list(['d', None, 'f']),
list(['g', 'h', 'i'])], dtype=object)
Growing lists is more efficient:
In [217]: alist = []
In [218]: for row in list_of_lists:
...: alist.append(row)
...:
In [219]: alist
Out[219]: [['a', 'b', 23], ['d', None, 'f'], ['g', 'h', 'i']]
In [220]: np.array(alist)
Out[220]:
array([['a', 'b', 23],
['d', None, 'f'],
['g', 'h', 'i']], dtype=object)
We strongly discourage iterative concatenation. Collect the sublists or arrays in a list first.
Yes there are. The rule of thumb would be to remember numpy.array is better for data of the same datatype (all integers, all double precision fp, all booleans, strings of the same length etc) instead of a mix bag of things. In the latter case you might just as well using generic list, considering this:
In [93]: a = [b'5', 5, '55', 'ab', 'cde', 'ef', 4, 6]
In [94]: b = np.array(a)
In [95]: %timeit 5 in a
65.6 ns ± 0.79 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [96]: %timeit 6 in a # worst case
219 ns ± 5.48 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [97]: %timeit 5 in b
10.9 µs ± 217 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
look at this several magnitudes of performance difference, where numpy.array is slower! Certainly this depends on the dimension of the list, and in this particular case depends on the index of 5 or 6 (worst case of O(n) complexity), but you get the idea.
Numpy arrays and functions are better for the most part. Here is an article if you want to look into it more: https://webcourses.ucf.edu/courses/1249560/pages/python-lists-vs-numpy-arrays-what-is-the-difference

Create an array with a letter repeated a given number of times given by another array

I have an array a and i want to create another array b with a certain string repeated the number of times specified by a
a = np.array([1,2,3])
s = 'a'
i want the b to be np.array(['a','aa','aaa']). What would be the numpy way to do it without loops?
Though my use case does not need it but, in general
a = np.array([1,2,3])
s = np.array(['a','b','c'])
How to get b to be np.array(['a','bb','ccc']) without loops?
There is a built-in method:
output = np.core.defchararray.multiply(s,a)
Let's compare the alternatives:
In [495]: a = np.array([1, 2, 3])
...: s = np.array(['a', 'b', 'c'])
Using the np.char function. Under the covers this applies string multiply to each element of the array (with a loop):
In [496]: np.char.multiply(s,a)
Out[496]: array(['a', 'bb', 'ccc'], dtype='<U3')
An explicit loop. i.item() converts the numpy string to Python string:
In [498]: np.array([i.item()*j for i,j in zip(s,a)])
Out[498]: array(['a', 'bb', 'ccc'], dtype='<U3')
Another way of creating an array of Python strings:
In [499]: s.astype(object)*a
Out[499]: array(['a', 'bb', 'ccc'], dtype=object)
Timings:
In [500]: timeit np.char.multiply(s,a)
21.3 µs ± 975 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [501]: timeit np.array([i.item()*j for i,j in zip(s,a)])
13.4 µs ± 21.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [502]: timeit s.astype(object)*a
9.16 µs ± 226 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
So the explicit loop approach does pretty well.
Another idea - use frompyfunc:
In [504]: np.frompyfunc(lambda i,j: i*j, 2,1)(s,a)
Out[504]: array(['a', 'bb', 'ccc'], dtype=object)
In [505]: timeit np.frompyfunc(lambda i,j: i*j, 2,1)(s,a)
6.28 µs ± 56 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I thought of frompyfunc because I wondered if we could use broadcasting:
In [508]: np.frompyfunc(lambda i,j: i*j, 2,1)(s,a[:,None])
Out[508]:
array([['a', 'b', 'c'],
['aa', 'bb', 'cc'],
['aaa', 'bbb', 'ccc']], dtype=object)
But that kind of broadcasting works for the other methods as well.
np.vectorize uses np.frompyfunc but does dtype conversions (frompyfunc always returns object dtype), but it tends to be slower.
I don't understand why you insist on them being numpy objects? Maybe I'm misunderstanding the question but you would handle it the same way as a list:
import numpy as np
a = np.array([1, 2, 3])
s = np.array(['a', 'b', 'c'])
new_array = np.array([str(s[i])*a[i] for i in range(len(s))])
print(new_array)
Outputs:
['a', 'bb', 'ccc']
This assumes a and s are of equal length, because it was not specified otherwise

vector substitution in numpy/python [duplicate]

This question already has answers here:
Numpy: get values from array where indices are in another array
(2 answers)
Closed 4 years ago.
Given two vectors v=['a','b','c'] and i=np.random.randint(len(v),size=10), I can get the "substitution" vector
vi = [v[i[x]] for x in range(len(i))]
E.g., vi is
['a', 'a', 'c', 'c', 'b', 'a', 'c', 'a', 'c', 'a']
if
i = array([0, 0, 2, 2, 1, 0, 2, 0, 2, 0])
Is there a vectorized operation for this?
You can simply use numpy indexing (note that you need to convert v to a numpy.array for this to work):
v = np.array(['a','b','c'])
i = np.random.randint(len(v),size=10)
>>> v[i]
array(['c', 'b', 'a', 'b', 'c', 'b', 'a', 'a', 'b', 'b'], dtype='<U1')
Timings
In [26]: i = np.random.randint(len(v),size=1000000)
In [27]: %timeit [v[i[x]] for x in range(len(i))]
554 ms ± 6.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [28]: %timeit v[i]
4.85 ms ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [29]: %timeit [v[s] for s in i]
505 ms ± 1.95 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Search indexes where values in my array match a value in a different array (python) [duplicate]

I have two numpy arrays, A and B. A conatains unique values and B is a sub-array of A.
Now I am looking for a way to get the index of B's values within A.
For example:
A = np.array([1,2,3,4,5,6,7,8,9,10])
B = np.array([1,7,10])
# I need a function fun() that:
fun(A,B)
>> 0,6,9
You can use np.in1d with np.nonzero -
np.nonzero(np.in1d(A,B))[0]
You can also use np.searchsorted, if you care about maintaining the order -
np.searchsorted(A,B)
For a generic case, when A & B are unsorted arrays, you can bring in the sorter option in np.searchsorted, like so -
sort_idx = A.argsort()
out = sort_idx[np.searchsorted(A,B,sorter = sort_idx)]
I would add in my favorite broadcasting too in the mix to solve a generic case -
np.nonzero(B[:,None] == A)[1]
Sample run -
In [125]: A
Out[125]: array([ 7, 5, 1, 6, 10, 9, 8])
In [126]: B
Out[126]: array([ 1, 10, 7])
In [127]: sort_idx = A.argsort()
In [128]: sort_idx[np.searchsorted(A,B,sorter = sort_idx)]
Out[128]: array([2, 4, 0])
In [129]: np.nonzero(B[:,None] == A)[1]
Out[129]: array([2, 4, 0])
Have you tried searchsorted?
A = np.array([1,2,3,4,5,6,7,8,9,10])
B = np.array([1,7,10])
A.searchsorted(B)
# array([0, 6, 9])
Just for completeness: If the values in A are non negative and reasonably small:
lookup = np.empty((np.max(A) + 1), dtype=int)
lookup[A] = np.arange(len(A))
indices = lookup[B]
I had the same question these days. However, the timing performance is very critical for me. Therefore, I guess the timing comparison of different solutions may be useful for others.
As Divakar mentioned, you can use np.in1d(A, B) with np.where, np.nonzero. Moreover, you can use the np.in1d(A, B) with np.intersect1d (based on this page). Also, you can use np.searchsorted as another useful approach for sorted arrays.
I want to add another simple solution. You can use the comprehension list. It may take longer that the previous ones. However, if you take the advantage of Numba python package, it is much less time-consuming.
In [1]: import numpy as np
In [2]: from numba import njit
In [3]: a = np.array([1,2,3,4,5,6,7,8,9,10])
In [4]: b = np.array([1,7,10])
In [5]: np.where(np.in1d(a, b))[0]
...: array([0, 6, 9])
In [6]: np.nonzero(np.in1d(a, b))[0]
...: array([0, 6, 9])
In [7]: np.searchsorted(a, b)
...: array([0, 6, 9])
In [8]: np.searchsorted(a, np.intersect1d(a, b))
...: array([0, 6, 9])
In [9]: [i for i, x in enumerate(a) if x in b]
...: [0, 6, 9]
In [10]: #njit
...: def func(a, b):
...: return [i for i, x in enumerate(a) if x in b]
In [11]: func(a, b)
...: [0, 6, 9]
Now, let's compare the timing performance of these solutions.
In [12]: %timeit np.where(np.in1d(a, b))[0]
4.26 µs ± 6.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [13]: %timeit np.nonzero(np.in1d(a, b))[0]
4.39 µs ± 14.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [14]: %timeit np.searchsorted(a, b)
800 ns ± 6.04 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [15]: %timeit np.searchsorted(a, np.intersect1d(a, b))
8.8 µs ± 73.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [16]: %timeit [i for i, x in enumerate(a) if x in b]
15.4 µs ± 18.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [17]: %timeit func(a, b)
336 ns ± 0.579 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Sampling with repetition in Python

I have a string with 50ish elements, I need to randomize this and generate a much longer string, I found random.sample() to only pick unique elements, which is great but not fit for my purpose, is there a way to allow repetitions in Python or do I need to manyally build a cycle?
You can use numpy.random.choice. It has an argument to specify how many samples you want, and an argument to specify whether you want replacement. Something like the following should work.
import numpy as np
choices = np.random.choice([1, 2, 3], size=10, replace=True)
# array([2, 1, 2, 3, 3, 1, 2, 2, 3, 2])
If your input is a string, say something like my_string = 'abc', you can use:
choices = np.random.choice([char for char in my_string], size=10, replace=True)
# array(['c', 'b', 'b', 'c', 'b', 'a', 'a', 'a', 'c', 'c'], dtype='<U1')
Then get a new string out of it with:
new_string = ''.join(choices)
# 'cbbcbaaacc'
Performance
Timing the three answers so far and random.choices from the comments (skipping the ''.join part since we all used it) producing 1000 samples from the string 'abc', we get:
numpy.random.choice([char for char in 'abc'], size=1000, replace=True):
34.1 µs ± 213 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
random.choices('abc', k=1000)
269 µs ± 4.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
[random.choice('abc') for _ in range(1000)]:
924 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
[random.sample('abc',1)[0] for _ in range(1000)]:
4.32 ms ± 67.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Numpy is fastest by far. If you put the ''.join parts in there, you actually see numpy and random.choices neck and neck, with both being three times faster than the next fastest for this example.
You could do something like this:
import random
dict = 'abcdef'
''.join([random.choice(dict) for x in range(50)])
Not saying this is the most effective (you should prob. use choice here) ... but consider it:
import random
a = ['a','b','c']
' '.join([random.sample(a,1)[0] for _ in range(6)])
I have found this, I forgot to mention I was on Python 3.6:
DICTIONARY_NUMBERS_HEX = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F']
block_text = "".join(random.choices(DICTIONARY_NUMBERS_HEX,k=50)
Using k=50 named argument will generate repeated elements.

Categories