Efficiently merging a large number of pyspark DataFrames

Efficiently merging a large number of pyspark DataFrames - python

I'm trying to perform dataframe union of thousands of dataframes in a Python list. I'm using two approaches I found. The first one is by means of for loop union and the second one is using functools.reduce. Both them work well for toy examples, however for thousands of dataframes I'm experimenting a severe overhead, probably caused by code out of the JVM, sequentialy appending each dataframe at a time (using both merging approaches).
from functools import reduce # For Python 3.x
from pyspark.sql import DataFrame
# The reduce approach
def unionAll(dfs):
return reduce(DataFrame.unionAll, dfs)
df_list = [td2, td3, td4, td5, td6, td7, td8, td9, td10]
df = unionAll(df_list)
#The loop approach
df = df_list[0].union(df_list[1])
for d in df_list[2:]:
df = df.union(d)
The question is how to perform this multiple dataframe operation efficiently, probably circunventing the overhead caused by merging dataframes one-by-one.
Thank you very much

You are currently joining your DataFrames like this:
(((td1 + td2) + td3) + td4)
At each stage, you are concatenating a huge dataframe with a small dataframe, resulting in a copy at each step and a lot of wasted memory. I would suggest combining them like this:
(td1 + td2) + (td3 + td4)
The idea is to iteratively coalesce pairs of roughly the same size until you are left with a single result. Here is a prototype:
def pairwise_reduce(op, x):
while len(x) > 1:
v = [op(i, j) for i, j in zip(x[::2], x[1::2])]
if len(x) > 1 and len(x) % 2 == 1:
v[-1] = op(v[-1], x[-1])
x = v
return x[0]
result = pairwise_reduce(DataFrame.unionAll, df_list)
You will see how this makes a huge difference for python lists.
from functools import reduce
from operator import add
x = [[1, 2, 3], [4, 5, 6], [7, 8], [9, 10, 11, 12]] * 1000
%timeit sum(x, [])
%timeit reduce(add, x)
%timeit pairwise_reduce(add, x)
64.2 ms ± 606 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
66.3 ms ± 679 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
970 µs ± 9.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
sum(x, []) == reduce(add, x) == pairwise_reduce(add, x)
# True

Related

Numpy fastest way to get indexes of neighbors with current value (flood fill)

I need to find fast way to get indicies of neighbors with values like current
For example:
arr = [0, 0, 0, 1, 0, 1, 1, 1, 1, 0]
indicies = func(arr, 6)
# [5, 6, 7, 8]
6th element has value 1, so I need full slice containing 6th and all it's neighbors with same value
It is like a part of flood fill algorithm. Is there a way to do it fast in numpy?
Is there a way for 2D array?
EDIT
Let's see some perfomance tests:
import numpy as np
import random
np.random.seed(1488)
arr = np.zeros(5000)
for x in np.random.randint(0, 5000, size = 100):
arr[x:x+50] = 1
I will compare function from #Ehsan:
def func_Ehsan(arr, idx):
change = np.insert(np.flatnonzero(np.diff(arr)), 0, -1)
loc = np.searchsorted(change, idx)
start = change[max(loc-1,0)]+1 if loc<len(change) else change[loc-1]
end = change[min(loc, len(change)-1)]
return (start, end)
change = np.insert(np.flatnonzero(np.diff(arr)), 0, -1)
def func_Ehsan_same_arr(arr, idx):
loc = np.searchsorted(change, idx)
start = change[max(loc-1,0)]+1 if loc<len(change) else change[loc-1]
end = change[min(loc, len(change)-1)]
return (start, end)
with my pure python function:
def my_func(arr, index):
val = arr[index]
size = arr.size
end = index + 1
while end < size and arr[end] == val:
end += 1
start = index - 1
while start > -1 and arr[start] == val:
start -= 1
return start + 1, end
Take a look:
np.random.seed(1488)
%timeit my_func(arr, np.random.randint(0, 5000))
# 42.4 µs ± 700 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
np.random.seed(1488)
%timeit func_Ehsan(arr, np.random.randint(0, 5000))
# 115 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
np.random.seed(1488)
%timeit func_Ehsan_same_arr(arr, np.random.randint(0, 5000))
# 18.1 µs ± 953 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Is there a way to use same logic by numpy, without C module/Cython/Numba/python loops? And make it faster!

I don't know how to solve this problem with numpy but If you use pandas, you might get the result that you want with this:
import pandas as pd
df=pd.DataFrame(arr,columns=["data"])
df["new"] = df["data"].diff().ne(0).cumsum()
[{i[0]:j.index.tolist()} for i,j in df.groupby(["data","new"],sort=False)]
Output:
[{0: [0, 1, 2]}, {1: [3]}, {0: [4]}, {1: [5, 6, 7, 8]}, {0: [9]}]

The main problem is that Numpy is not currently designed to solve this problem efficiently. A "find first index of value fast" or any similar lazy function call is required to solve this problem efficiently. However, while this feature as been discussed since 10 years ago, this feature is still no available in Numpy. See this post for more information. I do not expect any change soon. Until then, the best solution on relatively big array appear to use an iterative solution using relatively slow pure-Python loops and slow Numpy calls/accesses.
Beside this, one solution to speed up the computation is to work on small chunks. Here is an implementation:
def my_func_opt1(arr, index):
val = arr[index]
size = arr.size
chunkSize = 128
end = index + 1
while end < size:
chunk = arr[end:end+chunkSize]
locations = (chunk != val).nonzero()[0]
if len(locations) > 0:
foundCount = locations[0]
end += foundCount
break
else:
end += len(chunk)
start = index
while start > 0:
chunk = arr[max(start-chunkSize,0):start]
locations = (chunk != val).nonzero()[0]
if len(locations) > 0:
foundCount = locations[-1]
start -= chunkSize - 1 - foundCount
break
else:
start -= len(chunk)
return start, end
Here are performance results on my machine:
func_Ehsan: 53.8 µs ± 449 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
my_func: 17.5 µs ± 97.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
my_func_opt1: 7.31 µs ± 52.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The thing is the result are a bit biased since np.random.randint takes actually 2.01 µs. Without this Numpy call included in the benchmark, here are the results:
func_Ehsan: 51.8 µs
my_func: 15.5 µs
my_func_opt1: 5.31 µs
As a result, my_func_opt1 is about 3 times faster than my_func. This is very difficult to write a faster code as any Numpy call introduces a relatively big overhead of 0.5-1.0 µs on my machine whatever the array size (eg. empty arrays) due to internal checks.
The following is useful for people interested in speeding up the operation and that can use Numba.
The simplest solution consist in using the Numba's JIT and more specifically just add decorator. This solution is also very fast.
#nb.njit('UniTuple(i8,2)(f8[::1], i8)')
def my_func_opt2(arr, index):
val = arr[index]
size = arr.size
end = index + 1
while end < size and arr[end] == val:
end += 1
start = index - 1
while start > -1 and arr[start] == val:
start -= 1
return start + 1, end
On my machine my_func_opt2 takes only 0.63 µs (wit the random call excluded). As a result, my_func_opt2 is about 25 times faster than my_func. I highly doubt there is a faster solution since any Numpy calls on my machine take at least 0.5 µs and an empty Numba function takes 0.25 µs to call.
Beside this, please note that arr contains double-precision values which are pretty expensive to compute. It should be faster to use integers if you can. Also, please note that an array of 0 and 1 values can be stored in a int8 value which takes 8 times less memory and is often significantly faster to compute (due to CPU caches, the smaller the array the faster the computation). You can specify the type during the creation of the array: np.zeros(5000, dtype=np.int8)

Here is a numpy solution. I think you can improve it by a little more work:
def func(arr, idx):
change = np.insert(np.flatnonzero(np.diff(arr)), 0, -1)
loc = np.searchsorted(change, idx)
start = change[max(loc-1,0)]+1 if loc<len(change) else change[loc-1]
end = change[min(loc, len(change)-1)]
return np.arange(start, end)
sample output:
indices = func(arr, 6)
#[5 6 7 8]
This would specially be faster if you have few changes in your arr (relative to array size) and you are looking for multiple of those index searches in the same array. Otherwise, faster solutions come to mind.
Performance comparison:
If you are performing on same array multiple times, simply put first line out of function like this to avoid repetition.
change = np.insert(np.flatnonzero(np.diff(arr)), 0, -1)
def func(arr, idx):
loc = np.searchsorted(change, idx)
start = change[max(loc-1,0)]+1 if loc<len(change) else change[loc-1]
end = change[min(loc, len(change)-1)]
return np.arange(start, end)
For same input as OP's example:
np.random.seed(1488)
%timeit func_OP(arr, np.random.randint(0, 5000))
#23.5 µs ± 631 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
np.random.seed(1488)
%timeit func_Ehsan(arr, np.random.randint(0, 5000))
#7.89 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
np.random.seed(1488)
%timeit func_Jérôme_opt1(arr, np.random.randint(0, 5000))
#12.1 µs ± 757 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit func_Jérôme_opt2(arr, np.random.randint(0, 5000))
#3.45 µs ± 179 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
With func_Ehsan being fastest (excluding Numba). Please mind that again, the performance of these functions vary significantly on number of changes in array, array size and how many times the function is being called on the same array. And of course Numba is faster than all (almost 2x faster than func_Ehsan. And if you are going to run it many times, build the groups in O(n) and use hash map to indices in O(1).

How can merging on a wider p dataframe (n, p) can take more time?

I observe longer computation runtime when merging a dataframe with more columns. But the number of links to make remains unganged. Does someone has an insight on the reason of this behaviour ?
Here is the code :
def set_vars(n, p):
np.random.seed(seed=42)
df = pd.DataFrame(np.random.randint(27, size=(n,p)), columns=['merge'] + list(range(p-1)))
df_merge = pd.DataFrame(np.concatenate((np.arange(26).reshape((26,1)),
np.arange(26).reshape((26,1))), axis=1), columns=['merge', 'val'])
return df, df_merge
df, df_merge = set_vars(1000000, 5)
%%timeit
b = df.merge(df_merge, on='merge')
# 255 ms ± 63.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
df, df_merge = set_vars(1000000, 20)
%%timeit
b = df.merge(df_merge, on='merge')
# 494 ms ± 46.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Each row after the definition of set_vars() was run separately, so df and df_merge definition are not impacting.
Not sure to understand why. As the merge is on only one column, why having a big (p) dataset is impacting the performance ?

df.merge creates a new dataframe. This operation is costly and dependant of the size of the input dataframes and also the resulting dataframe. In the first case, the size of the output dataframe b is 963224 x 6, while in the second case, it is 963137 x 21. Producing a ~3 time bigger dataframe takes about twice more time.

Fastest way to append nonzero numpy array elements to list

I want to add all nonzero elements from a numpy array arr to a list out_list. Previous research suggests that for numpy arrays, using np.nonzero is most efficient. (My own benchmark below actually suggests it can be slightly improved using np.delete).
However, in my case I want my output to be a list, because I am combining many arrays for which I don't know the number of nonzero elements (so I can't effectively preallocate a numpy array for them). Hence, I was wondering whether there are some synergies that can be exploited to speed up the process. While my naive list comprehension approach is much slower than the pure numpy approach, I got some promising results combining list comprehension with numba.
Here's what I found so far:
import numpy as np
n = 60_000 # size of array
nz = 0.3 # fraction of zero elements
arr = (np.random.random_sample(n) - nz).clip(min=0)
# method 1
def add_to_list1(arr, out):
out.extend(list(arr[np.nonzero(arr)]))
# method 2
def add_to_list2(arr, out):
out.extend(list(np.delete(arr, arr == 0)))
# method 3
def add_to_list3(arr, out):
out += [x for x in arr if x != 0]
# method 4 (not sure how to get numba to accept an empty list as argument)
#njit
def add_to_list4(arr):
return [x for x in arr if x != 0]
out_list = []
%timeit add_to_list1(arr, out_list)
out_list = []
%timeit add_to_list2(arr, out_list)
out_list = []
%timeit add_to_list3(arr, out_list)
_ = add_to_list4(arr) # call once to compile
out_list = []
%timeit out_list.extend(add_to_list4(arr))
Yielding the following results:
2.51 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.19 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.6 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.63 ms ± 158 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Not surprisingly, numba outperforms all other methods. Among the rest, method 2 (using np.delete) is the best. Am I missing any obvious alternative that exploits the fact that I am converting to a list afterwards? Can you think of anything to further speed up the process?
Edit 1:
Performance of .tolist():
# method 5
def add_to_list5(arr, out):
out += arr[arr != 0].tolist()
# method 6
def add_to_list6(arr, out):
out += np.delete(arr, arr == 0).tolist()
# method 7
def add_to_list7(arr, out):
out += arr[arr.astype(bool)].tolist()
Timings are on par with numba:
1.62 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.65 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each
1.78 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Edit 2:
Here's some benchmarking using Mad Physicist's suggestion to use np.concatenate to construct a numpy array instead.
# construct numpy array using np.concatenate
out_list = []
t = time.perf_counter()
for i in range(100):
out_list.append(arr[arr != 0])
result = np.concatenate(out_list)
print(f"Time elapsed: {time.perf_counter() - t:.4f}s")
# compare with best list-based method
out_list = []
t = time.perf_counter()
for i in range(100):
out_list += arr[arr != 0].tolist()
print(f"Time elapsed: {time.perf_counter() - t:.4f}s")
Concatenating numpy arrays yields indeed another significant speed-up, although it is not directly comparable since the output is a numpy array instead of a list. So it will depend on the precise use what will be best.
Time elapsed: 0.0400s
Time elapsed: 0.1430s
TLDR;
1/ using arr[arr != 0] is the fastest of all the indexing options
2/ using .tolist() instead of list(.) speeds up things by a factor 1.3 - 1.5
3/ with the gains of 1/ and 2/ combined, the speed is on par with numba
4/ if having a numpy array instead of a list is acceptable, then using np.concatenate yields another gain in speed by a factor of ~3.5 compared to the best alternative

I submit that the method of choice, if you are indeed looking for a list output, is:
def f(arr, out_list):
out_list += arr[arr != 0].tolist()
It seems to beat all the other methods mentioned so far in the OP's question or in other responses (at the time of this writing).
If, however, you are looking for a result as a numpy array, then following #MadPhysicist's version (slightly modified to use arr[arr != 0] instead of using np.nonzero()) is almost 6x faster, see at the end of this post.
Side note: I would avoid using %timeit out_list.extend(some_list): it keeps adding to out_list during the many loops of timeit. Example:
out_list = []
%timeit out_list.extend([1,2,3])
and now:
>>> len(out_list)
243333333 # yikes
Timings
On 60K items on my machine, I see:
out_list = []
a = %timeit -o out_list + arr[arr != 0].tolist()
b = %timeit -o out_list + arr[np.nonzero(arr)].tolist()
c = %timeit -o out_list + list(arr[np.nonzero(arr)])
Yields:
1.23 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.53 ms ± 2.53 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
4.29 ms ± 3.02 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And:
>>> c.average / a.average
3.476
>>> b.average / a.average
1.244
For a numpy array result instead
Following #MadPhysicist, you can get some extra boost by not turning the arrays into lists, but using np.concatenate() instead:
def all_nonzero(arr_iter):
"""return non zero elements of all arrays as a np.array"""
return np.concatenate([a[a != 0] for a in arr_iter])
def all_nonzero_list(arr_iter):
"""return non zero elements of all arrays as a list"""
out_list = []
for a in arr_iter:
out_list += a[a != 0].tolist()
return out_list
from itertools import repeat
ta = %timeit -o all_nonzero(repeat(arr, 100))
tl = %timeit -o all_nonzero_list(repeat(arr, 100))
Yields:
39.7 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
227 ms ± 680 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
and
>>> tl.average / ta.average
5.75

Instead of extending a list by all of the elements of a new array, append the array itself. This will make for much fewer and smaller reallocations. You can also pre-allocate a list of Nones up-front or even use an object array, if you have an upper bound on the number of arrays you will process.
When you're done, call np.concatenate on the list.
So instead of this:
L = []
for i in range(10):
arr = (np.random.random_sample(n) - nz).clip(min=0)
L.extend(arr[np.nonzero(arr)])
result = np.array(L)
Try this:
L = []
for i in range(10):
arr = (np.random.random_sample(n) - nz).clip(min=0)
L.append(arr[np.nonzero(arr)])
result = np.concatenate(L)
Since you're keeping arrays around, the final concatenation will be a series of buffer copies (which is fast), rather than a bunch of python-to numpy type conversions (which won't be). The exact method you choose for deletion is of course still up to the result of your benchmark.
Also, here's another method to add to your benchmark:
def add_to_list5(arr, out):
out.extend(list(arr[arr.astype(bool)]))
I don't expect this to be overwhelmingly fast, but it's interesting to see how masking stacks up next to indexing.

Vectorized operations on 2 columns of different dataframes [duplicate]

I have two numpy arrays that define the x and y axes of a grid. For example:
x = numpy.array([1,2,3])
y = numpy.array([4,5])
I'd like to generate the Cartesian product of these arrays to generate:
array([[1,4],[2,4],[3,4],[1,5],[2,5],[3,5]])
In a way that's not terribly inefficient since I need to do this many times in a loop. I'm assuming that converting them to a Python list and using itertools.product and back to a numpy array is not the most efficient form.

A canonical cartesian_product (almost)
There are many approaches to this problem with different properties. Some are faster than others, and some are more general-purpose. After a lot of testing and tweaking, I've found that the following function, which calculates an n-dimensional cartesian_product, is faster than most others for many inputs. For a pair of approaches that are slightly more complex, but are even a bit faster in many cases, see the answer by Paul Panzer.
Given that answer, this is no longer the fastest implementation of the cartesian product in numpy that I'm aware of. However, I think its simplicity will continue to make it a useful benchmark for future improvement:
def cartesian_product(*arrays):
la = len(arrays)
dtype = numpy.result_type(*arrays)
arr = numpy.empty([len(a) for a in arrays] + [la], dtype=dtype)
for i, a in enumerate(numpy.ix_(*arrays)):
arr[...,i] = a
return arr.reshape(-1, la)
It's worth mentioning that this function uses ix_ in an unusual way; whereas the documented use of ix_ is to generate indices into an array, it just so happens that arrays with the same shape can be used for broadcasted assignment. Many thanks to mgilson, who inspired me to try using ix_ this way, and to unutbu, who provided some extremely helpful feedback on this answer, including the suggestion to use numpy.result_type.
Notable alternatives
It's sometimes faster to write contiguous blocks of memory in Fortran order. That's the basis of this alternative, cartesian_product_transpose, which has proven faster on some hardware than cartesian_product (see below). However, Paul Panzer's answer, which uses the same principle, is even faster. Still, I include this here for interested readers:
def cartesian_product_transpose(*arrays):
broadcastable = numpy.ix_(*arrays)
broadcasted = numpy.broadcast_arrays(*broadcastable)
rows, cols = numpy.prod(broadcasted[0].shape), len(broadcasted)
dtype = numpy.result_type(*arrays)
out = numpy.empty(rows * cols, dtype=dtype)
start, end = 0, rows
for a in broadcasted:
out[start:end] = a.reshape(-1)
start, end = end, end + rows
return out.reshape(cols, rows).T
After coming to understand Panzer's approach, I wrote a new version that's almost as fast as his, and is almost as simple as cartesian_product:
def cartesian_product_simple_transpose(arrays):
la = len(arrays)
dtype = numpy.result_type(*arrays)
arr = numpy.empty([la] + [len(a) for a in arrays], dtype=dtype)
for i, a in enumerate(numpy.ix_(*arrays)):
arr[i, ...] = a
return arr.reshape(la, -1).T
This appears to have some constant-time overhead that makes it run slower than Panzer's for small inputs. But for larger inputs, in all the tests I ran, it performs just as well as his fastest implementation (cartesian_product_transpose_pp).
In following sections, I include some tests of other alternatives. These are now somewhat out of date, but rather than duplicate effort, I've decided to leave them here out of historical interest. For up-to-date tests, see Panzer's answer, as well as Nico Schlömer's.
Tests against alternatives
Here is a battery of tests that show the performance boost that some of these functions provide relative to a number of alternatives. All the tests shown here were performed on a quad-core machine, running Mac OS 10.12.5, Python 3.6.1, and numpy 1.12.1. Variations on hardware and software are known to produce different results, so YMMV. Run these tests for yourself to be sure!
Definitions:
import numpy
import itertools
from functools import reduce
### Two-dimensional products ###
def repeat_product(x, y):
return numpy.transpose([numpy.tile(x, len(y)),
numpy.repeat(y, len(x))])
def dstack_product(x, y):
return numpy.dstack(numpy.meshgrid(x, y)).reshape(-1, 2)
### Generalized N-dimensional products ###
def cartesian_product(*arrays):
la = len(arrays)
dtype = numpy.result_type(*arrays)
arr = numpy.empty([len(a) for a in arrays] + [la], dtype=dtype)
for i, a in enumerate(numpy.ix_(*arrays)):
arr[...,i] = a
return arr.reshape(-1, la)
def cartesian_product_transpose(*arrays):
broadcastable = numpy.ix_(*arrays)
broadcasted = numpy.broadcast_arrays(*broadcastable)
rows, cols = numpy.prod(broadcasted[0].shape), len(broadcasted)
dtype = numpy.result_type(*arrays)
out = numpy.empty(rows * cols, dtype=dtype)
start, end = 0, rows
for a in broadcasted:
out[start:end] = a.reshape(-1)
start, end = end, end + rows
return out.reshape(cols, rows).T
# from https://stackoverflow.com/a/1235363/577088
def cartesian_product_recursive(*arrays, out=None):
arrays = [numpy.asarray(x) for x in arrays]
dtype = arrays[0].dtype
n = numpy.prod([x.size for x in arrays])
if out is None:
out = numpy.zeros([n, len(arrays)], dtype=dtype)
m = n // arrays[0].size
out[:,0] = numpy.repeat(arrays[0], m)
if arrays[1:]:
cartesian_product_recursive(arrays[1:], out=out[0:m,1:])
for j in range(1, arrays[0].size):
out[j*m:(j+1)*m,1:] = out[0:m,1:]
return out
def cartesian_product_itertools(*arrays):
return numpy.array(list(itertools.product(*arrays)))
### Test code ###
name_func = [('repeat_product',
repeat_product),
('dstack_product',
dstack_product),
('cartesian_product',
cartesian_product),
('cartesian_product_transpose',
cartesian_product_transpose),
('cartesian_product_recursive',
cartesian_product_recursive),
('cartesian_product_itertools',
cartesian_product_itertools)]
def test(in_arrays, test_funcs):
global func
global arrays
arrays = in_arrays
for name, func in test_funcs:
print('{}:'.format(name))
%timeit func(*arrays)
def test_all(*in_arrays):
test(in_arrays, name_func)
# `cartesian_product_recursive` throws an
# unexpected error when used on more than
# two input arrays, so for now I've removed
# it from these tests.
def test_cartesian(*in_arrays):
test(in_arrays, name_func[2:4] + name_func[-1:])
x10 = [numpy.arange(10)]
x50 = [numpy.arange(50)]
x100 = [numpy.arange(100)]
x500 = [numpy.arange(500)]
x1000 = [numpy.arange(1000)]
Test results:
In [2]: test_all(*(x100 * 2))
repeat_product:
67.5 µs ± 633 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
dstack_product:
67.7 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
cartesian_product:
33.4 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
cartesian_product_transpose:
67.7 µs ± 932 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
cartesian_product_recursive:
215 µs ± 6.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
cartesian_product_itertools:
3.65 ms ± 38.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [3]: test_all(*(x500 * 2))
repeat_product:
1.31 ms ± 9.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
dstack_product:
1.27 ms ± 7.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
cartesian_product:
375 µs ± 4.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
cartesian_product_transpose:
488 µs ± 8.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
cartesian_product_recursive:
2.21 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_itertools:
105 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: test_all(*(x1000 * 2))
repeat_product:
10.2 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
dstack_product:
12 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product:
4.75 ms ± 57.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_transpose:
7.76 ms ± 52.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_recursive:
13 ms ± 209 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_itertools:
422 ms ± 7.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In all cases, cartesian_product as defined at the beginning of this answer is fastest.
For those functions that accept an arbitrary number of input arrays, it's worth checking performance when len(arrays) > 2 as well. (Until I can determine why cartesian_product_recursive throws an error in this case, I've removed it from these tests.)
In [5]: test_cartesian(*(x100 * 3))
cartesian_product:
8.8 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_transpose:
7.87 ms ± 91.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_itertools:
518 ms ± 5.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: test_cartesian(*(x50 * 4))
cartesian_product:
169 ms ± 5.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
cartesian_product_transpose:
184 ms ± 4.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
cartesian_product_itertools:
3.69 s ± 73.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [7]: test_cartesian(*(x10 * 6))
cartesian_product:
26.5 ms ± 449 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
cartesian_product_transpose:
16 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_itertools:
728 ms ± 16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]: test_cartesian(*(x10 * 7))
cartesian_product:
650 ms ± 8.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
cartesian_product_transpose:
518 ms ± 7.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
cartesian_product_itertools:
8.13 s ± 122 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
As these tests show, cartesian_product remains competitive until the number of input arrays rises above (roughly) four. After that, cartesian_product_transpose does have a slight edge.
It's worth reiterating that users with other hardware and operating systems may see different results. For example, unutbu reports seeing the following results for these tests using Ubuntu 14.04, Python 3.4.3, and numpy 1.14.0.dev0+b7050a9:
>>> %timeit cartesian_product_transpose(x500, y500)
1000 loops, best of 3: 682 µs per loop
>>> %timeit cartesian_product(x500, y500)
1000 loops, best of 3: 1.55 ms per loop
Below, I go into a few details about earlier tests I've run along these lines. The relative performance of these approaches has changed over time, for different hardware and different versions of Python and numpy. While it's not immediately useful for people using up-to-date versions of numpy, it illustrates how things have changed since the first version of this answer.
A simple alternative: meshgrid + dstack
The currently accepted answer uses tile and repeat to broadcast two arrays together. But the meshgrid function does practically the same thing. Here's the output of tile and repeat before being passed to transpose:
In [1]: import numpy
In [2]: x = numpy.array([1,2,3])
...: y = numpy.array([4,5])
...:
In [3]: [numpy.tile(x, len(y)), numpy.repeat(y, len(x))]
Out[3]: [array([1, 2, 3, 1, 2, 3]), array([4, 4, 4, 5, 5, 5])]
And here's the output of meshgrid:
In [4]: numpy.meshgrid(x, y)
Out[4]:
[array([[1, 2, 3],
[1, 2, 3]]), array([[4, 4, 4],
[5, 5, 5]])]
As you can see, it's almost identical. We need only reshape the result to get exactly the same result.
In [5]: xt, xr = numpy.meshgrid(x, y)
...: [xt.ravel(), xr.ravel()]
Out[5]: [array([1, 2, 3, 1, 2, 3]), array([4, 4, 4, 5, 5, 5])]
Rather than reshaping at this point, though, we could pass the output of meshgrid to dstack and reshape afterwards, which saves some work:
In [6]: numpy.dstack(numpy.meshgrid(x, y)).reshape(-1, 2)
Out[6]:
array([[1, 4],
[2, 4],
[3, 4],
[1, 5],
[2, 5],
[3, 5]])
Contrary to the claim in this comment, I've seen no evidence that different inputs will produce differently shaped outputs, and as the above demonstrates, they do very similar things, so it would be quite strange if they did. Please let me know if you find a counterexample.
Testing meshgrid + dstack vs. repeat + transpose
The relative performance of these two approaches has changed over time. In an earlier version of Python (2.7), the result using meshgrid + dstack was noticeably faster for small inputs. (Note that these tests are from an old version of this answer.) Definitions:
>>> def repeat_product(x, y):
... return numpy.transpose([numpy.tile(x, len(y)),
numpy.repeat(y, len(x))])
...
>>> def dstack_product(x, y):
... return numpy.dstack(numpy.meshgrid(x, y)).reshape(-1, 2)
...
For moderately-sized input, I saw a significant speedup. But I retried these tests with more recent versions of Python (3.6.1) and numpy (1.12.1), on a newer machine. The two approaches are almost identical now.
Old Test
>>> x, y = numpy.arange(500), numpy.arange(500)
>>> %timeit repeat_product(x, y)
10 loops, best of 3: 62 ms per loop
>>> %timeit dstack_product(x, y)
100 loops, best of 3: 12.2 ms per loop
New Test
In [7]: x, y = numpy.arange(500), numpy.arange(500)
In [8]: %timeit repeat_product(x, y)
1.32 ms ± 24.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [9]: %timeit dstack_product(x, y)
1.26 ms ± 8.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
As always, YMMV, but this suggests that in recent versions of Python and numpy, these are interchangeable.
Generalized product functions
In general, we might expect that using built-in functions will be faster for small inputs, while for large inputs, a purpose-built function might be faster. Furthermore for a generalized n-dimensional product, tile and repeat won't help, because they don't have clear higher-dimensional analogues. So it's worth investigating the behavior of purpose-built functions as well.
Most of the relevant tests appear at the beginning of this answer, but here are a few of the tests performed on earlier versions of Python and numpy for comparison.
The cartesian function defined in another answer used to perform pretty well for larger inputs. (It's the same as the function called cartesian_product_recursive above.) In order to compare cartesian to dstack_prodct, we use just two dimensions.
Here again, the old test showed a significant difference, while the new test shows almost none.
Old Test
>>> x, y = numpy.arange(1000), numpy.arange(1000)
>>> %timeit cartesian([x, y])
10 loops, best of 3: 25.4 ms per loop
>>> %timeit dstack_product(x, y)
10 loops, best of 3: 66.6 ms per loop
New Test
In [10]: x, y = numpy.arange(1000), numpy.arange(1000)
In [11]: %timeit cartesian([x, y])
12.1 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [12]: %timeit dstack_product(x, y)
12.7 ms ± 334 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As before, dstack_product still beats cartesian at smaller scales.
New Test (redundant old test not shown)
In [13]: x, y = numpy.arange(100), numpy.arange(100)
In [14]: %timeit cartesian([x, y])
215 µs ± 4.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [15]: %timeit dstack_product(x, y)
65.7 µs ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
These distinctions are, I think, interesting and worth recording; but they are academic in the end. As the tests at the beginning of this answer showed, all of these versions are almost always slower than cartesian_product, defined at the very beginning of this answer -- which is itself a bit slower than the fastest implementations among the answers to this question.

>>> numpy.transpose([numpy.tile(x, len(y)), numpy.repeat(y, len(x))])
array([[1, 4],
[2, 4],
[3, 4],
[1, 5],
[2, 5],
[3, 5]])
See Using numpy to build an array of all combinations of two arrays for a general solution for computing the Cartesian product of N arrays.

You can just do normal list comprehension in python
x = numpy.array([1,2,3])
y = numpy.array([4,5])
[[x0, y0] for x0 in x for y0 in y]
which should give you
[[1, 4], [1, 5], [2, 4], [2, 5], [3, 4], [3, 5]]

I was interested in this as well and did a little performance comparison, perhaps somewhat clearer than in #senderle's answer.
For two arrays (the classical case):
For four arrays:
(Note that the length the arrays is only a few dozen entries here.)
Code to reproduce the plots:
from functools import reduce
import itertools
import numpy
import perfplot
def dstack_product(arrays):
return numpy.dstack(numpy.meshgrid(*arrays, indexing="ij")).reshape(-1, len(arrays))
# Generalized N-dimensional products
def cartesian_product(arrays):
la = len(arrays)
dtype = numpy.find_common_type([a.dtype for a in arrays], [])
arr = numpy.empty([len(a) for a in arrays] + [la], dtype=dtype)
for i, a in enumerate(numpy.ix_(*arrays)):
arr[..., i] = a
return arr.reshape(-1, la)
def cartesian_product_transpose(arrays):
broadcastable = numpy.ix_(*arrays)
broadcasted = numpy.broadcast_arrays(*broadcastable)
rows, cols = reduce(numpy.multiply, broadcasted[0].shape), len(broadcasted)
dtype = numpy.find_common_type([a.dtype for a in arrays], [])
out = numpy.empty(rows * cols, dtype=dtype)
start, end = 0, rows
for a in broadcasted:
out[start:end] = a.reshape(-1)
start, end = end, end + rows
return out.reshape(cols, rows).T
# from https://stackoverflow.com/a/1235363/577088
def cartesian_product_recursive(arrays, out=None):
arrays = [numpy.asarray(x) for x in arrays]
dtype = arrays[0].dtype
n = numpy.prod([x.size for x in arrays])
if out is None:
out = numpy.zeros([n, len(arrays)], dtype=dtype)
m = n // arrays[0].size
out[:, 0] = numpy.repeat(arrays[0], m)
if arrays[1:]:
cartesian_product_recursive(arrays[1:], out=out[0:m, 1:])
for j in range(1, arrays[0].size):
out[j * m : (j + 1) * m, 1:] = out[0:m, 1:]
return out
def cartesian_product_itertools(arrays):
return numpy.array(list(itertools.product(*arrays)))
perfplot.show(
setup=lambda n: 2 * (numpy.arange(n, dtype=float),),
n_range=[2 ** k for k in range(13)],
# setup=lambda n: 4 * (numpy.arange(n, dtype=float),),
# n_range=[2 ** k for k in range(6)],
kernels=[
dstack_product,
cartesian_product,
cartesian_product_transpose,
cartesian_product_recursive,
cartesian_product_itertools,
],
logx=True,
logy=True,
xlabel="len(a), len(b)",
equality_check=None,
)

Building on #senderle's exemplary ground work I've come up with two versions - one for C and one for Fortran layouts - that are often a bit faster.
cartesian_product_transpose_pp is - unlike #senderle's cartesian_product_transpose which uses a different strategy altogether - a version of cartesion_product that uses the more favorable transpose memory layout + some very minor optimizations.
cartesian_product_pp sticks with the original memory layout. What makes it fast is its using contiguous copying. Contiguous copies turn out to be so much faster that copying a full block of memory even though only part of it contains valid data is preferable to only copying the valid bits.
Some perfplots. I made separate ones for C and Fortran layouts, because these are different tasks IMO.
Names ending in 'pp' are my approaches.
1) many tiny factors (2 elements each)
2) many small factors (4 elements each)
3) three factors of equal length
4) two factors of equal length
Code (need to do separate runs for each plot b/c I couldn't figure out how to reset; also need to edit / comment in / out appropriately):
import numpy
import numpy as np
from functools import reduce
import itertools
import timeit
import perfplot
def dstack_product(arrays):
return numpy.dstack(
numpy.meshgrid(*arrays, indexing='ij')
).reshape(-1, len(arrays))
def cartesian_product_transpose_pp(arrays):
la = len(arrays)
dtype = numpy.result_type(*arrays)
arr = numpy.empty((la, *map(len, arrays)), dtype=dtype)
idx = slice(None), *itertools.repeat(None, la)
for i, a in enumerate(arrays):
arr[i, ...] = a[idx[:la-i]]
return arr.reshape(la, -1).T
def cartesian_product(arrays):
la = len(arrays)
dtype = numpy.result_type(*arrays)
arr = numpy.empty([len(a) for a in arrays] + [la], dtype=dtype)
for i, a in enumerate(numpy.ix_(*arrays)):
arr[...,i] = a
return arr.reshape(-1, la)
def cartesian_product_transpose(arrays):
broadcastable = numpy.ix_(*arrays)
broadcasted = numpy.broadcast_arrays(*broadcastable)
rows, cols = numpy.prod(broadcasted[0].shape), len(broadcasted)
dtype = numpy.result_type(*arrays)
out = numpy.empty(rows * cols, dtype=dtype)
start, end = 0, rows
for a in broadcasted:
out[start:end] = a.reshape(-1)
start, end = end, end + rows
return out.reshape(cols, rows).T
from itertools import accumulate, repeat, chain
def cartesian_product_pp(arrays, out=None):
la = len(arrays)
L = *map(len, arrays), la
dtype = numpy.result_type(*arrays)
arr = numpy.empty(L, dtype=dtype)
arrs = *accumulate(chain((arr,), repeat(0, la-1)), np.ndarray.__getitem__),
idx = slice(None), *itertools.repeat(None, la-1)
for i in range(la-1, 0, -1):
arrs[i][..., i] = arrays[i][idx[:la-i]]
arrs[i-1][1:] = arrs[i]
arr[..., 0] = arrays[0][idx]
return arr.reshape(-1, la)
def cartesian_product_itertools(arrays):
return numpy.array(list(itertools.product(*arrays)))
# from https://stackoverflow.com/a/1235363/577088
def cartesian_product_recursive(arrays, out=None):
arrays = [numpy.asarray(x) for x in arrays]
dtype = arrays[0].dtype
n = numpy.prod([x.size for x in arrays])
if out is None:
out = numpy.zeros([n, len(arrays)], dtype=dtype)
m = n // arrays[0].size
out[:, 0] = numpy.repeat(arrays[0], m)
if arrays[1:]:
cartesian_product_recursive(arrays[1:], out=out[0:m, 1:])
for j in range(1, arrays[0].size):
out[j*m:(j+1)*m, 1:] = out[0:m, 1:]
return out
### Test code ###
if False:
perfplot.save('cp_4el_high.png',
setup=lambda n: n*(numpy.arange(4, dtype=float),),
n_range=list(range(6, 11)),
kernels=[
dstack_product,
cartesian_product_recursive,
cartesian_product,
# cartesian_product_transpose,
cartesian_product_pp,
# cartesian_product_transpose_pp,
],
logx=False,
logy=True,
xlabel='#factors',
equality_check=None
)
else:
perfplot.save('cp_2f_T.png',
setup=lambda n: 2*(numpy.arange(n, dtype=float),),
n_range=[2**k for k in range(5, 11)],
kernels=[
# dstack_product,
# cartesian_product_recursive,
# cartesian_product,
cartesian_product_transpose,
# cartesian_product_pp,
cartesian_product_transpose_pp,
],
logx=True,
logy=True,
xlabel='length of each factor',
equality_check=None
)

As of Oct. 2017, numpy now has a generic np.stack function that takes an axis parameter. Using it, we can have a "generalized cartesian product" using the "dstack and meshgrid" technique:
import numpy as np
def cartesian_product(*arrays):
ndim = len(arrays)
return (np.stack(np.meshgrid(*arrays), axis=-1)
.reshape(-1, ndim))
a = np.array([1,2])
b = np.array([10,20])
cartesian_product(a,b)
# output:
# array([[ 1, 10],
# [ 2, 10],
# [ 1, 20],
# [ 2, 20]])
Note on the axis=-1 parameter. This is the last (inner-most) axis in the result. It is equivalent to using axis=ndim.
One other comment, since Cartesian products blow up very quickly, unless we need to realize the array in memory for some reason, if the product is very large, we may want to make use of itertools and use the values on-the-fly.

The Scikit-learn package has a fast implementation of exactly this:
from sklearn.utils.extmath import cartesian
product = cartesian((x,y))
Note that the convention of this implementation is different from what you want, if you care about the order of the output. For your exact ordering, you can do
product = cartesian((y,x))[:, ::-1]

I used #kennytm answer for a while, but when trying to do the same in TensorFlow, but I found that TensorFlow has no equivalent of numpy.repeat(). After a little experimentation, I think I found a more general solution for arbitrary vectors of points.
For numpy:
import numpy as np
def cartesian_product(*args: np.ndarray) -> np.ndarray:
"""
Produce the cartesian product of arbitrary length vectors.
Parameters
----------
np.ndarray args
vector of points of interest in each dimension
Returns
-------
np.ndarray
the cartesian product of size [m x n] wherein:
m = prod([len(a) for a in args])
n = len(args)
"""
for i, a in enumerate(args):
assert a.ndim == 1, "arg {:d} is not rank 1".format(i)
return np.concatenate([np.reshape(xi, [-1, 1]) for xi in np.meshgrid(*args)], axis=1)
and for TensorFlow:
import tensorflow as tf
def cartesian_product(*args: tf.Tensor) -> tf.Tensor:
"""
Produce the cartesian product of arbitrary length vectors.
Parameters
----------
tf.Tensor args
vector of points of interest in each dimension
Returns
-------
tf.Tensor
the cartesian product of size [m x n] wherein:
m = prod([len(a) for a in args])
n = len(args)
"""
for i, a in enumerate(args):
tf.assert_rank(a, 1, message="arg {:d} is not rank 1".format(i))
return tf.concat([tf.reshape(xi, [-1, 1]) for xi in tf.meshgrid(*args)], axis=1)

More generally, if you have two 2d numpy arrays a and b, and you want to concatenate every row of a to every row of b (A cartesian product of rows, kind of like a join in a database), you can use this method:
import numpy
def join_2d(a, b):
assert a.dtype == b.dtype
a_part = numpy.tile(a, (len(b), 1))
b_part = numpy.repeat(b, len(a), axis=0)
return numpy.hstack((a_part, b_part))

The fastest you can get is either by combining a generator expression with the map function:
import numpy
import datetime
a = np.arange(1000)
b = np.arange(200)
start = datetime.datetime.now()
foo = (item for sublist in [list(map(lambda x: (x,i),a)) for i in b] for item in sublist)
print (list(foo))
print ('execution time: {} s'.format((datetime.datetime.now() - start).total_seconds()))
Outputs (actually the whole resulting list is printed):
[(0, 0), (1, 0), ...,(998, 199), (999, 199)]
execution time: 1.253567 s
or by using a double generator expression:
a = np.arange(1000)
b = np.arange(200)
start = datetime.datetime.now()
foo = ((x,y) for x in a for y in b)
print (list(foo))
print ('execution time: {} s'.format((datetime.datetime.now() - start).total_seconds()))
Outputs (whole list printed):
[(0, 0), (1, 0), ...,(998, 199), (999, 199)]
execution time: 1.187415 s
Take into account that most of the computation time goes into the printing command. The generator calculations are otherwise decently efficient. Without printing the calculation times are:
execution time: 0.079208 s
for generator expression + map function and:
execution time: 0.007093 s
for the double generator expression.
If what you actually want is to calculate the actual product of each of the coordinate pairs, the fastest is to solve it as a numpy matrix product:
a = np.arange(1000)
b = np.arange(200)
start = datetime.datetime.now()
foo = np.dot(np.asmatrix([[i,0] for i in a]), np.asmatrix([[i,0] for i in b]).T)
print (foo)
print ('execution time: {} s'.format((datetime.datetime.now() - start).total_seconds()))
Outputs:
[[ 0 0 0 ..., 0 0 0]
[ 0 1 2 ..., 197 198 199]
[ 0 2 4 ..., 394 396 398]
...,
[ 0 997 1994 ..., 196409 197406 198403]
[ 0 998 1996 ..., 196606 197604 198602]
[ 0 999 1998 ..., 196803 197802 198801]]
execution time: 0.003869 s
and without printing (in this case it doesn't save much since only a tiny piece of the matrix is actually printed out):
execution time: 0.003083 s

This can also be easily done by using itertools.product method
from itertools import product
import numpy as np
x = np.array([1, 2, 3])
y = np.array([4, 5])
cart_prod = np.array(list(product(*[x, y])),dtype='int32')
Result:
array([[1, 4],
[1, 5],
[2, 4],
[2, 5],
[3, 4],
[3, 5]], dtype=int32)
Execution time: 0.000155 s

In the specific case that you need to perform simple operations such as addition on each pair, you can introduce an extra dimension and let broadcasting do the job:
>>> a, b = np.array([1,2,3]), np.array([10,20,30])
>>> a[None,:] + b[:,None]
array([[11, 12, 13],
[21, 22, 23],
[31, 32, 33]])
I'm not sure if there is any similar way to actually get the pairs themselves.

I'm a bit late to the party, but I encoutered a tricky variant of that problem.
Let's say I want the cartesian product of several arrays, but that cartesian product ends up being much larger than the computers' memory (however, the computation done with that product are fast, or at least parallelizable).
The obvious solution is to divide this cartesian product in chunks, and treat these chunks one after the other (in sort of a "streaming" manner). You can do that easily with itertools.product, but it's horrendously slow. Also, none of the proposed solutions here (as fast as they are) give us this possibility. The solution I propose uses Numba, and is slightly faster than the "canonical" cartesian_product mentioned here. It's pretty long because I tried to optimize it everywhere I could.
import numba as nb
import numpy as np
from typing import List
#nb.njit(nb.types.Tuple((nb.int32[:, :],
nb.int32[:]))(nb.int32[:],
nb.int32[:],
nb.int64, nb.int64))
def cproduct(sizes: np.ndarray, current_tuple: np.ndarray, start_idx: int, end_idx: int):
"""Generates ids tuples from start_id to end_id"""
assert len(sizes) >= 2
assert start_idx < end_idx
tuples = np.zeros((end_idx - start_idx, len(sizes)), dtype=np.int32)
tuple_idx = 0
# stores the current combination
current_tuple = current_tuple.copy()
while tuple_idx < end_idx - start_idx:
tuples[tuple_idx] = current_tuple
current_tuple[0] += 1
# using a condition here instead of including this in the inner loop
# to gain a bit of speed: this is going to be tested each iteration,
# and starting a loop to have it end right away is a bit silly
if current_tuple[0] == sizes[0]:
# the reset to 0 and subsequent increment amount to carrying
# the number to the higher "power"
current_tuple[0] = 0
current_tuple[1] += 1
for i in range(1, len(sizes) - 1):
if current_tuple[i] == sizes[i]:
# same as before, but in a loop, since this is going
# to get called less often
current_tuple[i + 1] += 1
current_tuple[i] = 0
else:
break
tuple_idx += 1
return tuples, current_tuple
def chunked_cartesian_product_ids(sizes: List[int], chunk_size: int):
"""Just generates chunks of the cartesian product of the ids of each
input arrays (thus, we just need their sizes here, not the actual arrays)"""
prod = np.prod(sizes)
# putting the largest number at the front to more efficiently make use
# of the cproduct numba function
sizes = np.array(sizes, dtype=np.int32)
sorted_idx = np.argsort(sizes)[::-1]
sizes = sizes[sorted_idx]
if chunk_size > prod:
chunk_bounds = (np.array([0, prod])).astype(np.int64)
else:
num_chunks = np.maximum(np.ceil(prod / chunk_size), 2).astype(np.int32)
chunk_bounds = (np.arange(num_chunks + 1) * chunk_size).astype(np.int64)
chunk_bounds[-1] = prod
current_tuple = np.zeros(len(sizes), dtype=np.int32)
for start_idx, end_idx in zip(chunk_bounds[:-1], chunk_bounds[1:]):
tuples, current_tuple = cproduct(sizes, current_tuple, start_idx, end_idx)
# re-arrange columns to match the original order of the sizes list
# before yielding
yield tuples[:, np.argsort(sorted_idx)]
def chunked_cartesian_product(*arrays, chunk_size=2 ** 25):
"""Returns chunks of the full cartesian product, with arrays of shape
(chunk_size, n_arrays). The last chunk will obviously have the size of the
remainder"""
array_lengths = [len(array) for array in arrays]
for array_ids_chunk in chunked_cartesian_product_ids(array_lengths, chunk_size):
slices_lists = [arrays[i][array_ids_chunk[:, i]] for i in range(len(arrays))]
yield np.vstack(slices_lists).swapaxes(0,1)
def cartesian_product(*arrays):
"""Actual cartesian product, not chunked, still fast"""
total_prod = np.prod([len(array) for array in arrays])
return next(chunked_cartesian_product(*arrays, total_prod))
a = np.arange(0, 3)
b = np.arange(8, 10)
c = np.arange(13, 16)
for cartesian_tuples in chunked_cartesian_product(*[a, b, c], chunk_size=5):
print(cartesian_tuples)
This would output our cartesian product in chunks of 5 3-uples:
[[ 0 8 13]
[ 0 8 14]
[ 0 8 15]
[ 1 8 13]
[ 1 8 14]]
[[ 1 8 15]
[ 2 8 13]
[ 2 8 14]
[ 2 8 15]
[ 0 9 13]]
[[ 0 9 14]
[ 0 9 15]
[ 1 9 13]
[ 1 9 14]
[ 1 9 15]]
[[ 2 9 13]
[ 2 9 14]
[ 2 9 15]]
If you're willing to understand what is being done here, the intuition behind the njitted function is to enumerate each "number" in a weird numerical base whose elements would be composed of the sizes of the input arrays (instead of the same number in regular binary, decimal or hexadecimal bases).
Obviously, this solution is interesting for large products. For small ones, the overhead might be a bit costly.
NOTE: since numba is still under heavy development, i'm using numba 0.50 to run this, with python 3.6.

Yet another one:
>>>x1, y1 = np.meshgrid(x, y)
>>>np.c_[x1.ravel(), y1.ravel()]
array([[1, 4],
[2, 4],
[3, 4],
[1, 5],
[2, 5],
[3, 5]])

Inspired by Ashkan's answer, you can also try the following.
>>> x, y = np.meshgrid(x, y)
>>> np.concatenate([x.flatten().reshape(-1,1), y.flatten().reshape(-1,1)], axis=1)
This will give you the required cartesian product!

This is a generalized version of the accepted answer (Cartesian product of multiple arrays using numpy.tile and numpy.repeat functions).
from functors import reduce
from operator import mul
def cartesian_product(arrays):
return np.vstack(
np.tile(
np.repeat(arrays[j], reduce(mul, map(len, arrays[j+1:]), 1)),
reduce(mul, map(len, arrays[:j]), 1),
)
for j in range(len(arrays))
).T

If you are willing to use PyTorch, I should think it is highly efficient:
>>> import torch
>>> torch.cartesian_prod(torch.as_tensor(x), torch.as_tensor(y))
tensor([[1, 4],
[1, 5],
[2, 4],
[2, 5],
[3, 4],
[3, 5]])
and you can easily get a numpy array:
>>> torch.cartesian_prod(torch.as_tensor(x), torch.as_tensor(y)).numpy()
array([[1, 4],
[1, 5],
[2, 4],
[2, 5],
[3, 4],
[3, 5]])

Cartesian product of x and y array points into single array of 2D points

I have two numpy arrays that define the x and y axes of a grid. For example:
x = numpy.array([1,2,3])
y = numpy.array([4,5])
I'd like to generate the Cartesian product of these arrays to generate:
array([[1,4],[2,4],[3,4],[1,5],[2,5],[3,5]])
In a way that's not terribly inefficient since I need to do this many times in a loop. I'm assuming that converting them to a Python list and using itertools.product and back to a numpy array is not the most efficient form.

A canonical cartesian_product (almost)
There are many approaches to this problem with different properties. Some are faster than others, and some are more general-purpose. After a lot of testing and tweaking, I've found that the following function, which calculates an n-dimensional cartesian_product, is faster than most others for many inputs. For a pair of approaches that are slightly more complex, but are even a bit faster in many cases, see the answer by Paul Panzer.
Given that answer, this is no longer the fastest implementation of the cartesian product in numpy that I'm aware of. However, I think its simplicity will continue to make it a useful benchmark for future improvement:
def cartesian_product(*arrays):
la = len(arrays)
dtype = numpy.result_type(*arrays)
arr = numpy.empty([len(a) for a in arrays] + [la], dtype=dtype)
for i, a in enumerate(numpy.ix_(*arrays)):
arr[...,i] = a
return arr.reshape(-1, la)
It's worth mentioning that this function uses ix_ in an unusual way; whereas the documented use of ix_ is to generate indices into an array, it just so happens that arrays with the same shape can be used for broadcasted assignment. Many thanks to mgilson, who inspired me to try using ix_ this way, and to unutbu, who provided some extremely helpful feedback on this answer, including the suggestion to use numpy.result_type.
Notable alternatives
It's sometimes faster to write contiguous blocks of memory in Fortran order. That's the basis of this alternative, cartesian_product_transpose, which has proven faster on some hardware than cartesian_product (see below). However, Paul Panzer's answer, which uses the same principle, is even faster. Still, I include this here for interested readers:
def cartesian_product_transpose(*arrays):
broadcastable = numpy.ix_(*arrays)
broadcasted = numpy.broadcast_arrays(*broadcastable)
rows, cols = numpy.prod(broadcasted[0].shape), len(broadcasted)
dtype = numpy.result_type(*arrays)
out = numpy.empty(rows * cols, dtype=dtype)
start, end = 0, rows
for a in broadcasted:
out[start:end] = a.reshape(-1)
start, end = end, end + rows
return out.reshape(cols, rows).T
After coming to understand Panzer's approach, I wrote a new version that's almost as fast as his, and is almost as simple as cartesian_product:
def cartesian_product_simple_transpose(arrays):
la = len(arrays)
dtype = numpy.result_type(*arrays)
arr = numpy.empty([la] + [len(a) for a in arrays], dtype=dtype)
for i, a in enumerate(numpy.ix_(*arrays)):
arr[i, ...] = a
return arr.reshape(la, -1).T
This appears to have some constant-time overhead that makes it run slower than Panzer's for small inputs. But for larger inputs, in all the tests I ran, it performs just as well as his fastest implementation (cartesian_product_transpose_pp).
In following sections, I include some tests of other alternatives. These are now somewhat out of date, but rather than duplicate effort, I've decided to leave them here out of historical interest. For up-to-date tests, see Panzer's answer, as well as Nico Schlömer's.
Tests against alternatives
Here is a battery of tests that show the performance boost that some of these functions provide relative to a number of alternatives. All the tests shown here were performed on a quad-core machine, running Mac OS 10.12.5, Python 3.6.1, and numpy 1.12.1. Variations on hardware and software are known to produce different results, so YMMV. Run these tests for yourself to be sure!
Definitions:
import numpy
import itertools
from functools import reduce
### Two-dimensional products ###
def repeat_product(x, y):
return numpy.transpose([numpy.tile(x, len(y)),
numpy.repeat(y, len(x))])
def dstack_product(x, y):
return numpy.dstack(numpy.meshgrid(x, y)).reshape(-1, 2)
### Generalized N-dimensional products ###
def cartesian_product(*arrays):
la = len(arrays)
dtype = numpy.result_type(*arrays)
arr = numpy.empty([len(a) for a in arrays] + [la], dtype=dtype)
for i, a in enumerate(numpy.ix_(*arrays)):
arr[...,i] = a
return arr.reshape(-1, la)
def cartesian_product_transpose(*arrays):
broadcastable = numpy.ix_(*arrays)
broadcasted = numpy.broadcast_arrays(*broadcastable)
rows, cols = numpy.prod(broadcasted[0].shape), len(broadcasted)
dtype = numpy.result_type(*arrays)
out = numpy.empty(rows * cols, dtype=dtype)
start, end = 0, rows
for a in broadcasted:
out[start:end] = a.reshape(-1)
start, end = end, end + rows
return out.reshape(cols, rows).T
# from https://stackoverflow.com/a/1235363/577088
def cartesian_product_recursive(*arrays, out=None):
arrays = [numpy.asarray(x) for x in arrays]
dtype = arrays[0].dtype
n = numpy.prod([x.size for x in arrays])
if out is None:
out = numpy.zeros([n, len(arrays)], dtype=dtype)
m = n // arrays[0].size
out[:,0] = numpy.repeat(arrays[0], m)
if arrays[1:]:
cartesian_product_recursive(arrays[1:], out=out[0:m,1:])
for j in range(1, arrays[0].size):
out[j*m:(j+1)*m,1:] = out[0:m,1:]
return out
def cartesian_product_itertools(*arrays):
return numpy.array(list(itertools.product(*arrays)))
### Test code ###
name_func = [('repeat_product',
repeat_product),
('dstack_product',
dstack_product),
('cartesian_product',
cartesian_product),
('cartesian_product_transpose',
cartesian_product_transpose),
('cartesian_product_recursive',
cartesian_product_recursive),
('cartesian_product_itertools',
cartesian_product_itertools)]
def test(in_arrays, test_funcs):
global func
global arrays
arrays = in_arrays
for name, func in test_funcs:
print('{}:'.format(name))
%timeit func(*arrays)
def test_all(*in_arrays):
test(in_arrays, name_func)
# `cartesian_product_recursive` throws an
# unexpected error when used on more than
# two input arrays, so for now I've removed
# it from these tests.
def test_cartesian(*in_arrays):
test(in_arrays, name_func[2:4] + name_func[-1:])
x10 = [numpy.arange(10)]
x50 = [numpy.arange(50)]
x100 = [numpy.arange(100)]
x500 = [numpy.arange(500)]
x1000 = [numpy.arange(1000)]
Test results:
In [2]: test_all(*(x100 * 2))
repeat_product:
67.5 µs ± 633 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
dstack_product:
67.7 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
cartesian_product:
33.4 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
cartesian_product_transpose:
67.7 µs ± 932 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
cartesian_product_recursive:
215 µs ± 6.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
cartesian_product_itertools:
3.65 ms ± 38.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [3]: test_all(*(x500 * 2))
repeat_product:
1.31 ms ± 9.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
dstack_product:
1.27 ms ± 7.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
cartesian_product:
375 µs ± 4.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
cartesian_product_transpose:
488 µs ± 8.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
cartesian_product_recursive:
2.21 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_itertools:
105 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: test_all(*(x1000 * 2))
repeat_product:
10.2 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
dstack_product:
12 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product:
4.75 ms ± 57.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_transpose:
7.76 ms ± 52.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_recursive:
13 ms ± 209 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_itertools:
422 ms ± 7.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In all cases, cartesian_product as defined at the beginning of this answer is fastest.
For those functions that accept an arbitrary number of input arrays, it's worth checking performance when len(arrays) > 2 as well. (Until I can determine why cartesian_product_recursive throws an error in this case, I've removed it from these tests.)
In [5]: test_cartesian(*(x100 * 3))
cartesian_product:
8.8 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_transpose:
7.87 ms ± 91.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_itertools:
518 ms ± 5.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: test_cartesian(*(x50 * 4))
cartesian_product:
169 ms ± 5.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
cartesian_product_transpose:
184 ms ± 4.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
cartesian_product_itertools:
3.69 s ± 73.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [7]: test_cartesian(*(x10 * 6))
cartesian_product:
26.5 ms ± 449 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
cartesian_product_transpose:
16 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_itertools:
728 ms ± 16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]: test_cartesian(*(x10 * 7))
cartesian_product:
650 ms ± 8.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
cartesian_product_transpose:
518 ms ± 7.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
cartesian_product_itertools:
8.13 s ± 122 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
As these tests show, cartesian_product remains competitive until the number of input arrays rises above (roughly) four. After that, cartesian_product_transpose does have a slight edge.
It's worth reiterating that users with other hardware and operating systems may see different results. For example, unutbu reports seeing the following results for these tests using Ubuntu 14.04, Python 3.4.3, and numpy 1.14.0.dev0+b7050a9:
>>> %timeit cartesian_product_transpose(x500, y500)
1000 loops, best of 3: 682 µs per loop
>>> %timeit cartesian_product(x500, y500)
1000 loops, best of 3: 1.55 ms per loop
Below, I go into a few details about earlier tests I've run along these lines. The relative performance of these approaches has changed over time, for different hardware and different versions of Python and numpy. While it's not immediately useful for people using up-to-date versions of numpy, it illustrates how things have changed since the first version of this answer.
A simple alternative: meshgrid + dstack
The currently accepted answer uses tile and repeat to broadcast two arrays together. But the meshgrid function does practically the same thing. Here's the output of tile and repeat before being passed to transpose:
In [1]: import numpy
In [2]: x = numpy.array([1,2,3])
...: y = numpy.array([4,5])
...:
In [3]: [numpy.tile(x, len(y)), numpy.repeat(y, len(x))]
Out[3]: [array([1, 2, 3, 1, 2, 3]), array([4, 4, 4, 5, 5, 5])]
And here's the output of meshgrid:
In [4]: numpy.meshgrid(x, y)
Out[4]:
[array([[1, 2, 3],
[1, 2, 3]]), array([[4, 4, 4],
[5, 5, 5]])]
As you can see, it's almost identical. We need only reshape the result to get exactly the same result.
In [5]: xt, xr = numpy.meshgrid(x, y)
...: [xt.ravel(), xr.ravel()]
Out[5]: [array([1, 2, 3, 1, 2, 3]), array([4, 4, 4, 5, 5, 5])]
Rather than reshaping at this point, though, we could pass the output of meshgrid to dstack and reshape afterwards, which saves some work:
In [6]: numpy.dstack(numpy.meshgrid(x, y)).reshape(-1, 2)
Out[6]:
array([[1, 4],
[2, 4],
[3, 4],
[1, 5],
[2, 5],
[3, 5]])
Contrary to the claim in this comment, I've seen no evidence that different inputs will produce differently shaped outputs, and as the above demonstrates, they do very similar things, so it would be quite strange if they did. Please let me know if you find a counterexample.
Testing meshgrid + dstack vs. repeat + transpose
The relative performance of these two approaches has changed over time. In an earlier version of Python (2.7), the result using meshgrid + dstack was noticeably faster for small inputs. (Note that these tests are from an old version of this answer.) Definitions:
>>> def repeat_product(x, y):
... return numpy.transpose([numpy.tile(x, len(y)),
numpy.repeat(y, len(x))])
...
>>> def dstack_product(x, y):
... return numpy.dstack(numpy.meshgrid(x, y)).reshape(-1, 2)
...
For moderately-sized input, I saw a significant speedup. But I retried these tests with more recent versions of Python (3.6.1) and numpy (1.12.1), on a newer machine. The two approaches are almost identical now.
Old Test
>>> x, y = numpy.arange(500), numpy.arange(500)
>>> %timeit repeat_product(x, y)
10 loops, best of 3: 62 ms per loop
>>> %timeit dstack_product(x, y)
100 loops, best of 3: 12.2 ms per loop
New Test
In [7]: x, y = numpy.arange(500), numpy.arange(500)
In [8]: %timeit repeat_product(x, y)
1.32 ms ± 24.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [9]: %timeit dstack_product(x, y)
1.26 ms ± 8.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
As always, YMMV, but this suggests that in recent versions of Python and numpy, these are interchangeable.
Generalized product functions
In general, we might expect that using built-in functions will be faster for small inputs, while for large inputs, a purpose-built function might be faster. Furthermore for a generalized n-dimensional product, tile and repeat won't help, because they don't have clear higher-dimensional analogues. So it's worth investigating the behavior of purpose-built functions as well.
Most of the relevant tests appear at the beginning of this answer, but here are a few of the tests performed on earlier versions of Python and numpy for comparison.
The cartesian function defined in another answer used to perform pretty well for larger inputs. (It's the same as the function called cartesian_product_recursive above.) In order to compare cartesian to dstack_prodct, we use just two dimensions.
Here again, the old test showed a significant difference, while the new test shows almost none.
Old Test
>>> x, y = numpy.arange(1000), numpy.arange(1000)
>>> %timeit cartesian([x, y])
10 loops, best of 3: 25.4 ms per loop
>>> %timeit dstack_product(x, y)
10 loops, best of 3: 66.6 ms per loop
New Test
In [10]: x, y = numpy.arange(1000), numpy.arange(1000)
In [11]: %timeit cartesian([x, y])
12.1 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [12]: %timeit dstack_product(x, y)
12.7 ms ± 334 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As before, dstack_product still beats cartesian at smaller scales.
New Test (redundant old test not shown)
In [13]: x, y = numpy.arange(100), numpy.arange(100)
In [14]: %timeit cartesian([x, y])
215 µs ± 4.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [15]: %timeit dstack_product(x, y)
65.7 µs ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
These distinctions are, I think, interesting and worth recording; but they are academic in the end. As the tests at the beginning of this answer showed, all of these versions are almost always slower than cartesian_product, defined at the very beginning of this answer -- which is itself a bit slower than the fastest implementations among the answers to this question.

>>> numpy.transpose([numpy.tile(x, len(y)), numpy.repeat(y, len(x))])
array([[1, 4],
[2, 4],
[3, 4],
[1, 5],
[2, 5],
[3, 5]])
See Using numpy to build an array of all combinations of two arrays for a general solution for computing the Cartesian product of N arrays.

You can just do normal list comprehension in python
x = numpy.array([1,2,3])
y = numpy.array([4,5])
[[x0, y0] for x0 in x for y0 in y]
which should give you
[[1, 4], [1, 5], [2, 4], [2, 5], [3, 4], [3, 5]]

I was interested in this as well and did a little performance comparison, perhaps somewhat clearer than in #senderle's answer.
For two arrays (the classical case):
For four arrays:
(Note that the length the arrays is only a few dozen entries here.)
Code to reproduce the plots:
from functools import reduce
import itertools
import numpy
import perfplot
def dstack_product(arrays):
return numpy.dstack(numpy.meshgrid(*arrays, indexing="ij")).reshape(-1, len(arrays))
# Generalized N-dimensional products
def cartesian_product(arrays):
la = len(arrays)
dtype = numpy.find_common_type([a.dtype for a in arrays], [])
arr = numpy.empty([len(a) for a in arrays] + [la], dtype=dtype)
for i, a in enumerate(numpy.ix_(*arrays)):
arr[..., i] = a
return arr.reshape(-1, la)
def cartesian_product_transpose(arrays):
broadcastable = numpy.ix_(*arrays)
broadcasted = numpy.broadcast_arrays(*broadcastable)
rows, cols = reduce(numpy.multiply, broadcasted[0].shape), len(broadcasted)
dtype = numpy.find_common_type([a.dtype for a in arrays], [])
out = numpy.empty(rows * cols, dtype=dtype)
start, end = 0, rows
for a in broadcasted:
out[start:end] = a.reshape(-1)
start, end = end, end + rows
return out.reshape(cols, rows).T
# from https://stackoverflow.com/a/1235363/577088
def cartesian_product_recursive(arrays, out=None):
arrays = [numpy.asarray(x) for x in arrays]
dtype = arrays[0].dtype
n = numpy.prod([x.size for x in arrays])
if out is None:
out = numpy.zeros([n, len(arrays)], dtype=dtype)
m = n // arrays[0].size
out[:, 0] = numpy.repeat(arrays[0], m)
if arrays[1:]:
cartesian_product_recursive(arrays[1:], out=out[0:m, 1:])
for j in range(1, arrays[0].size):
out[j * m : (j + 1) * m, 1:] = out[0:m, 1:]
return out
def cartesian_product_itertools(arrays):
return numpy.array(list(itertools.product(*arrays)))
perfplot.show(
setup=lambda n: 2 * (numpy.arange(n, dtype=float),),
n_range=[2 ** k for k in range(13)],
# setup=lambda n: 4 * (numpy.arange(n, dtype=float),),
# n_range=[2 ** k for k in range(6)],
kernels=[
dstack_product,
cartesian_product,
cartesian_product_transpose,
cartesian_product_recursive,
cartesian_product_itertools,
],
logx=True,
logy=True,
xlabel="len(a), len(b)",
equality_check=None,
)

Building on #senderle's exemplary ground work I've come up with two versions - one for C and one for Fortran layouts - that are often a bit faster.
cartesian_product_transpose_pp is - unlike #senderle's cartesian_product_transpose which uses a different strategy altogether - a version of cartesion_product that uses the more favorable transpose memory layout + some very minor optimizations.
cartesian_product_pp sticks with the original memory layout. What makes it fast is its using contiguous copying. Contiguous copies turn out to be so much faster that copying a full block of memory even though only part of it contains valid data is preferable to only copying the valid bits.
Some perfplots. I made separate ones for C and Fortran layouts, because these are different tasks IMO.
Names ending in 'pp' are my approaches.
1) many tiny factors (2 elements each)
2) many small factors (4 elements each)
3) three factors of equal length
4) two factors of equal length
Code (need to do separate runs for each plot b/c I couldn't figure out how to reset; also need to edit / comment in / out appropriately):
import numpy
import numpy as np
from functools import reduce
import itertools
import timeit
import perfplot
def dstack_product(arrays):
return numpy.dstack(
numpy.meshgrid(*arrays, indexing='ij')
).reshape(-1, len(arrays))
def cartesian_product_transpose_pp(arrays):
la = len(arrays)
dtype = numpy.result_type(*arrays)
arr = numpy.empty((la, *map(len, arrays)), dtype=dtype)
idx = slice(None), *itertools.repeat(None, la)
for i, a in enumerate(arrays):
arr[i, ...] = a[idx[:la-i]]
return arr.reshape(la, -1).T
def cartesian_product(arrays):
la = len(arrays)
dtype = numpy.result_type(*arrays)
arr = numpy.empty([len(a) for a in arrays] + [la], dtype=dtype)
for i, a in enumerate(numpy.ix_(*arrays)):
arr[...,i] = a
return arr.reshape(-1, la)
def cartesian_product_transpose(arrays):
broadcastable = numpy.ix_(*arrays)
broadcasted = numpy.broadcast_arrays(*broadcastable)
rows, cols = numpy.prod(broadcasted[0].shape), len(broadcasted)
dtype = numpy.result_type(*arrays)
out = numpy.empty(rows * cols, dtype=dtype)
start, end = 0, rows
for a in broadcasted:
out[start:end] = a.reshape(-1)
start, end = end, end + rows
return out.reshape(cols, rows).T
from itertools import accumulate, repeat, chain
def cartesian_product_pp(arrays, out=None):
la = len(arrays)
L = *map(len, arrays), la
dtype = numpy.result_type(*arrays)
arr = numpy.empty(L, dtype=dtype)
arrs = *accumulate(chain((arr,), repeat(0, la-1)), np.ndarray.__getitem__),
idx = slice(None), *itertools.repeat(None, la-1)
for i in range(la-1, 0, -1):
arrs[i][..., i] = arrays[i][idx[:la-i]]
arrs[i-1][1:] = arrs[i]
arr[..., 0] = arrays[0][idx]
return arr.reshape(-1, la)
def cartesian_product_itertools(arrays):
return numpy.array(list(itertools.product(*arrays)))
# from https://stackoverflow.com/a/1235363/577088
def cartesian_product_recursive(arrays, out=None):
arrays = [numpy.asarray(x) for x in arrays]
dtype = arrays[0].dtype
n = numpy.prod([x.size for x in arrays])
if out is None:
out = numpy.zeros([n, len(arrays)], dtype=dtype)
m = n // arrays[0].size
out[:, 0] = numpy.repeat(arrays[0], m)
if arrays[1:]:
cartesian_product_recursive(arrays[1:], out=out[0:m, 1:])
for j in range(1, arrays[0].size):
out[j*m:(j+1)*m, 1:] = out[0:m, 1:]
return out
### Test code ###
if False:
perfplot.save('cp_4el_high.png',
setup=lambda n: n*(numpy.arange(4, dtype=float),),
n_range=list(range(6, 11)),
kernels=[
dstack_product,
cartesian_product_recursive,
cartesian_product,
# cartesian_product_transpose,
cartesian_product_pp,
# cartesian_product_transpose_pp,
],
logx=False,
logy=True,
xlabel='#factors',
equality_check=None
)
else:
perfplot.save('cp_2f_T.png',
setup=lambda n: 2*(numpy.arange(n, dtype=float),),
n_range=[2**k for k in range(5, 11)],
kernels=[
# dstack_product,
# cartesian_product_recursive,
# cartesian_product,
cartesian_product_transpose,
# cartesian_product_pp,
cartesian_product_transpose_pp,
],
logx=True,
logy=True,
xlabel='length of each factor',
equality_check=None
)

As of Oct. 2017, numpy now has a generic np.stack function that takes an axis parameter. Using it, we can have a "generalized cartesian product" using the "dstack and meshgrid" technique:
import numpy as np
def cartesian_product(*arrays):
ndim = len(arrays)
return (np.stack(np.meshgrid(*arrays), axis=-1)
.reshape(-1, ndim))
a = np.array([1,2])
b = np.array([10,20])
cartesian_product(a,b)
# output:
# array([[ 1, 10],
# [ 2, 10],
# [ 1, 20],
# [ 2, 20]])
Note on the axis=-1 parameter. This is the last (inner-most) axis in the result. It is equivalent to using axis=ndim.
One other comment, since Cartesian products blow up very quickly, unless we need to realize the array in memory for some reason, if the product is very large, we may want to make use of itertools and use the values on-the-fly.

The Scikit-learn package has a fast implementation of exactly this:
from sklearn.utils.extmath import cartesian
product = cartesian((x,y))
Note that the convention of this implementation is different from what you want, if you care about the order of the output. For your exact ordering, you can do
product = cartesian((y,x))[:, ::-1]

I used #kennytm answer for a while, but when trying to do the same in TensorFlow, but I found that TensorFlow has no equivalent of numpy.repeat(). After a little experimentation, I think I found a more general solution for arbitrary vectors of points.
For numpy:
import numpy as np
def cartesian_product(*args: np.ndarray) -> np.ndarray:
"""
Produce the cartesian product of arbitrary length vectors.
Parameters
----------
np.ndarray args
vector of points of interest in each dimension
Returns
-------
np.ndarray
the cartesian product of size [m x n] wherein:
m = prod([len(a) for a in args])
n = len(args)
"""
for i, a in enumerate(args):
assert a.ndim == 1, "arg {:d} is not rank 1".format(i)
return np.concatenate([np.reshape(xi, [-1, 1]) for xi in np.meshgrid(*args)], axis=1)
and for TensorFlow:
import tensorflow as tf
def cartesian_product(*args: tf.Tensor) -> tf.Tensor:
"""
Produce the cartesian product of arbitrary length vectors.
Parameters
----------
tf.Tensor args
vector of points of interest in each dimension
Returns
-------
tf.Tensor
the cartesian product of size [m x n] wherein:
m = prod([len(a) for a in args])
n = len(args)
"""
for i, a in enumerate(args):
tf.assert_rank(a, 1, message="arg {:d} is not rank 1".format(i))
return tf.concat([tf.reshape(xi, [-1, 1]) for xi in tf.meshgrid(*args)], axis=1)

More generally, if you have two 2d numpy arrays a and b, and you want to concatenate every row of a to every row of b (A cartesian product of rows, kind of like a join in a database), you can use this method:
import numpy
def join_2d(a, b):
assert a.dtype == b.dtype
a_part = numpy.tile(a, (len(b), 1))
b_part = numpy.repeat(b, len(a), axis=0)
return numpy.hstack((a_part, b_part))

The fastest you can get is either by combining a generator expression with the map function:
import numpy
import datetime
a = np.arange(1000)
b = np.arange(200)
start = datetime.datetime.now()
foo = (item for sublist in [list(map(lambda x: (x,i),a)) for i in b] for item in sublist)
print (list(foo))
print ('execution time: {} s'.format((datetime.datetime.now() - start).total_seconds()))
Outputs (actually the whole resulting list is printed):
[(0, 0), (1, 0), ...,(998, 199), (999, 199)]
execution time: 1.253567 s
or by using a double generator expression:
a = np.arange(1000)
b = np.arange(200)
start = datetime.datetime.now()
foo = ((x,y) for x in a for y in b)
print (list(foo))
print ('execution time: {} s'.format((datetime.datetime.now() - start).total_seconds()))
Outputs (whole list printed):
[(0, 0), (1, 0), ...,(998, 199), (999, 199)]
execution time: 1.187415 s
Take into account that most of the computation time goes into the printing command. The generator calculations are otherwise decently efficient. Without printing the calculation times are:
execution time: 0.079208 s
for generator expression + map function and:
execution time: 0.007093 s
for the double generator expression.
If what you actually want is to calculate the actual product of each of the coordinate pairs, the fastest is to solve it as a numpy matrix product:
a = np.arange(1000)
b = np.arange(200)
start = datetime.datetime.now()
foo = np.dot(np.asmatrix([[i,0] for i in a]), np.asmatrix([[i,0] for i in b]).T)
print (foo)
print ('execution time: {} s'.format((datetime.datetime.now() - start).total_seconds()))
Outputs:
[[ 0 0 0 ..., 0 0 0]
[ 0 1 2 ..., 197 198 199]
[ 0 2 4 ..., 394 396 398]
...,
[ 0 997 1994 ..., 196409 197406 198403]
[ 0 998 1996 ..., 196606 197604 198602]
[ 0 999 1998 ..., 196803 197802 198801]]
execution time: 0.003869 s
and without printing (in this case it doesn't save much since only a tiny piece of the matrix is actually printed out):
execution time: 0.003083 s

This can also be easily done by using itertools.product method
from itertools import product
import numpy as np
x = np.array([1, 2, 3])
y = np.array([4, 5])
cart_prod = np.array(list(product(*[x, y])),dtype='int32')
Result:
array([[1, 4],
[1, 5],
[2, 4],
[2, 5],
[3, 4],
[3, 5]], dtype=int32)
Execution time: 0.000155 s

In the specific case that you need to perform simple operations such as addition on each pair, you can introduce an extra dimension and let broadcasting do the job:
>>> a, b = np.array([1,2,3]), np.array([10,20,30])
>>> a[None,:] + b[:,None]
array([[11, 12, 13],
[21, 22, 23],
[31, 32, 33]])
I'm not sure if there is any similar way to actually get the pairs themselves.

I'm a bit late to the party, but I encoutered a tricky variant of that problem.
Let's say I want the cartesian product of several arrays, but that cartesian product ends up being much larger than the computers' memory (however, the computation done with that product are fast, or at least parallelizable).
The obvious solution is to divide this cartesian product in chunks, and treat these chunks one after the other (in sort of a "streaming" manner). You can do that easily with itertools.product, but it's horrendously slow. Also, none of the proposed solutions here (as fast as they are) give us this possibility. The solution I propose uses Numba, and is slightly faster than the "canonical" cartesian_product mentioned here. It's pretty long because I tried to optimize it everywhere I could.
import numba as nb
import numpy as np
from typing import List
#nb.njit(nb.types.Tuple((nb.int32[:, :],
nb.int32[:]))(nb.int32[:],
nb.int32[:],
nb.int64, nb.int64))
def cproduct(sizes: np.ndarray, current_tuple: np.ndarray, start_idx: int, end_idx: int):
"""Generates ids tuples from start_id to end_id"""
assert len(sizes) >= 2
assert start_idx < end_idx
tuples = np.zeros((end_idx - start_idx, len(sizes)), dtype=np.int32)
tuple_idx = 0
# stores the current combination
current_tuple = current_tuple.copy()
while tuple_idx < end_idx - start_idx:
tuples[tuple_idx] = current_tuple
current_tuple[0] += 1
# using a condition here instead of including this in the inner loop
# to gain a bit of speed: this is going to be tested each iteration,
# and starting a loop to have it end right away is a bit silly
if current_tuple[0] == sizes[0]:
# the reset to 0 and subsequent increment amount to carrying
# the number to the higher "power"
current_tuple[0] = 0
current_tuple[1] += 1
for i in range(1, len(sizes) - 1):
if current_tuple[i] == sizes[i]:
# same as before, but in a loop, since this is going
# to get called less often
current_tuple[i + 1] += 1
current_tuple[i] = 0
else:
break
tuple_idx += 1
return tuples, current_tuple
def chunked_cartesian_product_ids(sizes: List[int], chunk_size: int):
"""Just generates chunks of the cartesian product of the ids of each
input arrays (thus, we just need their sizes here, not the actual arrays)"""
prod = np.prod(sizes)
# putting the largest number at the front to more efficiently make use
# of the cproduct numba function
sizes = np.array(sizes, dtype=np.int32)
sorted_idx = np.argsort(sizes)[::-1]
sizes = sizes[sorted_idx]
if chunk_size > prod:
chunk_bounds = (np.array([0, prod])).astype(np.int64)
else:
num_chunks = np.maximum(np.ceil(prod / chunk_size), 2).astype(np.int32)
chunk_bounds = (np.arange(num_chunks + 1) * chunk_size).astype(np.int64)
chunk_bounds[-1] = prod
current_tuple = np.zeros(len(sizes), dtype=np.int32)
for start_idx, end_idx in zip(chunk_bounds[:-1], chunk_bounds[1:]):
tuples, current_tuple = cproduct(sizes, current_tuple, start_idx, end_idx)
# re-arrange columns to match the original order of the sizes list
# before yielding
yield tuples[:, np.argsort(sorted_idx)]
def chunked_cartesian_product(*arrays, chunk_size=2 ** 25):
"""Returns chunks of the full cartesian product, with arrays of shape
(chunk_size, n_arrays). The last chunk will obviously have the size of the
remainder"""
array_lengths = [len(array) for array in arrays]
for array_ids_chunk in chunked_cartesian_product_ids(array_lengths, chunk_size):
slices_lists = [arrays[i][array_ids_chunk[:, i]] for i in range(len(arrays))]
yield np.vstack(slices_lists).swapaxes(0,1)
def cartesian_product(*arrays):
"""Actual cartesian product, not chunked, still fast"""
total_prod = np.prod([len(array) for array in arrays])
return next(chunked_cartesian_product(*arrays, total_prod))
a = np.arange(0, 3)
b = np.arange(8, 10)
c = np.arange(13, 16)
for cartesian_tuples in chunked_cartesian_product(*[a, b, c], chunk_size=5):
print(cartesian_tuples)
This would output our cartesian product in chunks of 5 3-uples:
[[ 0 8 13]
[ 0 8 14]
[ 0 8 15]
[ 1 8 13]
[ 1 8 14]]
[[ 1 8 15]
[ 2 8 13]
[ 2 8 14]
[ 2 8 15]
[ 0 9 13]]
[[ 0 9 14]
[ 0 9 15]
[ 1 9 13]
[ 1 9 14]
[ 1 9 15]]
[[ 2 9 13]
[ 2 9 14]
[ 2 9 15]]
If you're willing to understand what is being done here, the intuition behind the njitted function is to enumerate each "number" in a weird numerical base whose elements would be composed of the sizes of the input arrays (instead of the same number in regular binary, decimal or hexadecimal bases).
Obviously, this solution is interesting for large products. For small ones, the overhead might be a bit costly.
NOTE: since numba is still under heavy development, i'm using numba 0.50 to run this, with python 3.6.

Yet another one:
>>>x1, y1 = np.meshgrid(x, y)
>>>np.c_[x1.ravel(), y1.ravel()]
array([[1, 4],
[2, 4],
[3, 4],
[1, 5],
[2, 5],
[3, 5]])

Inspired by Ashkan's answer, you can also try the following.
>>> x, y = np.meshgrid(x, y)
>>> np.concatenate([x.flatten().reshape(-1,1), y.flatten().reshape(-1,1)], axis=1)
This will give you the required cartesian product!

This is a generalized version of the accepted answer (Cartesian product of multiple arrays using numpy.tile and numpy.repeat functions).
from functors import reduce
from operator import mul
def cartesian_product(arrays):
return np.vstack(
np.tile(
np.repeat(arrays[j], reduce(mul, map(len, arrays[j+1:]), 1)),
reduce(mul, map(len, arrays[:j]), 1),
)
for j in range(len(arrays))
).T

If you are willing to use PyTorch, I should think it is highly efficient:
>>> import torch
>>> torch.cartesian_prod(torch.as_tensor(x), torch.as_tensor(y))
tensor([[1, 4],
[1, 5],
[2, 4],
[2, 5],
[3, 4],
[3, 5]])
and you can easily get a numpy array:
>>> torch.cartesian_prod(torch.as_tensor(x), torch.as_tensor(y)).numpy()
array([[1, 4],
[1, 5],
[2, 4],
[2, 5],
[3, 4],
[3, 5]])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficiently merging a large number of pyspark DataFrames - python

Related

Numpy fastest way to get indexes of neighbors with current value (flood fill)

How can merging on a wider p dataframe (n, p) can take more time?

Fastest way to append nonzero numpy array elements to list

Vectorized operations on 2 columns of different dataframes [duplicate]

Cartesian product of x and y array points into single array of 2D points

Categories

Resources