Performance between C-contiguous and Fortran-contiguous array operations

Performance between C-contiguous and Fortran-contiguous array operations - python

Below, I compared performance when dealing with sum operations between C-contiguous and Fortran-contiguous arrays (C vs FORTRAN memory order). I set axis=0 to ensure numbers are added up column wise. I was surprised that Fortran-contiguous array is actually slower than its C counterpart. Isn't that Fortran-contiguous array has contiguous memory allocation in columns and hence is better at column-wise operation?
import numpy as np
a = np.random.standard_normal((10000, 10000))
c = np.array(a, order='C')
f = np.array(a, order='F')
In Jupyter notebook, run
%timeit c.sum(axis=0)
10 loops, best of 3: 84.6 ms per loop
%timeit f.sum(axis=0)
10 loops, best of 3: 137 ms per loop

I think it's in the implementation of the np.sum(). For example:
import numpy as np
A = np.random.standard_normal((10000,10000))
C = np.array(A, order='C')
F = np.array(A, order='F')
Benchmarking with Ipython:
In [7]: %timeit C.sum(axis=0)
10 loops, best of 3: 101 ms per loop
In [8]: %timeit C.sum(axis=1)
10 loops, best of 3: 149 ms per loop
In [9]: %timeit F.sum(axis=0)
10 loops, best of 3: 149 ms per loop
In [10]: %timeit F.sum(axis=1)
10 loops, best of 3: 102 ms per loop
So it's behaving exactly the opposite as expected. But let's try out some other function:
In [17]: %timeit np.amax(C, axis=0)
1 loop, best of 3: 173 ms per loop
In [18]: %timeit np.amax(C, axis=1)
10 loops, best of 3: 70.4 ms per loop
In [13]: %timeit np.amax(F,axis=0)
10 loops, best of 3: 72 ms per loop
In [14]: %timeit np.amax(F,axis=1)
10 loops, best of 3: 168 ms per loop
Sure, it's apples to oranges. But np.amax() works along the axis as does sum and returns a vector with one element for each row/column. And behaves as one would expect.
In [25]: C.strides
Out[25]: (80000, 8)
In [26]: F.strides
Out[26]: (8, 80000)
Tells us that the arrays are in fact packed in row-order and column-order and looping in that direction should be a lot faster. Unless for example the sum sums each row by row as it travels along the columns for providing the column sum (axis=0). But without a mean of peeking inside the .pyd I'm just speculating.
EDIT:
From percusse 's link: http://docs.scipy.org/doc/numpy/reference/generated/numpy.ufunc.reduce.html
Reduces a‘s dimension by one, by applying ufunc along one axis.
Let a.shape = (N_0, ..., N_i, ..., N_{M-1}).
Then ufunc.reduce(a, axis=i)[k_0, ..,k_{i-1}, k_{i+1}, .., k_{M-1}] = the result of iterating j over range(N_i), cumulatively applying ufunc to each
a[k_0, ..,k_{i-1}, j, k_{i+1}, .., k_{M-1}]
So in pseudocode, when calling F.sum(axis=0):
for j=cols #axis=0
for i=rows #axis=1
sum(j,i)=F(j,i)+sum(j-1,i)
So it would actually iterate over the row when calculating the column sum, slowing down considerably when in column-major order. Behaviour such as this would explain the difference.
eric's link provides us with the implementation, for somebody curious enough to go through large amounts of code for the reason.

That's expected. If you check the result of
%timeit f.sum(axis=1)
it also gives similar result with the timing of c. Similarly,
%timeit c.sum(axis=1)
is slower.
Some explanation: suppose you have the following structure
|1| |6|
|2| |7|
|3| |8|
|4| |9|
|5| |10|
As Eric mentioned, these operations work with reduce. Let's say we are asking for column sum. So intuitive mechanism is not at play such that each column is accessed once, summed, and recorded. In fact it is the opposite such that each row is accessed and the function (here summing) is performed in essence similar to having two arrays a,b and executing
a += b
That's a very informal way of repeating what is mentioned super-cryptically in the documentation of reduce.
This requires rows to be accessed contiguously although we are performing a column sum [1,6] + [2,7] + [3,8]... Hence, the implementation direction matters depending on the operation but not the array.

Related

Optimized way of accessing elements from n-dimensional array using multiple indices [duplicate]

I am attempting to extract values from a 3d numpy array. At the moment I can perform the following operations:
newmesh.shape
(40,40,40)
newmesh[2,5,6]
6
However, if I try to index it with an array, the result is not as expected;
newmesh[np.array([2,5,6])].shape
(3, 42, 42)
I have tried using np.take, however it produces the following;
np.take(newmesh,np.array([2,5,6]))
[-1 -1 -1]
Any ideas why this is happening? My goal is to input a (n,3) array, where each row corresponds to a value of newmesh, i.e. inputting a (n,3) array would give back a 1d array of length n.

With idx as the (n,3) indexing array, one approach using linear-indexing would be with np.ravel_multi_index -
np.take(newmesh,np.ravel_multi_index(idx.T,newmesh.shape))
An approach with tuple formation would look like this -
newmesh[tuple(idx.T)]
If there are just three dimensions, you can even just use columnar slices for indexing into each dimension, like so -
newmesh[idx[:,0],idx[:,1],idx[:,2]]
Runtime test If anyone's interested in seeing the performance numbers associated with the listed approaches, here's a quick runtime test -
In [18]: newmesh = np.random.rand(40,40,40)
In [19]: idx = np.random.randint(0,40,(1000,3))
In [20]: %timeit np.take(newmesh,np.ravel_multi_index(idx.T,newmesh.shape))
10000 loops, best of 3: 22.5 µs per loop
In [21]: %timeit newmesh[tuple(idx.T)]
10000 loops, best of 3: 20.9 µs per loop
In [22]: %timeit newmesh[idx[:,0],idx[:,1],idx[:,2]]
100000 loops, best of 3: 17.2 µs per loop

for loop is faster than numpy average, also a bit different result

I'm calculating the average slope between a part of a series and a lagged part of the series:
one way is:
def get_trend(series, slen)
sum = 0.0
for i in range(slen):
sum += series[i+slen] - series[i]
return sum / slen**2
second way is:
numpy.average(numpy.subtract(series[slen:2*slen], series[:slen]))/float(slen)
The first code is faster than the second by about 50% according to timeit, and the results differ in the 18th digit and onward for a series of size 200 with slen = 66 and numbers in the series ranging between 0 and 1.
I've also tried to replace the average with sum and divide by slen**2, like I do in the for sum:
numpy.sum(numpy.subtract(series[slen:2*slen], series[:slen]))/float(slen**2)
This is equivalent in execution time to the for loop version, but the result is still not exactly the same, it's also (sometime) not the same as the average version, although more often than not it is the same as the average version.
Questions are:
Which of these should give the most accurate answer?
Why does the last version give a different answer from the for loop?
And why is the average function so inefficient?
Note: for timing I'm measuring the operation on a standard list, on a numpy array the average is faster than the for loop, but the sum is still more than twice as fast as the average.

A better suggestion
I think a better vectorized approach would be with slicing -
(series[slen:2*slen] - series[:slen]).sum()/float(slen**2)
Runtime test and verification -
In [139]: series = np.random.randint(11,999,(200))
...: slen= 66
...:
# Original app
In [140]: %timeit get_trend(series, slen)
100000 loops, best of 3: 17.1 µs per loop
# Proposed app
In [141]: %timeit (series[slen:2*slen] - series[:slen]).sum()/float(slen**2)
100000 loops, best of 3: 3.81 µs per loop
In [142]: out1 = get_trend(series, slen)
In [143]: out2 = (series[slen:2*slen] - series[:slen]).sum()/float(slen**2)
In [144]: out1, out2
Out[144]: (0.7587235996326905, 0.75872359963269054)
Investigating comparison on average based approach against loopy one
Let's add the second approach (vectorized one) from the question for testing -
In [146]: np.average(np.subtract(series[slen:2*slen], series[:slen]))/float(slen)
Out[146]: 0.75872359963269054
Timings are better than the loopy one and results look good. So, I am suspecting the way you are timing might be off.
If you are using NumPy ufuncs to leverage the vectorized operations with NumPy, you should work with arrays. So, if your data is a list, convert it to an array and then use the vectorized approach. Let's investigate it a bit more -
Case #1 : With a list of 200 elems and slen = 66
In [147]: series_list = np.random.randint(11,999,(200)).tolist()
In [148]: series = np.asarray(series_list)
In [149]: slen = 66
In [150]: %timeit get_trend(series_list, slen)
100000 loops, best of 3: 5.68 µs per loop
In [151]: %timeit np.asarray(series_list)
100000 loops, best of 3: 7.99 µs per loop
In [152]: %timeit np.average(np.subtract(series[slen:2*slen], series[:slen]))/float(slen)
100000 loops, best of 3: 6.98 µs per loop
Case #2 : Scale it 10x
In [157]: series_list = np.random.randint(11,999,(2000)).tolist()
In [159]: series = np.asarray(series_list)
In [160]: slen = 660
In [161]: %timeit get_trend(series_list, slen)
10000 loops, best of 3: 53.6 µs per loop
In [162]: %timeit np.asarray(series_list)
10000 loops, best of 3: 65.4 µs per loop
In [163]: %timeit np.average(np.subtract(series[slen:2*slen], series[:slen]))/float(slen)
100000 loops, best of 3: 8.71 µs per loop
So, it's the overhead of converting to an array that's hurting you!
Investigating comparison on sum based approach against average based one
On the third part of comparing sum-based code against average-based one, it's because np.avarege is indeed slower than "manually" doing it with summation. Timing it on this as well -
In [173]: a = np.random.randint(0,1000,(1000))
In [174]: %timeit np.sum(a)/float(len(a))
100000 loops, best of 3: 4.36 µs per loop
In [175]: %timeit np.average(a)
100000 loops, best of 3: 7.2 µs per loop
A better one than np.average with np.mean -
In [179]: %timeit np.mean(a)
100000 loops, best of 3: 6.46 µs per loop
Now, looking into the source code for np.average, it seems to be using np.mean. This explains why it 's slower than np.mean as we are avoiding the function call overhead there. On the tussle between np.sum and np.mean, I think np.mean does take care of the overflow in case we are adding a huge number of elements, which we might miss it with np.sum. So, for being on the safe side, I guess it's better to go with np.mean.

As for the first two questions...well I would not say you have different results!
A difference of the order of 1e-18 is veeeery small (also think in the context of your script).
This is why you should not compare floats for strict equality, but set a tolerance:
What is the best way to compare floats for almost-equality in Python?
https://www.python.org/dev/peps/pep-0485/

Aggregate numpy functions

I have a numpy operation that I call intensively and I need to optimise:
np.sum(a**2, axis=1)**.5 # where a is a 2 dimensional ndarray
This operation is composed of three functions and requires iterating through 'a' three times. It would be more efficient to aggregate all operations under one function and apply that function just once along axis 1. Unfortunately, numpy's apply_along_axis function is not an option as performance is around x1000 worse.
Is there a way of aggregating several numpy operations so it only has to loop once over the array?

When working with floating-point array, you can use np.einsum -
np.sqrt(np.einsum('ij,ij->i',a,a))
Runtime test -
In [34]: a = np.random.rand(1000,1000)
In [35]: np.allclose(np.sum(a**2, axis=1)**.5,np.sqrt(np.einsum('ij,ij->i',a,a)))
Out[35]: True
In [36]: %timeit np.sum(a**2, axis=1)**.5
100 loops, best of 3: 7.57 ms per loop
In [37]: %timeit np.sqrt(np.einsum('ij,ij->i',a,a))
1000 loops, best of 3: 1.52 ms per loop

Take a look at numexpr, which allows you to evaluate numerical expressions faster than pure numpy:
In [19]: a = np.arange(1e6).reshape(1000,1000)
In [20]: import numexpr as ne
In [21]: %timeit np.sum(a**2,axis=1)**0.5
100 loops, best of 3: 6.08 ms per loop
In [22]: %timeit ne.evaluate("sum(a**2,axis=1)")**0.5
100 loops, best of 3: 4.27 ms per loop
The **0.5 is not part of the expression because the sum is a reduction operations and needs to be computed last in an expression. You could also run another evaluation for the sqrt/**0.5.

Is there a non-copying constructor for a pandas DataFrame

Reposted from https://groups.google.com/forum/#!topic/pydata/5mhuatNAl5g
It seems when creating a DataFrame from a structured array that the data is copied?
I get similar results if the data is instead a dictionary of numpy arrays.
Is there anyway to create a DataFrame from a structured array or similar without any copying or checking?
In [44]: sarray = randn(1e7,10).view([(name, float) for name in 'abcdefghij']).squeeze()
In [45]: for N in [10,100,1000,10000,100000,1000000,10000000]:
...: s = sarray[:N]
...: %timeit z = pd.DataFrame(s)
...:
1000 loops, best of 3: 830 µs per loop
1000 loops, best of 3: 834 µs per loop
1000 loops, best of 3: 872 µs per loop
1000 loops, best of 3: 1.33 ms per loop
100 loops, best of 3: 15.4 ms per loop
10 loops, best of 3: 161 ms per loop
1 loops, best of 3: 1.45 s per loop
Thanks,
Dave

Panda's DataFrame uses BlockManager to consolidate similarly typed data into a single memory chunk. Its this consolidation into a single chunk that causes the copy. If you initialize as follows:
pd.DataFrame(npmatrix, copy=False)
then the DataFrame will not copy the data, but reference it instead.
HOWEVER, sometimes you may come with multiple arrays and BlockManager will try to consolidate the data into a single chunk. In that situation, I think your only option is to monkey patch the BlockManager to not consolidate the data.
I agree with #DaveHirschfeld, this could be provided as a consolidate=False parameter to BlockManager. Pandas would be better for it.

This will by definition coerce dtypes to a single dtype (e.g. float64 in this case). No way around that. This is a view on the original array. Note that this only helps with construction. Most operations will tend to make and return copies.
In [44]: s = sarray[:1000000]
Original Method
In [45]: %timeit DataFrame(s)
10 loops, best of 3: 107 ms per loop
Coerce to an ndarray. Pass in copy=False (this doesn't affect a structured array, ONLY a plain single dtyped ndarray).
In [47]: %timeit DataFrame(s.view(np.float64).reshape(-1,len(s.dtype.names)),columns=s.dtype.names,copy=False)
100 loops, best of 3: 3.3 ms per loop
In [48]: result = DataFrame(s.view(np.float64).reshape(-1,len(s.dtype.names)),columns=s.dtype.names,copy=False)
In [49]: result2 = DataFrame(s)
In [50]: result.equals(result2)
Out[50]: True
Note that both DataFrame.from_dict and DataFrame.from_records will copy this. Pandas keeps like-dtyped ndarrays as a single ndarray. And its expensive to do a np.concatenate to aggregate, which is what is done under the hood. Using a view avoids this issue.
I suppose this could be the default for a structrured array if the passed dtypes are all the same. But then you have to ask why you are using a structured array in the first place. (obviously to get name-access..but is their another reason?)

Efficient column indexing and selection in PANDAS

I'm looking for the most efficient way to select multiple columns from a data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(4,8), columns = list('abcdefgh'))
I want to select columns the following columns a,c,e,f,g only, which can be done by using indexing:
df.ix[:,[0,2,4,5,6]]
For a large data frame of many columns, this seems an inefficient method and I would much rather specify consecutive column indexes by range, if at all possible, but attempts such as the following, both throw up syntax errors:
df.ix[:,[0,2,4:6]]
or
df.ix[:,[0,2,[4:6]]]

As soon as you select non adjacent columns, you will pay the load.
If your data is homogeneous, falling back to numpy give you notable improvement.
In [147]: %timeit df[['a','c','e','f','g']]
%timeit df.values[:,[0,2,4,5,6]]
%timeit df.ix[:,[0,2,4,5,6]]
%timeit pd.DataFrame(df.values[:,[0,2,4,5,6]],columns=df.columns[[0,2,4,5,6]])
100 loops, best of 3: 2.67 ms per loop
10000 loops, best of 3: 58.7 µs per loop
1000 loops, best of 3: 1.81 ms per loop
1000 loops, best of 3: 568 µs per loop

I think you can use range:
print [0,2] + range(4,7)
[0, 2, 4, 5, 6]
print df.ix[:, [0,2] + range(4,7)]
a c e f g
0 0.278231 0.192650 0.653491 0.944689 0.663457
1 0.416367 0.477074 0.582187 0.730247 0.946496
2 0.396906 0.877941 0.774960 0.057290 0.556719
3 0.119685 0.211581 0.526096 0.213282 0.492261

Pandas is relatively well thought, the shortest way is the most efficient:
df[['a','c','e','f','g']]
You don't need ix, as it will do a search in your data, but for that you obviously need the names of the columns.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Performance between C-contiguous and Fortran-contiguous array operations - python

Related

Optimized way of accessing elements from n-dimensional array using multiple indices [duplicate]

for loop is faster than numpy average, also a bit different result

Aggregate numpy functions

Is there a non-copying constructor for a pandas DataFrame

Efficient column indexing and selection in PANDAS

Categories

Resources