Efficiently aggregate results into a Python Data Structure

Efficiently aggregate results into a Python Data Structure - python

I often find myself looping over some long INPUT list (or dataframe, or dictionary). Per iteration I do some calculations on the input data, I then push the results into some OUTPUT data structure. Often the final output is a dataframe (since it is convenient to deal with).
Below are two methods that loop over a long list, and aggregate some dummy results into a dataframe. Approach 1 is very slow (~3 seconds per run), whereas Approach 2 is very fast (~18 ms per run). Approach 1 is not good, because it is slow. Approach 2 is faster, but it is not ideal either, because it effectively "caches" data in a local file (and then relies on pandas to read that file back in very quickly). Ideally, we do everything in memory.
What approaches can people suggest to efficiently aggregate results? Bonus: And what if we don't know the exact size/length of our output structure (e.g. the actual output size may exceed the initial size estimate)? Any ideas appreciated.
import time
import pandas as pd
def run1(long_list):
my_df = pd.DataFrame(columns=['A','B','C'])
for el in long_list:
my_df.loc[(len)] = [el, el+1, 1/el] # Dummy calculations
return my_df
def run2(long_list):
with open('my_file.csv', 'w') as f:
f.write('A,B,C\n')
for el in long_list:
f.write(f'{el},{el+1},{1/el}\n') # Dummy calculations
return pd.read_csv('my_file.csv')
long_list = range(1, 2000)
%timeit df1 = run1(long_list) # 3 s ± 349 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2 = run2(long_list) # 18 ms ± 697 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

You can do this by creating and then dropping a dummy input column and doing all of the calculations directly in pandas:
def func(long_list):
my_df = pd.DataFrame(long_list, columns=['input'])
my_df = my_df.assign(
A=my_df.input,
B=my_df.input+1,
C=1/my_df.input)
return my_df.drop('input', axis=1)
Comparing the times:
%timeit df1 = run1(long_list)
%timeit df2 = run2(long_list)
%timeit df3 = func(long_list)
3.81 s ± 6.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.54 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.19 ms ± 3.95 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Pros:
All in memory
Really fast
Easy to read
Cons:
Probably not as fast as vectorized Numpy operations

You can directly build a DataFrame from a list of lists:
def run3(long_list):
return pd.DataFrame([[el, el+1, 1/el] for el in long_list],
columns=['A','B','C'])
It should be much faster than first one, and still faster that second one, because it does not use disk io.

Related

When indexing a DataFrame with a boolean mask, is it faster to apply the masks sequentially?

Given a large DataFrame df, which is faster in general?
# combining the masks first
sub_df = df[(df["column1"] < 5) & (df["column2"] > 10)]
# applying the masks sequentially
sub_df = df[df["column1"] < 5]
sub_df = sub_df[sub_df["column2"] > 10]
The first approach only selects from the DataFrame once which may be faster, however, the second selection in the second example only has to consider a smaller DataFrame.

It depends on your dataset.
First let's generate a DataFrame with almost all values that should be dropped in the first condition:
n = 1_000_000
p = 0.0001
np.random.seed(0)
df = pd.DataFrame({'column1': np.random.choice([0,6], size=n, p=[p, 1-p]),
'column2': np.random.choice([0,20], size=n)})
And as expected:
# simultaneous conditions
5.69 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# successive slicing
2.99 ms ± 45.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It is faster to first generate a small intermediate.
Now, let's change the probability to p = 0.9999. This means that the first condition will remove very few rows.
We could expect both solutions to run with a similar speed, but:
# simultaneous conditions
27.5 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# successive slicing
55.7 ms ± 3.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now the overhead of creating the intermediate DataFrame is not negligible.

Pandas groupby: efficiently chain several functions

I need to group a DataFrame and apply several chained functions on each group.
My problem is basically the same as in pandas - Groupby two functions: apply cumsum then shift on each group.
There are answers there on how to obtain a correct result, however they seem to have a suboptimal performance. My specific question is thus: is there a more efficient way than the ones I describe below?
First here is some large testing data:
from string import ascii_lowercase
import numpy as np
import pandas as pd
n = 100_000_000
np.random.seed(0)
df = pd.DataFrame(
{
"x": np.random.choice(np.array([*ascii_lowercase]), size=n),
"y": np.random.normal(size=n),
}
)
Below is the performance of each function:
%timeit df.groupby("x")["y"].cumsum()
4.65 s ± 71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.groupby("x")["y"].shift()
5.29 s ± 54.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
A basic solution is to group twice. It seems suboptimal since grouping is a large part of the total runtime and should only be done once.
%timeit df.groupby("x")["y"].cumsum().groupby(df["x"]).shift()
10.1 s ± 63.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The accepted answer to the aforementioned question suggests to use apply with a custom function to avoid this issue. However for some reason it is actually performing much worse than the previous solution.
def cumsum_shift(s):
return s.cumsum().shift()
%timeit df.groupby("x")["y"].apply(cumsum_shift)
27.8 s ± 858 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Do you have any idea how to optimize this code? Especially in a case where I'd like to chain more than two functions, performance gains can become quite significant.

Let me know if this helps, few weeks back I was having the same issue.
I solved it by just spliting the code. And creating a separate groupby object which contains information about the groups.
# creating groupby object
g = df.groupby('x')['y']
%timeit g.cumsum()
592 ms ± 8.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit g.shift()
1.7 s ± 8.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I would suggest to give a try to transform instead of apply
try this:
%timeit df.groupby("x")["y"].transform(np.cumsum).transform(lambda x: x.shift())
or, also try using
from toolz import pipe
%timeit df.groupby("x").pipe(lambda g: g["y"].cumsum().shift())
I am pretty sure that pipe can be more efficient than apply or transform
Let us know if it works well

Small Pandas DataFrame = bad performance

Usecase and conclusions
Create a 2*2 table filled with integers, change the values of one specific row, and access several rows. For this I was planning using pandas DataFrame but got very disappointed with the performances. The conclusion shows an incompressible cost for pandas DataFrame when the data is a small tables:
- pandas.loc = pandas.iloc = 115 µs
- pandas.iat = 5 µs (20 times faster, but only access one cell)
- numpy access = 0.5 (200 times faster, acceptable performance)
Am I doing a incorrect usage of pandas DataFrame ? Is it supposed to be used only for massive tables of data ? Given that my goal is to use a very simple MultiIndexation (type 1, type 2 and date), is there an existing data-structure that would support similar performance that numpy arrays and is as easy as pandas DataFrame ?
Or the other option I consider is to create my own class of MultiIndexed numpy arrays.
Code of the usecase
import pandas as pd
import numpy as np
from datetime import datetime
import time
data = [[1,2],[3,4]]
array = np.array(data)
index= pd.DatetimeIndex([datetime(2000,2,2), datetime(2000,2,4)])
df = pd.DataFrame(data = data, index = index)
It takes around 0.1 ms using pandas using the loc function
%timeit df.loc[datetime(2000,2,2)] = [12, 42]
115 µs ± 2.19 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
iloc has the same bad performances, although I would have intuitively think it would be faster
%timeit df.iloc[0] = [12, 42]
114 µs ± 3.49 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Testing iat, we are down to 5 µs, which is more acceptable
%timeit df.iat[0,0] = 42
5.03 µs ± 37.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Finally testing the same behavior with a numpy.array, we have excellent performances :
%timeit array[0,:] = [12, 42]
705 ns ± 4.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Why is NumPy sometimes slower than NumPy + plain Python loop?

This is based on this question asked 2018-10.
Consider the following code. Three simple functions to count non-zero elements in a NumPy 3D array (1000 × 1000 × 1000).
import numpy as np
def f_1(arr):
return np.sum(arr > 0)
def f_2(arr):
ans = 0
for val in range(arr.shape[0]):
ans += np.sum(arr[val, :, :] > 0)
return ans
def f_3(arr):
return np.count_nonzero(arr)
if __name__ == '__main__':
data = np.random.randint(0, 10, (1_000, 1_000, 1_000))
print(f_1(data))
print(f_2(data))
print(f_3(data))
Runtimes on my machine (Python 3.7.?, Windows 10, NumPy 1.16.?):
%timeit f_1(data)
1.73 s ± 21.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_2(data)
1.4 s ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_3(data)
2.38 s ± 956 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
So, f_2() works faster than f_1() and f_3(). However, it's not the case with data of smaller size. The question is - why so? Is it NumPy, Python, or something else?

This is due to memory access and caching. Each of these functions is doing two things, taking the first code as an example:
np.sum(arr > 0)
It first does a comparison to find where arr is greater than zero (or non-zero, since arr contains non-negative integers). This creates an intermediate array the same shape as arr. Then, it sums this array.
Straightforward, right? Well, when using np.sum(arr > 0) this is a large array. When it's large enough to not fit in cache, performance will decrease since when the processor starts to execute the sum most of the array elements will have been evicted from memory and need to be reloaded.
Since f_2 iterates over the first dimension, it is dealing with smaller sub-arrays. The same copy and sum is done, but this time the intermediate array fits in memory. It's created, used, and destroyed without ever leaving memory. This is much faster.
Now, you would think that f_3 would be fastest (using an in-built method and all), but looking at the source code shows that it uses the following operations:
a_bool = a.astype(np.bool_, copy=False)
return a_bool.sum(axis=axis, dtype=np.intp
a_bool is just another way of finding the non-zero entries, and creates a large intermediate array.
Conclusions
Rules of thumb are just that, and are frequently wrong. If you want faster code, profile it and see what the problems are (good work on that here).
Python does some things very well. In cases where it's optimized, it can be faster than numpy. Don't be afraid to use plain old python code or datatypes in combination with numpy.
If you find frequently yourself manually writing for loops for better performance you may want to take a look at numexpr - it automatically does some of this. I haven't used it much myself, but it should provide a good speedup if intermediate arrays are what's slowing down your program.

It's all a matter of how the data is laid out in memory and how the code accesses it. Essentially, data is fetched from the memory in blocks which are then cached; if an algorithm manages to use data from a block that is in the cache, there is no need to read from memory again. This can result in huge time savings, especially when the cache is much smaller than the data you are dealing with.
Consider these variations, which only differ in which axis we are iterating on:
def f_2_0(arr):
ans = 0
for val in range(arr.shape[0]):
ans += np.sum(arr[val, :, :] > 0)
return ans
def f_2_1(arr):
ans = 0
for val in range(arr.shape[1]):
ans += np.sum(arr[:, val, :] > 0)
return ans
def f_2_2(arr):
ans = 0
for val in range(arr.shape[2]):
ans += np.sum(arr[:, :, val] > 0)
return ans
And the results on my laptop:
%timeit f_1(data)
2.31 s ± 47.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_2_0(data)
1.88 s ± 60 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_2_1(data)
2.65 s ± 142 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_2_2(data)
12.8 s ± 650 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can see that f_2_1 almost as fast as f_1, which makes me think that numpy is not using the optimal access pattern (the one used by f_2_0). The explanation for how exactly caching affects the timing is in the other answer.

Let's remove the temporary array completely
As #user2699 already mentioned in his answer, allocating and writing to a large array that doesn't fit in cache can slow down the process quite a lot. To show this behavior I have written two small functions using Numba (JIT-Compiler).
In compiled languages (C,Fortran,..) you normally avoid temporary arrays. In interpreted Python (without using Cython or Numba) you often want to call a compiled function on a larger chunk of data (vectorization) because loops in interpreted code are extremely slow. But this can also have a view downsides (like temporary arrays, bad cache usage)
Function without temporary array allocation
#nb.njit(fastmath=True,parallel=False)
def f_4(arr):
sum=0
for i in nb.prange(arr.shape[0]):
for j in range(arr.shape[1]):
for k in range(arr.shape[2]):
if arr[i,j,k]>0:
sum+=1
return sum
With temporary array
Please note that if you turn on parallelization parallel=True, the compiler does not only try to parallelize the code, but also other optimizations like loop fusing are turned on.
#nb.njit(fastmath=True,parallel=False)
def f_5(arr):
return np.sum(arr>0)
Timings
%timeit f_1(data)
1.65 s ± 48.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_2(data)
1.27 s ± 5.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_3(data)
1.99 s ± 6.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_4(data) #parallel=false
216 ms ± 5.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_4(data) #parallel=true
121 ms ± 4.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_5(data) #parallel=False
1.12 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_5(data) #parallel=true Temp-Array is automatically optimized away
146 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Efficiency of random slicing on a numpy memory map

I have a 20GB, 100k x 100k 'float16' 2D array as a datafile. I load it to memory as follows:
fp_read = np.memmap(filename, dtype='float16', mode='r', shape=(100000, 100000))
I then attempt to read slices from it. The vertical slices I need to take are effectively random but the performance is very poor for this, or am I doing something wrong?
Analysis:
I have compared with other forms of cross-sectional slicing, which is much better although I don't know why it should be:
%timeit fp_read[:,17000:17005] # slice 5 consecutive cols
1.64 µs ± 16.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit fp_read[:,11000:11050:10]
1.67 µs ± 21 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit fp_read[:,5000:6000:200]
1.66 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit fp_read[:,0:100000:20000] # slice 5 disperse cols
1.69 µs ± 14.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit fp_read[:,[1,1001,27009,81008,99100]] # slice 5 rand cols
32.4 ms ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
a = np.arange(100000); b = np.array([1,1001,27009,81008,99100])
%timeit fp_read[np.ix_(a,b)]
18 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Even these timeit functions don't accurately capture the performance degradation, since:
import time
a = np.arange(100000)
cols = np.arange(100000)
np.random.shuffle(cols)
cols = np.sort(cols[:5])
t = time.time()
arr = fp_read[np.ix_(a,cols)]
print('Actually took: {} seconds'.format(time.time() - t))
Actually took: 24.5 seconds
Compared with:
t = time.time()
arr = fp_read[:,0:100000:20000]
print('Actually took: {} seconds'.format(time.time() - t))
Actually took 0.00024 seconds

The performance difference is explained by one key difference in "basic slicing and indexing" vs. "advanced indexing", see these docs. The key line herein is
Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view).
How much the copy hurts can be seen from comparing fp_read[:,5000:6000:200] against fp_read[:,5000:6000:200].copy().
Although making an array copy is always going to be slower than making a new view, it's especially bad for a memmap:
Reading from disk is relatively slow. The data needs to be read from disk to make the (in-memory) copy, while a view doesn't have to read any data at all! There is simply a new ndarray object created with new offset and stepsize (strides) parameters for the memory buffer.
The memory layout of your data is row-major order (vs. columns-major, see wikipedia). For accessing random columns this means that a sector has to be read from disk for every single value of data. Compare that to contiguous access, where you only read one sector for every 256 values (assuming float16 and 512 byte sectors). With memory-mapped io this effect is even worse, because then data is read in blocks (memory pages) of 4kB, so 8 x 512 byte sectors.
Now we can also understand why the timeit results are not really representative: That particular part of the file is cached by the OS in memory.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.