Better python code for reduce memory usage? - python

I have a data frame of about 19 million rows, which 4 of the variables are latitudes & longitudes. I create a function to calculate distance of latitudes & longitudes with help of python haversine package.
# function to calculate distance of 2 coordinates
def measure_distance(lat_1, long_1, lat_2, long_2):
coordinate_start = list(zip(lat_1, long_1))
coodrinate_end = list(zip(lat_2, long_2))
distance = haversine_vector(coordinate_start, coodrinate_end, Unit.KILOMETERS)
return distance
I use magic command %%memit to measure memory usage to perform the calculation. On average, memory usage is between 8 - 10 GB. I run my work on Google Colab which has 12GB RAM, as a result, sometime the operation hit the limit of runtime and restart.
%%memit
measure_distance(df.station_latitude_start.values,
df.station_longitude_start.values,
df.station_latitude_end.values,
df.station_longitude_end.values)
peak memory: 7981.16 MiB, increment: 5312.66 MiB
Is there a way to optimise my code?

TL;DR: use Numpy and compute the result by chunk.
The amount of memory taken by the CPython interpreter is expected regarding the big input size.
Indeed, CPython stores values in list using references. On a 64-bit system, references takes 8 bytes and basic types (float and small integers) take usually 32 bytes. A tuple of two floats is a complex type that contains the size of the tuple as well as references of the two floats (not values themselves). Its size should be close to 64 bytes. Since you have 2 lists containing 19 million of (reference of) float pairs and 4 list containing 19 million of (reference of) floats, the resulting memory taken should be about 4*19e6*(8+32) + 2*19e6*(8+64) = 5.7 GB. Not to mention that Haversine can make some internal copies and the result take some space too.
If you want to reduce the memory usage, then use Numpy. Indeed, float Numpy arrays store values in a much more compact way (no references, no internal tag). You can replace the list of tuple by a N x 2 Numpy 2D array. The resulting size should be about 4*19e6*8 + 2*19e6*(8*2) = 1.2 GB. Moreover, the computation will be much faster Haversine use Numpy internally. Here is an example:
import numpy as np
# Assume lat_1, long_1, lat_2 and long_2 are of type np.array.
# Use np.array(yourList) if you want to convert it.
def measure_distance(lat_1, long_1, lat_2, long_2):
coordinate_start = np.column_stack((lat_1, long_1))
coordinate_end = np.column_stack((lat_2, long_2))
return haversine_vector(coordinate_start, coordinate_end, Unit.KILOMETERS)
The above code is about 25 time faster.
If you want to reduce even more the memory usage, you can compute the coordinate by chunk (for example 32K values) and then concatenate the output chunks. You can also use single precision numbers rather than double precision if you do not care too much about the accuracy of the computed distances.
Here is an example of how to compute the result by chunk:
def better_measure_distance(lat_1, long_1, lat_2, long_2):
chunckSize = 65536
result = np.zeros(len(lat_1))
for i in range(0, len(lat_1), chunckSize):
coordinate_start = np.column_stack((lat_1[i:i+chunckSize], long_1[i:i+chunckSize]))
coordinate_end = np.column_stack((lat_2[i:i+chunckSize], long_2[i:i+chunckSize]))
result[i:i+chunckSize] = haversine_vector(coordinate_start, coordinate_end, Unit.KILOMETERS)
return result
On my machine, using double precision, the above code takes about 800 MB while the initial implementation take 8 GB. Thus, 10 times less memory! It is also still 23 times faster! Using simple precision, the above code takes about 500 MB, so 16 times less memory, and it is 48 times faster!

Related

Python: how to speed up this function and make it more scalable?

I have the following function which accepts an indicator matrix of shape (20,000 x 20,000). And I have to run the function 20,000 x 20,000 = 400,000,000 times. Note that the indicator_Matrix has to be in the form of a pandas dataframe when passed as parameter into the function, as my actual problem's dataframe has timeIndex and integer columns but I have simplified this a bit for the sake of understanding the problem.
Pandas Implementation
indicator_Matrix = pd.DataFrame(np.random.randint(0,2,[20000,20000]))
def operations(indicator_Matrix):
s = indicator_Matrix.sum(axis=1)
d = indicator_Matrix.div(s,axis=0)
res = d[d>0].mean(axis=0)
return res.iloc[-1]
I tried to improve it by using numpy but it is still taking ages to run. I also tried concurrent.future.ThreadPoolExecutor but it still take a long time to run and not much improvement from list comprehension.
Numpy Implementation
indicator_Matrix = pd.DataFrame(np.random.randint(0,2,[20000,20000]))
def operations(indicator_Matrix):
s = indicator_Matrix.to_numpy().sum(axis=1)
d = (indicator_Matrix.to_numpy().T / s).T
d = pd.DataFrame(d, index = indicator_Matrix.index, columns = indicator_Matrix.columns)
res = d[d>0].mean(axis=0)
return res.iloc[-1]
output = [operations(indicator_Matrix) for i in range(0,20000**2)]
Note that the reason I convert d to a dataframe again is because I need to obtain the column means and retain only the last column mean using .iloc[-1]. d[d>0].mean(axis=0) return column means, i.e.
2478 1.0
0 1.0
Update: I am still stuck in this problem. I wonder if using gpu packages like cudf and CuPy on my local desktop would make any difference.
Assuming the answer of #CrazyChucky is correct, one can implement a faster parallel Numba implementation. The idea is to use plain loops and care about reading data the contiguous way. Reading data contiguously is important so to make the computation cache-friendly/memory-efficient. Here is an implementation:
import numba as nb
#nb.njit(['(int_[:,:],)', '(int_[:,::1],)', '(int_[::1,:],)'], parallel=True)
def compute_fastest(matrix):
n, m = matrix.shape
sum_by_row = np.zeros(n, matrix.dtype)
is_row_major = matrix.strides[0] >= matrix.strides[1]
if is_row_major:
for i in nb.prange(n):
s = 0
for j in range(m):
s += matrix[i, j]
sum_by_row[i] = s
else:
for chunk_id in nb.prange(0, (n+63)//64):
start = chunk_id * 64
end = min(start+64, n)
for j in range(m):
for i2 in range(start, end):
sum_by_row[i2] += matrix[i2, j]
count = 0
s = 0.0
for i in range(n):
value = matrix[i, -1] / sum_by_row[i]
if value > 0:
s += value
count += 1
return s / count
# output = [compute_fastest(indicator_Matrix.to_numpy()) for i in range(0,20000**2)]
Pandas dataframes can contain both row-major and column-major arrays. Regarding the memory layout, it is better to iterate over the rows or the column. This is why there is two implementations of the sum based on is_row_major. There is also 3 Numba signatures: one for row-major contiguous arrays, one for columns-major contiguous arrays and one for non-contiguous arrays. Numba will compile the 3 function variants and automatically pick the best one at runtime. The JIT-compiler of Numba can generate a faster implementation (eg. using SIMD instructions) when the input 2D array is known to be contiguous.
Experimental Results
This computation is about 14.5 times faster than operations_simpler on my i5-9600KF processor (6 cores). It still takes a lot of time but the computation is memory-bound and nearly optimal on my machine: it is bounded by the main-memory which has to be read:
On a 2000x2000 dataframe with 32-bit integers:
- operations: 86.310 ms/iter
- operations_simpler: 5.450 ms/iter
- compute_fastest: 0.375 ms/iter
- optimal: 0.345-0.370 ms/iter
If you want to get a faster code, then you need to use more compact data types. For example, a uint8 data type is large enough to contain the values 0 and 1, and it is 4 times smaller in memory on Windows. This means the code can be up to 4 time faster in this case. The smaller the data type, the faster the program. One could even try to compact 8 columns in 1 using bit tweaks though it is generally significantly slower using Numba unless you have a lot of available cores.
Notes & Discussion
The above code works only with uniformly-typed columns. If this is not the case, you can split the dataframe in multiple groups and convert each column group to Numpy array so to then call the Numba function (modified to support groups). Note the #CrazyChucky code has a similar issue: a dataframe column with mixed datatypes converted to a Numpy array results in an object-based Numpy array which is very inefficient (especially a row-major Numpy array).
Note that using a GPU will not make the computation faster unless the input dataframe is already stored in the GPU memory. Indeed, CPU-GPU data transfers are more expensive than just reading the RAM (due to the interconnect overhead which is generally a quite slow PCI one). Note that the GPU memory is quite limited compared to the CPU. If the target dataframe(s) do not need to be transferred, then using cudf is relatively simple and should give a small speed up. For a faster code, one need to implement a fast CUDA code but this is clearly far from being easy for dataframes with mixed dataype. In the end, the resulting speed up should be main_ram_throughput / gpu_ram_througput assuming there is no data transfer. Note that this factor is generally 5-12. Note also that CUDA and cudf require a Nvidia GPU.
Finally, reducing the input data size or just the amount of computation is certainly the best solution (as indicated in the comment by #zvone) since it is very computationally intensive.
You're doing some extra math you don't have to. In plain English, what you're doing is:
Summing each column
Turning the list of sums "sideways" and dividing each column by it
Taking the mean of each column, ignoring values ≤ 0
Returning only the rightmost mean
After step one, you no longer need anything but the rightmost column; you can ignore the other columns, only dividing and averaging the one whose result you care about. Changing your code accordingly:
def operations_simpler(indicator_matrix):
sums = indicator_matrix.sum(axis=1)
last_column = indicator_matrix.iloc[:, -1]
divided = last_column / sums
return divided[divided > 0].mean()
...yields the same result, and takes about a hundredth of the time. Extrapolating from shorter test runs, this cuts the time for 400,000,000 runs on my machine from about 114 years down to... about 324 days. Still not great. So far I've not managed to get it to run any faster by converting to NumPy, compiling with Numba, or employing multiprocessing, but I'll go ahead and post this for now in case it's helpful.
Note: You're unlikely to see any improvements with compute-heavy work like this from threading; if anything, you'd want to use multiprocessing. concurrent.futures offers executors for both. Threads are mostly useful to avoid waiting around for I/O.
As per the previous answer you can use Numba or you can you two other alternatives such as Dask which is a distributed computing package, to parallelize your function's execution it can divide your data into smaller bits and distribute computing across many CPU cores or even numerous machines.
import dask.array as da
def operations(indicator_matrix):
s = indicator_matrix.sum(axis=1)
d = indicator_matrix.div(s, axis=0)
res = d[d > 0].mean(axis=0)
return res.iloc[-1]
indicator_matrix_dask = da.from_array(indicator_matrix, chunks=(1000, 1000))
output_dask = indicator_matrix_dask.map_blocks(operations, dtype=float)
output = output_dask.compute()
or you can use CuPy which uses GPU to increase your function excution
import cupy as cp
def operations(indicator_matrix):
s = cp.sum(indicator_matrix, axis=1)
d = cp.divide(indicator_matrix.T, s).T
d = pd.DataFrame(d, index = indicator_matrix.index, columns = indicator_matrix.columns)
res = d[d > 0].mean(axis=0)
return res.iloc[-1]
indicator_matrix_cupy = cp.asarray(indicator_matrix)
output_cupy = operations(indicator_matrix_cupy)
output = cp.asnumpy(output_cupy)

why is my memory footprint blowing up in this greedy approach to tsp?

I have an assignment to use a greedy approach to satisfy TSP. The problem has 33708 cities. because I had a lot of helpful tools for this from the previous assignment, I decided to reuse that approach and precompute the distances.
so that is barely more than half a billion entries (33708 choose 2), each comfortably fitting in a float32. The x and y coordinates, likewise, are numbers $|n| < 10000 $ with no more than 4 decimal places.
My python for the same was:
def get_distance(left, right):
""" return the euclidean distance between tuples left and right, which are coordinates"""
return ((left[0] - right[0]) ** 2 + (left[1] - right[1]) ** 2) ** 0.5
# precompute all distances
distances = {}
for i in range(len(cities)):
for j in range(i + 1, len(cities)):
d = get_distance(cities[i], cities[j])
distances[frozenset((i, j)))] = d
and I expected this to occupy (3 * 32b) * 568m ≈ 6.7 gigabytes of memory. But in fact, watching the live runtime in my jupyter notebook, it appears to be shooting past even 35GB. (442s and counting) I had to kill it as I was well into my swap space and it slowed down a lot. Anyone know why this is so surprisingly large?
update: trying again with tuple(sorted((i,j))) -- but already at 110s it is 15GB and counting
sizes
>>> import sys
>>> a = frozenset((1,2))
>>> sys.getsizeof(a)
216
>>> sys.getsizeof(tuple(sorted((1,2))))
56
>>> sys.getsizeof(1)
28
is there anything like float32 and int16 in python?? -- ans: numpy has them
updated attempt:
from numpy import float32, int16
from itertools import combinations
import sys
def get_distance(left, right):
""" return the euclidean distance between tuples left and right, which are coordinates"""
return float32(((left[0] - right[0]) ** 2 + (left[1] - right[1]) ** 2) ** 0.5)
# precompute all distances
distances = {}
for i, j in combinations(range(len(cities)), 2):
distances[tuple(sorted((int16(i), int16(j))))] = get_distance(cities[i], cities[j])
print(sys.getsizeof(distances))
observed sizes:
with cities = cities[:2] : 232
with cities = cities[:3] : also 232
with cities = cities[:10] : 2272
with cities = cities[:100] : 147552
with cities = cities[:1000] : 20971608 (20MB)
with cities = cities[:10000] : 2684354656 (2.6GB)
note the growth rate does not scale with the data even as we approach 50 million entries ie 10000 choose 2 (10% of the total size of the data):
2684354656/(1000 choose 2 / 100 choose 2 * 20971608) ≈ 1.27
20971608/(1000 choose 2 / 100 choose 2 * 147552) ≈ 1.4
I decided to halt my attempt at the full cities list, as my OS snapshot of the memory grew to well over 30GB and I was going to swap. This means that, even if the final object ends up that big, the amount of memory the notebook is requiring is much larger still.
Python objects have an overhead because of dynamic typing and reference counting. The absolute minimal object object() has a size of 16 bytes (on 64 bit machines). 8 byte reference count, 8 bytes type pointer. No python object can be smaller than that. float and int are slightly larger which 24 bytes at least. list are at least an array of pointers, which adds an additional 8 bytes. So the small possible memory footprint of a list of half a billion ints is 32 * 500_000_000 ~= 16Gb. sets and dicts are even larger than that since they store more than just one pointer per element.
Use numpy (maybe the stdlib array module is already enough).
(Note: The numpy float32 types can't be smaller than 16 bytes either)

Why is my vectorized Numpy code taking longer than the non-vectorized code

So I am calculating Poisson distributions using large amounts of data. I have an array of shape (2666667,19) - "spikes", and an array of shape (19,100) - "placefields". I used to have a for loop that iterated through the 2666667 dimension, which took around 60 seconds to complete. Then, I learned that If I vectorize for loops, it becomes much faster, so I tried to do so. The vectorized form works and outputs the same results, however, now it takes 120 seconds :/
Here is the original loop (60s):
def compute_probability(spikes,placefields):
nTimeBins = len(spikes[0])
probability = np.empty((nTimeBins, 99)) #empty probability matrix
for i in range(nTimeBins):
nspikes = np.tile(spikes[:,i],(99))
nspikes = np.swapaxes(nspikes,0,1)
maxL = stats.poisson.pmf(nspikes,placefields)
maxL = maxL.prod(axis=0)
probability[i,:] = maxL
return probability
And here is the vectorised form (120s)
def compute_probability(spikes,placefields):
placefields = np.reshape(placefields,(19,99,1))
#prepared placefields
nspikes = np.tile(spikes, (99,1,1))
nspikes = np.swapaxes(nspikes,0,1)
#prepared nspikes
probability = stats.poisson.pmf(nspikes,placefields)
probability = np.swapaxes(probability.prod(axis=0),0,1)
return probability
Why is it SO SLOW. I think it might be that the tiled arrays created by the vectorized form are so gigantic they take up a huge amount of memory. How can I make it go faster?
download samplespikes and sampleplacefields (as suggested by the comments)- https://mega.nz/file/lpRF1IKI#YHq1HtkZ9EzYvaUdlrMtBwMg-0KEwmhFMYswxpaozXc
EDIT:
The issue was that although it was vectorized, the huge array was taking up too much RAM. I have split the calculation into chunks, and it does better now:
placefields = np.reshape(placefields,(len(placefields),99,1))
nspikes = np.swapaxes(np.tile(spikes, (xybins,1,1)),0,1)
probability = np.empty((len(spikes[0]), xybins))
chunks = len(spikes[0])//20
n = int(len(spikes[0])/chunks)
for i in range(0,len(nspikes[0][0]),n):
nspikes_chunk = nspikes[:,:,i:i+n]
probability_chunk = stats.poisson.pmf(nspikes_chunk,placefields)
probability_chunk = np.swapaxes(probability_chunk.prod(axis=0),0,1)
if len(probability_chunk)<(len(spikes)//chunks):
probability[i:] = probability_chunk
else:
probability[i:i+len(probability_chunk)] = probability_chunk
This is likely due to memory/cache effects.
The first code work on small arrays fitting in the CPU caches. It is not great because each Numpy function call take some time. The second code fix that issue. However, it allocate/fill huge arrays in memory of several GiB. It is much faster to work in CPU caches than in main memory (RAM). This is especially true when the working arrays are used only once (because of expensive OS page-faults) which seems to be the case in your code. If you do not have enough memory, the OS will read/write temporary data in SSD/HDD storage devices that are very slow compared to the RAM and the CPU caches.
The best solution is probably to work on chunks so that the operation is both vectorized (reducing the overhead of the Numpy function calls) and fit in CPU caches (reducing the cost of RAM reads/writes). Note that the size of the last level cache is typically few MiB nowadays on mainstream PC processors.
The takeaway message is that vectorization do not always make things faster. For better performance, one should care about the size of the manipulated data chunks so they fit in CPU caches.
PS: note that if you do not care too much about precision, you can use simple-precision (np.float32) instead of double-precision (np.float64) to speed up a bit the computation.

NumPy: Compute mode row-wise spanning over multiple arrays from iterator

Have a look at this image:
In my application I receive from an iterator an arbitrary amount (let's say 1000 for now) of big 1-dimensional arrays arr1, arr2, arr3, ..., arr1000 (10000 entries each). Each entry is an integer between 0 and n, where in this case n = 9. My ultimate goal is to compute a 1-dimensional array result such that result[i] == the mode of arr1[i], arr2[i], arr3[i], ..., arr1000[i].
However, it is not tractable to concatenate the arrays to one big matrix and then compute the mode row-wise, since this may exceed the RAM on my machine.
An alternative would be to set up an array res2 of shape (10000, 10), then loop through every array, use each entry e as index and then to increase the value of res2[i][e] by 1. Alter looping, I would apply something like argmax. However, this is too slow.
So: Is the a way to perform the task in a fast way, maybe by using NumPy's advanced indexing?
EDIT (due to the comments):
This is basically the code which calculates the modes row-wise – avoiding to concatenate the arrays:
def foo(length, n):
counts = np.zeros((length, n), dtype=np.int_)
for arr in array_iterator():
i = 0
for e in arr:
counts[i][e] += 1
i += 1
return np.argmax(counts, axis=1)
It takes already 60 seconds for 100 arrays of size 10000 (although there is more work done behind the scenes, which results into that time – however, this work scales linearly with the amount of arrays).
Regarding the real sizes:
The amount of different arrays is really arbitrary. It's a parameter of experiments and I'd like to have the opportunity even to set this to values like 10^6. The length of each array is depending of my data set I'm working with. This could be 10000, or 100000 or even worse. However – spitting this into smaller pieces may be possible, though annoying.
My free RAM for this task is about 4 GB.
EDIT 2:
The running time I gave above leads to a wrong impression. Actually, the running time which just belongs to the inner loop (for e in arr) in the above mentioned scenario is just 5 seconds – which is now ok for me, since it's negligible compared to the remaining running time. I will leave this question open anyway for a moment, since there might be an even faster method waiting out there.

Simple to implement struct for memory efficient list of tuples

I need to create a list of the following type
[(latitude, longitude, date), ...]
where latitude and longitude are floats, and date is an integer. I'm running out of memory on my local machine because I need to store about 60 million of these tuples. What is the most memory efficient (and at the same time simple to implement) way of representing these tuples in python?
The precision of the latitude and longitude does not need to be so great (just enough to represent values such as -65.100234) and the integers need to be big enough to handle UNIX timestamps.
I have used swig before to define "c-structs" which are in general much more memory efficient than they python, but this is complicated to implement...maybe there some scipy or numpy way to declare such tuples that uses less memory...any ideas?
If you are fine with using NumPy, you could use a numpy.recarray. If you want 8 significant digits for your coordinates, single precision floats are probably just not enough, so your records would have two double precision floats and a 32-bit integer, which is twenty bytes in total, so 60 million records would need 1.2 GB of memory. Note that NumPy arrays have fixed size and need to be reallocated if the size changes.
Code example:
# Create an uninitialised array with 100 records
a = numpy.recarray(100,
formats=["f8", "f8", "i4"],
names=["latitude", "longitude", "date"])
# initialise to 0
a[:] = (0.0, 0.0, 0)
# assign a single record
a[0] = (-65.100234, -38.32432, 1309351408)
# access the date of the first record
a[0].date
# access the whole date column
a.date
If you want to avoid a dependency on NumPy, you could also use ctypes arrays of ctypes structures, which are less convenient than NumPy arrays, but more convenient than using SWIG.

Categories