How to use numpy.savez inside a for loop - python

I want to use numpy.savez inside a loop so that I can directly save the array inside a file and save ram memory as I am dealing with huge data sets. Each iteration perfroms some work on the arrays, My idea is to stack the resulting array from each iteration on to previous arrays inside the npz file.
import numpy as np
a= [2,3,4]
b= [5,6,7]
steps=10
for n in range(steps):
a=a*3
b=b*4
np.savez('filename.npz', a=a, b=b)
But when I load the file, only the arrays from the last iteration is available. I know that the file is getting overwritten in each iteration, but is there a way I can stack all arrays inside the file?

You could do the following:
import numpy as np
a = np.array([2,3,4])
b = np.array([5,6,7])
steps = 10
Then, in order to save a different file with each iteration (inside the loop), you use format:
for i in range(steps):
a = a * 3
b = b * 3
np.savez('filename{:03}.npz'.format(i), a=a, b=b)
To load:
data = np.load('filename009.npz')
print(data['a']) # [118098 177147 236196]
print(data['b']) # [295245 354294 413343]
data2 = np.load('filename004.npz')
print(data2['b']) # [1215 1458 1701]
print(data2['a']) # [486 729 972]
You could even use something as crazy as timestamps!
from datetime import datetime
for _ in range(steps):
a = a * 3
b = b * 3
np.savez('filename-{}.npz'.format(datetime.now().isoformat(sep='_', timespec='auto')), a=a, b=b)
The result is:
filename-2020-01-24_00:02:42.013775.npz filename-2020-01-24_00:02:42.017066.npz
filename-2020-01-24_00:02:42.014601.npz filename-2020-01-24_00:02:42.017710.npz
filename-2020-01-24_00:02:42.015249.npz filename-2020-01-24_00:02:42.018611.npz
filename-2020-01-24_00:02:42.015946.npz filename-2020-01-24_00:02:42.019326.npz
filename-2020-01-24_00:02:42.016516.npz filename-2020-01-24_00:02:42.019864.npz

Related

Python shared memory, how can I put the random integer into the shared memory block?

I created a memory block with a Byte size of 10 and wanted to create a random number and put it into the Memory block but it always just gives me error messages so I wonder if I am doing it wrong.
from multiprocessing import shared_memory
import random
shared_mem_1 = shared_memory.SharedMemory(create=True, size=10)
num = (random.sample(range(1, 1000), 10))
for i, c in enumerate(num):
shared_mem_1.buf[i] = c
the error-message:
Traceback (most recent call last):
File "main.py", line 7, in shared_mem_1.buf[i] = c
ValueError: memoryview: invalid value for format 'B'
The problem is that num contains values over 255 and when it's assigned to buf the invalid value for format 'B' error appears. Format B is exactly the format for bytes (Check the table of formats here).
There are 2 options:
Change the range of the random numbers to be between 0 and 255; or,
Convert to bytes with the int.to_bytes function.
Option 1
from multiprocessing import shared_memory
import random
shared_mem_1 = shared_memory.SharedMemory(create=True, size=10)
num = (random.sample(range(0, 255), 10))
for i, c in enumerate(num):
shared_mem_1.buf[i] = c
shared_mem_1.unlink()
Option 2
For option 2 you'd need to pay attention to the bytes order (big-endian/little-endian) and how many bytes an integer has in your case (Also, the amount of memory to allocate depends on this length). The assignment to the buffer should calculate the offset it saved already.
from multiprocessing import shared_memory
import random
int_length = 4
shared_mem_1 = shared_memory.SharedMemory(create=True, size=int_length * 10)
num = (random.sample(range(1, 1000), 10))
for i, c in enumerate(num):
pos = i*int_length
shared_mem_1.buf[pos:pos+int_length] = c.to_bytes(int_length, 'big')
shared_mem_1.unlink()
I find the most useful way to take advantage of multiprocessing.shared_memory is to create a numpy array that uses the shared memory region as it's memory buffer. Numpy handles setting the correct data type (is it an 8 bit integer? a 32 bit float? 64 bit float? etc..) as well as providing a convenient interface (similar, but more extensible than python's built-in array module). That way any modifications to the array are visible across any processes that have that same memory region mapped to an array.
from multiprocessing import Process
from multiprocessing.shared_memory import SharedMemory
import numpy as np
def foo(shm, shape, dtype):
arr = np.ndarray(shape, dtype, buffer = shm.buf) #remote version of arr
print(arr)
arr[0] = 20 #modify some data in arr to show modifications cross to the other process
shm.close() #SharedMemory is internally a file, which needs to be closed.
if __name__ == "__main__":
shm = SharedMemory(create=True, size=40) #40 bytes for 10 floats
arr = np.ndarray([10], 'f4', shm.buf) #local version of arr (10 floats)
arr[:] = np.random.rand(10) #insert some data to arr
p = Process(target=foo, args=(shm, arr.shape, arr.dtype)
p.start()
p.join() #wait for p to finish
print(arr) #arr should reflect the changes made in foo which occurred in another process.
shm.close() #close the file
shm.unlink() #delete the file (happens automatically on windows but not linux)

Compiled Numba function not faster that CPython

I have a compile function with Numba that splits an array based on an index, this returns an irregular(variable length) list of numpy arrays. This then get padded to form a 2d array from the irregular list.
Problem
The compile function 'nb_array2mat' should be much faster than the pure python 'array2mat' but it is not.
Additionally, is this possible using numpy?
length of the array and index
1456391 95007
times:
numba: 1.3438396453857422
python: 1.1407015323638916
I think I am not using the numba compile in a proper manner. Any help would be great.
EDIT
Using dummy data as edited in the code section now I get an speed up, why does it not work with the actual data?
length of the array and index
1456391 95007
times:
numba: 0.012002706527709961
python: 0.13403034210205078
Code
idx_split: https://drive.google.com/file/d/1hSduTs1_s3seEFAiyk_n5yk36ZBl0AXW/view?usp=sharing
dist_min_orto: https://drive.google.com/file/d/1fwarVmBa0NGbWPifBEezTzjEZSrHncSN/view?usp=sharing
import time
import numba
import numpy as np
from numba.pycc import CC
cc = CC('compile_func')
cc.verbose = True
#numba.njit(parallel=True, fastmath=True)
#cc.export('nb_array2mat', 'f8[:,:](f8[:], i4[:])')
def array2mat(arr, idx):
# split arr by idx indexes
out = []
s = 0
for n in numba.prange(len(idx)):
e = idx[n]
out.append(arr[s:e])
s = e
# create a 2d array with arr values pading empty values with fill_value=1000000.0
_len = [len(_i) for _i in out]
cols = max(_len)
rows = len(out)
mat = np.full(shape=(rows, cols), fill_value=1000000.0)
for row in numba.prange(rows):
len_col = len(out[row])
mat[row, :len_col] = out[row]
return mat
if __name__ == "__main__":
cc.compile()
# PYTHON FUNC
def array2mat(arr, idx):
# split arr by idx indexes
out = []
s = 0
for n in range(len(idx)):
e = idx[n]
out.append(arr[s:e])
s = e
# create a 2d array with arr values pading empty values with fill_value=1000000.0
_len = [len(_i) for _i in out]
cols = max(_len)
rows = len(out)
mat = np.full(shape=(rows, cols), fill_value=1000000.0)
for row in range(rows):
len_col = len(out[row])
mat[row, :len_col] = out[row]
return mat
import compile_func
#ACTUAL DATA
arr = np.load('dist_min_orto.npy').astype(float)
idx = np.load('idx_split.npy').astype(int)
# DUMMY DATA
arr = np.random.randint(50, size=1456391).astype(float)
idx = np.cumsum(np.random.randint(5, size=95007).astype(int))
print(len(arr), len(idx))
#NUMBA FUNC
t0 = time.time()
print(compile_func.nb_array2mat(arr, idx))
print(time.time() - t0)
# PYTHON FUNC
t0 = time.time()
print(array2mat(arr, idx))
print(time.time() - t0)
You cannot use nb.prange on the first loop since out is shared between threads and it is also read/written by them. This causes a race condition. Numba assume that there is not dependencies between iterations and this is your responsibility to guarantee this. The simplest solution is not to use a parallel loop here
Additionally, the second loop is mainly memory-bound so I do not expect a big speed up using multiple threads since the RAM is a shared resource with a limited throughput (few threads are often enough to saturate it, especially on PC where sometimes one thread is enough).
Hopefully, you do not need to create the out temporary list, just the end offsets so then to compute len_cols in the parallel loop. The maximum cols can be computed on the fly in the first loop. The first loop should be executed very quickly compared to the second loop. Filling a big matrix newly allocated is often faster in parallel on Linux since page faults can be done in parallel. AFAIK, one Windows this is less true (certainly since pages faults scale more badly). This is also better here since the range 0:len_col is variable and thus the time to fill this part of the matrix is variable causing some thread to finish after others (the slower thread bound the execution). Furthermore, this is generally much faster on NUMA machines since each NUMA node can write in its own memory.
Note that AOT compilation does not support automatic parallel execution. To quote a Numba developer:
From discussion in today's triage meeting, related to #7696: this is not likely to be supported as AOT code doesn't require Numba to be installed - this would mean a great deal of work and issues to overcome for packaging the code for the threading layers.
The same thing applies for fastmath also it is likely to be added in the next incoming release regarding the current work.
Note that JIT compilation and AOT compilation are two separate process. Thus the parameters of njit are not shared to cc.export and the signature is not shared to njit. This means that the function will be compiled during its first execution due to lazy compilation. That being said, the function is redefined, so the njit is just useless here (overwritten).
Here is the resulting code (using only the JIT implementation with an eager compilation instead of the AOT one):
import time
import numba
import numpy as np
#numba.njit('f8[:,:](f8[:], i4[:])', fastmath=True)
def nb_array2mat(arr, idx):
# split arr by idx indexes
s = 0
ends = np.empty(len(idx), dtype=np.int_)
cols = 0
for n in range(len(idx)):
e = idx[n]
ends[n] = e
len_col = e - s
cols = max(cols, len_col)
s = e
# create a 2d array with arr values pading empty values with fill_value=1000000.0
rows = len(idx)
mat = np.empty(shape=(rows, cols))
for row in numba.prange(rows):
s = ends[row-1] if row >= 1 else 0
e = ends[row]
len_col = e - s
mat[row, 0:len_col] = arr[s:e]
mat[row, len_col:cols] = 1000000.0
return mat
# PYTHON FUNC
def array2mat(arr, idx):
# split arr by idx indexes
out = []
s = 0
for n in range(len(idx)):
e = idx[n]
out.append(arr[s:e])
s = e
# create a 2d array with arr values pading empty values with fill_value=1000000.0
_len = [len(_i) for _i in out]
cols = max(_len)
rows = len(out)
mat = np.full(shape=(rows, cols), fill_value=1000000.0)
for row in range(rows):
len_col = len(out[row])
mat[row, :len_col] = out[row]
return mat
#ACTUAL DATA
arr = np.load('dist_min_orto.npy').astype(np.float64)
idx = np.load('idx_split.npy').astype(np.int32)
#NUMBA FUNC
t0 = time.time()
print(nb_array2mat(arr, idx))
print(time.time() - t0)
# PYTHON FUNC
t0 = time.time()
print(array2mat(arr, idx))
print(time.time() - t0)
On my machine, the new Numba code is slightly faster: it takes 0.358 seconds for the Numba implementation and 0.418 for the Python implementation. In fact, using a sequential Numba code is even slightly faster on my machine as it takes 0.344 second.
Note that the shape of the output matrix is (95007,5469). Thus, the matrix takes 3.87 GiB in memory. You should check you have enough memory to store it. In fact the Python implementation takes about 7.5 GiB on my machine (possibly because the GC/default-allocator does not release the memory directly). If you do not have enouth memory, then the system can use the very slow swap memory (which use your storage device). Moreover, x86-64 processors use a write allocate cache policy causing written cache-lines to be actually read by default. Non temporal writes can be used to avoid this on a big matrix. Unfortunately, neither Numpy nor Numba use this on my machine. This means half the RAM throughput is wasted. Not to mention page faults are pretty expensive: in sequential, 60% of the time of the Numpy implementation is spent in page faults. The Numba code spend almost all its time writing in memory and performing page faults. Here is a related open issue.
based on #Jérôme Richard answer I wrote the same function. The improvement was in the way the mat numpy array is created, as the previous answer stated, the size in memory of the np.full takes a lot longer to operate, so the solution was to initialize it as a np.empty.
The improvement is not much bewtewn python and numba, but the size of the mat array takes a big impact in processing time.
1456391 95007
python: 0.29506611824035645
numba: 0.1800403594970703
Code
#cc.export('nb_array2mat', 'f8[:,:](f8[:], i4[:])')
def nb_array2mat(arr, idx):
s = 0
_len = np.empty(len(idx), dtype=np.int_)
_len[0] = idx[0]
_len[1:] = idx[1:] - idx[:-1]
# create a 2d array
cols = int(np.max(_len))
rows = len(idx)
mat = np.empty(shape=(rows, cols), dtype=np.float_)
for row in range(len(idx)):
e = idx[row]
len_col = _len[row]
mat[row, :len_col] = arr[s:e]
s = e
return mat

IndexError: tuple index out of range - Sending the calculation of a mean to 4 processes in parallel

I am trying to explore the topic of concurrency in python. I saw a couple of post about how to optimize processes by splitting the input data and processes separately, and afterwards joining the results. My task is to calculate the mean along the Z axis of a stack of rasters. I read the list of raster from a text file, and them create a stack numpy array with the data.
Then I wrote a simple function to use the stack array as input and calculate the mean. This task take me some minutes to complete. And I would like to process the numpy array in chunks to optimize the script. However when I do so by using the numpy.split (maybe a not good idea to split my 3d Array), then I get the following error:
Traceback <most recent call last>:
File "C:\Users\~\AppData\Local\conda\conda\envs\geoprocessing\lib\site-packages\numpy\lib\shape_base.py",
line 553, in split
len(indices_or_sections)
TypeError: object of type 'int' has no len()
During handling of the above exception, another exception ocurred:
Traceback (most recent call last):
File "tf_calculation_numpy.py", line 69, in <module>
main()
Tile "tf_calculation_numpy.py", line 60, in main
subarrays = np.split(final_array, 4)
File "C:\Users\~\AppData\Local\conda\conda\envs\geoprocessing\lib\site-packages\numpy\lib\shape_base.py", line 559, in split
array split does not result in an equal division'
ValueError: array split does not result in an equal division
Code is:
import rasterio
import os
import numpy as np
import time
import concurrent.futures
def mean_py(array):
print("Calculating mean of array")
start_time = time.time()
x = array.shape[1]
y = array.shape[2]
values = np.empty((x,y), type(array[0][0][0]))
for i in range(x):
for j in range(y):
#no more need for append operations
values[i][j] = ((np.mean(array[:, i, j])))
end_time = time.time()
hours, rem = divmod(end_time-start_time, 3600)
minutes, seconds = divmod(rem, 60)
print("{:0>2}:{:0>2}:{:05.2f}".format(int(hours),int(minutes),seconds))
print(f"{'.'*80}")
return values
def TF_mean(ndarray):
sdir = r'G:\Mosaics\VH'
final_array = np.asarray(ndarray)
final_array = mean_py(final_array)
out_name = (sdir + "/" + "MEAN_VH.tif")
print(final_array.shape)
with rasterio.open(out_name, "w", **profile) as dst:
dst.write(final_array.astype('float32'), 1)
print(out_name)
print(f"\nDone!\n{'.'*80}")
def main():
sdir = r'G:\Mosaics\VH'
a = np.random.randn(250_000)
b = np.random.randn(250_000)
c = np.random.randn(250_000)
e = np.random.randn(250_000)
f = np.random.randn(250_000)
g = np.random.randn(250_000)
h = np.random.randn(250_000)
arrays = [a, b, c, e, f, g, h]
final_array = []
for array in arrays:
final_array.append(array)
print(f"{array} added")
print("Splitting nd-array!")
final_array = np.asarray(final_array)
subarrays = np.split(final_array, 4)
with concurrent.futures.ProcessPoolExecutor() as executor:
for subarray, mean in zip(subarrays, executor.map(TF_mean,subarrays)):
print(f'Processing {subarray}')
if __name__ == '__main__':
main()
I just expect to have four processes running in parallel and a way to obtained the 4 subarrays and write them as a whole Geotiff file.
The second exception is the important one here, in terms of describing the error: "array split does not result in an equal division"
final_array is a 2D array, with shape 7 by 250,000. numpy.split operates along an axis, defaulting to axis 0, so you just asked it to split a length seven axis into four equal parts. Obviously, this isn't possible, so it gives up.
To fix, you can:
Split more; you could just split in seven parts and process each separately. The executor is perfectly happy to do seven tasks, no matter how many workers you have; seven won't split evenly, so at the tail end of processing you'll likely have some workers idle while the rest finish up, but that's not the end of the world
Split on a more fine-grained level. You could just flatten the array, e.g. final_array = final_array.reshape(final_array.size), which would make it a flat 1,750,000 element array, which can be split into four parts.
Split unevenly; instead of subarrays = np.split(final_array, 4) which requires axis 0 to be evenly splittable, do subarrays = np.split(final_array, (2,4,6)), which splits into three groups of two rows, plus one group with a single row.
There are many other options depending on your use case (e.g. split on axis=1 instead of the default axis=0), but those three are the least invasive (and #1 and #3 shouldn't change behavior meaningfully; #2 might, depending on whether the separation between 250K element blocks is meaningful).

Writing Python arrays into txt file, one array per line

I would like to write a number of Python Arrays into a txt file, with one array per line. After which I would like to read the Arrays line by line.
My work in progress code below. The problem I am working on involves about 100,000 arrays (length of L)
from __future__ import division
from array import array
M = array('I',[1,2,3])
N = array('I',[10,20,30])
L = [M,N]
with open('manyArrays.txt','w') as file:
for a in L:
sA = a.tostring()
file.write(sA + '\n')
with open('manyArrays.txt','r') as file:
for line in file:
lineRead = array('I', [])
lineRead.fromstring(line)
print MRead
The error message I get is
lineRead.fromstring(line)
ValueError: string length not a multiple of item size
You can either use numpy function for this, or code lines yourself:
You could concatenate your arrays in one 2D array and save it directly with np.savetxt, load it with np.genfromtext :
M = np.array([1,2,3],dtype='I')
N = np.array([10,20,30],dtype='I')
data= np.array([M,N])
file='test.txt'
np.savetxt(file,data)
M2,N2 = np.genfromtxt(file)
Or do :
file2='test2.txt'
form="%i %i %i \n"
with open(file2,'w') as f:
for i in range(len(data)):
f.write(form % (data[i,0],data[i,1],data[i,2]))

return value to vba from xlwings

Let's use the example on xlwings documentation.
Given the following python code:
import numpy as np
from xlwings import Workbook, Range
def rand_numbers():
""" produces std. normally distributed random numbers with shape (n,n)"""
wb = Workbook.caller() # Creates a reference to the calling Excel file
n = int(Range('Sheet1', 'B1').value) # Write desired dimensions into Cell B1
rand_num = np.random.randn(n, n)
Range('Sheet1', 'C3').value = rand_num
This is the original example.
Let's say we modify it slightly to be:
import numpy as np
from xlwings import Workbook, Range
def rand_numbers():
""" produces std. normally distributed random numbers with shape (n,n)"""
wb = Workbook.caller() # Creates a reference to the calling Excel file
n = int(Range('Sheet1', 'B1').value) # Write desired dimensions into Cell B1
rand_num = np.random.randn(n, n)
return rand_num #modified line
And we call it from VBA using the following call:
Sub MyMacro()
dim z 'new line'
z = RunPython ("import mymodule; mymodule.rand_numbers()")
End Sub
We get z as an empty value.
Is there any way to return a value to vba directly without writing to a text file, or putting the value first in the excel document?
Thank you for any pointers.
RunPython does not allow you to return values as per xlwings documentation.
To overcome these issue, use UDFs, see VBA: User Defined Functions (UDFs) - however, this is currently limited to Windows only.
https://docs.xlwings.org/en/stable/udfs.html#udfs

Categories