How to manage a number of huge array pythonicly.

How to manage a number of huge array pythonicly. - python

I'm a python user for scientific computation. Now, I have some numpy arrays, and the size of each of them is huge. Thus, I can not short all of them in the memory at the same time. I want to save the arrays in the disk and read them one by one at each time to do some calculation. How to perform this process pythonicly?
I know if all the data are stored in the memory, I can create a list named array_list like this:
array_list = []
for i0 in range(n_array):
t_ayyay = do_some_calculate()
array_list.append(t_ayyay)
and when I want to use them:
for i0 in range(n_array):
t_ayyay = array_list[i0]
# do something.
How to save array_list in the disk, and I can read each object using the index without load all of them in the memory?
Thanks.

Pickle is your friend for serialization.
import pickle
some_list = [....]
pickle_out = open("some_list.pickle", "w")
pickle.dump(some_list, pickle_out)
pickle_out.close()
to open up your saved array
pickle_in = open("some_list.pickle", "r")
some_list = pickle.open(pickle_in)

Related

Saving multiple arrays in a text file (python)

I have 500 arrays named A0 to A499 (All arrays are of different sizes). I want to save these arrays in a text file. Is there any way to get my job done? Also, if possible, I would like to keep their names (A0,A1 etc) so that it is easier to recall later.
I am able to save a single array using np.savetxt but i have no idea how to do it for these 500 arrays.
Thank you very much.
for i in range(500):
exec("A%s=SMtoM(outputS(115,15,0.62))"%(i))
this is how I made my 500 Arrays!

Using pickle :
import pickle
def load(filename):
with open(filename, 'rb') as f:
my_lists = pickle.load(f)
return my_lists
def save(filename, my_lists):
with open(filename, 'wb') as f:
pickle.dump(my_lists)
# Where my_lists = [A0, A1, ..., A499]

Try:
outpt= open("file.txt", "w")
for array in arraylist:
# write each array as a line
outpt.write(arraylist)
outpt.write("\n")
outpt.close()
Pickle works even better as mentioned above.

fast formatted file output of numpy array

I have a large numpy array and I'd like to dump it into a file using ASCII format. I would like to specify the format. This works:
import numpy
a = numpy.random.rand(5)
fmt = "{:.11e}\n"
with open("out.dat", "w") as f:
for item in a:
f.write(fmt.format(item))
but is slow because I manually loop over all entries of a. Is there a way to handle this in only one write operation?

Provided RAM is not an issue, you can try formatting the array to a string and then exporting it:
a_str = np.array2string(a, formatter={'float_kind':lambda x: "%.11f" % x}, separator='\n', threshold=np.inf)[1:-1]
with open("out.dat", "w") as f:
f.write(a_str)

Apply ufunc to xarray single Dataset variable as delayed operation using dask

I would like to apply a custom function to a variable within an xarray.Dataset modifying only the specified variable. At the same time I am trying to make this part of a dask computation graph so it can be delayed prior to reading out to disk with to_netcdf.
At the moment I can apply the ufunc using xr.apply_ufunc() but only to all variables within the Dataset.
I understand I could probably access the variable directly using it's name like Dataset.var and pass this to apply_ufunc() but I don't quite understand how the output of this function (a delayed future) would be recombined with the original dataset prior to output.
Ideally, I want to do something like this (where 'data.nc' has multiple variables and only var1 is squared).
import xarray as xr
from distributed import Client
dask_client = Client()
def square(x):
return x*x
data = xr.open_dataset('data.nc', chunks={'d1':10})
fut_sq = xr.apply_ufunc(square, data.var1, dask='parallelized', output_dtypes=['float'])
data.var1 = fut_sq.var1
fut_save = data.to_netcft('new.nc', compute=False)
dask_client.compute(fut_save)

So I played around with this a bit more and decided that the best way to do this was to extract the data from the netCDF4 file, convert it to a dask.array and then rewrite a new file to disk. This involves writing custom functions using the dask.delayed functionality. Using the ufunc approach was probably inappropriate for my problem.
A few drawbacks of this:
You don't seem to be able to modify the file in place. To save the modified variables from the original NetCDF4 file you have to rewrite the whole file to disk.
For me at least, the best way to parallelise the custom square function was to create my own data chunks and pass these to chunks individually to square. Then reconstitute them using dask.array.concatenate. I know dask has some bagging functionality but I struggled to get it to work the way I wanted.
The reading of the file happens in parallel but it does not appear that dask writes to NetCDF4 in parallel.
It would be great if I can be corrected on these points.
Here is my amended example
import xarray as xr
from distributed import Client
import dask
import dask.array as da
dask_client = Client()
def bag_slices(ind, n=10):
bag = list()
prev = 0
for i in range(len(ind)):
if (i+1)%n == 0:
bag.append(slice(prev, i+1, 1))
prev = i+1
if prev != i+1:
bag.append(slice(prev, i+1, 1))
return bag
#dask.delayed
def square(x):
return x*x
#dask.delayed
def assign(old_xr_dataset, new_data):
old_xr_dataset['var1'].values = new_data
return old_xr_dataset
# for me data.data.var1 is 3D and I process by splitting the data along the second dimension.
with xr.open_dataset('data.nc', chunks={'d1':10}) as data:
# create slice bags for distributed processing along preferred axis
bags = bag_slices(data.coords['dim2'].values, n=10)
# convert to dask array
data_da = da.from_array(data.var1.values)
# create data bags
bags = [data_da[:, slc, :] for slc in bags]
future_squared = []
for data_bag in bags:
# concatenate doesn't understand delayed objects
# so must convert them back to delayed arrays
future_squared.append(da.from_delayed(square(data_bag), data_bag.shape, dtype=float))
data_new = dask.array.concatenate(future_squared, axis=1)
fut_dataset = assign(data, data_new)
fut_nc_save = fut_dataset.to_netcdf('data_squared.nc', compute=False)
fut_nc_save.compute()

reading v 7.3 mat file in python

I am trying to read a matlab file with the following code
import scipy.io
mat = scipy.io.loadmat('test.mat')
and it gives me the following error
raise NotImplementedError('Please use HDF reader for matlab v7.3 files')
NotImplementedError: Please use HDF reader for matlab v7.3 files
so could anyone please had the same problem and could please any sample code
thanks

I've created a small library to load MATLAB 7.3 files:
pip install mat73
To load a .mat 7.3 into Python as a dictionary:
import mat73
data_dict = mat73.loadmat('data.mat')
simple as that!

Try using h5py module
import h5py
with h5py.File('test.mat', 'r') as f:
f.keys()

import h5py
import numpy as np
filepath = '/path/to/data.mat'
arrays = {}
f = h5py.File(filepath)
for k, v in f.items():
arrays[k] = np.array(v)
you should end up with your data in the arrays dict, unless you have MATLAB structures, I suspect. Hope it helps!

Per Magu_'s answer on a related thread, check out the package hdf5storage which has convenience functions to read v7.3 matlab mat files; it is as simple as
import hdf5storage
mat = hdf5storage.loadmat('test.mat')

I had a look at this issue: https://github.com/h5py/h5py/issues/726. If you saved your mat file with -v7.3 option, you should generate the list of keys with (under Python 3.x):
import h5py
with h5py.File('test.mat', 'r') as file:
print(list(file.keys()))
In order to access the variable a for instance, you have to use the same trick:
with h5py.File('test.mat', 'r') as file:
a = list(file['a'])

According to the Scipy cookbook. http://wiki.scipy.org/Cookbook/Reading_mat_files,
Beginning at release 7.3 of Matlab, mat files are actually saved using the HDF5 format by default (except if you use the -vX flag at save time, see help save in Matlab). These files can be read in Python using, for instance, the PyTables or h5py package. Reading Matlab structures in mat files does not seem supported at this point.
Perhaps you could use Octave to re-save using the -vX flag.

Despite hours of searching I've not found how to access Matlab v7.3 structures either. Hopefully this partial answer will help someone, and I'd be very happy to see extra pointers.
So starting with (I think the [0][0] arises from Matlab giving everything to dimensions):
f = h5py.File('filename', 'r')
f['varname'][0][0]
gives: < HDF5 object reference >
Pass this reference to f again:
f[f['varname'][0][0]]
which gives an array:
convert this to a numpy array and extract the value (or, recursively, another < HDF5 object reference > :
np.array(f[f['varname'][0][0]])[0][0]
If accessing the disk is slow, maybe loading to memory would help.
Further edit: after much futile searching my final workaround (I really hope someone else has a better solution!) was calling Matlab from python which is pretty easy and fast:
eng = matlab.engine.start_matlab() # first fire up a Matlab instance
eng.quit()
eng = matlab.engine.connect_matlab() # or connect to an existing one
eng.sqrt(4.0)
x = 4.0
eng.workspace['y'] = x
a = eng.eval('sqrt(y)')
print(a)
x = eng.eval('parameterised_function_in_Matlab(1, 1)', nargout=1)
a = eng.eval('Structured_variable{1}{2}.object_name') # (nested cell, cell, object)

This function reads Matlab-produced HDF5 .mat files, and returns a structure of nested dicts of Numpy arrays. Matlab writes matrices in Fortran order, so this also transposes matrices and higher-dimensional arrays into conventional Numpy order arr[..., page, row, col].
import h5py
def read_matlab(filename):
def conv(path=''):
p = path or '/'
paths[p] = ret = {}
for k, v in f[p].items():
if type(v).__name__ == 'Group':
ret[k] = conv(f'{path}/{k}') # Nested struct
continue
v = v[()] # It's a Numpy array now
if v.dtype == 'object':
# HDF5ObjectReferences are converted into a list of actual pointers
ret[k] = [r and paths.get(f[r].name, f[r].name) for r in v.flat]
else:
# Matrices and other numeric arrays
ret[k] = v if v.ndim < 2 else v.swapaxes(-1, -2)
return ret
paths = {}
with h5py.File(filename, 'r') as f:
return conv()

If you are only reading in basic arrays and structs, see vikrantt's answer on a similar post. However, if you are working with a Matlab table, then IMHO the best solution is to avoid the save option altogether.
I've created a simple helper function to convert a Matlab table to a standard hdf5 file, and another helper function in Python to extract the data into a Pandas DataFrame.
Matlab Helper Function
function table_to_hdf5(T, path, group)
%TABLE_TO_HDF5 Save a Matlab table in an hdf5 file format
%
% TABLE_TO_HDF5(T) Saves the table T to the HDF5 file inputname.h5 at the root ('/')
% group, where inputname is the name of the input argument for T
%
% TABLE_TO_HDF5(T, path) Saves the table T to the HDF5 file specified by path at the
% root ('/') group.
%
% TABLE_TO_HDF5(T, path, group) Saves the table T to the HDF5 file specified by path
% at the group specified by group.
%
%%%
if nargin < 2
path = [inputname(1),'.h5']; % default file name to input argument
end
if nargin < 3
group = ''; % We will prepend '/' later, so this is effectively root
end
for field = T.Properties.VariableNames
% Prepare to write
field = field{:};
dataset_name = [group '/' field];
data = T.(field);
if ischar(data) || isstring(data)
warning('String columns not supported. Skipping...')
continue
end
% Write the data
h5create(path, dataset_name, size(data))
h5write(path, dataset_name, data)
end
end
Python Helper Function
import pandas as pd
import h5py
def h5_to_df(path, group = '/'):
"""
Load an hdf5 file into a pandas DataFrame
"""
df = pd.DataFrame()
with h5py.File(path, 'r') as f:
data = f[group]
for k,v in data.items():
if v.shape[0] > 1: # Multiple column field
for i in range(v.shape[0]):
k_new = f'{k}_{i}'
df[k_new] = v[i]
else:
df[k] = v[0]
return df
Important Notes
This will only work on numerical data. If you know how to add string data, please comment.
This will create the file if it does not already exist.
This will crash if the data already exists in the file. You'll want to include logic to handle those cases as you deem appropriate.

How to write variables as binary data with Python?

I need to write a long list of ints and floats with Python the same way fwrite would do in C - in a binary form.
This is necessary to create input files for another piece of code I am working with.
What is the best way to do this?

You can do this quite simply with the struct module.
For example, to write a list of 32-bit integers in binary:
import struct
ints = [10,50,100,2500,256]
with open('output', 'w') as fh:
data = struct.pack('i' * len(ints), *ints)
fh.write(data)
Will write '\n\x00\x00\x002\x00\x00\x00d\x00\x00\x00\xc4\t\x00\x00\x00\x01\x00\x00'

Have a look at numpy: numpy tofile:
With the array-method 'tofile' you can write binary-data:
# define output-format
numdtype = num.dtype('2f')
# write data
myarray.tofile('filename', numdtype)
Another way is to use memmaps: numpy memmaps
# create memmap
data = num.memmap('filename', mode='w+', dtype=num.float, offset=myoffset, shape=(my_shape), order='C')
# put some data into in:
data[1:10] = num.random.rand(9)
# flush to disk:
data.flush()
del data

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to manage a number of huge array pythonicly. - python

Pickle is your friend for serialization. import pickle some_list = [....] pickle_out = open("some_list.pickle", "w") pickle.dump(some_list, pickle_out) pickle_out.close() to open up your saved array pickle_in = open("some_list.pickle", "r") some_list = pickle.open(pickle_in)

Related

Saving multiple arrays in a text file (python)

fast formatted file output of numpy array

Apply ufunc to xarray single Dataset variable as delayed operation using dask

reading v 7.3 mat file in python

How to write variables as binary data with Python?

Categories

Resources