Python - read 2d array from binary data - python

I try to read a 2d array with floats from a binary file with Python. Files have been written with big endian by a Fortran program (it is the intermediate file of the Weather Research and Forecast model). I already know dimension sizes of the array to read (nx & ny) but as a Fortran and IDl programer I am completely lost, how to manage it in Python. (Later on I want to visualize the array).
Shall I use struct.unpack or numpy.fromfile or the array module?
Do I have to read first a vector and afterwards reshape it? (have seen this option only for the numpy-way)
How do I define a 2d array with numpy and how do I define the dtype to read with big-endian byte ordering?
Is there an issue with array ordering (column or row wise) to take into account?

Short answers per sub-question:
I don't think the array module has a way to specify endianness.
Between the struct module and Numpy I think Numpy is easier to
use, especially for Fortran-like ordered arrays.
All data is inherently 1-dimensional as far as the hardware (disk,
RAM, etc) is concerned, so yes reshaping to get a 2D representation
is always necessary. With numpy.fromfile the reshape must happen
explicitly afterwards, but numpy.memmap provides a way to reshape
more implicitly.
The easiest way to specify endianness with Numpy is to use a short
type string, actually very similar to the approach needed for
the struct module. In Numpy >f and >f4 specify single
precision and >d and >f8 double precision big-endian floating
point.
Your binary file could walk the array along the rows (C-like) or along
the columns (Fortran-like). Whichever of the two, this has to be taken into
account to represent the data properly. Numpy makes this easy with the
order keyword argument for reshape and memmap (among others).
All in all, the code could be for example:
import numpy as np
filename = 'somethingsomething'
with open(filename, 'rb') as f:
nx, ny = ... # parse; advance file-pointer to data segment
data = np.fromfile(f, dtype='>f8', count=nx*ny)
array = np.reshape(data, [nx, ny], order='F')

Related

Whey saving an numpy array of float arrays to .npy file using numpy.save/numpy.load, is there any reason why the order of the arrays would change?

I currently have data where each row has a text passage and a numpy float array.
As far as I know, the it's not efficient to save these two datatypes into one data format (correct me if I am wrong). So I am going to save them separately, with another column of ints that will be used to map the two datasets together when I want to join them again.
I have having trouble figuring out how to append a column of ints next to the float arrays (if anyone has a solution to that I would love to hear it) and then save the numpy array.
But then I realized I can just save the float arrays as is with numpy.save without the extra int column if I can get a confirmation that numpy.save and numpy.load will never change the order of the arrays.
That way I can just append the loaded numpy float arrays to the pandas dataframe as is.
Logically, I don't see any reason why the order of the rows would change, but perhaps there's some optimization compression that I am unaware of.
Would numpy.save or numpy.load ever change the order of a numpy array of float arrays?
The order will not change by the numpy save / load. You are saving the numpy object as is. An array is an ordered object.
Note: if you want to save multiple data arrays to the same file, you can use np.savez.
>>> np.savez('out.npz', f=array_of_floats, s=array_of_strings)
You can retrieve back each with the following:
>>> data = np.load('out.npz')
>>> array_of_floats = data['f']
>>> array_of_strings = data['s']

Does the np.nan in numpy array occupy memory?

I have a huge file of csv which can not be loaded into memory. Transforming it to libsvm format may save some memory.
There are many nan in csv file. If I read lines and store them as np.array, with np.nan as NULL, will the array still occupy too much memory ?
Does the np.nan in array also occupy memory ?
When working with floating point representations of numbers, non-numeric values (NaN and inf) are also represented by a specific binary pattern occupying the same number of bits as any numeric floating point value. Therefore, NaNs occupy the same amount of memory as any other number in the array.
As far as I know yes, nan and zero values occupy the same memory as any other value, however, you can address your problem in other ways:
Have you tried using a sparse vector? they are intended for vectors with a lot of 0 values and memory consumption is optimized
SVM Module Scipy
Sparse matrices Scipy
There you have some info about SVM and sparse matrices, if you have further questions just ask.
Edited to provide an answer as well as a solution
According to the getsizeof() command from the sys module it does. A simple and fast example :
import sys
import numpy as np
x = np.array([1,2,3])
y = np.array([1,np.nan,3])
x_size = sys.getsizeof(x)
y_size = sys.getsizeof(y)
print(x_size)
print(y_size)
print(y_size == x_size)
This should print out
120
120
True
so my conclusion was it uses as much memory as a normal entry.
Instead you could use sparse matrices (Scipy.sparse) which do not save zero / Null at all and therefore are more memory efficient. But Scipy strongly discourages from using Numpy methods directly https://docs.scipy.org/doc/scipy/reference/sparse.html since Numpy might not interpret them correctly.

Save a csr_matrix and a numpy array in one file

I need to save a large sparse csr_matrix and a numpy array to be able to read them back later. Let X be the sparse csr_matrix and Y be the number array.
Currently I take the following slightly insane route.
from scipy.sparse import csr_matrix
import numpy as np
def save_sparse_csr(filename,array):
np.savez(filename,data = array.data ,indices=array.indices,
indptr =array.indptr, shape=array.shape )
def load_sparse_csr(filename):
loader = np.load(filename)
return csr_matrix(( loader['data'], loader['indices'], loader['indptr']),
shape = loader['shape'])
save_sparse_csr("file1", X)
np.save("file2", Y)
Then when I want to read them in it is:
X = load_sparse_csr("file1.npz")
Y = np.load("file2.npy")
Two questions:
Is there a better way to save a csr_matrix than this?
Can I save both X and Y to the same file somehow? I seems crazy to have to make two files for this.
So you are saving the 3 array attributes of the csr along with its shape. And that is sufficient to recreate the array, right?
What's wrong with that? Even if you find a function that saves the csr for you, I bet it is doing the same thing - saving those same arrays.
The normal way in Python to save a class is to pickle it. But the class has to create the appropriate pickle method. numpy does that (essentially its save function). But as far as I know scipy.sparse has not provided that.
Since scipy.sparse has its roots in the MATLAB sparse code (and C/Fortran code developed for linear algebra problems), it can load/save using the loadmat/savemat functions. I'd have to double check but I think the work with csc the default MATLAB sparse format.
There are one or two other sparse.io modules than handle sparse, but I have worked with those. There formats for sharing sparse arrays among different packages working with the same problems (for example PDEs or finite element). More than likely those formats will use a coo compatible layout (data, rows, cols), either as 3 arrays, a csv of 3 columns, or 2d array.
Mentioning coo format raises another possibility. Make a structure array with data, row, col fields, and use np.save or even np.savetxt. I don't think it's any faster or cleaner than csr direct. But it does put all the data in one array (but shape might still need a separate entry).
You might also be able to pickle the dok format, since it is a dict subclass.

Exporting a 2D array with 0-values into a txt/csv file

I have a 100x100 array which I would like to export as either a txt or csv file. The elements of the array are all 0 and a few other integer numbers. When using the following code, the integer numbers are exported properly, but the zeros are replaced by random numbers with giganormous exponents (1.98E-258). Does anyone know a way to avoid this behavior?
The code that I am using is the following:
import numpy as np
my_array=np.ndarray(shape=(100,100))
my_array[[],[]]=0 #WRONG
np.savetxt("my_file.csv", my_array, delimiter=",")
That's actually a really small number ... But what you need to do is tell numpy that the array will be filled with integers, not floats:
#or np.int32, np.int64, np.uint8 ... depending on desired range.
my_array=np.zeros((100,100), dtype=int)
While we're at it, I used np.zeros to give you an array initialized to zero since that seems to be what you want anyway. Generally speaking, np.ndarray is used for subclassing a numpy array -- It's not very idiomatic to call the constructor yourself.
The problem is with the line
my_array[[],[]]=0
replace it with
my_array[:,:]=0
The issue is that you're never really initializing the array, so everything is just random, including the exponents. The above correction sets everything to zero.

Is there a way to view how much memory a SciPy matrix used?

I know in python it's hard to see the memory usage of an object.
Is it easier to do this for SciPy objects (for example, sparse matrix)?
you can use array.itemsize (size of the contained type in bytes) and array.flat to obtain the lenght:
# a is your array
bytes = a.itemsize * a.size
it's not the exact value, as it ignore the whole array infrastructure, but for big array it's the value that matter (and I guess that you care because you have something big)
if you want to use it on a sparse array you have to modify it, as the sparse doesn't have the itemsize attribute. You have to access the dtype and get the itemsize from it:
bytes = a.dtype.itemsize * a.size
In general I don't think it's easy to evaluate the real memory occupied by a python object...the numpy array is an exception being just a thin layer over a C array
If you are inside IPython, you can also use its %whosmagic function, which gives you information about the session's variables and includes how much RAM each takes.

Categories