Python : read array in binary file - python

I am currently trying to read an fortran file with python with the following technique
with open(myfile, "rb") as f:
for i in range (0, n):
s = struct.unpack('=f', f.read(4))
mylist.append(s[0])
But it is very slow for large arrays. Is there a way to read the content of the entire loop in one time and put it to mylist in order to avoid a conversion/append of each item one by one?
Thank you very much.

This is what the array module is for:
a = array.array('f')
a.fromfile(f, n)
Now you can use the array object like a normal sequence type. You can also convert it to a list if you need to with tolist.

Related

Saving multiple Numpy arrays to a Numpy binary file (Python)

I want to save multiple large-sized numpy arrays to a numpy binary file to prevent my code from crashing, but it seems like it keeps getting overwritten when I add on an array. The last array saved is what is set to allarrays when save.npy is opened and read. Here is my code:
with open('save.npy', 'wb') as f:
for num in range(500):
array = np.random.rand(100,400)
np.save(f, array)
with open('save.npy', 'rb') as f:
allarrays = np.load(f)
If the file existed before, I want it to be overwritten if the code is rerun. That's why I chose 'wb' instead of 'ab'.
alist =[]
with open('save.npy', 'rb') as f:
alist.append(np.load(f))
When you load you have collect all loads in a list or something. load only loads one array, starting at the current file position.
You can try memory mapping to disk.
# merge arrays using memory mapped file
mm = np.memmap("mmap.bin", dtype='float32', mode='w+', shape=(500,100,400))
for num in range(500):
mm[num::] = np.random.rand(100,400)
# save final array to npy file
with open('save.npy', 'wb') as f:
np.save(f, mm[::])
I ran into this problem as well, and solved it in not a very neat way, but perhaps it's useful for others. It's inspired by hpaulj's approach, which is incomplete (i.e., doesn't load the data). Perhaps this is not how one is supposed to solve this problem to begin with...but anyhow, read on.
I had saved my data using a similar procedure as the OP,
# Saving the data in a for-loop
with open(savefilename, 'wb') as f:
for datafilename in list_of_datafiles:
# Do the processing
data_to_save = ...
np.save( savefilename, data_to_save )
And ran into the problem that calling np.load() only loaded the last saved array, none of the rest. However, I knew that the data was in principle contained in the *.npy file, given the file size was growing during the saving loop. What was required was to simply loop over the content of the numpy array while calling the load command repeatedly. As I didn't quite know how many files were contained in the file, I simply looped over the loading loop until it failed. It's hacky, but it works.
# Loading the data in a for-loop
data_to_read = []
with open(savefilename, 'r') as f:
while True:
try:
data_to_read.append( np.load(f) )
except:
print("all data has been read!")
break
Then you can call, e.g., len(data_to_read) to see how many of the arrays are contained in it. Calling, e.g., data_to_read[0] gives you the first saved array, etc.

Two implementations of Numpy fromfile?

I am trying to update some legacy code that uses np.fromfile in a method. When I try searching the numpy source for this method I only find np.core.records.fromfile, but when you search the docs you can find np.fromfile. Taking a look at these two methods you can see they have different kwargs which makes me feel like they are different methods altogether.
My questions are:
1) Where is the source for np.fromfile located?
2) Why are there two different functions under the same name? This can clearly get confusing if you aren't careful about the difference as the two behave differently. Specifically np.core.records.fromfile will raise errors if you try to read more bytes than a file contains while np.fromfile does not. You can find a minimal example below.
In [1]: import numpy as np
In [2]: my_bytes = b'\x04\x00\x00\x00\xac\x92\x01\x00\xb2\x91\x01'
In [3]: with open('test_file.itf', 'wb') as f:
f.write(my_bytes)
In [4]: with open('test_file.itf', 'rb') as f:
result = np.fromfile(f, 'int32', 5)
In [5]: result
Out [5]:
In [6]: with open('test_file.itf', 'rb') as f:
result = np.core.records.fromfile(f, 'int32', 5)
ValueError: Not enough bytes left in file for specified shape and type
If you use help on np.fromfile you will find something very... helpful:
Help on built-in function fromfile in module numpy.core.multiarray:
fromfile(...)
fromfile(file, dtype=float, count=-1, sep='')
Construct an array from data in a text or binary file.
A highly efficient way of reading binary data with a known data-type,
as well as parsing simply formatted text files. Data written using the
`tofile` method can be read using this function.
As far as I can tell, this is implemented in C and can be found here.
If you are trying to save and load binary data, you shouldn't use np.fromfile anymore. You should use np.save and np.load which will use a platform-independent binary format.

python struct.pack and write vs matlab fwrite

I am trying to port this bit of matlab code to python
matlab
function write_file(im,name)
fp = fopen(name,'wb');
M = size(im);
fwrite(fp,[M(1) M(2) M(3)],'int');
fwrite(fp,im(:),'float');
fclose(fp);
where im is a 3D matrix. As far as I understand, the function first writes a binary file with a header row containing the matrix size. The header is made of 3 integers. Then, the im is written as a single column of floats. In matlab this takes few seconds for a file of 150MB.
python
import struct
import numpy as np
def write_image(im, file_name):
with open(file_name, 'wb') as f:
l = im.shape[0]*im.shape[1]*im.shape[2]
header = np.array([im.shape[0], im.shape[1], im.shape[2]])
header_bin = struct.pack("I"*3, *header)
f.write(header_bin)
im_bin = struct.pack("f"*l,*np.reshape(im, (l,1), order='F'))
f.write(im_bin)
f.close()
where im is a numpy array. This code works well as I compared with the binary returned by matlab and they are the same. However, for the 150MB file, it takes several seconds and tends to drain all the memory (in the image linked I stopped the execution to avoid it, but you can see how it builds up!).
This does not make sense to me as I am running the function on a 15GB of RAM PC. How come a 150MB file processing requires so much memory?
I'd happy to use a different method, as far as it is possible to have two formats for the header and the data column.
There is no need to use struct to save your array. numpy.ndarray has a convenience method for saving itself in binary mode: ndarray.tofile. The following should be much more efficient than creating a gigantic string with the same number of elements as your array:
def write_image(im, file_name):
with open(file_name, 'wb') as f:
np.array(im.shape).tofile(f)
im.T.tofile(f)
tofile always saves in row-major C order, while MATLAB uses column-major Fortran order. The simplest way to get around this is to save the transpose of the array. In general, ndarray.T should create a view (wrapper object pointing to the same underlying data) instead of a copy, so your memory usage should not increase noticeably from this operation.

Memory runs out when creating a numpy array from a list of lists in python

I want to read in a tab separated value file and turn it into a numpy array. The file has 3 lines. It looks like this:
Ann1 Bill1 Chris1 Dick1
Ann2 Bill2 Chris2 "Dick2
Ann3 Bill3 Chris3 Dick3
So, I used this simple line of code:
new_list = []
with open('/home/me/my_tsv.tsv') as tsv:
for line in csv.reader(tsv, delimiter="\t"):
new_list.append(line)
new = np.array(job_posts)
print new.shape
And because of that pesky " character, the shape of my fancy new numpy array is
(2,4)
That's not right! So, the solution is to include the argument, quoting, in the csv.reader call, as follows:
for line in csv.reader(tsv, delimiter="\t", quoting=csv.QUOTE_NONE):
This is great! Now my dimension is
(3,4)
as I hoped.
Now comes the real problem -- in all actuality, I have a 700,000 X 10 .tsv file, with long fields. I can read the file into Python with no problems, just like in the above situation. But, when I get to the step where I create new = np.array(job_posts), my wimpy 16 GB laptop cries and says...
MEMORY ERROR
Obviously, I cannot simultaneously have these 2 object in memory -- the Python list of lists, and the numpy array.
My question is therefore this: how can I read this file directly into a numpy array, perhaps using genfromtxt or something similar...yet also achieve what I've achieved by using the quoting=csv.QUOTE_NONE argument in the csv.reader?
So far, I've found no analogy to the quoting=csv.QUOTE_NONE option available anywhere in the standard ways to read in a tsv file using numpy.
This is a tough little problem. I though about iteratively building the numpy array during the reading in process, but I can't figure it out.
I tried
nparray = np.genfromtxt("/home/me/my_tsv.tsv", delimiter="/t")
print obj.shap
and got
(3,0)
If anyone has any advice I'd be greatly appreciative. Also, I know the real answer is probably to use Pandas...but at this point I'm committed to using numpy for a lot of compelling reasons...
Thanks in advance.

Read binary file into Python 2D array

I have a large binary file that I want to read as a 48x1414339 array. I read it in this way:
f = open(fname, 'rb')
s = f.read()
import array
a = array.array('f',s)
But this gives me a 1D string. Is there a way to keep the columns distinct?
Wrap it in a class and implement e.g. __getitem__() to convert an index pair to the linear index. Using separate arrays is probably just adding overhead unless you plan to use rows separately.

Categories