Effective way to convert 3D-numpy array to ASCII format - python

I need to make a conversion of 18 3D-numpy arrays (each weight~2 Gb) to the ASCII format files.
Here is the info about one of the ndarray:
I've search in the web about any ideas of this convertation, one of them was to save ndarray to txt with numpy.savetxt but it seems that it works only for 2D arrays. For 3D arrays it was reccomended firstly to slice into 2D arrays and then transfer txt file into ASCII format file. But since I am new in Python I am not sure this is the optimal way to deal with huge massive of the data as I have. Becasue in my case it will be 256 2D-arrays for each file (and I have them 18 files in total).
Will be really appreciate for your ideas and help!

This procedure can be done with the awrite function in Py4CatS or with numpy.savetxt function in Python.

Related

Python, How to write a 2-d array to a binary File with precision (uint8, uint16)?

I'm hoping someone can explain to me how to write a binary output to a text file.
I know that in matlab, it is pretty simple, as it has the 'fwrite' function which has the following format: fwrite(fileID,A,precision)
However, I'm not sure how to translate this to python. I currently have the data stored in a 2-D numpy array and would like to write it to the file in the same way that matlab does.
I have attempted it with the following:
#let Y be the numpy matrix
with open (filepath,'wb') as FileToWrite:
np.array(Y.shape).tofile(FileToWrite)
Y.T.tofile(FileToWrite)
however, this didn't lead to the desired result. The file size is way to large and the data is incorrectly formatted.
Ideally i should be able to specificy the data format to be uint8 or uint16 as well.
Any help would be massively appreciated.
So, I figured it. The solution is as follows.
#let Y be the numpy matrix
with open (filepath,'wb') as FileToWrite:
np.asarray(Y, dtype=np.uint8).tofile(FileToWrite)

Is there a way to concatenate matlab arrays with python (e.g. using scipy.io)?

I'm trying to write a large (~ 20GB) .fvp file, which behaves like a delimited text file to a matlab array and save it as a .mat with python 3. Right now, I am reading the file in python, converting the values to a single numpy array and saving it with scipy.io.savemat().
However, my pc runs out of memory in the process, which I think is due to the large size of the numpy array since my code runs okay for smaller .fvp files.
To solve this problem, I want to write and save sections of the .fvp file in multiple .mat files and join them up later, preferably in python. Is there a way to do it? I can't find it in scipy.io.

Memory-efficient 2d growable array in python?

I'm working on an app that processes a lot of data.
.... and keeps running my computer out of memory. :(
Python has a huge amount of memory overhead on variables (as per sys.getsizeof()). A basic tuple with one integer in it takes up 56 bytes, for example. An empty list, 64 bytes. Serious overhead.
Numpy arrays are great for reducing overhead. But they're not designed to grow efficiently (see Fastest way to grow a numpy numeric array). Array (https://docs.python.org/3/library/array.html) seems promising, but it's 1d. My data is 2d, with an arbitrary number of rows and a column width of 3 floats (ideally float32) for one array, and a column width of two ints (ideally uint32) for the other. Obviously, using ~80 bytes of python structure to store 12 or 8 bytes of data per row is going to total my memory consumption.
Is the only realistic way to keep memory usage down in Python to "fake" 2d, aka by addressing the array as arr[row*WIDTH+column] and counting rows as len(arr)/WIDTH?
Based on your comments, I'd suggest that you split your task into two parts:
1) In part 1, parse the JSON files using regexes and generate two CSV files in simple format: no headers, no spaces, just numbers. This should be quick and performant, with no memory issues: read text in, write text out. Don't try to keep anything in memory that you don't absolutely have to.
2) In part 2, use pandas read_csv() function to slurp in the CSV files directly. (Yes, pandas! You've probably already got it, and it's hella fast.)

What data format for large files in R?

I produce a very large data file with Python, mostly consisting of 0 (false) and only a few 1 (true). It has about 700.000 columns and 15.000 rows and thus a size of 10.5GB. The first row is the header.
This file then needs to be read and visualized in R.
I'm looking for the right data format to export my file from Python.
As stated here:
HDF5 is row based. You get MUCH efficiency by having tables that are
not too wide but are fairly long.
As I have a very wide table, I assume, HDF5 is inappropriate in my case?
So what data format suits best for this purpose?
Would it also make sense to compress (zip) it?
Example of my file:
id,col1,col2,col3,col4,col5,...
1,0,0,0,1,0,...
2,1,0,0,0,1,...
3,0,1,0,0,1,...
4,...
Zipping won't help you, as you'll have to unzip it to process it. If you could post your code that generates the file, that might help a lot.
Also, what do yo want to accomplish in R? Might it be faster to visualize it in Python, avoiding the read/write of 10.5GB?
Perhaps rethinking your approach to how you're storing the data (eg: store the coordinates of the 1's if there are very few) might be a better angle here.
For instance, instead of storing a 700K by 15K table of all zeroes except for a 1 in line 600492 column 10786, I might just store the tuple (600492, 10786) and achieve the same visualization in R.
SciPy has scipy.io.mmwrite which makes files that can be read by R's readMM command. SciPy also supports several different sparse matrix representations.

Extracting specific values from .npy file

I have a .npy file of which I know basically everything (size, number of elements, type of elements, etc.) and I'd like to have a way to retrieve specific values without loading the array. The goal is to use the less amount of memory possible.
I'm looking for something like
def extract('test.npy',i,j):
return "test.npy[i,j]"
I kinda know how to do it with a text file (see recent questions) but doing this with a npy array would allow me to do more than line extraction.
Also if you know any way to do this with a scipy sparse matrix that would be really great.
Thank you.
Just use data = np.load(filename, mmap_mode='r') (or one of the other modes, if you need to change specific elements, as well).
This will return a memory-mapped array. The contents of the array won't be loaded into memory and will be on disk, but you can access individual items by indexing the array as you normally would. (Be aware that accessing some slices will take much longer than accessing other slices depending on the shape and order of your array.)
HDF is a more efficient format for this, but the .npy format is designed to allow for memmapped arrays.

Categories