Saving data incrementaly with python

Saving data incrementaly with python - python

I am working on a project in which a lot of data is being generated. I want a way to save my data as I go so I don't have to keep it all in RAM. I am currently using numpy to save everything in a npz file when the program finishes. The things that need to be saved are scalars, list, and list of lists. The lists have values added on to them incrementally and so I need a way to append to each list without having to load everything into memory.
I am still a bit new to python so if there is a standard way of doing this please point me in that direction.
Thanks

PyTables is a numpy friendly package that is designed to page data to disk to operate on data-sets that don't fit in memory.
See: https://www.pytables.org/usersguide/tutorials.html
https://kastnerkyle.github.io/posts/using-pytables-for-larger-than-ram-data-processing/
Usage
# Create a data-frame description (called a table)
# each attribute of Particle below is a column.
from tables import *
class Particle(IsDescription):
name = StringCol(16) # 16-character String
idnumber = Int64Col() # Signed 64-bit integer
ADCcount = UInt16Col() # Unsigned short integer
TDCcount = UInt8Col() # unsigned byte
grid_i = Int32Col() # 32-bit integer
grid_j = Int32Col() # 32-bit integer
pressure = Float32Col() # float (single-precision)
energy = Float64Col() # double (double-precision)
# create a hdf5 file on disk to store data in
h5file = open_file("tutorial1.h5", mode="w", title="Test file")
# create a table within the file, using the Particle description class
table = h5file.create_table(group, 'readout', Particle, "Readout example")
Performance
It is especially useful for computations across many data rows.
PyTables supports Blosc (which is a neat trick)
You can perform "in kernal" queries using blosc with the where method.
result = [row['col2'] for row in table.where(
'''(((col4 >= lim1) & (col4 < lim2)) |
((col2 > lim3) & (col2 < lim4)) &
((col1+3.1*col2+col3*col4) > lim5))''')]

Related

Pytables: Can Appended Earray be reduced in size?

Following suggestions on SO Post, I also found PyTables-append is exceptionally time efficient. However, in my case the output file (earray.h5) has huge size. Is there a way to append the data such that the output file is not as huge? For example, in my case (see link below) a 13GB input file (dset_1: 2.1E8 x 4 and dset_2: 2.1E8 x 4) gives a 197 GB output file with just one column (2.5E10 x 1). All elements are float64.
I want to reduce the output file size such that the execution speed of the script is not compromised and the output file reading is also efficient for later use. Can saving the data along columns and not just rows help? Any suggestions on this? Given below is a MWE.
Output and input files' details here
# no. of chunks from dset-1 and dset-2 in inp.h5
loop_1 = 40
loop_2 = 20
# save to disk after these many rows
app_len = 10**6
# **********************************************
# Grabbing input.h5 file
# **********************************************
filename = 'inp.h5'
f2 = h5py.File(filename, 'r')
chunks1 = f2['dset_1']
chunks2 = f2['dset_2']
shape1, shape2 = chunks1.shape[0], chunks2.shape[0]
f1 = tables.open_file("table.h5", "w")
a = f1.create_earray(f1.root, "dataset_1", atom=tables.Float64Atom(), shape=(0, 4))
size1 = shape1//loop_1
size2 = shape2//loop_2
# ***************************************************
# Grabbing chunks to process and append data
# ***************************************************
for c in range(loop_1):
h = c*size1
# grab chunks from dset_1 of inp.h5
chunk1 = chunks1[h:(h + size1)]
for d in range(loop_2):
g = d*size2
chunk2 = chunks2[g:(g + size2)] # grab chunks from dset_2 of inp.h5
r1 = chunk1.shape[0]
r2 = chunk2.shape[0]
left, right = 0, 0
for j in range(r1): # grab col.2 values from dataset-1
e1 = chunk1[j, 1]
#...Algaebraic operations here to output a row containing 4 float64
#...append to a (earray) when no. of rows reach a million
del chunk2
del chunk1
f2.close()

I wrote the answer you are referencing. That is a simple example that "only" writes 1.5e6 rows. I didn't do anything to optimize performance for very large files. You are creating a very large file, but did not say how many rows (obviously way more than 10**6). Here are some suggestions based on comments in another thread.
Areas I recommend (3 related to PyTables code, and 2 based on external utilizes).
PyTables code suggestions:
Enable compression when you create the file (add the filters= parameter when you create the file). Start with tb.Filters(complevel=1).
Define the expectedrows= parameter in .create_tables() (per PyTables docs, 'this will optimize the HDF5 B-Tree and amount of memory used'). The default value is set in tables/parameters.py (look for EXPECTED_ROWS_TABLE; It's only 10000 in my installation). I suggest you set this to a larger value if you are creating 10**6 (or more) rows.
There is a side benefit to setting expectedrows=. If you don't define chunkshape, 'a sensible value is calculated based on the expectedrows parameter'. Check the value used. This won't decrease the created file size, but will improve I/O performance.
If you didn't use compression when you created the file, there are 2 methods to compress existing files:
External Utilities:
The PyTables utility ptrepack - run against a HDF5 file to create a
new file (useful to go from uncompressed to compressed, or vice-versa). It is delivered with PyTables, and runs on the command line.
The HDF5 utility h5repack - works similar to ptrepack. It is delivered with the HDF5 installer from The HDF Group.
There are trade-offs with file compression: it reduces the file size, but increases access time (reduces I/O performance). I tend to use uncompressed files I open frequently (for best I/O performance). Then when done, I convert to compressed format for long term archiving. You can continue to work with them in compress format (the API handles cleanly).

How to store 25M 3-D int tuple with python?

As a hobby, I try to code a basic game in python, and I need to store a map of the game world. It can be viewed as a 2-D array to store height. The point is, for the moment, my map dimensions are 5000x5000.
I store that in a sqlite db (schema : CREATE TABLE map (x SMALLINT, y SMALLINT, h SMALLINT); + VACCUM at the end of the creation), but it take up to 500MB on the disk.
I can compress (lzma, for example) the sqlite file, and it only takes ~35-40MB, but in order to use it in python, I need to unzip it first, so it always ends up taking so much place.
How would you store that kind of data in python ?
A 2-D array of int, or a list 3-int tuple of that dimensions (or bigger) and it could still run on a Raspberry Pi ? Speed is not important, but RAM and file size are.

You need 10 bits to store each height, so 10 bytes can store 8 heights, and thus 31.25Mo can store all 25,000,000 of them. You can figure out which block of 10 bytes stores a desired height (how depends on how you arrange them), and a little bit shifting can isolate the specific height you want (since every height will be split between 2 adjacent bytes).

I finally used the HDF5 file format, with pyTables.
The outcome is a ~20MB file for the exact same data, directly usable by the application.
Here is how I create it:
import tables
db_struct = {
'x': tables.Int16Col(),
'y': tables.Int16Col(),
'h': tables.Int16Col()
}
h5file = tables.open_file("my_file.h5", mode="w", title='Map')
filters = tables.Filters(complevel=9, complib='lzo')
group = h5file.create_group('/', 'group', 'Group')
table = h5file.create_table(group, 'map', db_struct, filters=filters)
heights = table.row
for y in range(0, int(MAP_HEIGHT)):
for x in range(0, int(MAP_WIDTH)):
heights['x'] = x
heights['y'] = y
heights['h'] = h
heights.append()
table.flush()
table.flush()
h5file.close()

What is the fastest way to sort and unpack a large bytearray?

I have a large binary file that needs to be converted into hdf5 file format.
I am using Python3.6. My idea is to read in the file, sort the relevant information, unpack it and store it away. My information is stored in a way that the 8 byte time is followed by 2 bytes of energy and then 2 bytes of extra information, then again time, ... My current way of doing it, is the following (my information is read as an bytearray, with the name byte_array):
for i in range(0, len(byte_array)+1, 12):
if i == 0:
timestamp_bytes = byte_array[i:i+8]
energy_bytes = byte_array[i+8:i+10]
extras_bytes = byte_array[i+10:i+12]
else:
timestamp_bytes += byte_array[i:i+8]
energy_bytes += byte_array[i+8:i+10]
extras_bytes += byte_array[i+10:i+12]
timestamp_array = np.ndarray((len(timestamp_bytes)//8,), '<Q',timestamp_bytes)
energy_array = np.ndarray((len(energy_bytes) // 2,), '<h', energy_bytes)
extras_array = np.ndarray((len(timestamp_bytes) // 8,), '<H', extras_bytes)
I assume there is a much faster way of doing this, maybe by avoiding to loop over the the whole thing. My files are up to 15GB in size so every bit of improvement would help a lot.

You should be able to just tell NumPy to interpret the data as a structured array and extract fields:
as_structured = numpy.ndarray(shape=(len(byte_array)//12,),
dtype='<Q, <h, <H',
buffer=byte_array)
timestamps = as_structured['f0']
energies = as_structured['f1']
extras = as_structured['f2']
This will produce three arrays backed by the input bytearray. Creating these arrays should be effectively instant, but I can't guarantee that working with them will be fast - I think NumPy may need to do some implicit copying to handle alignment issues with these arrays. It's possible (I don't know) that explicitly copying them yourself with .copy() first might speed things up.

You can use numpy.frombuffer with a custom datatype:
import struct
import random
import numpy as np
data = [
(random.randint(0, 255**8), random.randint(0, 255*255), random.randint(0, 255*255))
for _ in range(20)
]
Bytes = b''.join(struct.pack('<Q2H', *row) for row in data)
dtype = np.dtype([('time', np.uint64),
('energy', np.uint16), # you may need to change that to `np.int16`, if energy can be negative
('extras', np.uint16)])
original = np.array(data, dtype=np.uint64)
result = np.frombuffer(Bytes, dtype)
print((result['time'] == original[:, 0]).all())
print((result['energy'] == original[:, 1]).all())
print((result['extras'] == original[:, 2]).all())
print(result)
Example output:
True
True
True
[(6048800706604665320, 52635, 291) (8427097887613035313, 15520, 4976)
(3250665110135380002, 44078, 63748) (17867295175506485743, 53323, 293)
(7840430102298790024, 38161, 27601) (15927595121394361471, 47152, 40296)
(8882783920163363834, 3480, 46666) (15102082728995819558, 25348, 3492)
(14964201209703818097, 60557, 4445) (11285466269736808083, 64496, 52086)
(6776526382025956941, 63096, 57267) (5265981349217761773, 19503, 32500)
(16839331389597634577, 49067, 46000) (16893396755393998689, 31922, 14228)
(15428810261434211689, 32003, 61458) (5502680334984414629, 59013, 42330)
(6325789410021178213, 25515, 49850) (6328332306678721373, 59019, 64106)
(3222979511295721944, 26445, 37703) (4490370317582410310, 52413, 25364)]

I'm not an expert on numpy, but here's my 5 cents:
You have lots of data, and probably it's more than your RAM.
This points to the simplest solution - don't try to fit all data in your program.
When you read a file into a variable - then the X GB is being read into RAM. If it's more than available RAM, then swapping is done by your OS. Swapping slows you down, since not only do you have disk operations for reading from source file, now you also have writing to disk to dump RAM contents into swap file.
Instead of that write the script so that it uses parts of the input file as necessary (in your case you read the file along anyways and don't go back or jump far ahead).
Try opening the input file as memory mapped data structure (please note differences in usage between Unix and windows environments)
Then you can do simple read([n]) bytes at a time and append that to your arrays.
behind the scenes data is read into RAM page by page as needed and will not exceed the available memory, also, leaving more space for your arrays to grow.
Also consider the fact that your resultant arrays can also outgrow RAM, which will cause similar slowdown as reading of a big file.

PyTables vs Matlab HDF5 read times

I have an HDF5 output file from NASTRAN that contains mode shape data. I am trying to read them into Matlab and Python to check various post-processing techniques. The file in question is in the local directory for both of these tests. The file is semi-large at 1.2 GB but certainly not that large in terms of HDF5 files I have read previously. There are 17567342 rows and 8 columns in the table I want to access. The first and last columns are integers the middle 6 are floating point numbers.
Matlab:
file = 'HDF5.h5';
hinfo = hdf5info(file);
% ... Find the dataset I want to extract
t = hdf5read(file, '/NASTRAN/RESULT/NODAL/EIGENVECTOR');
This last operation is extremely slow (can be measured in hours).
Python:
import tables
hfile = tables.open_file("HDF5.h5")
modetable = hfile.root.NASTRAN.RESULT.NODAL.EIGENVECTOR
data = modetable.read()
This last operation is basically instant. I can then access data as if it were a numpy array. I am clearly missing something very basic about what these commands are doing. I'm thinking it might have something to do with data conversion but I'm not sure. If I do type(data) I get back numpy.ndarray and type(data[0]) returns numpy.void.
What is the correct (i.e. speedy) way to read the dataset I want into Matlab?

Matt, Are you still working on this problem?
I am not a matlab guy, but I am familiar with Nastran HDF5 file. You are right; 1.2 GB is big, but not that big by today's standards.
You might be able to diagnose the matlab performance bottle neck by running tests with different numbers of rows in your EIGENVECTOR dataset. To do that (without running a lot of Nastran jobs), I created some simple code to create a HDF5 file with a user defined # of rows. It mimics the structure of the Nastran Eigenvector Result dataset. See below:
import tables as tb
import numpy as np
hfile = tb.open_file('SO_54300107.h5','w')
eigen_dtype = np.dtype([('ID',int), ('X',float),('Y',float),('Z',float),
('RX',float),('RY',float),('RZ',float), ('DOMAIN_ID',int)])
fsize = 1000.0
isize = int(fsize)
recarr = np.recarray((isize,),dtype=eigen_dtype)
id_arr = np.arange(1,isize+1)
dom_arr = np.ones((isize,), dtype=int)
arr = np.array(np.arange(fsize))/fsize
recarr['ID'] = id_arr
recarr['X'] = arr
recarr['Y'] = arr
recarr['Z'] = arr
recarr['RX'] = arr
recarr['RY'] = arr
recarr['RZ'] = arr
recarr['DOMAIN_ID'] = dom_arr
modetable = hfile.create_table('/NASTRAN/RESULT/NODAL', 'EIGENVECTOR',
createparents=True, obj=recarr )
hfile.close()
Try running this with different values for fsize (# of rows), then attach the HDF5 file it creates to matlab. Maybe you can find the point where performance noticeably degrades.

Matlab provided another HDF5 reader called h5read. Using the same basic approach the amount of time taken to read the data was drastically reduced. In fact hdf5read is listed for removal in a future version. Here is same basic code with the perfered functions.
file = 'HDF5.h5';
hinfo = h5info(file);
% ... Find the dataset I want to extract
t = h5read(file, '/NASTRAN/RESULT/NODAL/EIGENVECTOR');

astropy.fits: Manipulating image data from a fits Table? (e.g., 3072R x 2C)

I'm currently having a little issue with a fits file. The data is in table format, a format I haven't previously used. I'm a python user, and rely heavily on astropy.fits to manipulate fits images. A quick output of the info gives:
No. Name Type Cards Dimensions Format
0 PRIMARY PrimaryHDU 60 ()
1 BinTableHDU 29 3072R x 2C [1024E, 1024E]
The header for the BinTableHDU is as follows:
XTENSION= 'BINTABLE' /Written by IDL: Mon Jun 22 23:28:21 2015
BITPIX = 8 /
NAXIS = 2 /Binary table
NAXIS1 = 8192 /Number of bytes per row
NAXIS2 = 3072 /Number of rows
PCOUNT = 0 /Random parameter count
GCOUNT = 1 /Group count
TFIELDS = 2 /Number of columns
TFORM1 = '1024E ' /Real*4 (floating point)
TFORM2 = '1024E ' /Real*4 (floating point)
TTYPE1 = 'COUNT_RATE' /
TUNIT1 = '1e-6cts/s/arcmin^2' /
TTYPE2 = 'UNCERTAINTY' /
TUNIT2 = '1e-6cts/s/arcmin^2' /
HISTORY g000m90r1b120pm.fits created on 10/08/97. PI channel range: 8: 19
PIXTYPE = 'HEALPIX ' / HEALPIX pixelisation
ORDERING= 'NESTED ' / Pixel ordering scheme, either RING or NESTED
NSIDE = 512 / Healpix resolution parameter
NPIX = 3145728 / Total number of pixels
OBJECT = 'FULLSKY ' / Sky coverage, either FULLSKY or PARTIAL
FIRSTPIX= 0 / First pixel # (0 based)
LASTPIX = 3145727 / Last pixel # (zero based)
INDXSCHM= 'IMPLICIT' / indexing : IMPLICIT or EXPLICIT
GRAIN = 0 / GRAIN = 0: No index,
COMMENT GRAIN =1: 1 pixel index for each pixel,
COMMENT GRAIN >1: 1 pixel index for Grain consecutive pixels
BAD_DATA= -1.63750E+30 / Sentinel value given to bad pixels
COORDSYS= 'G ' / Pixelization coordinate system
COMMENT G = Galactic, E = ecliptic, C = celestial = equatorial
END
I'd like to access the fits image which is stored within the TTYPE labeled 'COUNT-RATE', and then have this in a format with which I can then add to other count-rate arrays with the same dimensions.
I started with my usual prodcedure for opening a fits file:
hdulist_RASS_SXRB_R1 = fits.open('/Users/.../RASS_SXRB_R1.fits')
hdulist_RASS_SXRB_R1.info()
image_XRAY_SKYVIEW_R1 = hdulist_RASS_SXRB_R1[1].data
image_XRAY_SKYVIEW_R1 = numpy.array(image_XRAY_SKYVIEW_R1)
image_XRAY_SKYVIEW_header_R1 = hdulist_RASS_SXRB_R1[1].header
But this is coming back with IndexError: too many indices for array. I've had a look at accessing table data in the astropy documentation here (Accessing data stored as a table in a multi-extension FITS (MEF) file)
If anyone has a tried and tested method for accessing such images from a fits table I'd be very grateful! Many thanks.

I can't be sure without seeing the full traceback but I think the exception you're getting is from this:
image_XRAY_SKYVIEW_R1 = numpy.array(image_XRAY_SKYVIEW_R1)
There's no reason to manually wrap numpy.array() around the array. It's already a Numpy array. But in this case it's a structured array (see http://docs.scipy.org/doc/numpy/user/basics.rec.html).
#Andromedae93's answer is right one. But also for general documentation on this see: http://docs.astropy.org/en/stable/io/fits/index.html#working-with-table-data
However, the way you're working (which is fine for images) of manually calling fits.open, accessing the .data attribute of the HDU, etc. is fairly low level, and Numpy structured arrays are good at representing tables, but not great for manipulating them.
You're better off generally using Astropy's higher-level Table interface. A FITS table can be read directly into an Astropy Table object with Table.read(): http://docs.astropy.org/en/stable/io/unified.html#fits
The only reason the same thing doesn't exist for FITS images is there's no a generic "Image" class yet.

I used astropy.io.fits during my internship in Astrophysics and this is my process to open file .fits and make some operations :
# Opening the .fits file which is named SMASH.fits
field = fits.open(SMASH.fits)
# Data fits reading
tbdata = field[1].data
Now, with this kind of method, tbdata is a numpy.array and you can make lots of things.
For example, if you have data like :
ID, Name, Object
1, HD 1527, Star
2, HD 7836, Star
3, NGC 6739, Galaxy
If you want to print data along one condition :
Data_name = tbdata['Name']
You will get :
HD 1527
HD 7836
NGC 6739
I don't know what do you want exactly with your data, but I can help you ;)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.