How to store 25M 3-D int tuple with python? - python

As a hobby, I try to code a basic game in python, and I need to store a map of the game world. It can be viewed as a 2-D array to store height. The point is, for the moment, my map dimensions are 5000x5000.
I store that in a sqlite db (schema : CREATE TABLE map (x SMALLINT, y SMALLINT, h SMALLINT); + VACCUM at the end of the creation), but it take up to 500MB on the disk.
I can compress (lzma, for example) the sqlite file, and it only takes ~35-40MB, but in order to use it in python, I need to unzip it first, so it always ends up taking so much place.
How would you store that kind of data in python ?
A 2-D array of int, or a list 3-int tuple of that dimensions (or bigger) and it could still run on a Raspberry Pi ? Speed is not important, but RAM and file size are.

You need 10 bits to store each height, so 10 bytes can store 8 heights, and thus 31.25Mo can store all 25,000,000 of them. You can figure out which block of 10 bytes stores a desired height (how depends on how you arrange them), and a little bit shifting can isolate the specific height you want (since every height will be split between 2 adjacent bytes).

I finally used the HDF5 file format, with pyTables.
The outcome is a ~20MB file for the exact same data, directly usable by the application.
Here is how I create it:
import tables
db_struct = {
'x': tables.Int16Col(),
'y': tables.Int16Col(),
'h': tables.Int16Col()
}
h5file = tables.open_file("my_file.h5", mode="w", title='Map')
filters = tables.Filters(complevel=9, complib='lzo')
group = h5file.create_group('/', 'group', 'Group')
table = h5file.create_table(group, 'map', db_struct, filters=filters)
heights = table.row
for y in range(0, int(MAP_HEIGHT)):
for x in range(0, int(MAP_WIDTH)):
heights['x'] = x
heights['y'] = y
heights['h'] = h
heights.append()
table.flush()
table.flush()
h5file.close()

Related

What is the fastest way to sort and unpack a large bytearray?

I have a large binary file that needs to be converted into hdf5 file format.
I am using Python3.6. My idea is to read in the file, sort the relevant information, unpack it and store it away. My information is stored in a way that the 8 byte time is followed by 2 bytes of energy and then 2 bytes of extra information, then again time, ... My current way of doing it, is the following (my information is read as an bytearray, with the name byte_array):
for i in range(0, len(byte_array)+1, 12):
if i == 0:
timestamp_bytes = byte_array[i:i+8]
energy_bytes = byte_array[i+8:i+10]
extras_bytes = byte_array[i+10:i+12]
else:
timestamp_bytes += byte_array[i:i+8]
energy_bytes += byte_array[i+8:i+10]
extras_bytes += byte_array[i+10:i+12]
timestamp_array = np.ndarray((len(timestamp_bytes)//8,), '<Q',timestamp_bytes)
energy_array = np.ndarray((len(energy_bytes) // 2,), '<h', energy_bytes)
extras_array = np.ndarray((len(timestamp_bytes) // 8,), '<H', extras_bytes)
I assume there is a much faster way of doing this, maybe by avoiding to loop over the the whole thing. My files are up to 15GB in size so every bit of improvement would help a lot.
You should be able to just tell NumPy to interpret the data as a structured array and extract fields:
as_structured = numpy.ndarray(shape=(len(byte_array)//12,),
dtype='<Q, <h, <H',
buffer=byte_array)
timestamps = as_structured['f0']
energies = as_structured['f1']
extras = as_structured['f2']
This will produce three arrays backed by the input bytearray. Creating these arrays should be effectively instant, but I can't guarantee that working with them will be fast - I think NumPy may need to do some implicit copying to handle alignment issues with these arrays. It's possible (I don't know) that explicitly copying them yourself with .copy() first might speed things up.
You can use numpy.frombuffer with a custom datatype:
import struct
import random
import numpy as np
data = [
(random.randint(0, 255**8), random.randint(0, 255*255), random.randint(0, 255*255))
for _ in range(20)
]
Bytes = b''.join(struct.pack('<Q2H', *row) for row in data)
dtype = np.dtype([('time', np.uint64),
('energy', np.uint16), # you may need to change that to `np.int16`, if energy can be negative
('extras', np.uint16)])
original = np.array(data, dtype=np.uint64)
result = np.frombuffer(Bytes, dtype)
print((result['time'] == original[:, 0]).all())
print((result['energy'] == original[:, 1]).all())
print((result['extras'] == original[:, 2]).all())
print(result)
Example output:
True
True
True
[(6048800706604665320, 52635, 291) (8427097887613035313, 15520, 4976)
(3250665110135380002, 44078, 63748) (17867295175506485743, 53323, 293)
(7840430102298790024, 38161, 27601) (15927595121394361471, 47152, 40296)
(8882783920163363834, 3480, 46666) (15102082728995819558, 25348, 3492)
(14964201209703818097, 60557, 4445) (11285466269736808083, 64496, 52086)
(6776526382025956941, 63096, 57267) (5265981349217761773, 19503, 32500)
(16839331389597634577, 49067, 46000) (16893396755393998689, 31922, 14228)
(15428810261434211689, 32003, 61458) (5502680334984414629, 59013, 42330)
(6325789410021178213, 25515, 49850) (6328332306678721373, 59019, 64106)
(3222979511295721944, 26445, 37703) (4490370317582410310, 52413, 25364)]
I'm not an expert on numpy, but here's my 5 cents:
You have lots of data, and probably it's more than your RAM.
This points to the simplest solution - don't try to fit all data in your program.
When you read a file into a variable - then the X GB is being read into RAM. If it's more than available RAM, then swapping is done by your OS. Swapping slows you down, since not only do you have disk operations for reading from source file, now you also have writing to disk to dump RAM contents into swap file.
Instead of that write the script so that it uses parts of the input file as necessary (in your case you read the file along anyways and don't go back or jump far ahead).
Try opening the input file as memory mapped data structure (please note differences in usage between Unix and windows environments)
Then you can do simple read([n]) bytes at a time and append that to your arrays.
behind the scenes data is read into RAM page by page as needed and will not exceed the available memory, also, leaving more space for your arrays to grow.
Also consider the fact that your resultant arrays can also outgrow RAM, which will cause similar slowdown as reading of a big file.

Is there a faster way to convert big file from hexa to binary and binary to int?

I have a big DataFrame (1999048 rows and 1col), with hexadecimal datas. I want to put each line in binary, cut it into pieces and traduce each piece in decimal format.
I tried this:
for i in range (len(df.index)):
hexa_line=hex2bin(str(f1.iloc[i]))[::-1]
channel = int(hexa_line[0:3][::-1], 2)
edge = int(hexa_line[3][::-1], 2)
time = int(hexa_line[4:32][::-1], 2)
sweep = int(hexa_line[32:48][::-1], 2)
tag = int(hexa_line[48:63][::-1], 2)
datalost = int(hexa_line[63][::-1], 2)
line=np.array([[channel, edge, time, sweep, tag, datalost]])
tab=np.concatenate((tab, line), axis=0)
But it is really really long.... Is there a faster way to do that ?
only thing I can imagine helping a lot would be changing these lines:
line=np.array([[channel, edge, time, sweep, tag, datalost]])
tab=np.concatenate((tab, line), axis=0)
certainly in pandas, and I think also in numpy concatting is an expensive thing to do, and depends on the size of the total size of both arrays (rather than, say list.append)
I think what this does is re-writes the entire array tab each time you call it. Perhaps you could try appending each line to a list then concatting the whole list together.
eg something more like this:
tab = []
for i in range (len(df.index)):
hexa_line=hex2bin(str(f1.iloc[i]))[::-1]
channel = int(hexa_line[0:3][::-1], 2)
edge = int(hexa_line[3][::-1], 2)
time = int(hexa_line[4:32][::-1], 2)
sweep = int(hexa_line[32:48][::-1], 2)
tag = int(hexa_line[48:63][::-1], 2)
datalost = int(hexa_line[63][::-1], 2)
line=np.array([[channel, edge, time, sweep, tag, datalost]])
tab.append(line)
final_tab = np.concatenate(tab, axis=0)
# or whatever the syntax is :p

Why using database (redis, SQL) would help when loading big data and RAM is running out of memory?

I need to take 100 000 images from a directory, put them all in one big dictionary where the keys are the ids of the pictures and the values are the numpy arrays of the pixels of the images. Creating this dict takes 19 GB of my RAM and I have 24GB in total. Then I need to order the dictionary with respect to the key and at the end take only the values of this ordered dictionary and save it as one big numpy array. I need this big numpy array because I want to sent it to train_test_split sklearn function and split the whole data to train and test sets with respect to their label. I found this question where they have the same problem with running out of RAM in the step where after creating the dictionary of 19GB I try to sort the dict: How to sort a LARGE dictionary and people suggest using database.
def save_all_images_as_one_numpy_array():
data_dict = {}
for img in os.listdir('images'):
id_img = img.split('_')[1]
loadimg = load_img(os.path.join('images', img))
x = image.img_to_array(loadimg)
data_dict[id_img] = x
data_dict = np.stack([ v for k, v in sorted(data_dict.items(), key = lambda x: int(x[0]))])
mmamfile = open_memmap('trythismmapfile.npy', dtype=np.float32, mode='w+',shape=data_dict.shape)
mmamfile[:] = data_dict[:]
def load_numpy_array_with_images():
a = open_memmap('trythismmapfile.npy', dtype=np.float32, mode='r')
When using np.stack I am stacking each numpy array in new array and this is where I run out of RAM. I can't afford to buy more RAM. I thought I can use redis in docker container but I don't understand why and how using a database will solve my problem?
The reason using a DB helps is because the DB library stores data on the hard-disk rather than in memory. If you look at the documentation for the library the linked answer suggests then you'll see that the first argument is filename, demonstrating that the hard-disk is used.
https://docs.python.org/2/library/bsddb.html#bsddb.hashopen
However, the linked question is talking about sorting by value, not key. Sorting by key will be much less memory intensive although you'll likely still have memory issues when training your model. I'd suggest trying something along the lines of
# Get the list of file names
imgs = os.listdir('images')
# Create a mapping of ID to file name
# This will allow us to sort the IDs then load the files in order
img_ids = {int(img.split('_')[1]): img for img in imgs}
# Get the list of file names sorted by ID
sorted_imgs = [v for k, v in sorted(img_ids.items(), key=lambda x: x[0])]
# Define a function for loading a named img
def load_img(img):
loadimg = load_img(os.path.join('images', img))
return image.img_to_array(loadimg)
# Iterate through the sorted file names and stack the results
data_dict = np.stack([load_img(img) for img in sorted_imgs])

Saving data incrementaly with python

I am working on a project in which a lot of data is being generated. I want a way to save my data as I go so I don't have to keep it all in RAM. I am currently using numpy to save everything in a npz file when the program finishes. The things that need to be saved are scalars, list, and list of lists. The lists have values added on to them incrementally and so I need a way to append to each list without having to load everything into memory.
I am still a bit new to python so if there is a standard way of doing this please point me in that direction.
Thanks
PyTables is a numpy friendly package that is designed to page data to disk to operate on data-sets that don't fit in memory.
See: https://www.pytables.org/usersguide/tutorials.html
https://kastnerkyle.github.io/posts/using-pytables-for-larger-than-ram-data-processing/
Usage
# Create a data-frame description (called a table)
# each attribute of Particle below is a column.
from tables import *
class Particle(IsDescription):
name = StringCol(16) # 16-character String
idnumber = Int64Col() # Signed 64-bit integer
ADCcount = UInt16Col() # Unsigned short integer
TDCcount = UInt8Col() # unsigned byte
grid_i = Int32Col() # 32-bit integer
grid_j = Int32Col() # 32-bit integer
pressure = Float32Col() # float (single-precision)
energy = Float64Col() # double (double-precision)
# create a hdf5 file on disk to store data in
h5file = open_file("tutorial1.h5", mode="w", title="Test file")
# create a table within the file, using the Particle description class
table = h5file.create_table(group, 'readout', Particle, "Readout example")
Performance
It is especially useful for computations across many data rows.
PyTables supports Blosc (which is a neat trick)
You can perform "in kernal" queries using blosc with the where method.
result = [row['col2'] for row in table.where(
'''(((col4 >= lim1) & (col4 < lim2)) |
((col2 > lim3) & (col2 < lim4)) &
((col1+3.1*col2+col3*col4) > lim5))''')]

astropy.fits: Manipulating image data from a fits Table? (e.g., 3072R x 2C)

I'm currently having a little issue with a fits file. The data is in table format, a format I haven't previously used. I'm a python user, and rely heavily on astropy.fits to manipulate fits images. A quick output of the info gives:
No. Name Type Cards Dimensions Format
0 PRIMARY PrimaryHDU 60 ()
1 BinTableHDU 29 3072R x 2C [1024E, 1024E]
The header for the BinTableHDU is as follows:
XTENSION= 'BINTABLE' /Written by IDL: Mon Jun 22 23:28:21 2015
BITPIX = 8 /
NAXIS = 2 /Binary table
NAXIS1 = 8192 /Number of bytes per row
NAXIS2 = 3072 /Number of rows
PCOUNT = 0 /Random parameter count
GCOUNT = 1 /Group count
TFIELDS = 2 /Number of columns
TFORM1 = '1024E ' /Real*4 (floating point)
TFORM2 = '1024E ' /Real*4 (floating point)
TTYPE1 = 'COUNT_RATE' /
TUNIT1 = '1e-6cts/s/arcmin^2' /
TTYPE2 = 'UNCERTAINTY' /
TUNIT2 = '1e-6cts/s/arcmin^2' /
HISTORY g000m90r1b120pm.fits created on 10/08/97. PI channel range: 8: 19
PIXTYPE = 'HEALPIX ' / HEALPIX pixelisation
ORDERING= 'NESTED ' / Pixel ordering scheme, either RING or NESTED
NSIDE = 512 / Healpix resolution parameter
NPIX = 3145728 / Total number of pixels
OBJECT = 'FULLSKY ' / Sky coverage, either FULLSKY or PARTIAL
FIRSTPIX= 0 / First pixel # (0 based)
LASTPIX = 3145727 / Last pixel # (zero based)
INDXSCHM= 'IMPLICIT' / indexing : IMPLICIT or EXPLICIT
GRAIN = 0 / GRAIN = 0: No index,
COMMENT GRAIN =1: 1 pixel index for each pixel,
COMMENT GRAIN >1: 1 pixel index for Grain consecutive pixels
BAD_DATA= -1.63750E+30 / Sentinel value given to bad pixels
COORDSYS= 'G ' / Pixelization coordinate system
COMMENT G = Galactic, E = ecliptic, C = celestial = equatorial
END
I'd like to access the fits image which is stored within the TTYPE labeled 'COUNT-RATE', and then have this in a format with which I can then add to other count-rate arrays with the same dimensions.
I started with my usual prodcedure for opening a fits file:
hdulist_RASS_SXRB_R1 = fits.open('/Users/.../RASS_SXRB_R1.fits')
hdulist_RASS_SXRB_R1.info()
image_XRAY_SKYVIEW_R1 = hdulist_RASS_SXRB_R1[1].data
image_XRAY_SKYVIEW_R1 = numpy.array(image_XRAY_SKYVIEW_R1)
image_XRAY_SKYVIEW_header_R1 = hdulist_RASS_SXRB_R1[1].header
But this is coming back with IndexError: too many indices for array. I've had a look at accessing table data in the astropy documentation here (Accessing data stored as a table in a multi-extension FITS (MEF) file)
If anyone has a tried and tested method for accessing such images from a fits table I'd be very grateful! Many thanks.
I can't be sure without seeing the full traceback but I think the exception you're getting is from this:
image_XRAY_SKYVIEW_R1 = numpy.array(image_XRAY_SKYVIEW_R1)
There's no reason to manually wrap numpy.array() around the array. It's already a Numpy array. But in this case it's a structured array (see http://docs.scipy.org/doc/numpy/user/basics.rec.html).
#Andromedae93's answer is right one. But also for general documentation on this see: http://docs.astropy.org/en/stable/io/fits/index.html#working-with-table-data
However, the way you're working (which is fine for images) of manually calling fits.open, accessing the .data attribute of the HDU, etc. is fairly low level, and Numpy structured arrays are good at representing tables, but not great for manipulating them.
You're better off generally using Astropy's higher-level Table interface. A FITS table can be read directly into an Astropy Table object with Table.read(): http://docs.astropy.org/en/stable/io/unified.html#fits
The only reason the same thing doesn't exist for FITS images is there's no a generic "Image" class yet.
I used astropy.io.fits during my internship in Astrophysics and this is my process to open file .fits and make some operations :
# Opening the .fits file which is named SMASH.fits
field = fits.open(SMASH.fits)
# Data fits reading
tbdata = field[1].data
Now, with this kind of method, tbdata is a numpy.array and you can make lots of things.
For example, if you have data like :
ID, Name, Object
1, HD 1527, Star
2, HD 7836, Star
3, NGC 6739, Galaxy
If you want to print data along one condition :
Data_name = tbdata['Name']
You will get :
HD 1527
HD 7836
NGC 6739
I don't know what do you want exactly with your data, but I can help you ;)

Categories