Related
So my question is how I should save a large amount of simulation data to a file using Python (or update new data rows to the existing file).
Lets say I have NN=1000 particles, and I want to save the position and velocity data of each particle (x y z, vx vy vz). The data is in format [x1,y1,z1,vx1,vy1,vz1, x2,y2,z2,vx2,vy2,vz2, ...] and so on.
Simulation is working well, but I believe the methods I use for saving and keeping these information saved is not really optimal for me.
Pseudo code similar to my code
T_max = 1000 # for example
dt = 0.1 # time step
T = 0 # current time
iterations = int(T_max/dt) # number of iterations we are doing
NN = 1000 # Number of particles
ZZ = np.zeros( (iterations, 2+NN*6 ) ) # Here I generate whole data matrix at the beginning.
# ^ might not be the best idea as the system needs to keep everything in memory for the whole time
# So I guess saving could be done in chunks?
ZZ[0][0], ZZ[0][1] = T , dt
# ZZ[0][2:] = initialize_system(NN=NN) # so lets initialize the system.
# However, for this post I do this differently due to simplicity. See below
ZZ[0][2:] = np.random.uniform(-100,100,NN*6)
i = 0
while i < iteration:
T += dt
Z[i+1][0], Z[i+1][1] = T, dt
#Z[i+1][2:] = rk4(EOM_function, posvel=Z[i][2:])
# ^ Using this I would calculate new positions based on previous ones.
Z[i+1][2:] = np.random.uniform(-100,100,NN*6) #This is just for example here.
i += 1
# Now the simulation data is basically done, so one would need to save
# This one feels slow, as it takes 181s to save and is size of 1046246KB
np.savetxt('test1.txt', ZZ)
#other method with a bit less accuracy as I don't need to have all decimals saved
np.savetxt('test2.txt', ZZ, fmt='%1.6f') # Takes 125s and size is 426698KB
# Both of the above are kinda slow so I also tried to save to npy format
np.save('test.npy', ZZ) # It took 8.9s and size 164118KB
so this np.save() method seems to be fast, but I read somewhere that I can not append data to it. So this would not work if I keep saving the data in parts while calculating new positions.
So back to my question. How should/could I save the data efficiently (fast and memory friendly). I keep having some memory issues when NN and T_max gets larger because with this method I keep this whole ZZ all the time in memory.
So I guess I should calculate ZZ in parts, i.e. iterations/10 parts but then I should append this data to an existing file, and tests I have made felt slow. Any suggestions?
EDIT: feel free to ask more specifying questions as I feel like I forgot to explain something.
That highly depends on what you intend to use the output for. If it's stored for further calculations, .npy or some other binary format is always the way to go as it is faster, takes less space, and doesn't lose precision between loads and saves, instead of serializing it into a human readable format. If you need it to be readable, you might as well just output row by row to a csv file or something.
If you want to do it with binary, h5py allows you to extend a dataset after saving and append more stuff to it.
import numpy as np
import h5py
T_max = 10**4 # for example
dt = 0.1 # time step
T = 0 # current time
iterations = int(T_max/dt) # number of iterations we are doing
NN = 1000 # Number of particles
chunk_size = 10**3
ZZ = np.zeros( (chunk_size, 2+NN*6 ) )
ZZ[0][0], ZZ[0][1] = T , dt
# ZZ[0][2:] = initialize_system(NN=NN) # so lets initialize the system.
# However, for this post I do this differently due to simplicity. See below
ZZ[0][2:] = np.random.uniform(-100,100,NN*6)
with h5py.File("test.h5", "a") as f:
dset = f.create_dataset('ZZ', (0,2+NN*6), maxshape=(None,2+NN*6), dtype='float64', chunks=(chunk_size,2+NN+6))
for chunk in range(0, iterations, chunk_size):
for i in range(0, chunk_size - 1):
T += dt
ZZ[i + 1][0], ZZ[i + 1][1] = T, dt
#Z[i+1][2:] = rk4(EOM_function, posvel=Z[i][2:])
# ^ Using this I would calculate new positions based on previous ones.
ZZ[i + 1][2:] = np.random.uniform(-100,100,NN*6) #This is just for example here.
# Expand the file here to allow for more data.
dset.resize(dset.shape[0] + chunk_size, axis=0)
dset[chunk: chunk + chunk_size ] = ZZ
# update and initialize next chunk. the next chunk's first row should be the last row of the previous chunk + iteration
T += dt
ZZ[0][0], ZZ[0][1] = T, dt
#Z[0][2:] = rk4(EOM_function, posvel=Z[-1][2:])
# ^ Using this I would calculate new positions based on previous ones.
ZZ[0][2:] = np.random.uniform(-100,100,NN*6) #This is just for example here.
print(dset.shape)
This takes 70 seconds on the save step on my computer, generating a 45GB file, for a dataset that is 100 times your original code.
The above code is more general in case you are streaming your data and don't know your final size. If you know it from the start, you can replace the initial create_dataset with
dset = f.create_dataset('ZZ', (iterations,2+NN*6), dtype='float64')
and remove the dset.resize(dset.shape[0] + chunk_size, axis=0)
You'll probably also want to read it back in chunks afterwards for other processing, in which case you can follow the docs here: https://docs.h5py.org/en/latest/high/dataset.html#reading-writing-data
Okay so I'm continuing my question / providing possible answer to it based on the answer of EricChen1248. EDIT: Answer provided by EricChen1248 works now and is way better than this my code part. See his code
I do not yet still understand completely how this f.create_dataset () truly works (i.e. when does it write data to file in the loop etc).
Using the code provided by Eric, it created and saved the data files fastly, but when I read the file as follows
hf = h5py.File('temp/test.h5', 'r')
ZZ = np.array(hf['ZZ'])
hf.close()
and plotted the first column (time T column, which should increase by timestep dt after each iteration) I get the following figure
plt.plot(ZZ[:,0])
time T column plotted
and as can be seen, it grows to a time of 100, and then goes to zero. This happens after the first 'chunk_size' has been passed. I started to read docs provided by Eric, and using his code as reference I managed to write something like this
import numpy as np
import h5py
T_max = 10**4
dt = 0.1
T = 0
NN = 1000
iterations = int(T_max/dt)
chunk_size = 10**3
with h5py.File('temp/data12.h5', 'a') as hf:
dset = hf.create_dataset("ZZ", (chunk_size, 2+NN*6),maxshape=(None,2+NN*6) ,chunks=(chunk_size, 2+NN*6), dtype='f8' )
# ^ first I create data set equals to one chunk_size
# Here I initialize the system. Columns ; 0=T , 1=dt, 2=arbitrary data point, 3=sin(column2)
# all the rest columns are random numbers just to fill some numbers in
dset[0,0], dset[0,1] = T, dt
#dset[0,2:] = np.random.uniform(0,1,NN*6)
dset[0,2] = 1
dset[0,3] = np.sin(dset[0,2])
dset[0,4:] = np.random.uniform(0,1,NN*6 -2)
print('starts')
# Main difference down there is that I use dataset (dset)
# as a data matrix to be filled instead of matrix ZZ as in my question.
i = 0
#for j, s in enumerate(dset.iter_chunks()):
for j, s in enumerate(range(0, iterations, chunk_size )):
print(j, s)
while i < iterations and i < chunk_size*(j+1) -1:
#for i in range(chunk_size*j, chunk_size*(j+1)-1):
T += dt
dset[i+1,0], dset[i+1,1] = T, dt
#dset[i+1,2:] = np.sin(dset[i,2:]+dt)
dset[i+1,2] = dset[i,2] + dt
dset[i+1,3] = np.sin(dset[i,2]+dt)
dset[i+1,4:] = dset[i,4:] + np.random.uniform(-1,1,NN*6-2)
i+=1
print(dset.shape)
dset.resize(dset.shape[0] + chunk_size, axis=0)
This code runs in 1min 50s , and saves a file of size 4.47GB so I am happy with the speed, and what I'm really happy is that it do not use so much memory while iterating (I used to get into problem with huge RAM usage).
When I read the data file provided by my code (similarly as above) I get following image for time Time T column plotted, my code version and it grows nicely to T=10e4 as should be. It still generated one more chunk_size block to the end of dataset which is full of zeros. That I need to get rid of. One more proof that the code works and saves data without weird problems is this sinusoidal plot plt.plot(ZZ[500:1500,0] , ZZ[500:1500,3]). Sinusoidal image proof Note that the plot is limited for T ~ [50,150] so one could still see something there (if plotted the whole thing, one could not see lines well).
I believe this is not the best way to write this code, but it is the way I got this working. So if someone sees improvements, please let me know. Also, I am curious to know why the code provided by Eric did not work, at least for me.
EDIT : fixed typos
Following suggestions on SO Post, I also found PyTables-append is exceptionally time efficient. However, in my case the output file (earray.h5) has huge size. Is there a way to append the data such that the output file is not as huge? For example, in my case (see link below) a 13GB input file (dset_1: 2.1E8 x 4 and dset_2: 2.1E8 x 4) gives a 197 GB output file with just one column (2.5E10 x 1). All elements are float64.
I want to reduce the output file size such that the execution speed of the script is not compromised and the output file reading is also efficient for later use. Can saving the data along columns and not just rows help? Any suggestions on this? Given below is a MWE.
Output and input files' details here
# no. of chunks from dset-1 and dset-2 in inp.h5
loop_1 = 40
loop_2 = 20
# save to disk after these many rows
app_len = 10**6
# **********************************************
# Grabbing input.h5 file
# **********************************************
filename = 'inp.h5'
f2 = h5py.File(filename, 'r')
chunks1 = f2['dset_1']
chunks2 = f2['dset_2']
shape1, shape2 = chunks1.shape[0], chunks2.shape[0]
f1 = tables.open_file("table.h5", "w")
a = f1.create_earray(f1.root, "dataset_1", atom=tables.Float64Atom(), shape=(0, 4))
size1 = shape1//loop_1
size2 = shape2//loop_2
# ***************************************************
# Grabbing chunks to process and append data
# ***************************************************
for c in range(loop_1):
h = c*size1
# grab chunks from dset_1 of inp.h5
chunk1 = chunks1[h:(h + size1)]
for d in range(loop_2):
g = d*size2
chunk2 = chunks2[g:(g + size2)] # grab chunks from dset_2 of inp.h5
r1 = chunk1.shape[0]
r2 = chunk2.shape[0]
left, right = 0, 0
for j in range(r1): # grab col.2 values from dataset-1
e1 = chunk1[j, 1]
#...Algaebraic operations here to output a row containing 4 float64
#...append to a (earray) when no. of rows reach a million
del chunk2
del chunk1
f2.close()
I wrote the answer you are referencing. That is a simple example that "only" writes 1.5e6 rows. I didn't do anything to optimize performance for very large files. You are creating a very large file, but did not say how many rows (obviously way more than 10**6). Here are some suggestions based on comments in another thread.
Areas I recommend (3 related to PyTables code, and 2 based on external utilizes).
PyTables code suggestions:
Enable compression when you create the file (add the filters= parameter when you create the file). Start with tb.Filters(complevel=1).
Define the expectedrows= parameter in .create_tables() (per PyTables docs, 'this will optimize the HDF5 B-Tree and amount of memory used'). The default value is set in tables/parameters.py (look for EXPECTED_ROWS_TABLE; It's only 10000 in my installation). I suggest you set this to a larger value if you are creating 10**6 (or more) rows.
There is a side benefit to setting expectedrows=. If you don't define chunkshape, 'a sensible value is calculated based on the expectedrows parameter'. Check the value used. This won't decrease the created file size, but will improve I/O performance.
If you didn't use compression when you created the file, there are 2 methods to compress existing files:
External Utilities:
The PyTables utility ptrepack - run against a HDF5 file to create a
new file (useful to go from uncompressed to compressed, or vice-versa). It is delivered with PyTables, and runs on the command line.
The HDF5 utility h5repack - works similar to ptrepack. It is delivered with the HDF5 installer from The HDF Group.
There are trade-offs with file compression: it reduces the file size, but increases access time (reduces I/O performance). I tend to use uncompressed files I open frequently (for best I/O performance). Then when done, I convert to compressed format for long term archiving. You can continue to work with them in compress format (the API handles cleanly).
I'm looking for a data file structure that enables fast reading of random data samples for deep learning, and have been experimenting with lmdb today. However, one thing that seems surprising to me is how inefficiently it seems to store the data.
I have an ASCII file that is around 120 GB with gene sequences.
Now I would have expected to be able to fit this data in a lmdb database of roughly the same size or perhaps even a bit smaller since ASCII is a highly inefficient storing method.
However what I'm seeing seems, to suggest that I need around 350 GB to store this data in a lmdb file and I simply don't understand that.
Am I not utilizing some setting correctly, or what exactly am I doing wrong here?
import time
import lmdb
import pyarrow as pa
def dumps_pyarrow(obj):
"""
Serialize an object.
Returns:
Implementation-dependent bytes-like object
"""
return pa.serialize(obj).to_buffer()
t0 = time.time()
filepath = './../../Uniparc/uniparc_active/uniparc_active.fasta'
output_file = './../data/out_lmdb.fasta'
write_freq = 100000
start_line = 2
nprot = 0
db = lmdb.open(output_file, map_size=1e9, readonly=False,
meminit=False, map_async=True)
txn = db.begin(write=True)
with open(filepath) as fp:
line = fp.readline()
cnt = 1
protein = ''
while line:
if cnt >= start_line:
if line[0] == '>': #Old protein finished, new protein starting on next line
txn.put(u'{}'.format(nprot).encode('ascii'), dumps_pyarrow((protein)))
nprot += 1
if nprot % write_freq == 0:
t1 = time.time()
print("comitting... nprot={} ,time={:2.2f}".format(nprot,t1-t0))
txn.commit()
txn = db.begin(write=True)
line_checkpoint = cnt
protein = ''
else:
protein += line.strip()
line = fp.readline()
cnt += 1
txn.commit()
keys = [u'{}'.format(k).encode('ascii') for k in range(nprot + 1)]
with db.begin(write=True) as txn:
txn.put(b'__keys__', dumps_pyarrow(keys))
txn.put(b'__len__', dumps_pyarrow(len(keys)))
print("Flushing database ...")
db.sync()
db.close()
t2 = time.time()
print("All done, time taken {:2.2f}s".format(t2-t0))
Edit:
Some additional information about the data:
In the 120 GB file the data is structured like this (Here I am showing the first 2 proteins):
>UPI00001E0F7B status=inactive
YPRSRSQQQGHHNAAQQAHHPYQLQHSASTVSHHPHAHGPPSQGGPGGPGPPHGGHPHHP
HHGGAGSGGGSGPGSHGGQPHHQKPRRTASQRIRAATAARKLHFVFDPAGRLCYYWSMVV
SMAFLYNFWVIIYRFAFQEINRRTIAIWFCLDYLSDFLYLIDILFHFRTGYLEDGVLQTD
ALKLRTHYMNSTIFYIDCLCLLPLDFLYLSIGFNSILRSFRLVKIYRFWAFMDRTERHTN
YPNLFRSTALIHYLLVIFHWNGCLYHIIHKNNGFGSRNWVYHDSESADVVKQYLQSYYWC
TLALTTIGDLPKPRSKGEYVFVILQLLFGLMLFATVLGHVANIVTSVSAARKEFQGESNL
RRQWVKVVWSAPASG
>UPI00001E0FBF status=active
MWRAQPSLWIWWIFLILVPSIRAVYEDYRLPRSVEPLHYNLRILTHLNSTDQRFEGSVTI
DLLARETTKNITLHAAYLKIDENRTSVVSGQEKFGVNRIEVNEVHNFYILHLGRELVKDQ
IYKLEMHFKAGLNDSQSGYYKSNYTDIVTKEVHHLAVTQFSPTFARQAFPCFDEPSWKAT
FNITLGYHKKYMGLSGMPVLRCQDHDSLTNYVWCDHDTLLRTSTYLVAFAVHDLENAATE
ESKTSNRVIFRNWMQPKLLGQEMISMEIAPKLLSFYENLFQINFPLAKVDQLTVPTHRFT
AMENWGLVTYNEERLPQNQGDYPQKQKDSTAFTVAHEYAHQWFGNLVTMNWWNDLWLKEG
PSTYFGYLALDSLQPEWRRGERFISRDLANFFSKDSNATVPAISKDVKNPAEVLGQFTEY
VYEKGSLTIRMLHKLVGEEAFFHGIRSFLERFSFGNVAQADLWNSLQMAALKNQVISSDF
NLSRAMDSWTLQGGYPLVTLIRNYKTGEVTLNQSRFFQEHGIEKASSCWWVPLRFVRQNL
PDFNQTTPQFWLECPLNTKVLKLPDHLSTDEWVILNPQVATIFRVNYDEHNWRLIIESLR
NDPNSGGIHKLNKAQLLDDLMALAAVRLHKYDKAFDLLEYLKKEQDFLPWQRAIGILNRL
GALLNVAEANKFKNYMQKLLLPLYNRFPKLSGIREAKPAIKDIPFAHFAYSQACRYHVAD
CTDQAKILAITHRTEGQLELPSDFQKVAYCSLLDEGGDAEFLEVFGLFQNSTNGSQRRIL
ASALGCVRNFGNFEQFLNYTLESDEKLLGDCYMLAVKSALNREPLVSPTANYIISHAKKL
GEKFKKKELTGLLLSLAQNLRSTEEIDRLKAQLEDLKEFEEPLKKALYQGKMNQKWQKDC
SSDFIEAIEKHL
When I store the data in the database I concatenate all the lines making up each protein, and store those as a single data point. I ignore the headerline (the line starting with >).
The reason why I believe that the data should be more compressed when stored in the database is because I expect it to be stored in some binary form which I would expect would be more compressed - though I will admit I don't know whether that is how it would actually work (For comparison the data is only 70 GB when compressed/zipped).
I would be okay with the data taking up a similar amount of space in lmdb format, but I don't understand why it should take up almost 3 times the space as it does in ASCII format.
LMDB does not implement any sort of compression on data. 1Byte in memory is 1Byte on disk.
But its internals can amplify space required:
data handling is in pages (generally 4KB)
each "record" need to store additional structures for the b-tree (for key and data) and the count of pages occupied by key and data
Bottom line : LMDB is designed for FAST data access, not to save space.
I have a function which I want to compute in parallel using multiprocessing. The function takes an argument, but also loads subsets from two very large dataframe which has already been loaded into memory (one of which is about 1G and the other is just over 6G).
largeDF1 = pd.read_csv(directory + 'name1.csv')
largeDF2 = pd.read_csv(directory + 'name2.csv')
def f(x):
load_content1 = largeDF1.loc[largeDF1['FirstRow'] == x]
load_content2 = largeDF1.loc[largeDF1['FirstRow'] == x]
#some computation happens here
new_data.to_csv(directory + 'output.csv', index = False)
def main():
multiprocessing.set_start_method('spawn', force = True)
pool = multiprocessing.Pool(processes = multiprocessing.cpu_count())
input = input_data['col']
pool.map_async(f, input)
pool.close()
pool.join()
The problem is that the files are too big and when I run them over multiple cores I get a memory issue. I want to know if there is a way where the loaded files can be shared across all processes.
I have tried manager() but could not get it to work. Any help is appreciated. Thanks.
If you were running this on a UNIX-like system (which uses the fork startmethod by default) the data would be shared out-of-the-box. Most operating systems use copy-on-write for memory pages. So even if you fork a process several times they would share most of the memory pages that contain the dataframes, al long as you don't modify those dataframes.
But when using the spawn start method, each worker process has to load the dataframe. I'm not sure if the OS is smart enough in that case to share the memory pages. Or indeed that these spawned processes would all have the same memory lay-out.
The only portable solution I can think of would be to leave the data on disk and use mmap in the workers to map it into memory read-only. That way the OS would notice that multiple processes are mapping the same file, and it would only load one copy.
The downside is that the data would be in memory in on-disk csv format, which makes reading data from it (without making a copy!) less convenient. So you might want to prepare the data beforehand into a form that it easier to use. Like e.g. convert the data from 'FirstRow' into a binary file of float or double that you can iterate over with struct.iter_unpack.
The function below (from my statusline script) uses mmap to count the amount of messages in a mailbox file.
def mail(storage, mboxname):
"""
Report unread mail.
Arguments:
storage: a dict with keys (unread, time, size) from the previous call or an empty dict.
This dict will be *modified* by this function.
mboxname (str): name of the mailbox to read.
Returns: A string to display.
"""
stats = os.stat(mboxname)
if stats.st_size == 0:
return 'Mail: 0'
# When mutt modifies the mailbox, it seems to only change the
# ctime, not the mtime! This is probably releated to how mutt saves the
# file. See also stat(2).
newtime = stats.st_ctime
newsize = stats.st_size
if not storage or newtime > storage['time'] or newsize != storage['size']:
with open(mboxname) as mbox:
with mmap.mmap(mbox.fileno(), 0, prot=mmap.PROT_READ) as mm:
start, total = 0, 1 # First mail is not found; it starts on first line...
while True:
rv = mm.find(b'\n\nFrom ', start)
if rv == -1:
break
else:
total += 1
start = rv + 7
start, read = 0, 0
while True:
rv = mm.find(b'\nStatus: R', start)
if rv == -1:
break
else:
read += 1
start = rv + 10
unread = total - read
# Save values for the next run.
storage['unread'], storage['time'], storage['size'] = unread, newtime, newsize
else:
unread = storage['unread']
return f'Mail: {unread}'
In this case I used mmap because it was 4x faster than just reading the file. See normal reading versus using mmap.
I have a 120 GB file saved (in binary via pickle) that contains about 50,000 (600x600) 2d numpy arrays. I need to stack all of these arrays using a median. The easiest way to do this would be to simply read in the whole file as a list of arrays and use np.median(arrays, axis=0). However, I don't have much RAM to work with, so this is not a good option.
So, I tried to stack them pixel-by-pixel, as in I focus on one pixel position (i, j) at a time, then read in each array one by one, appending the value at the given position to a list. Once all the values for a certain position across all arrays are saved, I use np.median and then just have to save that value in a list -- which in the end will have the medians of each pixel position. In the end I can just reshape this to 600x600, and I'll be done. The code for this is below.
import pickle
import time
import numpy as np
filename = 'images.dat' #contains my 50,000 2D numpy arrays
def stack_by_pixel(i, j):
pixels_at_position = []
with open(filename, 'rb') as f:
while True:
try:
# Gather pixels at a given position
array = pickle.load(f)
pixels_at_position.append(array[i][j])
except EOFError:
break
# Stacking at position (median)
stacked_at_position = np.median(np.array(pixels_at_position))
return stacked_at_position
# Form whole stacked image
stacked = []
for i in range(600):
for j in range(600):
t1 = time.time()
stacked.append(stack_by_pixel(i, j))
t2 = time.time()
print('Done with element %d, %d: %f seconds' % (i, j, (t2-t1)))
stacked_image = np.reshape(stacked, (600,600))
After seeing some of the time printouts, I realize that this is wildly inefficient. Each completion of a position (i, j) takes about 150 seconds or so, which is not surprising since it is reading about 50,000 arrays one by one. And given that there are 360,000 (i, j) positions in my large arrays, this is projected to take 22 months to finish! Obviously this isn't feasible. But I'm sort of at a loss, because there's not enough RAM available to read in the whole file. Or maybe I could save all the pixel positions at once (a separate list for each position) for the arrays as it opens them one by one, but wouldn't saving 360,000 lists (that are about 50,000 elements long) in Python use a lot of RAM as well?
Any suggestions are welcome for how I could make this run significantly faster without using a lot of RAM. Thanks!
This is a perfect use case for numpy's memory mapped arrays.
Memory mapped arrays allow you to treat a .npy file on disk as though it were loaded in memory as a numpy array, without actually loading it. It's as simple as
arr = np.load('filename', mmap_mode='r')
For the most part you can treat this as any other array. Array elements are only loaded into memory as required. Unfortunately some quick experimentation suggests that median doesn't handle memmory mapped arrays well*, it still seems to load a substantial portion of the data into memory at once. So median(arr, 0) may not work.
However, you can still loop over each index and calculate the median without running into memory issues.
[[np.median([arr[k][i][j] for k in range(50000)]) for i in range(600)] for j in range(600)]
where 50,000 reflects the total number of arrays.
Without the overhead of unpickling each file just to extract a single pixel the run time should be much quicker (by about 360000 times).
Of course, that leaves the problem of creating a .npy file containing all of the data. A file can be created as follows,
arr = np.lib.format.open_memmap(
'filename', # File to store in
mode='w+', # Specify to create the file and write to it
dtype=float32, # Change this to your data's type
shape=(50000, 600, 600) # Shape of resulting array
)
Then, load the data as before and store it into the array (which just writes it to disk behind the scenes).
idx = 0
with open(filename, 'rb') as f:
while True:
try:
arr[idx] = pickle.load(f)
idx += 1
except EOFError:
break
Give it a couple hours to run, then head back to the start of this answer to see how to load it and take the median. Can't be any simpler**.
*I just tested it on a 7GB file, taking the median of 1,500 samples of 5,000,000 elements and memory usage was around 7GB, suggesting the entire array may have been loaded into memory. It doesn't hurt to try this way first though. If anyone else has experience with median on memmapped arrays feel free to comment.
** If you believe strangers on the internet.
Note: I use Python 2.x, porting this to 3.x shouldn't be difficult.
My idea is simple - disk space is plentiful, so let's do some preprocessing and turn that big pickle file into something that is easier to process in small chunks.
Preparation
In order to test this, I wrote a small script the generates a pickle file that resembles yours. I assumed your input images are grayscale and have 8bit depth, and generated 10000 random images using numpy.random.randint.
This script will act as a benchmark that we can compare the preprocessing and processing stages against.
import numpy as np
import pickle
import time
IMAGE_WIDTH = 600
IMAGE_HEIGHT = 600
FILE_COUNT = 10000
t1 = time.time()
with open('data/raw_data.pickle', 'wb') as f:
for i in range(FILE_COUNT):
data = np.random.randint(256, size=IMAGE_WIDTH*IMAGE_HEIGHT, dtype=np.uint8)
data = data.reshape(IMAGE_HEIGHT, IMAGE_WIDTH)
pickle.dump(data, f)
print i,
t2 = time.time()
print '\nDone in %0.3f seconds' % (t2 - t1)
In a test run this script completed in 372 seconds, generating ~ 10 GB file.
Preprocessing
Let's split the input images on a row-by-row basis -- we will have 600 files, where file N contains row N from each input image. We can store the row data in binary using numpy.ndarray.tofile (and later load those files using numpy.fromfile).
import numpy as np
import pickle
import time
# Increase open file limit
# See https://stackoverflow.com/questions/6774724/why-python-has-limit-for-count-of-file-handles
import win32file
win32file._setmaxstdio(1024)
IMAGE_WIDTH = 600
IMAGE_HEIGHT = 600
FILE_COUNT = 10000
t1 = time.time()
outfiles = []
for i in range(IMAGE_HEIGHT):
outfilename = 'data/row_%03d.dat' % i
outfiles.append(open(outfilename, 'wb'))
with open('data/raw_data.pickle', 'rb') as f:
for i in range(FILE_COUNT):
data = pickle.load(f)
for j in range(IMAGE_HEIGHT):
data[j].tofile(outfiles[j])
print i,
for i in range(IMAGE_HEIGHT):
outfiles[i].close()
t2 = time.time()
print '\nDone in %0.3f seconds' % (t2 - t1)
In a test run, this script completed in 134 seconds, generating 600 files, 6 million bytes each. It used ~30MB or RAM.
Processing
Simple, just load each array using numpy.fromfile, then use numpy.median to get per-column medians, reducing it back to a single row, and accumulate such rows in a list.
Finally, use numpy.vstack to reassemble a median image.
import numpy as np
import time
IMAGE_WIDTH = 600
IMAGE_HEIGHT = 600
t1 = time.time()
result_rows = []
for i in range(IMAGE_HEIGHT):
outfilename = 'data/row_%03d.dat' % i
data = np.fromfile(outfilename, dtype=np.uint8).reshape(-1, IMAGE_WIDTH)
median_row = np.median(data, axis=0)
result_rows.append(median_row)
print i,
result = np.vstack(result_rows)
print result
t2 = time.time()
print '\nDone in %0.3f seconds' % (t2 - t1)
In a test run, this script completed in 74 seconds. You could even parallelize it quite easily, but it doesn't seem to be worth it. The script used ~40MB of RAM.
Given how both of those scripts are linear, the time used should scale linearly as well. For 50000 images, this is about 11 minutes for preprocessing and 6 minutes for the final processing. This is on i7-4930K # 3.4GHz, using 32bit Python on purpose.