Optimizing my large data code with little RAM

Optimizing my large data code with little RAM - python

I have a 120 GB file saved (in binary via pickle) that contains about 50,000 (600x600) 2d numpy arrays. I need to stack all of these arrays using a median. The easiest way to do this would be to simply read in the whole file as a list of arrays and use np.median(arrays, axis=0). However, I don't have much RAM to work with, so this is not a good option.
So, I tried to stack them pixel-by-pixel, as in I focus on one pixel position (i, j) at a time, then read in each array one by one, appending the value at the given position to a list. Once all the values for a certain position across all arrays are saved, I use np.median and then just have to save that value in a list -- which in the end will have the medians of each pixel position. In the end I can just reshape this to 600x600, and I'll be done. The code for this is below.
import pickle
import time
import numpy as np
filename = 'images.dat' #contains my 50,000 2D numpy arrays
def stack_by_pixel(i, j):
pixels_at_position = []
with open(filename, 'rb') as f:
while True:
try:
# Gather pixels at a given position
array = pickle.load(f)
pixels_at_position.append(array[i][j])
except EOFError:
break
# Stacking at position (median)
stacked_at_position = np.median(np.array(pixels_at_position))
return stacked_at_position
# Form whole stacked image
stacked = []
for i in range(600):
for j in range(600):
t1 = time.time()
stacked.append(stack_by_pixel(i, j))
t2 = time.time()
print('Done with element %d, %d: %f seconds' % (i, j, (t2-t1)))
stacked_image = np.reshape(stacked, (600,600))
After seeing some of the time printouts, I realize that this is wildly inefficient. Each completion of a position (i, j) takes about 150 seconds or so, which is not surprising since it is reading about 50,000 arrays one by one. And given that there are 360,000 (i, j) positions in my large arrays, this is projected to take 22 months to finish! Obviously this isn't feasible. But I'm sort of at a loss, because there's not enough RAM available to read in the whole file. Or maybe I could save all the pixel positions at once (a separate list for each position) for the arrays as it opens them one by one, but wouldn't saving 360,000 lists (that are about 50,000 elements long) in Python use a lot of RAM as well?
Any suggestions are welcome for how I could make this run significantly faster without using a lot of RAM. Thanks!

This is a perfect use case for numpy's memory mapped arrays.
Memory mapped arrays allow you to treat a .npy file on disk as though it were loaded in memory as a numpy array, without actually loading it. It's as simple as
arr = np.load('filename', mmap_mode='r')
For the most part you can treat this as any other array. Array elements are only loaded into memory as required. Unfortunately some quick experimentation suggests that median doesn't handle memmory mapped arrays well*, it still seems to load a substantial portion of the data into memory at once. So median(arr, 0) may not work.
However, you can still loop over each index and calculate the median without running into memory issues.
[[np.median([arr[k][i][j] for k in range(50000)]) for i in range(600)] for j in range(600)]
where 50,000 reflects the total number of arrays.
Without the overhead of unpickling each file just to extract a single pixel the run time should be much quicker (by about 360000 times).
Of course, that leaves the problem of creating a .npy file containing all of the data. A file can be created as follows,
arr = np.lib.format.open_memmap(
'filename', # File to store in
mode='w+', # Specify to create the file and write to it
dtype=float32, # Change this to your data's type
shape=(50000, 600, 600) # Shape of resulting array
)
Then, load the data as before and store it into the array (which just writes it to disk behind the scenes).
idx = 0
with open(filename, 'rb') as f:
while True:
try:
arr[idx] = pickle.load(f)
idx += 1
except EOFError:
break
Give it a couple hours to run, then head back to the start of this answer to see how to load it and take the median. Can't be any simpler**.
*I just tested it on a 7GB file, taking the median of 1,500 samples of 5,000,000 elements and memory usage was around 7GB, suggesting the entire array may have been loaded into memory. It doesn't hurt to try this way first though. If anyone else has experience with median on memmapped arrays feel free to comment.
** If you believe strangers on the internet.

Note: I use Python 2.x, porting this to 3.x shouldn't be difficult.
My idea is simple - disk space is plentiful, so let's do some preprocessing and turn that big pickle file into something that is easier to process in small chunks.
Preparation
In order to test this, I wrote a small script the generates a pickle file that resembles yours. I assumed your input images are grayscale and have 8bit depth, and generated 10000 random images using numpy.random.randint.
This script will act as a benchmark that we can compare the preprocessing and processing stages against.
import numpy as np
import pickle
import time
IMAGE_WIDTH = 600
IMAGE_HEIGHT = 600
FILE_COUNT = 10000
t1 = time.time()
with open('data/raw_data.pickle', 'wb') as f:
for i in range(FILE_COUNT):
data = np.random.randint(256, size=IMAGE_WIDTH*IMAGE_HEIGHT, dtype=np.uint8)
data = data.reshape(IMAGE_HEIGHT, IMAGE_WIDTH)
pickle.dump(data, f)
print i,
t2 = time.time()
print '\nDone in %0.3f seconds' % (t2 - t1)
In a test run this script completed in 372 seconds, generating ~ 10 GB file.
Preprocessing
Let's split the input images on a row-by-row basis -- we will have 600 files, where file N contains row N from each input image. We can store the row data in binary using numpy.ndarray.tofile (and later load those files using numpy.fromfile).
import numpy as np
import pickle
import time
# Increase open file limit
# See https://stackoverflow.com/questions/6774724/why-python-has-limit-for-count-of-file-handles
import win32file
win32file._setmaxstdio(1024)
IMAGE_WIDTH = 600
IMAGE_HEIGHT = 600
FILE_COUNT = 10000
t1 = time.time()
outfiles = []
for i in range(IMAGE_HEIGHT):
outfilename = 'data/row_%03d.dat' % i
outfiles.append(open(outfilename, 'wb'))
with open('data/raw_data.pickle', 'rb') as f:
for i in range(FILE_COUNT):
data = pickle.load(f)
for j in range(IMAGE_HEIGHT):
data[j].tofile(outfiles[j])
print i,
for i in range(IMAGE_HEIGHT):
outfiles[i].close()
t2 = time.time()
print '\nDone in %0.3f seconds' % (t2 - t1)
In a test run, this script completed in 134 seconds, generating 600 files, 6 million bytes each. It used ~30MB or RAM.
Processing
Simple, just load each array using numpy.fromfile, then use numpy.median to get per-column medians, reducing it back to a single row, and accumulate such rows in a list.
Finally, use numpy.vstack to reassemble a median image.
import numpy as np
import time
IMAGE_WIDTH = 600
IMAGE_HEIGHT = 600
t1 = time.time()
result_rows = []
for i in range(IMAGE_HEIGHT):
outfilename = 'data/row_%03d.dat' % i
data = np.fromfile(outfilename, dtype=np.uint8).reshape(-1, IMAGE_WIDTH)
median_row = np.median(data, axis=0)
result_rows.append(median_row)
print i,
result = np.vstack(result_rows)
print result
t2 = time.time()
print '\nDone in %0.3f seconds' % (t2 - t1)
In a test run, this script completed in 74 seconds. You could even parallelize it quite easily, but it doesn't seem to be worth it. The script used ~40MB of RAM.
Given how both of those scripts are linear, the time used should scale linearly as well. For 50000 images, this is about 11 minutes for preprocessing and 6 minutes for the final processing. This is on i7-4930K # 3.4GHz, using 32bit Python on purpose.

Related

Saving continuously generated simulation data with Python3

So my question is how I should save a large amount of simulation data to a file using Python (or update new data rows to the existing file).
Lets say I have NN=1000 particles, and I want to save the position and velocity data of each particle (x y z, vx vy vz). The data is in format [x1,y1,z1,vx1,vy1,vz1, x2,y2,z2,vx2,vy2,vz2, ...] and so on.
Simulation is working well, but I believe the methods I use for saving and keeping these information saved is not really optimal for me.
Pseudo code similar to my code
T_max = 1000 # for example
dt = 0.1 # time step
T = 0 # current time
iterations = int(T_max/dt) # number of iterations we are doing
NN = 1000 # Number of particles
ZZ = np.zeros( (iterations, 2+NN*6 ) ) # Here I generate whole data matrix at the beginning.
# ^ might not be the best idea as the system needs to keep everything in memory for the whole time
# So I guess saving could be done in chunks?
ZZ[0][0], ZZ[0][1] = T , dt
# ZZ[0][2:] = initialize_system(NN=NN) # so lets initialize the system.
# However, for this post I do this differently due to simplicity. See below
ZZ[0][2:] = np.random.uniform(-100,100,NN*6)
i = 0
while i < iteration:
T += dt
Z[i+1][0], Z[i+1][1] = T, dt
#Z[i+1][2:] = rk4(EOM_function, posvel=Z[i][2:])
# ^ Using this I would calculate new positions based on previous ones.
Z[i+1][2:] = np.random.uniform(-100,100,NN*6) #This is just for example here.
i += 1
# Now the simulation data is basically done, so one would need to save
# This one feels slow, as it takes 181s to save and is size of 1046246KB
np.savetxt('test1.txt', ZZ)
#other method with a bit less accuracy as I don't need to have all decimals saved
np.savetxt('test2.txt', ZZ, fmt='%1.6f') # Takes 125s and size is 426698KB
# Both of the above are kinda slow so I also tried to save to npy format
np.save('test.npy', ZZ) # It took 8.9s and size 164118KB
so this np.save() method seems to be fast, but I read somewhere that I can not append data to it. So this would not work if I keep saving the data in parts while calculating new positions.
So back to my question. How should/could I save the data efficiently (fast and memory friendly). I keep having some memory issues when NN and T_max gets larger because with this method I keep this whole ZZ all the time in memory.
So I guess I should calculate ZZ in parts, i.e. iterations/10 parts but then I should append this data to an existing file, and tests I have made felt slow. Any suggestions?
EDIT: feel free to ask more specifying questions as I feel like I forgot to explain something.

That highly depends on what you intend to use the output for. If it's stored for further calculations, .npy or some other binary format is always the way to go as it is faster, takes less space, and doesn't lose precision between loads and saves, instead of serializing it into a human readable format. If you need it to be readable, you might as well just output row by row to a csv file or something.
If you want to do it with binary, h5py allows you to extend a dataset after saving and append more stuff to it.
import numpy as np
import h5py
T_max = 10**4 # for example
dt = 0.1 # time step
T = 0 # current time
iterations = int(T_max/dt) # number of iterations we are doing
NN = 1000 # Number of particles
chunk_size = 10**3
ZZ = np.zeros( (chunk_size, 2+NN*6 ) )
ZZ[0][0], ZZ[0][1] = T , dt
# ZZ[0][2:] = initialize_system(NN=NN) # so lets initialize the system.
# However, for this post I do this differently due to simplicity. See below
ZZ[0][2:] = np.random.uniform(-100,100,NN*6)
with h5py.File("test.h5", "a") as f:
dset = f.create_dataset('ZZ', (0,2+NN*6), maxshape=(None,2+NN*6), dtype='float64', chunks=(chunk_size,2+NN+6))
for chunk in range(0, iterations, chunk_size):
for i in range(0, chunk_size - 1):
T += dt
ZZ[i + 1][0], ZZ[i + 1][1] = T, dt
#Z[i+1][2:] = rk4(EOM_function, posvel=Z[i][2:])
# ^ Using this I would calculate new positions based on previous ones.
ZZ[i + 1][2:] = np.random.uniform(-100,100,NN*6) #This is just for example here.
# Expand the file here to allow for more data.
dset.resize(dset.shape[0] + chunk_size, axis=0)
dset[chunk: chunk + chunk_size ] = ZZ
# update and initialize next chunk. the next chunk's first row should be the last row of the previous chunk + iteration
T += dt
ZZ[0][0], ZZ[0][1] = T, dt
#Z[0][2:] = rk4(EOM_function, posvel=Z[-1][2:])
# ^ Using this I would calculate new positions based on previous ones.
ZZ[0][2:] = np.random.uniform(-100,100,NN*6) #This is just for example here.
print(dset.shape)
This takes 70 seconds on the save step on my computer, generating a 45GB file, for a dataset that is 100 times your original code.
The above code is more general in case you are streaming your data and don't know your final size. If you know it from the start, you can replace the initial create_dataset with
dset = f.create_dataset('ZZ', (iterations,2+NN*6), dtype='float64')
and remove the dset.resize(dset.shape[0] + chunk_size, axis=0)
You'll probably also want to read it back in chunks afterwards for other processing, in which case you can follow the docs here: https://docs.h5py.org/en/latest/high/dataset.html#reading-writing-data

Okay so I'm continuing my question / providing possible answer to it based on the answer of EricChen1248. EDIT: Answer provided by EricChen1248 works now and is way better than this my code part. See his code
I do not yet still understand completely how this f.create_dataset () truly works (i.e. when does it write data to file in the loop etc).
Using the code provided by Eric, it created and saved the data files fastly, but when I read the file as follows
hf = h5py.File('temp/test.h5', 'r')
ZZ = np.array(hf['ZZ'])
hf.close()
and plotted the first column (time T column, which should increase by timestep dt after each iteration) I get the following figure
plt.plot(ZZ[:,0])
time T column plotted
and as can be seen, it grows to a time of 100, and then goes to zero. This happens after the first 'chunk_size' has been passed. I started to read docs provided by Eric, and using his code as reference I managed to write something like this
import numpy as np
import h5py
T_max = 10**4
dt = 0.1
T = 0
NN = 1000
iterations = int(T_max/dt)
chunk_size = 10**3
with h5py.File('temp/data12.h5', 'a') as hf:
dset = hf.create_dataset("ZZ", (chunk_size, 2+NN*6),maxshape=(None,2+NN*6) ,chunks=(chunk_size, 2+NN*6), dtype='f8' )
# ^ first I create data set equals to one chunk_size
# Here I initialize the system. Columns ; 0=T , 1=dt, 2=arbitrary data point, 3=sin(column2)
# all the rest columns are random numbers just to fill some numbers in
dset[0,0], dset[0,1] = T, dt
#dset[0,2:] = np.random.uniform(0,1,NN*6)
dset[0,2] = 1
dset[0,3] = np.sin(dset[0,2])
dset[0,4:] = np.random.uniform(0,1,NN*6 -2)
print('starts')
# Main difference down there is that I use dataset (dset)
# as a data matrix to be filled instead of matrix ZZ as in my question.
i = 0
#for j, s in enumerate(dset.iter_chunks()):
for j, s in enumerate(range(0, iterations, chunk_size )):
print(j, s)
while i < iterations and i < chunk_size*(j+1) -1:
#for i in range(chunk_size*j, chunk_size*(j+1)-1):
T += dt
dset[i+1,0], dset[i+1,1] = T, dt
#dset[i+1,2:] = np.sin(dset[i,2:]+dt)
dset[i+1,2] = dset[i,2] + dt
dset[i+1,3] = np.sin(dset[i,2]+dt)
dset[i+1,4:] = dset[i,4:] + np.random.uniform(-1,1,NN*6-2)
i+=1
print(dset.shape)
dset.resize(dset.shape[0] + chunk_size, axis=0)
This code runs in 1min 50s , and saves a file of size 4.47GB so I am happy with the speed, and what I'm really happy is that it do not use so much memory while iterating (I used to get into problem with huge RAM usage).
When I read the data file provided by my code (similarly as above) I get following image for time Time T column plotted, my code version and it grows nicely to T=10e4 as should be. It still generated one more chunk_size block to the end of dataset which is full of zeros. That I need to get rid of. One more proof that the code works and saves data without weird problems is this sinusoidal plot plt.plot(ZZ[500:1500,0] , ZZ[500:1500,3]). Sinusoidal image proof Note that the plot is limited for T ~ [50,150] so one could still see something there (if plotted the whole thing, one could not see lines well).
I believe this is not the best way to write this code, but it is the way I got this working. So if someone sees improvements, please let me know. Also, I am curious to know why the code provided by Eric did not work, at least for me.
EDIT : fixed typos

Modifying a list from multiple Python pools

I have a large data set (~2Gb) to analyse and I'd like to multi process it to reduce the run time of the code. I've imported the dataset into a list which I will then want to run numerous passes over. On each pass I'll set up a pool for each available core and each pool will then only assess a certain block of the data set (note, the pool still needs access to the complete data set).
Each line of the input file takes the format "a,b,c,d,e,f,g,h" and all are numbers.
I'm struggling to separate out the get the parameters in the Calc1stPass Pool; I'm getting a tuple index out or range error. Can anyone help me out with this error please?
def Calc1stPass(DataSet,Params):
print("DataSet =", DataSet)
print("Params =", Params)
Pass, (PoolNumber, ArrayCount, CoreCount) = Params
StartRow = int((ArrayCount / CoreCount) * PoolNumber)
EndRow = int(((ArrayCount / CoreCount) * (PoolNumber+1))-1)
for Row in range(StartRow,EndRow):
Rand = randrange(ArrayCount)
Value1 = Decimal(DataSet[Row][0]) + Decimal(DataSet[Row][1])
Value2 = Decimal(DataSet[Rand][0]) + Decimal(DataSet[Rand][1])
Value3 = Value1 - Value2
NewValue = Decimal(DataSet[Row][7]) + Value3
DataSet[Row][7] = str(NewValue)
def main():
#Importing the file
print("Importing File ", FileToImport)
OriginalDataSet = []
f = open(FileToImport)
for line in f:
StrippedLine = line.rstrip()
OriginalDataSet.append(StrippedLine.split(",",))
ArrayCount = len(OriginalDataSet)
#Running passes on dataset
for Pass in range(NumberofPasses):
print("Running Pass : ", Pass + 1, " of ", NumberofPasses)
CoreCount = mp.cpu_count()
WorkPool=mp.Pool(CoreCount)
for PoolNumber in range(CoreCount):
Params = [Pass,PoolNumber,ArrayCount,CoreCount]
RevisedDataSet = WorkPool.starmap(Calc1stPass, product(OriginalDataSet, zip(range(1),Params)))
print(RevisedDataSet)
if __name__ == "__main__":
freeze_support()
main()

Okay, here we go with what I came up with after some discussion plus trial and error. I hope I've kept it somewhat comprehensible. However, it seems you are very new to a lot of this, so you probably have a lot of reading to do regarding how certain libraries and data types work.
Analyzing the algorithm
Let's start with taking a closer look at your computation:
for Pass in range(Passes:
for Row in range(StartRow,EndRow):
Rand = randrange(ArrayCount)
Value1 = Decimal(DataSet[Row][0]) + Decimal(DataSet[Row][1])
Value2 = Decimal(DataSet[Rand][0]) + Decimal(DataSet[Rand][1])
Value3 = Value1 - Value2
NewValue = Decimal(DataSet[Row][7]) + Value3
DataSet[Row][7] = str(NewValue)
So basically, we update a single row through a computation involving another random row.
Assumptions that I make:
the real algorithm does a bit more stuff, otherwise it is hard to see what you want to achieve
the access pattern of the real algorithm stays the same
Following our discussion, there are no functional reasons for the following aspects:
Computation in Decimal is unnecessary. float will do just fine
The values don't need to be stored as string. We can use an array of float
At this point it is clear that we can save tremendous amounts of runtime by using a numpy array instead of a list of string.
There is an additional hazard here for parallelization: We use random numbers. When we use multiple processes, the random number generators need to be set up for parallel generation. We'll cross that bridge when we get there.
Notably, the output column is no input for the next pass. The inputs per pass stay constant.
Input / Output
The input file format seems to be a simple CSV mostly filled with floating point numbers (using only one decimal place) and one column not being a floating point number. The text based format coupled with your information that there are gigabytes of data means that a significant amount of time will be spent just parsing the input file or formatting the output. I'll try to be efficient in both but keep things simple enough that extensions in both are possible.
Optimizing the sequential algorithm
It is always advisable to first optimize the sequential case before parallelizing. So we start here. We begin with parsing the input file into a numpy array.
import numpy as np
def ReadInputs(Filename):
"""Read a CSV file containing 10 columns
The 7th column is skipped because it doesn't contains floating point values
Return value:
2D numpy array of floats
"""
UsedColumns = (0, 1, 2, 3, 4, 5, 7, 8, 9)
return np.loadtxt(Filename, delimiter=',', usecols=UsedColumns)
Since we are using numpy, we switch over to its random number generators. This is the setup routine. It allows us to force deterministic values for easier debugging.
def MakeRandomGenerator(Deterministic=False):
"""Initializes the random number generator avoiding birthday paradox
Arguments:
Deterministic -- if True, the same same random numbers are being used
Return value:
numpy random number generator
"""
SeedInt = 0 if Deterministic else None
Seed = np.random.SeedSequence(SeedInt)
return np.random.default_rng(Seed)
And now the main computation. Numpy makes this very straight-forward.
def ComputePass(DataSets, RandomGenerator):
"""The main computation
Arguments:
DataSets -- 2D numpy array. Changed in place
RandomGenerator -- numpy random number generator
"""
Count = len(DataSets)
RandomIndices = RandomGenerator.integers(
low=0, high=Count, size=Count)
RandomRows = DataSets[RandomIndices]
# All rows: first column + second column
Value1 = DataSets[:, 0] + DataSets[:, 1]
Value2 = RandomRows[:, 0] + RandomRows[:, 1]
Value3 = Value1 - Value2
# This change is in-place of the whole DataSets array
DataSets[:, 7] += Value3
I've kept the structure the same. That means there are a few optimizations that we can still do:
We never use most columns. Columns that are unnecessary should be removed from the array (skipped in input parsing) to reduce memory consumption and improve locality of data. If necessary for output, it is better to merge in the output stage, maybe by re-reading the input file to gather the remaining columns
Since Value1 and Value2 never change, we could pre-compute Value3 for all rows and just use that. Again, if we don't need the first two columns in memory, better to remove them
If we transpose the array (or store in Fortran order), we improve vectorization. This will make the use of MPI harder, but not impossible
I've not done any of this because I do not want to stray too far from the original algorithm.
The last step is the output. Here I go with a pure Python route to keep things simple and replicate the input file format:
def WriteOutputs(Filename, DataSets):
LineFormat = "{:.1f}, " * 6 + "+" + ", {:.1f}" * 3 + "\n"
with open(Filename, 'w') as OutFile:
for Row in DataSets:
OutFile.write(LineFormat.format(*Row))
Now the entire operation is rather simple:
def main():
InFilename = "indata.csv"
OutFilename = "outdata.csv"
Passes = 20
RandomGenerator = MakeRandomGenerator()
DataSets = ReadInputs(InFilename)
for _ in range(Passes):
ComputePass(DataSets, RandomGenerator)
WriteOutputs(OutFilename, DataSets)
if __name__ == '__main__':
main()
Parallelization framework
There are two main concerns for parallelization:
For every row, we need access to the entire input data set to pick a random entry
The amount of calculation per row is very low
So we need to find a way that keeps overhead per row small and shares the input data set efficiently.
The first choice is multiprocessing since, you know, standard library and all that. However, I think that the normal usage patterns have too much overhead. It's certainly possible but I would like to use MPI for this to give us as much performance as possible. Also, your first attempt at parallelization used a pattern that matches MPI's preferred pattern. So it is a good fit.
A word towards the concept of MPI: multiprocessing.Pool works with a main process that distributes work items among a set of worker processes. MPI start N processes that all execute the same code. There is no main process. The only distinguishing feature is the process "rank", which is a number [0, N). If you need a main process, the one with rank 0 is usually chosen. Other than that, the idea is that all processes execute the same code, only picking different indices or offsets based on their rank. If processes need to communicate, there are a couple of "collective" communication patterns such as broadcasting, scattering, and gathering.
Option 1: Pure MPI
Let's rewrite the code. The main idea is this: We distribute rows in the data set among all processes. Then each process calculates all passes for its own set of rows. Input and output take considerable time, so we try to do as much as possible parallelized, too.
We start by defining a helper function that defines how we distribute rows among all processes. This is very similar to what you had in your original version.
from mpi4py import MPI
def MakeDistribution(NumberOfRows):
"""Computes how the data set should be distributed across processes
Arguments:
NumberOfRows -- size of the whole dataset
Return value:
(Offsets, Counts) numpy integer arrays. One entry per process
"""
Comm = MPI.COMM_WORLD
WorldSize = Comm.Get_size()
SameSize, Tail = divmod(NumberOfRows, WorldSize)
Counts = np.full(WorldSize, SameSize, dtype=int)
Counts[:Tail] += 1
# Start offset per process
Offsets = np.cumsum(Counts) - Counts[0]
return Offsets, Counts
A second helper function is used to distribute the data sets among all processes. MPI's allgather function is used to collect results of a computation among all processes into one array. The normal function gather collects the whole array on one process. Allgather collects it in all processes. Since all processes need access to all data sets for their random access, we use allgather. Allgatherv is a generalized version that allows different number of entries per process. We need this because we cannot guarantee that all processes have the same number of rows in their local data set.
This function uses the "buffer" interface of mpi4py. This is the more efficient version but also very error-prone. If we mess up an index or the size of a data type, we risk data corruption.
def DistributeDataSets(DataSets, Offsets, Counts):
"""Shares the datasets with all other processes
Arguments:
DataSets -- numpy array of floats. Changed in place
Offsets, Counts -- See MakeDistribution
Return value:
DataSets. Most likely a reference to the original.
Might be an updated copy
"""
# Sanitize the input. Better safe than sorry and shouldn't cost anything
DataSets = np.ascontiguousarray(DataSets, dtype='f8')
assert len(DataSets) == np.sum(Counts)
# MPI works best if we pretend to have 1-dimensional data
InnerSize = np.prod(DataSets.shape[1:], dtype=int)
# I really wish mpi4py had a helper for this
BufferDescr = (DataSets,
Counts * InnerSize,
Offsets * InnerSize,
MPI.DOUBLE)
MPI.COMM_WORLD.Allgatherv(MPI.IN_PLACE, BufferDescr)
return DataSets
We split reading the input data into two parts. First we read all lines in a single process. This is relatively cheap and we need to know the total number of rows before we can distribute the datasets. Then we scatter the lines among all processes and let each process parse its own set of rows. After that, we use the DistributeDataSets function to let each process know all the results.
Scattering the lines uses mpi4py's pickle interface that can transfer arbitrary objects among processes. It's slower but more convenient. For stuff like lists of strings it's very good.
def ParseLines(TotalLines, Offset, OwnLines):
"""Allocates a data set and parses the own segment of it
Arguments:
TotalLines -- number of rows in the whole data set across all processes
Offset -- starting offset of the set of rows parsed by this process
OwnLines -- list of lines to be parsed by the local process
Return value:
a 2D numpy array. The rows [Offset:Offset+len(OwnLines)] are initialized
with the parsed values
"""
UsedColumns = (0, 1, 2, 3, 4, 5, 7, 8, 9)
DataSet = np.empty((TotalLines, len(UsedColumns)), dtype='f8')
OwnEnd = Offset + len(OwnLines)
for Row, Line in zip(DataSet[Offset:OwnEnd], OwnLines):
Columns = Line.split(',')
# overwrite in-place with new values
Row[:] = [float(Columns[Column]) for Column in UsedColumns]
return DataSet
def DistributeInputs(Filename):
"""Read input from the file and distribute it among processes
Arguments:
Filename -- path to the CSV file to parse
Return value:
(DataSets, Offsets, Counts) with
DataSets -- 2D array containing all values in the CSV file
Offsets -- Row indices (one per rank) where each process starts its own
processing
Counts -- number of rows per process
"""
Comm = MPI.COMM_WORLD
Rank = Comm.Get_rank()
Lines = None
LineCount = None
if not Rank:
# Read the data. We do as little work as possible here so that other
# processes can help with the parsing
with open(Filename) as InFile:
Lines = InFile.readlines()
LineCount = len(Lines)
# broadcast so that all processes know the number of datasets
LineCount = Comm.bcast(LineCount, root=0)
Offsets, Counts = MakeDistribution(LineCount)
# reshape into one list per process
if not Rank:
Lines = [Lines[Offset:Offset+Count]
for Offset, Count
in zip(Offsets, Counts)]
# distribute strings for parsing
Lines = Comm.scatter(Lines, root=0)
# parse into a float array
DataSets = ParseLines(LineCount, Offsets[Rank], Lines)
del Lines # release strings because this is a huge array
# Share the parsed result
DataSets = DistributeDataSets(DataSets, Offsets, Counts)
return DataSets, Offsets, Counts
Now we need to update the way the random number generator is initialized. What we need to prevent is that each process has the same state and generates the same random numbers. Thankfully, numpy gives us a convenient way of doing this.
def MakeRandomGenerator(Deterministic=False):
"""Initializes the random number generator avoiding birthday paradox
Arguments:
Deterministic -- if True, the same number of processes should always result
in the same random numbers being used
Return value:
numpy random number generator
"""
Comm = MPI.COMM_WORLD
Rank = Comm.Get_rank()
AllSeeds = None
if not Rank:
# the root process (rank=0) generates a seed sequence for everyone else
WorldSize = Comm.Get_size()
SeedInt = 0 if Deterministic else None
OwnSeed = np.random.SeedSequence(SeedInt)
AllSeeds = OwnSeed.spawn(WorldSize)
# mpi4py can scatter Python objects. This is the simplest way
OwnSeed = Comm.scatter(AllSeeds, root=0)
return np.random.default_rng(OwnSeed)
The computation itself is almost unchanged. We just need to limit it to the rows for which the individual process is responsible.
def ComputePass(DataSets, Offset, Count, RandomGenerator):
"""The main computation
Arguments:
DataSets -- 2D numpy array. Changed in place
Offset, Count -- rows that should be updated by this process
RandomGenerator -- numpy random number generator
"""
RandomIndices = RandomGenerator.integers(
low=0, high=len(DataSets), size=Count)
RandomRows = DataSets[RandomIndices]
# Creates a "view" into the whole dataset for the given slice
OwnDataSets = DataSets[Offset:Offset + Count]
# All rows: first column + second column
Value1 = OwnDataSets[:, 0] + OwnDataSets[:, 1]
Value2 = RandomRows[:, 0] + RandomRows[:, 1]
Value3 = Value1 - Value2
# This change is in-place of the whole DataSets array
OwnDataSets[:, 7] += Value3
Now we come to writing the output. The most expensive part is formatting the floating point numbers into strings. So we let each process format its own data. MPI has a file IO interface that allows all processes to write a single file together. Unfortunately, for text files, we need to calculate the offsets before writing the data. So we format all rows into one huge string per process, then write the file.
import io
def WriteOutputs(Filename, DataSets, Offset, Count):
"""Writes all DataSets to a CSV file
We parse all rows to a string (one per process), then write it
collectively using MPI
Arguments:
Filename -- output path
DataSets -- all values among all processes
Offset, Count -- the rows for which the local process is responsible
"""
StringBuf = io.StringIO()
LineFormat = "{:.6f}, " * 6 + "+" + ", {:.6f}" * 3 + "\n"
for Row in DataSets[Offset:Offset+Count]:
StringBuf.write(LineFormat.format(*Row))
StringBuf = StringBuf.getvalue() # to string
StringBuf = StringBuf.encode() # to bytes
Comm = MPI.COMM_WORLD
BytesPerProcess = Comm.allgather(len(StringBuf))
Rank = Comm.Get_rank()
OwnOffset = sum(BytesPerProcess[:Rank])
FileLength = sum(BytesPerProcess)
AccessMode = MPI.MODE_WRONLY | MPI.MODE_CREATE
OutFile = MPI.File.Open(Comm, Filename, AccessMode)
OutFile.Set_size(FileLength)
OutFile.Write_ordered(StringBuf)
OutFile.Close()
The main process is almost unchanged.
def main():
InFilename = "indata.csv"
OutFilename = "outdata.csv"
Passes = 20
RandomGenerator = MakeRandomGenerator()
DataSets, Offsets, Counts = DistributeInputs(InFilename)
Rank = MPI.COMM_WORLD.Get_rank()
Offset = Offsets[Rank]
Count = Counts[Rank]
for _ in range(Passes):
ComputePass(DataSets, Offset, Count, RandomGenerator)
WriteOutputs(OutFilename, DataSets, Offset, Count)
if __name__ == '__main__':
main()
You need to call this script with mpirun or mpiexec. E.g. mpiexec python3 script_name.py
Using shared memory
The MPI pattern has one significant drawback: Each process needs its own copy of the whole data set. Given its size, this is very inconvenient. We might run out of memory before we run out of CPU cores for multithreading. As a different idea, we can use shared memory. Shared memory allows multiple processes to access the same physical memory without any extra cost. This has some drawbacks:
We need a very recent python version. 3.8 IIRC
Python's implementation may behave differently on various operating systems. I could only test it on Linux. There is a chance that it will not work on any different system
IMHO python's implementation is not great. You will notice that the final version will print some warnings which I think are harmless. Maybe I'm using it wrong but I don't see a more correct way of using it
It limits you to a single PC. MPI itself is perfectly capable (and indeed designed to) operate across multiple systems on a network. Shared memory works only locally.
The major benefit is that the memory consumption does not increase with the number of processes.
We start by allocating such a data set.
From here on, we put in "barriers" at various points where processes may have to wait for one another. For example because all processes need to access the same shared memory segment, they all have to open it before we can unlink it.
from multiprocessing import shared_memory
def AllocateSharedDataSets(NumberOfRows, NumberOfCols=9):
"""Creates a numpy array in shared memory
Arguments:
NumberOfRows, NumberOfCols -- basic shape
Return value:
(DataSets, Buf) with
DataSets -- numpy array shaped (NumberOfRows, NumberOfCols).
Datatype float
Buf -- multiprocessing.shared_memory.SharedMemory that backs the array.
Close it when no longer needed
"""
length = NumberOfRows * NumberOfCols * np.float64().itemsize
Comm = MPI.COMM_WORLD
Rank = Comm.Get_rank()
Buf = None
BufName = None
if not Rank:
Buf = shared_memory.SharedMemory(create=True, size=length)
BufName = Buf.name
BufName = Comm.bcast(BufName)
if Rank:
Buf = shared_memory.SharedMemory(name=BufName, size=length)
DataSets = np.ndarray((NumberOfRows, NumberOfCols), dtype='f8',
buffer=Buf.buf)
Comm.barrier()
if not Rank:
Buf.unlink() # this may differ among operating systems
return DataSets, Buf
The input parsing also changes a little because have to put the data into the previously allocated array
def ParseLines(DataSets, Offset, OwnLines):
"""Reads lines into a preallocated array
Arguments:
DataSets -- [Rows, Cols] numpy array. Will be changed in-place
Offset -- starting offset of the set of rows parsed by this process
OwnLines -- list of lines to be parsed by the local process
"""
UsedColumns = (0, 1, 2, 3, 4, 5, 7, 8, 9)
OwnEnd = Offset + len(OwnLines)
OwnDataSets = DataSets[Offset:OwnEnd]
for Row, Line in zip(OwnDataSets, OwnLines):
Columns = Line.split(',')
Row[:] = [float(Columns[Column]) for Column in UsedColumns]
def DistributeInputs(Filename):
"""Read input from the file and stores it in shared memory
Arguments:
Filename -- path to the CSV file to parse
Return value:
(DataSets, Offsets, Counts, Buf) with
DataSets -- [Rows, 9] array containing two copies of all values in the
CSV file
Offsets -- Row indices (one per rank) where each process starts its own
processing
Counts -- number of rows per process
Buf -- multiprocessing.shared_memory.SharedMemory object backing the
DataSets object
"""
Comm = MPI.COMM_WORLD
Rank = Comm.Get_rank()
Lines = None
LineCount = None
if not Rank:
# Read the data. We do as little work as possible here so that other
# processes can help with the parsing
with open(Filename) as InFile:
Lines = InFile.readlines()
LineCount = len(Lines)
# broadcast so that all processes know the number of datasets
LineCount = Comm.bcast(LineCount, root=0)
Offsets, Counts = MakeDistribution(LineCount)
# reshape into one list per process
if not Rank:
Lines = [Lines[Offset:Offset+Count]
for Offset, Count
in zip(Offsets, Counts)]
# distribute strings for parsing
Lines = Comm.scatter(Lines, root=0)
# parse into a float array
DataSets, Buf = AllocateSharedDataSets(LineCount)
try:
ParseLines(DataSets, Offsets[Rank], Lines)
Comm.barrier()
return DataSets, Offsets, Counts, Buf
except:
Buf.close()
raise
Output writing is exactly the same. The main process changes slightly because now we have to manage the life time of the shared memory.
import contextlib
def main():
InFilename = "indata.csv"
OutFilename = "outdata.csv"
Passes = 20
RandomGenerator = MakeRandomGenerator()
Comm = MPI.COMM_WORLD
Rank = Comm.Get_rank()
DataSets, Offsets, Counts, Buf = DistributeInputs(InFilename)
with contextlib.closing(Buf):
Offset = Offsets[Rank]
Count = Counts[Rank]
for _ in range(Passes):
ComputePass(DataSets, Offset, Count, RandomGenerator)
WriteOutputs(OutFilename, DataSets, Offset, Count)
Results
I've not benchmarked the original version. The sequential version requires 2 GiB memory and 3:20 minutes for 12500000 lines and 20 passes.
The pure MPI version requires 6 GiB and 42 seconds with 6 cores.
The shared memory version requires a bit over 2 GiB of memory and 38 seconds with 6 cores.

Pytables: Can Appended Earray be reduced in size?

Following suggestions on SO Post, I also found PyTables-append is exceptionally time efficient. However, in my case the output file (earray.h5) has huge size. Is there a way to append the data such that the output file is not as huge? For example, in my case (see link below) a 13GB input file (dset_1: 2.1E8 x 4 and dset_2: 2.1E8 x 4) gives a 197 GB output file with just one column (2.5E10 x 1). All elements are float64.
I want to reduce the output file size such that the execution speed of the script is not compromised and the output file reading is also efficient for later use. Can saving the data along columns and not just rows help? Any suggestions on this? Given below is a MWE.
Output and input files' details here
# no. of chunks from dset-1 and dset-2 in inp.h5
loop_1 = 40
loop_2 = 20
# save to disk after these many rows
app_len = 10**6
# **********************************************
# Grabbing input.h5 file
# **********************************************
filename = 'inp.h5'
f2 = h5py.File(filename, 'r')
chunks1 = f2['dset_1']
chunks2 = f2['dset_2']
shape1, shape2 = chunks1.shape[0], chunks2.shape[0]
f1 = tables.open_file("table.h5", "w")
a = f1.create_earray(f1.root, "dataset_1", atom=tables.Float64Atom(), shape=(0, 4))
size1 = shape1//loop_1
size2 = shape2//loop_2
# ***************************************************
# Grabbing chunks to process and append data
# ***************************************************
for c in range(loop_1):
h = c*size1
# grab chunks from dset_1 of inp.h5
chunk1 = chunks1[h:(h + size1)]
for d in range(loop_2):
g = d*size2
chunk2 = chunks2[g:(g + size2)] # grab chunks from dset_2 of inp.h5
r1 = chunk1.shape[0]
r2 = chunk2.shape[0]
left, right = 0, 0
for j in range(r1): # grab col.2 values from dataset-1
e1 = chunk1[j, 1]
#...Algaebraic operations here to output a row containing 4 float64
#...append to a (earray) when no. of rows reach a million
del chunk2
del chunk1
f2.close()

I wrote the answer you are referencing. That is a simple example that "only" writes 1.5e6 rows. I didn't do anything to optimize performance for very large files. You are creating a very large file, but did not say how many rows (obviously way more than 10**6). Here are some suggestions based on comments in another thread.
Areas I recommend (3 related to PyTables code, and 2 based on external utilizes).
PyTables code suggestions:
Enable compression when you create the file (add the filters= parameter when you create the file). Start with tb.Filters(complevel=1).
Define the expectedrows= parameter in .create_tables() (per PyTables docs, 'this will optimize the HDF5 B-Tree and amount of memory used'). The default value is set in tables/parameters.py (look for EXPECTED_ROWS_TABLE; It's only 10000 in my installation). I suggest you set this to a larger value if you are creating 10**6 (or more) rows.
There is a side benefit to setting expectedrows=. If you don't define chunkshape, 'a sensible value is calculated based on the expectedrows parameter'. Check the value used. This won't decrease the created file size, but will improve I/O performance.
If you didn't use compression when you created the file, there are 2 methods to compress existing files:
External Utilities:
The PyTables utility ptrepack - run against a HDF5 file to create a
new file (useful to go from uncompressed to compressed, or vice-versa). It is delivered with PyTables, and runs on the command line.
The HDF5 utility h5repack - works similar to ptrepack. It is delivered with the HDF5 installer from The HDF Group.
There are trade-offs with file compression: it reduces the file size, but increases access time (reduces I/O performance). I tend to use uncompressed files I open frequently (for best I/O performance). Then when done, I convert to compressed format for long term archiving. You can continue to work with them in compress format (the API handles cleanly).

PyTables vs Matlab HDF5 read times

I have an HDF5 output file from NASTRAN that contains mode shape data. I am trying to read them into Matlab and Python to check various post-processing techniques. The file in question is in the local directory for both of these tests. The file is semi-large at 1.2 GB but certainly not that large in terms of HDF5 files I have read previously. There are 17567342 rows and 8 columns in the table I want to access. The first and last columns are integers the middle 6 are floating point numbers.
Matlab:
file = 'HDF5.h5';
hinfo = hdf5info(file);
% ... Find the dataset I want to extract
t = hdf5read(file, '/NASTRAN/RESULT/NODAL/EIGENVECTOR');
This last operation is extremely slow (can be measured in hours).
Python:
import tables
hfile = tables.open_file("HDF5.h5")
modetable = hfile.root.NASTRAN.RESULT.NODAL.EIGENVECTOR
data = modetable.read()
This last operation is basically instant. I can then access data as if it were a numpy array. I am clearly missing something very basic about what these commands are doing. I'm thinking it might have something to do with data conversion but I'm not sure. If I do type(data) I get back numpy.ndarray and type(data[0]) returns numpy.void.
What is the correct (i.e. speedy) way to read the dataset I want into Matlab?

Matt, Are you still working on this problem?
I am not a matlab guy, but I am familiar with Nastran HDF5 file. You are right; 1.2 GB is big, but not that big by today's standards.
You might be able to diagnose the matlab performance bottle neck by running tests with different numbers of rows in your EIGENVECTOR dataset. To do that (without running a lot of Nastran jobs), I created some simple code to create a HDF5 file with a user defined # of rows. It mimics the structure of the Nastran Eigenvector Result dataset. See below:
import tables as tb
import numpy as np
hfile = tb.open_file('SO_54300107.h5','w')
eigen_dtype = np.dtype([('ID',int), ('X',float),('Y',float),('Z',float),
('RX',float),('RY',float),('RZ',float), ('DOMAIN_ID',int)])
fsize = 1000.0
isize = int(fsize)
recarr = np.recarray((isize,),dtype=eigen_dtype)
id_arr = np.arange(1,isize+1)
dom_arr = np.ones((isize,), dtype=int)
arr = np.array(np.arange(fsize))/fsize
recarr['ID'] = id_arr
recarr['X'] = arr
recarr['Y'] = arr
recarr['Z'] = arr
recarr['RX'] = arr
recarr['RY'] = arr
recarr['RZ'] = arr
recarr['DOMAIN_ID'] = dom_arr
modetable = hfile.create_table('/NASTRAN/RESULT/NODAL', 'EIGENVECTOR',
createparents=True, obj=recarr )
hfile.close()
Try running this with different values for fsize (# of rows), then attach the HDF5 file it creates to matlab. Maybe you can find the point where performance noticeably degrades.

Matlab provided another HDF5 reader called h5read. Using the same basic approach the amount of time taken to read the data was drastically reduced. In fact hdf5read is listed for removal in a future version. Here is same basic code with the perfered functions.
file = 'HDF5.h5';
hinfo = h5info(file);
% ... Find the dataset I want to extract
t = h5read(file, '/NASTRAN/RESULT/NODAL/EIGENVECTOR');

How can I gradually free memory from a numpy array?

I'm in a situation where I'm constantly hitting my memory limit (I have 20G of RAM). Somehow I managed to get the huge array into memory and carry on my processes. Now the data needs to be saved onto the disk. I need to save it in leveldb format.
This is the code snippet responsible for saving the normalized data onto the disk:
print 'Outputting training data'
leveldb_file = dir_des + 'svhn_train_leveldb_normalized'
batch_size = size_train
# create the leveldb file
db = leveldb.LevelDB(leveldb_file)
batch = leveldb.WriteBatch()
datum = caffe_pb2.Datum()
for i in range(size_train):
if i % 1000 == 0:
print i
# save in datum
datum = caffe.io.array_to_datum(data_train[i], label_train[i])
keystr = '{:0>5d}'.format(i)
batch.Put( keystr, datum.SerializeToString() )
# write batch
if(i + 1) % batch_size == 0:
db.Write(batch, sync=True)
batch = leveldb.WriteBatch()
print (i + 1)
# write last batch
if (i+1) % batch_size != 0:
db.Write(batch, sync=True)
print 'last batch'
print (i + 1)
Now, my problem is, I hit my limit pretty much at the very end (495k out of 604k items that need to be saved to the disk) when saving to the disk.
To get around this issue, I thought after writing each batch, I release the corresponding memory from the numpy array (data_train) since it seems leveldb writes the data in a transaction manner, and until all the data are written, they are not flushed to the disk!
My second thought is to somehow, make the write non-transactional, and when each batch is written using the db.Write, it actually saves the content to the disk.
I don't know if any of these ideas are applicable.

Try reducing batch_size to something smaller than the entire dataset, for example, 100000.
Converted to Community Wiki from #ren 's comment

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.