Sharing a large dataframe with multiprocessing.pool

Sharing a large dataframe with multiprocessing.pool - python

I have a function which I want to compute in parallel using multiprocessing. The function takes an argument, but also loads subsets from two very large dataframe which has already been loaded into memory (one of which is about 1G and the other is just over 6G).
largeDF1 = pd.read_csv(directory + 'name1.csv')
largeDF2 = pd.read_csv(directory + 'name2.csv')
def f(x):
load_content1 = largeDF1.loc[largeDF1['FirstRow'] == x]
load_content2 = largeDF1.loc[largeDF1['FirstRow'] == x]
#some computation happens here
new_data.to_csv(directory + 'output.csv', index = False)
def main():
multiprocessing.set_start_method('spawn', force = True)
pool = multiprocessing.Pool(processes = multiprocessing.cpu_count())
input = input_data['col']
pool.map_async(f, input)
pool.close()
pool.join()
The problem is that the files are too big and when I run them over multiple cores I get a memory issue. I want to know if there is a way where the loaded files can be shared across all processes.
I have tried manager() but could not get it to work. Any help is appreciated. Thanks.

If you were running this on a UNIX-like system (which uses the fork startmethod by default) the data would be shared out-of-the-box. Most operating systems use copy-on-write for memory pages. So even if you fork a process several times they would share most of the memory pages that contain the dataframes, al long as you don't modify those dataframes.
But when using the spawn start method, each worker process has to load the dataframe. I'm not sure if the OS is smart enough in that case to share the memory pages. Or indeed that these spawned processes would all have the same memory lay-out.
The only portable solution I can think of would be to leave the data on disk and use mmap in the workers to map it into memory read-only. That way the OS would notice that multiple processes are mapping the same file, and it would only load one copy.
The downside is that the data would be in memory in on-disk csv format, which makes reading data from it (without making a copy!) less convenient. So you might want to prepare the data beforehand into a form that it easier to use. Like e.g. convert the data from 'FirstRow' into a binary file of float or double that you can iterate over with struct.iter_unpack.
The function below (from my statusline script) uses mmap to count the amount of messages in a mailbox file.
def mail(storage, mboxname):
"""
Report unread mail.
Arguments:
storage: a dict with keys (unread, time, size) from the previous call or an empty dict.
This dict will be *modified* by this function.
mboxname (str): name of the mailbox to read.
Returns: A string to display.
"""
stats = os.stat(mboxname)
if stats.st_size == 0:
return 'Mail: 0'
# When mutt modifies the mailbox, it seems to only change the
# ctime, not the mtime! This is probably releated to how mutt saves the
# file. See also stat(2).
newtime = stats.st_ctime
newsize = stats.st_size
if not storage or newtime > storage['time'] or newsize != storage['size']:
with open(mboxname) as mbox:
with mmap.mmap(mbox.fileno(), 0, prot=mmap.PROT_READ) as mm:
start, total = 0, 1 # First mail is not found; it starts on first line...
while True:
rv = mm.find(b'\n\nFrom ', start)
if rv == -1:
break
else:
total += 1
start = rv + 7
start, read = 0, 0
while True:
rv = mm.find(b'\nStatus: R', start)
if rv == -1:
break
else:
read += 1
start = rv + 10
unread = total - read
# Save values for the next run.
storage['unread'], storage['time'], storage['size'] = unread, newtime, newsize
else:
unread = storage['unread']
return f'Mail: {unread}'
In this case I used mmap because it was 4x faster than just reading the file. See normal reading versus using mmap.

Related

Modifying a list from multiple Python pools

I have a large data set (~2Gb) to analyse and I'd like to multi process it to reduce the run time of the code. I've imported the dataset into a list which I will then want to run numerous passes over. On each pass I'll set up a pool for each available core and each pool will then only assess a certain block of the data set (note, the pool still needs access to the complete data set).
Each line of the input file takes the format "a,b,c,d,e,f,g,h" and all are numbers.
I'm struggling to separate out the get the parameters in the Calc1stPass Pool; I'm getting a tuple index out or range error. Can anyone help me out with this error please?
def Calc1stPass(DataSet,Params):
print("DataSet =", DataSet)
print("Params =", Params)
Pass, (PoolNumber, ArrayCount, CoreCount) = Params
StartRow = int((ArrayCount / CoreCount) * PoolNumber)
EndRow = int(((ArrayCount / CoreCount) * (PoolNumber+1))-1)
for Row in range(StartRow,EndRow):
Rand = randrange(ArrayCount)
Value1 = Decimal(DataSet[Row][0]) + Decimal(DataSet[Row][1])
Value2 = Decimal(DataSet[Rand][0]) + Decimal(DataSet[Rand][1])
Value3 = Value1 - Value2
NewValue = Decimal(DataSet[Row][7]) + Value3
DataSet[Row][7] = str(NewValue)
def main():
#Importing the file
print("Importing File ", FileToImport)
OriginalDataSet = []
f = open(FileToImport)
for line in f:
StrippedLine = line.rstrip()
OriginalDataSet.append(StrippedLine.split(",",))
ArrayCount = len(OriginalDataSet)
#Running passes on dataset
for Pass in range(NumberofPasses):
print("Running Pass : ", Pass + 1, " of ", NumberofPasses)
CoreCount = mp.cpu_count()
WorkPool=mp.Pool(CoreCount)
for PoolNumber in range(CoreCount):
Params = [Pass,PoolNumber,ArrayCount,CoreCount]
RevisedDataSet = WorkPool.starmap(Calc1stPass, product(OriginalDataSet, zip(range(1),Params)))
print(RevisedDataSet)
if __name__ == "__main__":
freeze_support()
main()

Okay, here we go with what I came up with after some discussion plus trial and error. I hope I've kept it somewhat comprehensible. However, it seems you are very new to a lot of this, so you probably have a lot of reading to do regarding how certain libraries and data types work.
Analyzing the algorithm
Let's start with taking a closer look at your computation:
for Pass in range(Passes:
for Row in range(StartRow,EndRow):
Rand = randrange(ArrayCount)
Value1 = Decimal(DataSet[Row][0]) + Decimal(DataSet[Row][1])
Value2 = Decimal(DataSet[Rand][0]) + Decimal(DataSet[Rand][1])
Value3 = Value1 - Value2
NewValue = Decimal(DataSet[Row][7]) + Value3
DataSet[Row][7] = str(NewValue)
So basically, we update a single row through a computation involving another random row.
Assumptions that I make:
the real algorithm does a bit more stuff, otherwise it is hard to see what you want to achieve
the access pattern of the real algorithm stays the same
Following our discussion, there are no functional reasons for the following aspects:
Computation in Decimal is unnecessary. float will do just fine
The values don't need to be stored as string. We can use an array of float
At this point it is clear that we can save tremendous amounts of runtime by using a numpy array instead of a list of string.
There is an additional hazard here for parallelization: We use random numbers. When we use multiple processes, the random number generators need to be set up for parallel generation. We'll cross that bridge when we get there.
Notably, the output column is no input for the next pass. The inputs per pass stay constant.
Input / Output
The input file format seems to be a simple CSV mostly filled with floating point numbers (using only one decimal place) and one column not being a floating point number. The text based format coupled with your information that there are gigabytes of data means that a significant amount of time will be spent just parsing the input file or formatting the output. I'll try to be efficient in both but keep things simple enough that extensions in both are possible.
Optimizing the sequential algorithm
It is always advisable to first optimize the sequential case before parallelizing. So we start here. We begin with parsing the input file into a numpy array.
import numpy as np
def ReadInputs(Filename):
"""Read a CSV file containing 10 columns
The 7th column is skipped because it doesn't contains floating point values
Return value:
2D numpy array of floats
"""
UsedColumns = (0, 1, 2, 3, 4, 5, 7, 8, 9)
return np.loadtxt(Filename, delimiter=',', usecols=UsedColumns)
Since we are using numpy, we switch over to its random number generators. This is the setup routine. It allows us to force deterministic values for easier debugging.
def MakeRandomGenerator(Deterministic=False):
"""Initializes the random number generator avoiding birthday paradox
Arguments:
Deterministic -- if True, the same same random numbers are being used
Return value:
numpy random number generator
"""
SeedInt = 0 if Deterministic else None
Seed = np.random.SeedSequence(SeedInt)
return np.random.default_rng(Seed)
And now the main computation. Numpy makes this very straight-forward.
def ComputePass(DataSets, RandomGenerator):
"""The main computation
Arguments:
DataSets -- 2D numpy array. Changed in place
RandomGenerator -- numpy random number generator
"""
Count = len(DataSets)
RandomIndices = RandomGenerator.integers(
low=0, high=Count, size=Count)
RandomRows = DataSets[RandomIndices]
# All rows: first column + second column
Value1 = DataSets[:, 0] + DataSets[:, 1]
Value2 = RandomRows[:, 0] + RandomRows[:, 1]
Value3 = Value1 - Value2
# This change is in-place of the whole DataSets array
DataSets[:, 7] += Value3
I've kept the structure the same. That means there are a few optimizations that we can still do:
We never use most columns. Columns that are unnecessary should be removed from the array (skipped in input parsing) to reduce memory consumption and improve locality of data. If necessary for output, it is better to merge in the output stage, maybe by re-reading the input file to gather the remaining columns
Since Value1 and Value2 never change, we could pre-compute Value3 for all rows and just use that. Again, if we don't need the first two columns in memory, better to remove them
If we transpose the array (or store in Fortran order), we improve vectorization. This will make the use of MPI harder, but not impossible
I've not done any of this because I do not want to stray too far from the original algorithm.
The last step is the output. Here I go with a pure Python route to keep things simple and replicate the input file format:
def WriteOutputs(Filename, DataSets):
LineFormat = "{:.1f}, " * 6 + "+" + ", {:.1f}" * 3 + "\n"
with open(Filename, 'w') as OutFile:
for Row in DataSets:
OutFile.write(LineFormat.format(*Row))
Now the entire operation is rather simple:
def main():
InFilename = "indata.csv"
OutFilename = "outdata.csv"
Passes = 20
RandomGenerator = MakeRandomGenerator()
DataSets = ReadInputs(InFilename)
for _ in range(Passes):
ComputePass(DataSets, RandomGenerator)
WriteOutputs(OutFilename, DataSets)
if __name__ == '__main__':
main()
Parallelization framework
There are two main concerns for parallelization:
For every row, we need access to the entire input data set to pick a random entry
The amount of calculation per row is very low
So we need to find a way that keeps overhead per row small and shares the input data set efficiently.
The first choice is multiprocessing since, you know, standard library and all that. However, I think that the normal usage patterns have too much overhead. It's certainly possible but I would like to use MPI for this to give us as much performance as possible. Also, your first attempt at parallelization used a pattern that matches MPI's preferred pattern. So it is a good fit.
A word towards the concept of MPI: multiprocessing.Pool works with a main process that distributes work items among a set of worker processes. MPI start N processes that all execute the same code. There is no main process. The only distinguishing feature is the process "rank", which is a number [0, N). If you need a main process, the one with rank 0 is usually chosen. Other than that, the idea is that all processes execute the same code, only picking different indices or offsets based on their rank. If processes need to communicate, there are a couple of "collective" communication patterns such as broadcasting, scattering, and gathering.
Option 1: Pure MPI
Let's rewrite the code. The main idea is this: We distribute rows in the data set among all processes. Then each process calculates all passes for its own set of rows. Input and output take considerable time, so we try to do as much as possible parallelized, too.
We start by defining a helper function that defines how we distribute rows among all processes. This is very similar to what you had in your original version.
from mpi4py import MPI
def MakeDistribution(NumberOfRows):
"""Computes how the data set should be distributed across processes
Arguments:
NumberOfRows -- size of the whole dataset
Return value:
(Offsets, Counts) numpy integer arrays. One entry per process
"""
Comm = MPI.COMM_WORLD
WorldSize = Comm.Get_size()
SameSize, Tail = divmod(NumberOfRows, WorldSize)
Counts = np.full(WorldSize, SameSize, dtype=int)
Counts[:Tail] += 1
# Start offset per process
Offsets = np.cumsum(Counts) - Counts[0]
return Offsets, Counts
A second helper function is used to distribute the data sets among all processes. MPI's allgather function is used to collect results of a computation among all processes into one array. The normal function gather collects the whole array on one process. Allgather collects it in all processes. Since all processes need access to all data sets for their random access, we use allgather. Allgatherv is a generalized version that allows different number of entries per process. We need this because we cannot guarantee that all processes have the same number of rows in their local data set.
This function uses the "buffer" interface of mpi4py. This is the more efficient version but also very error-prone. If we mess up an index or the size of a data type, we risk data corruption.
def DistributeDataSets(DataSets, Offsets, Counts):
"""Shares the datasets with all other processes
Arguments:
DataSets -- numpy array of floats. Changed in place
Offsets, Counts -- See MakeDistribution
Return value:
DataSets. Most likely a reference to the original.
Might be an updated copy
"""
# Sanitize the input. Better safe than sorry and shouldn't cost anything
DataSets = np.ascontiguousarray(DataSets, dtype='f8')
assert len(DataSets) == np.sum(Counts)
# MPI works best if we pretend to have 1-dimensional data
InnerSize = np.prod(DataSets.shape[1:], dtype=int)
# I really wish mpi4py had a helper for this
BufferDescr = (DataSets,
Counts * InnerSize,
Offsets * InnerSize,
MPI.DOUBLE)
MPI.COMM_WORLD.Allgatherv(MPI.IN_PLACE, BufferDescr)
return DataSets
We split reading the input data into two parts. First we read all lines in a single process. This is relatively cheap and we need to know the total number of rows before we can distribute the datasets. Then we scatter the lines among all processes and let each process parse its own set of rows. After that, we use the DistributeDataSets function to let each process know all the results.
Scattering the lines uses mpi4py's pickle interface that can transfer arbitrary objects among processes. It's slower but more convenient. For stuff like lists of strings it's very good.
def ParseLines(TotalLines, Offset, OwnLines):
"""Allocates a data set and parses the own segment of it
Arguments:
TotalLines -- number of rows in the whole data set across all processes
Offset -- starting offset of the set of rows parsed by this process
OwnLines -- list of lines to be parsed by the local process
Return value:
a 2D numpy array. The rows [Offset:Offset+len(OwnLines)] are initialized
with the parsed values
"""
UsedColumns = (0, 1, 2, 3, 4, 5, 7, 8, 9)
DataSet = np.empty((TotalLines, len(UsedColumns)), dtype='f8')
OwnEnd = Offset + len(OwnLines)
for Row, Line in zip(DataSet[Offset:OwnEnd], OwnLines):
Columns = Line.split(',')
# overwrite in-place with new values
Row[:] = [float(Columns[Column]) for Column in UsedColumns]
return DataSet
def DistributeInputs(Filename):
"""Read input from the file and distribute it among processes
Arguments:
Filename -- path to the CSV file to parse
Return value:
(DataSets, Offsets, Counts) with
DataSets -- 2D array containing all values in the CSV file
Offsets -- Row indices (one per rank) where each process starts its own
processing
Counts -- number of rows per process
"""
Comm = MPI.COMM_WORLD
Rank = Comm.Get_rank()
Lines = None
LineCount = None
if not Rank:
# Read the data. We do as little work as possible here so that other
# processes can help with the parsing
with open(Filename) as InFile:
Lines = InFile.readlines()
LineCount = len(Lines)
# broadcast so that all processes know the number of datasets
LineCount = Comm.bcast(LineCount, root=0)
Offsets, Counts = MakeDistribution(LineCount)
# reshape into one list per process
if not Rank:
Lines = [Lines[Offset:Offset+Count]
for Offset, Count
in zip(Offsets, Counts)]
# distribute strings for parsing
Lines = Comm.scatter(Lines, root=0)
# parse into a float array
DataSets = ParseLines(LineCount, Offsets[Rank], Lines)
del Lines # release strings because this is a huge array
# Share the parsed result
DataSets = DistributeDataSets(DataSets, Offsets, Counts)
return DataSets, Offsets, Counts
Now we need to update the way the random number generator is initialized. What we need to prevent is that each process has the same state and generates the same random numbers. Thankfully, numpy gives us a convenient way of doing this.
def MakeRandomGenerator(Deterministic=False):
"""Initializes the random number generator avoiding birthday paradox
Arguments:
Deterministic -- if True, the same number of processes should always result
in the same random numbers being used
Return value:
numpy random number generator
"""
Comm = MPI.COMM_WORLD
Rank = Comm.Get_rank()
AllSeeds = None
if not Rank:
# the root process (rank=0) generates a seed sequence for everyone else
WorldSize = Comm.Get_size()
SeedInt = 0 if Deterministic else None
OwnSeed = np.random.SeedSequence(SeedInt)
AllSeeds = OwnSeed.spawn(WorldSize)
# mpi4py can scatter Python objects. This is the simplest way
OwnSeed = Comm.scatter(AllSeeds, root=0)
return np.random.default_rng(OwnSeed)
The computation itself is almost unchanged. We just need to limit it to the rows for which the individual process is responsible.
def ComputePass(DataSets, Offset, Count, RandomGenerator):
"""The main computation
Arguments:
DataSets -- 2D numpy array. Changed in place
Offset, Count -- rows that should be updated by this process
RandomGenerator -- numpy random number generator
"""
RandomIndices = RandomGenerator.integers(
low=0, high=len(DataSets), size=Count)
RandomRows = DataSets[RandomIndices]
# Creates a "view" into the whole dataset for the given slice
OwnDataSets = DataSets[Offset:Offset + Count]
# All rows: first column + second column
Value1 = OwnDataSets[:, 0] + OwnDataSets[:, 1]
Value2 = RandomRows[:, 0] + RandomRows[:, 1]
Value3 = Value1 - Value2
# This change is in-place of the whole DataSets array
OwnDataSets[:, 7] += Value3
Now we come to writing the output. The most expensive part is formatting the floating point numbers into strings. So we let each process format its own data. MPI has a file IO interface that allows all processes to write a single file together. Unfortunately, for text files, we need to calculate the offsets before writing the data. So we format all rows into one huge string per process, then write the file.
import io
def WriteOutputs(Filename, DataSets, Offset, Count):
"""Writes all DataSets to a CSV file
We parse all rows to a string (one per process), then write it
collectively using MPI
Arguments:
Filename -- output path
DataSets -- all values among all processes
Offset, Count -- the rows for which the local process is responsible
"""
StringBuf = io.StringIO()
LineFormat = "{:.6f}, " * 6 + "+" + ", {:.6f}" * 3 + "\n"
for Row in DataSets[Offset:Offset+Count]:
StringBuf.write(LineFormat.format(*Row))
StringBuf = StringBuf.getvalue() # to string
StringBuf = StringBuf.encode() # to bytes
Comm = MPI.COMM_WORLD
BytesPerProcess = Comm.allgather(len(StringBuf))
Rank = Comm.Get_rank()
OwnOffset = sum(BytesPerProcess[:Rank])
FileLength = sum(BytesPerProcess)
AccessMode = MPI.MODE_WRONLY | MPI.MODE_CREATE
OutFile = MPI.File.Open(Comm, Filename, AccessMode)
OutFile.Set_size(FileLength)
OutFile.Write_ordered(StringBuf)
OutFile.Close()
The main process is almost unchanged.
def main():
InFilename = "indata.csv"
OutFilename = "outdata.csv"
Passes = 20
RandomGenerator = MakeRandomGenerator()
DataSets, Offsets, Counts = DistributeInputs(InFilename)
Rank = MPI.COMM_WORLD.Get_rank()
Offset = Offsets[Rank]
Count = Counts[Rank]
for _ in range(Passes):
ComputePass(DataSets, Offset, Count, RandomGenerator)
WriteOutputs(OutFilename, DataSets, Offset, Count)
if __name__ == '__main__':
main()
You need to call this script with mpirun or mpiexec. E.g. mpiexec python3 script_name.py
Using shared memory
The MPI pattern has one significant drawback: Each process needs its own copy of the whole data set. Given its size, this is very inconvenient. We might run out of memory before we run out of CPU cores for multithreading. As a different idea, we can use shared memory. Shared memory allows multiple processes to access the same physical memory without any extra cost. This has some drawbacks:
We need a very recent python version. 3.8 IIRC
Python's implementation may behave differently on various operating systems. I could only test it on Linux. There is a chance that it will not work on any different system
IMHO python's implementation is not great. You will notice that the final version will print some warnings which I think are harmless. Maybe I'm using it wrong but I don't see a more correct way of using it
It limits you to a single PC. MPI itself is perfectly capable (and indeed designed to) operate across multiple systems on a network. Shared memory works only locally.
The major benefit is that the memory consumption does not increase with the number of processes.
We start by allocating such a data set.
From here on, we put in "barriers" at various points where processes may have to wait for one another. For example because all processes need to access the same shared memory segment, they all have to open it before we can unlink it.
from multiprocessing import shared_memory
def AllocateSharedDataSets(NumberOfRows, NumberOfCols=9):
"""Creates a numpy array in shared memory
Arguments:
NumberOfRows, NumberOfCols -- basic shape
Return value:
(DataSets, Buf) with
DataSets -- numpy array shaped (NumberOfRows, NumberOfCols).
Datatype float
Buf -- multiprocessing.shared_memory.SharedMemory that backs the array.
Close it when no longer needed
"""
length = NumberOfRows * NumberOfCols * np.float64().itemsize
Comm = MPI.COMM_WORLD
Rank = Comm.Get_rank()
Buf = None
BufName = None
if not Rank:
Buf = shared_memory.SharedMemory(create=True, size=length)
BufName = Buf.name
BufName = Comm.bcast(BufName)
if Rank:
Buf = shared_memory.SharedMemory(name=BufName, size=length)
DataSets = np.ndarray((NumberOfRows, NumberOfCols), dtype='f8',
buffer=Buf.buf)
Comm.barrier()
if not Rank:
Buf.unlink() # this may differ among operating systems
return DataSets, Buf
The input parsing also changes a little because have to put the data into the previously allocated array
def ParseLines(DataSets, Offset, OwnLines):
"""Reads lines into a preallocated array
Arguments:
DataSets -- [Rows, Cols] numpy array. Will be changed in-place
Offset -- starting offset of the set of rows parsed by this process
OwnLines -- list of lines to be parsed by the local process
"""
UsedColumns = (0, 1, 2, 3, 4, 5, 7, 8, 9)
OwnEnd = Offset + len(OwnLines)
OwnDataSets = DataSets[Offset:OwnEnd]
for Row, Line in zip(OwnDataSets, OwnLines):
Columns = Line.split(',')
Row[:] = [float(Columns[Column]) for Column in UsedColumns]
def DistributeInputs(Filename):
"""Read input from the file and stores it in shared memory
Arguments:
Filename -- path to the CSV file to parse
Return value:
(DataSets, Offsets, Counts, Buf) with
DataSets -- [Rows, 9] array containing two copies of all values in the
CSV file
Offsets -- Row indices (one per rank) where each process starts its own
processing
Counts -- number of rows per process
Buf -- multiprocessing.shared_memory.SharedMemory object backing the
DataSets object
"""
Comm = MPI.COMM_WORLD
Rank = Comm.Get_rank()
Lines = None
LineCount = None
if not Rank:
# Read the data. We do as little work as possible here so that other
# processes can help with the parsing
with open(Filename) as InFile:
Lines = InFile.readlines()
LineCount = len(Lines)
# broadcast so that all processes know the number of datasets
LineCount = Comm.bcast(LineCount, root=0)
Offsets, Counts = MakeDistribution(LineCount)
# reshape into one list per process
if not Rank:
Lines = [Lines[Offset:Offset+Count]
for Offset, Count
in zip(Offsets, Counts)]
# distribute strings for parsing
Lines = Comm.scatter(Lines, root=0)
# parse into a float array
DataSets, Buf = AllocateSharedDataSets(LineCount)
try:
ParseLines(DataSets, Offsets[Rank], Lines)
Comm.barrier()
return DataSets, Offsets, Counts, Buf
except:
Buf.close()
raise
Output writing is exactly the same. The main process changes slightly because now we have to manage the life time of the shared memory.
import contextlib
def main():
InFilename = "indata.csv"
OutFilename = "outdata.csv"
Passes = 20
RandomGenerator = MakeRandomGenerator()
Comm = MPI.COMM_WORLD
Rank = Comm.Get_rank()
DataSets, Offsets, Counts, Buf = DistributeInputs(InFilename)
with contextlib.closing(Buf):
Offset = Offsets[Rank]
Count = Counts[Rank]
for _ in range(Passes):
ComputePass(DataSets, Offset, Count, RandomGenerator)
WriteOutputs(OutFilename, DataSets, Offset, Count)
Results
I've not benchmarked the original version. The sequential version requires 2 GiB memory and 3:20 minutes for 12500000 lines and 20 passes.
The pure MPI version requires 6 GiB and 42 seconds with 6 cores.
The shared memory version requires a bit over 2 GiB of memory and 38 seconds with 6 cores.

Poor CPU utilization when transforming netcdfs to zarr and rechunking

I am transferring and rechunking data from netcdf to zarr. The process is slow and is not using much of the CPUs. I have tried several different configurations, sometimes it seems to do slightly better, but it hasn't worked well. Does anyone have any tips for making this run more efficiently?
The last attempt (and some, perhaps all, of the previous attempts) (with single machine, distributed scheduler and using threads) the logs gave this message:
distributed.core - INFO - Event loop was unresponsive in Worker for 10.05s. This is often caused by long-running GIL-holding functions or moving large chunks of data.
Previously I have had errors with memory getting used up, so I am writing the zarr in pieces, using the "stepwise_to_zarr" function below:
def stepwise_to_zarr(dataset, step_dim, step_size, chunks, out_loc, group):
start = dataset[step_dim].min()
end = dataset[step_dim].max()
iis = np.arange(start, end, step_size)
if end > iis[-1]:
iis = np.append(iis, end)
lon=dataset.get_index(step_dim)
first = True
failures = []
for i in range(1,len(iis)):
lower, upper = (iis[i-1], iis[i])
if upper >= end:
lon_list= [l for l in lon if lower <= l <= upper]
else:
lon_list= [l for l in lon if lower <= l < upper]
sub = dataset.sel(longitude=lon_list)
rechunked_sub = sub.chunk(chunks)
write_sync=zarr.ThreadSynchronizer()
if first:
rechunked_sub.to_zarr(out_loc, group=group,
consolidated=True, synchronizer=write_sync, mode="w")
first = False
else:
rechunked_sub.to_zarr(out_loc, group=group,
consolidated=True, synchronizer=write_sync, append_dim=step_dim)
chunks = {'time':8760, 'latitude':21, 'longitude':20}
ds = xr.open_mfdataset("path to data", parallel=True, combine="by_coords")
stepwise_to_zarr(ds, step_size=10, step_dim="longitude",
chunks=chunks, out_loc="path to output", group="group name")
In the plot above, the drop from ~6% utilization to ~0.5% utilization seems to coincide with the first "batch" of 10 degreees latitude being finished.
Background info:
I am using a single GCE instance of 32 vCPUs and 256 GB memory.
The data is a about 600 GB and is spread over about 150 netcdf files.
The data is in GCS and I am using Cloud Storage FUSE to read and write data.
I am rechunking the data from chunk sizes: {'time':1, 'latitude':521, 'longitude':1440} to chunksizes:{'time':8760, 'latitude':21, 'longitude':20}
I have tried:
Using the default multiprocessing scheduler
Using distributed scheduler for single machine (https://docs.dask.org/en/latest/setup/single-distributed.html) both with processes=True and processes=False.
Both distributed scheduler and the default multiprocessing sceduler while also setting environment variables to avoid oversubscribing threads, like so:
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
as described in best practices(https://docs.dask.org/en/latest/array-best-practices.html?highlight=export#avoid-oversubscribing-threads)

I ended up solving my problem by writing to an intermediate Zarr storage with chunks: {'time':8760, 'latitude':260, 'longitude':360}. This went fast, even though cpu the resources were only fully utilized for a relatively small portion of the job. I then read this intermediate zarr and stored in the final chunking, using a modified version of the stepwise process described in the question. This gave acceptable performance, although not ideal.
CPU utilization when writing to intermediate store
CPU utilization when writing from intermediate to final store
Here is the code:
def stepwise_to_zarr(dataset, step_dim, step_size, encoding, out_loc, group, include_end=True):
start = dataset[step_dim].min()
end = dataset[step_dim].max()
iis = np.arange(start, end, step_size)
if end > iis[-1]:
iis = np.append(iis, end)
lon=dataset.get_index(step_dim)
first = True
failures = []
for i in range(1,len(iis)):
lower, upper = (iis[i-1], iis[i])
if upper >= end and include_end:
lon_list= [l for l in lon if lower <= l <= upper]
else:
lon_list= [l for l in lon if lower <= l < upper]
sub = dataset.sel(longitude=lon_list)
write_sync=zarr.ThreadSynchronizer()
if first:
sub_write=sub.to_zarr(output_loc,
group=varname,
consolidated=True,
synchronizer=write_sync,
encoding=encoding,
mode="w", compute=False)
first = False
else:
sub_write=sub.to_zarr(output_loc,
group=varname,
consolidated=True,
synchronizer=write_sync,
append_dim=step_dim,
compute=False)
sub_write.compute(retries=2)
z = xr.open_zarr(input_loc, group=groupname)
new_chunks = {'time':8760, 'latitude':21, 'longitude':20}
z_rechunked=z.chunk(new_chunks)
#Workaround to avoid:NotImplementedError: Specified zarr chunks (8760, 260, 360) would #overlap multiple dask chunks
#See https://github.com/pydata/xarray/issues/2300
encoding = {}
for v in ['var1', 'var2', 'var3']:
encoding.update({v:z[v].encoding.copy()})
encoding[v]["chunks"]=(96408, 21, 20)
stepwise_to_zarr(z_rechunked, "longitude", 60, encoding, output_loc, group=groupname)
Note I had to overwrite the encodings to be able to rechunk the zarrs.
This process worked, but was a bit cumbersome. I only did it this way because I had not heard of rechunker. The next time I am rechunking I will try rechunker to it takes care of the issue.

Writing to file in Pool multiprocessing (Python 2.7)

I'm doing a lot of calculations writing the results to a file. Using multiprocessing I'm trying to parallelise the calculations.
Problem here is that I'm writing to one output file, which all the workers are writing too. I'm quite new to multiprocessing, and wondering how I could make it work.
A very simple concept of the code is given below:
from multiprocessing import Pool
fout_=open('test'+'.txt','w')
def f(x):
fout_.write(str(x) + "\n")
if __name__ == '__main__':
p = Pool(5)
p.map(f, [1, 2, 3])
The result I want would be a file with:
1 2 3
However now I get an empty file. Any suggestions?
I greatly appreciate any help :)!

You shouldn't be letting all the workers/processes write to a single file. They can all read from one file (which may cause slow downs due to workers waiting for one of them to finish reading), but writing to the same file will cause conflicts and potentially corruption.
Like said in the comments, write to separate files instead and then combine them into one on a single process. This small program illustrates it based on the program in your post:
from multiprocessing import Pool
def f(args):
''' Perform computation and write
to separate file for each '''
x = args[0]
fname = args[1]
with open(fname, 'w') as fout:
fout.write(str(x) + "\n")
def fcombine(orig, dest):
''' Combine files with names in
orig into one file named dest '''
with open(dest, 'w') as fout:
for o in orig:
with open(o, 'r') as fin:
for line in fin:
fout.write(line)
if __name__ == '__main__':
# Each sublist is a combination
# of arguments - number and temporary output
# file name
x = range(1,4)
names = ['temp_' + str(y) + '.txt' for y in x]
args = list(zip(x,names))
p = Pool(3)
p.map(f, args)
p.close()
p.join()
fcombine(names, 'final.txt')
It runs f for each argument combination which in this case are value of x and temporary file name. It uses a nested list of argument combinations since pool.map does not accept more than one arguments. There are other way to go around this, especially on newer Python versions.
For each argument combination and pool member it creates a separate file to which it writes the output. In principle your output will be longer, you can simply add another function that computes it to the f function. Also, no need to use Pool(5) for 3 arguments (though I assume that only three workers were active anyway).
Reasons for calling close() and join() are explained well in this post. It turns out (in the comment to the linked post) that map is blocking, so here you don't need them for the original reasons (wait till they all finish and then write to the combined output file from just one process). I would still use them in case other parallel features are added later.
In the last step, fcombine gathers and copies all the temporary files into one. It's a bit too nested, if you for instance decide to remove the temporary file after copying, you may want to use a separate function under the with open('dest', ).. or the for loop underneath - for readability and functionality.

Multiprocessing.pool spawns processes, writing to a common file without lock from each process can cause data loss.
As you said you are trying to parallelise the calculation, multiprocessing.pool can be used to parallelize the computation.
Below is the solution that do parallel computation and writes the result in file, hope it helps:
from multiprocessing import Pool
# library for time
import datetime
# file in which you want to write
fout = open('test.txt', 'wb')
# function for your calculations, i have tried it to make time consuming
def calc(x):
x = x**2
sum = 0
for i in range(0, 1000000):
sum += i
return x
# function to write in txt file, it takes list of item to write
def f(res):
global fout
for x in res:
fout.write(str(x) + "\n")
if __name__ == '__main__':
qs = datetime.datetime.now()
arr = [1, 2, 3, 4, 5, 6, 7]
p = Pool(5)
res = p.map(calc, arr)
# write the calculated list in file
f(res)
qe = datetime.datetime.now()
print (qe-qs).total_seconds()*1000
# to compare the improvement using multiprocessing, iterative solution
qs = datetime.datetime.now()
for item in arr:
x = calc(item)
fout.write(str(x)+"\n")
qe = datetime.datetime.now()
print (qe-qs).total_seconds()*1000

Optimize parsing of GB sized files in parallel

I have several compressed files with sizes on the order of 2GB compressed. The beginning of each file has a set of headers which I parse and extract a list of ~4,000,000 pointers (pointers).
For each pair of pointers (pointers[i], pointers[i+1]) for 0 <= i < len(pointers), I
seek to pointers[i]
read pointers[i+1]-pointer[i]
decompress it
do a single pass operation on that data and update a dictionary with what I find.
The issue is, I can only process roughly 30 of pointer pairs a second using a single Python process, which means each file takes more than a day to get through.
Assuming splitting up the pointers list among multiple processes doesn't hurt performance (due to each process looking at the same file, though different non-overlapping parts), how can I use multiprocessing to speed this up?
My single threaded operation looks like this:
def search_clusters(pointers, filepath, automaton, counter):
def _decompress_lzma(f, pointer, chunk_size=2**14):
# skipping over this
...
return uncompressed_buffer
first_pointer, last_pointer = pointers[0], pointers[-1]
with open(filepath, 'rb') as fh:
fh.seek(first_pointer)
f = StringIO(fh.read(last_pointer - first_pointer))
for pointer1, pointer2 in zip(pointers, pointers[1:]):
size = pointer2 - pointer1
f.seek(pointer1 - first_pointer)
buffer = _decompress_lzma(f, 0)
# skipping details, ultimately the counter dict is
# modified passing the uncompressed buffer through
# an aho corasick automaton
counter = update_counter_with_buffer(buffer, automaton, counter)
return counter
# parse file and return pointers list
bzf = ZimFile(infile)
pointers = bzf.cluster_pointers
counter = load_counter_dict() # returns collections.Counter()
automaton = load_automaton()
search_clusters(pointers, infile, autmaton, counter)
I tried changing this to use multiprocessing.Pool:
from itertools import repeat, izip
import logging
import multiprocessing
logger = multiprocessing.log_to_stderr()
logger.setLevel(multiprocessing.SUBDEBUG)
def chunked(pointers, chunksize=1024):
for i in range(0, len(pointers), chunksize):
yield list(pointers[i:i+chunksize+1])
def search_wrapper(args):
return search_clusters(*args)
# parse file and return pointers list
bzf = ZimFile(infile)
pointers = bzf.cluster_pointers
counter = load_counter_dict() # returns collections.Counter()
map_args = izip(chunked(cluster_pointers), repeat(infile),
repeat(automaton.copy()), repeat(word_counter.copy()))
pool = multiprocessing.Pool(20)
results = pool.map(search_wrapper, map_args)
pool.close()
pool.join()
but after a little while of processing, I get the following message and the script just hangs there with no further output:
[DEBUG/MainProcess] cleaning up worker 0
[DEBUG/MainProcess] added worker
[INFO/PoolWorker-20] child process calling self.run()
However, if I run with a serialized version of map without multiprocessing, things run just fine:
map(search_wrapper, map_args)
Any advice on how to change my multiprocessing code so it doesn't hang? Is it even a good idea to attempt to use multiple processes to read the same file?

Time-series data analysis using scientific python: continuous analysis over multiple files

The Problem
I'm doing time-series analysis. Measured data comes from the sampling the voltage output of a sensor at 50 kHz and then dumping that data to disk as separate files in hour chunks. Data is saved to an HDF5 file using pytables as a CArray. This format was chosen to maintain interoperability with MATLAB.
The full data set is now multiple TB, far too large to load into memory.
Some of my analysis requires me to iterative over the full data set. For analysis that requires me to grab chunks of data, I can see a path forward through creating a generator method. I'm a bit uncertain of how to proceed with analysis that requires a continuous time series.
Example
For example, let's say I'm looking to find and categorize transients using some moving window process (e.g. wavelet analysis) or apply a FIR filter. How do I handle the boundaries, either at the end or beginning of a file or at chunk boundaries? I would like the data to appear as one continuous data set.
Request
I would love to:
Keep the memory footprint low by loading data as necessary.
Keep a map of the entire data set in memory so that I can address the data set as I would a regular pandas Series object, e.g. data[time1:time2].
I'm using scientific python (Enthought distribution) with all the regular stuff: numpy, scipy, pandas, matplotlib, etc. I only recently started incorporating pandas into my work flow and I'm still unfamiliar with all of its capabilities.
I've looked over related stackexchange threads and didn't see anything that exactly addressed my issue.
EDIT: FINAL SOLUTION.
Based upon the helpful hints I built a iterator that steps over files and returns chunks of arbitrary size---a moving window that hopefully handles file boundaries with grace. I've added the option of padding the front and back of each of the windows with data (overlapping windows). I can then apply a succession of filters to the overlapping windows and then remove the overlaps at the end. This, I hope, gives me continuity.
I haven't yet implemented __getitem__ but it's on my list of things to do.
Here's the final code. A few details are omitted for brevity.
class FolderContainer(readdata.DataContainer):
def __init__(self,startdir):
readdata.DataContainer.__init__(self,startdir)
self.filelist = None
self.fs = None
self.nsamples_hour = None
# Build the file list
self._build_filelist(startdir)
def _build_filelist(self,startdir):
"""
Populate the filelist dictionary with active files and their associated
file date (YYYY,MM,DD) and hour.
Each entry in 'filelist' has the form (abs. path : datetime) where the
datetime object contains the complete date and hour information.
"""
print('Building file list....',end='')
# Use the full file path instead of a relative path so that we don't
# run into problems if we change the current working directory.
filelist = { os.path.abspath(f):self._datetime_from_fname(f)
for f in os.listdir(startdir)
if fnmatch.fnmatch(f,'NODE*.h5')}
# If we haven't found any files, raise an error
if not filelist:
msg = "Input directory does not contain Illionix h5 files."
raise IOError(msg)
# Filelist is a ordered dictionary. Sort before saving.
self.filelist = OrderedDict(sorted(filelist.items(),
key=lambda t: t[0]))
print('done')
def _datetime_from_fname(self,fname):
"""
Return the year, month, day, and hour from a filename as a datetime
object
"""
# Filename has the prototype: NODE##-YY-MM-DD-HH.h5. Split this up and
# take only the date parts. Convert the year form YY to YYYY.
(year,month,day,hour) = [int(d) for d in re.split('-|\.',fname)[1:-1]]
year+=2000
return datetime.datetime(year,month,day,hour)
def chunk(self,tstart,dt,**kwargs):
"""
Generator expression from returning consecutive chunks of data with
overlaps from the entire set of Illionix data files.
Parameters
----------
Arguments:
tstart: UTC start time [provided as a datetime or date string]
dt: Chunk size [integer number of samples]
Keyword arguments:
tend: UTC end time [provided as a datetime or date string].
frontpad: Padding in front of sample [integer number of samples].
backpad: Padding in back of sample [integer number of samples]
Yields:
chunk: generator expression
"""
# PARSE INPUT ARGUMENTS
# Ensure 'tstart' is a datetime object.
tstart = self._to_datetime(tstart)
# Find the offset, in samples, of the starting position of the window
# in the first data file
tstart_samples = self._to_samples(tstart)
# Convert dt to samples. Because dt is a timedelta object, we can't use
# '_to_samples' for conversion.
if isinstance(dt,int):
dt_samples = dt
elif isinstance(dt,datetime.timedelta):
dt_samples = np.int64((dt.day*24*3600 + dt.seconds +
dt.microseconds*1000) * self.fs)
else:
# FIXME: Pandas 0.13 includes a 'to_timedelta' function. Change
# below when EPD pushes the update.
t = self._parse_date_str(dt)
dt_samples = np.int64((t.minute*60 + t.second) * self.fs)
# Read keyword arguments. 'tend' defaults to the end of the last file
# if a time is not provided.
default_tend = self.filelist.values()[-1] + datetime.timedelta(hours=1)
tend = self._to_datetime(kwargs.get('tend',default_tend))
tend_samples = self._to_samples(tend)
frontpad = kwargs.get('frontpad',0)
backpad = kwargs.get('backpad',0)
# CREATE FILE LIST
# Build the the list of data files we will iterative over based upon
# the start and stop times.
print('Pruning file list...',end='')
tstart_floor = datetime.datetime(tstart.year,tstart.month,tstart.day,
tstart.hour)
filelist_pruned = OrderedDict([(k,v) for k,v in self.filelist.items()
if v >= tstart_floor and v <= tend])
print('done.')
# Check to ensure that we're not missing files by enforcing that there
# is exactly an hour offset between all files.
if not all([dt == datetime.timedelta(hours=1)
for dt in np.diff(np.array(filelist_pruned.values()))]):
raise readdata.DataIntegrityError("Hour gap(s) detected in data")
# MOVING WINDOW GENERATOR ALGORITHM
# Keep two files open, the current file and the next in line (que file)
fname_generator = self._file_iterator(filelist_pruned)
fname_current = fname_generator.next()
fname_next = fname_generator.next()
# Iterate over all the files. 'lastfile' indicates when we're
# processing the last file in the que.
lastfile = False
i = tstart_samples
while True:
with tables.openFile(fname_current) as fcurrent, \
tables.openFile(fname_next) as fnext:
# Point to the data
data_current = fcurrent.getNode('/data/voltage/raw')
data_next = fnext.getNode('/data/voltage/raw')
# Process all data windows associated with the current pair of
# files. Avoid unnecessary file access operations as we moving
# the sliding window.
while True:
# Conditionals that depend on if our slice is:
# (1) completely into the next hour
# (2) partially spills into the next hour
# (3) completely in the current hour.
if i - backpad >= self.nsamples_hour:
# If we're already on our last file in the processing
# que, we can't continue to the next. Exit. Generator
# is finished.
if lastfile:
raise GeneratorExit
# Advance the active and que file names.
fname_current = fname_next
try:
fname_next = fname_generator.next()
except GeneratorExit:
# We've reached the end of our file processing que.
# Indicate this is the last file so that if we try
# to pull data across the next file boundary, we'll
# exit.
lastfile = True
# Our data slice has completely moved into the next
# hour.
i-=self.nsamples_hour
# Return the data
yield data_next[i-backpad:i+dt_samples+frontpad]
# Move window by amount dt
i+=dt_samples
# We've completely moved on the the next pair of files.
# Move to the outer scope to grab the next set of
# files.
break
elif i + dt_samples + frontpad >= self.nsamples_hour:
if lastfile:
raise GeneratorExit
# Slice spills over into the next hour
yield np.r_[data_current[i-backpad:],
data_next[:i+dt_samples+frontpad-self.nsamples_hour]]
i+=dt_samples
else:
if lastfile:
# Exit once our slice crosses the boundary of the
# last file.
if i + dt_samples + frontpad > tend_samples:
raise GeneratorExit
# Slice is completely within the current hour
yield data_current[i-backpad:i+dt_samples+frontpad]
i+=dt_samples
def _to_samples(self,input_time):
"""Convert input time, if not in samples, to samples"""
if isinstance(input_time,int):
# Input time is already in samples
return input_time
elif isinstance(input_time,datetime.datetime):
# Input time is a datetime object
return self.fs * (input_time.minute * 60 + input_time.second)
else:
raise ValueError("Invalid input 'tstart' parameter")
def _to_datetime(self,input_time):
"""Return the passed time as a datetime object"""
if isinstance(input_time,datetime.datetime):
converted_time = input_time
elif isinstance(input_time,str):
converted_time = self._parse_date_str(input_time)
else:
raise TypeError("A datetime object or string date/time were "
"expected")
return converted_time
def _file_iterator(self,filelist):
"""Generator for iterating over file names."""
for fname in filelist:
yield fname

#Sean here's my 2c
Take a look at this issue here which I created a while back. This is essentially what you are trying to do. This is a bit non-trivial.
Without knowing more details, I would offer a couple of suggestions:
HDFStore CAN read in a standard CArray type of format, see here
You can easily create a 'Series' like object that has nice properties of a) knowing where each file is and its extents, and uses __getitem__ to 'select' those files, e.g. s[time1:time2]. From a top-level view this might be a very nice abstraction, and you can then dispatch operations.
e.g.
class OutOfCoreSeries(object):
def __init__(self, dir):
.... load a list of the files in the dir where you have them ...
def __getitem__(self, key):
.... map the selection key (say its a slice, which 'time1:time2' resolves) ...
.... to the files that make it up .... , then return a new Series that only
.... those file pointers ....
def apply(self, func, **kwargs):
""" apply a function to the files """
results = []
for f in self.files:
results.append(func(self.read_file(f)))
return Results(results)
This can very easily get quite complicated. For instance, if you apply an operation that does a reduction that you can fit in memory, Results can simpley be a pandas.Series (or Frame). Hoever,
you may be doing a transformation which necessitates you writing out a new set of transformed data files. If you so, then you have to handle this.
Several more suggestions:
You may want to hold onto your data in possibly multiple ways that may be useful. For instance you say that you are saving multiple values in a 1-hour slice. It may be that you can split these 1-hour files instead into a file for each variable you are saving but save a much longer slice that then becomes memory readable.
You might want to resample the data to lower frequencies, and work on these, loading the data in a particular slice as needed for more detailed work.
You might want to create a dataset that is queryable across time, e.g. say high-low peaks at varying frequencies, e.g. maybe using the Table format see here
Thus you may have multiple variations of the same data. Disk space is usually much cheaper/easier to manage than main memory. It makes a lot of sense to take advantage of that.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sharing a large dataframe with multiprocessing.pool - python

Related

Modifying a list from multiple Python pools

Poor CPU utilization when transforming netcdfs to zarr and rechunking

Writing to file in Pool multiprocessing (Python 2.7)

Optimize parsing of GB sized files in parallel

Time-series data analysis using scientific python: continuous analysis over multiple files

Categories

Resources