Dataflow Sideinputs - Worker Cache Size in SDK 2.x

Dataflow Sideinputs - Worker Cache Size in SDK 2.x - python

I am experiencing performance issues in my pipeline in a DoFn that uses large side input of ~ 1GB. The side input is passed using the pvalue.AsList(), which forces materialization of the side input.
The execution graph of the pipeline shows that the particular step spends most of the time for reading the side input. The total amount of data read exceeds the size of the side input by far. Consequently, I conclude that the side input does not fit into memory / cache of the workers even though their RAM is sufficient (using n1-highmem4 workers with 26 GB RAM).
How do I know how big this cache actually is? Is there a way to control its size using Beam Python SDK 2.15.0 (like there was the pipeline option --workerCacheMb=200 for Java 1.x SDK)?
There is no easy way of shrinking my side input more than 10%.

If you are using AsList, you are correct that the whole side input should be loaded into memory. It may be that your worker has enough memory available, but it just takes very long to read 1GB of data into the list. Also, the size of the data that is read depends on the encoding of it. If you can share more details about your algorithm, we can try to figure out how to write a pipeline that may run more efficiently.
Another option may be to have an external service to keep your side input - for instance, a Redis instance that you write to on one side, and red from on the other side.

Related

Numpy's memmap acting strangely?

I am dealing with large numpy arrays and I am trying out memmap as it could help.
big_matrix = np.memmap(parameters.big_matrix_path, dtype=np.float16, mode='w+', shape=(1000000, 1000000)
The above works fine and it creates a file on my hard drive of about 140GB.
1000000 is just a random number I used - not the one I am actually using.
I want to fill the matrix with values. Currently it is just set to zero.
for i in tqdm(range(len(big_matrix))):
modified_row = get_row(i)
big_matrix[i, :] = modified_row
At this point now, I have a big_matrix filled with the values I want.
The problem is that from this point on I can't operate on this memmap.
For example I want to multiply column wise (broadcast).
I run this:
big_matrix * weights[:, np.newaxis]
Where weights has the same length.
It just hangs and throws and out of memory error as my RAM and SWAP is all used.
My understanding was that the memmap will keep everything on the hard drive.
For example save the results directly there.
So I tried this then:
for i in tqdm(range(big_matrix.shape[1])):
temp = big_matrix[:, i].tolist()
temp = np.array(temp) * weights
The above loads only 1 column in memory, and multiply that with the weights.
Then I will save that column back in big_matrix.
But even with 1 column my program hangs. The only difference here is that the RAM is not used up.
At this point I am thinking of switching to sqlite.
I wanted to get some insights why my code is not working?
Do I need to flush the memmap everytime I change it ?

np.memmap map a part of the virtual memory to the storage device space here. The OS is free to preload pages and cache them for a fast reuse. The memory is generally not flushed unless it is reclaimed (eg. by another process or the same process). When this happen, the OS typically (partially) flush data to the storage device and (partially) free the physical memory used for the mapping. That being said, this behaviour is dependent of the actual OS. It work that way on Windows. On Linux, you can use madvise to tune this behaviour but madvise is a low-level C function not yet supported by Numpy (though it is apparently supported for Python, see this issue for more information). Actually, Numpy does not even support closing the memmaped space (which is leaky). The solution is generally to flush data manually not to lose it. There are alternative solutions but none of them is great yet.
big_matrix * weights[:, np.newaxis]
It just hangs and throws and out of memory error as my RAM and SWAP is all used
This is normal since Numpy creates a new temporary array stored in RAM. There is no way to tell to Numpy to store temporary array in on the storage device. That being said, you can tell to Numpy where the output data is stored using the out parameter on some function (eg. np.multiply supports it). The output array can be created using memmap so not to use too much memory (regarding the behaviour of the OS).
But even with 1 column my program hangs. The only difference here is that the RAM is not used up.
This is also expected, especially if you use a HDD and not and SSD. Indeed, the array is stored (virtually) contiguously on the storage device. big_matrix[:, i] has to fetch data with a huge stride. For each item, with a size of only 2 bytes, the OS will perform an IO request to the storage device. Storage devices are optimized for contiguous reads so fetches are buffered and each IO request has a pretty significant latency. In practice, the OS will generally at least fetch a page (typically 4096 bytes, that is 512 times more than what is actually needed). Moreover, there is a limit of the number of IO requests that can be completed per second. HDDs can typically do about 20-200 IO requests per seconds while the fastest Nvme SSDs reach 100_000-600_000 UI requests per seconds. Note that the cache help not not reload data for the next column unless there are too many loaded pages and the OS has to flush them. Reading a matrix of size (1_000_000,1_000_000) causes up to 1_000_000*1_000_000=1_000_000_000_000 fetch, which is horribly inefficient. The cache could reduce this by a large margin, but operating simultaneously on 1_000_000 pages is also horribly inefficient since the processor cannot do that (due to a limited number of entries in the TLB). This will typically results in TLB misses, that is expensive kernel calls for each item to be read. Because a kernel call typically take (at least) about ~1 us on mainstream PC, this means more than a week to to the whole computation.
If you want to efficiently read columns, then you need to read large chunk of columns. For example, you certainly need at least several hundred of columns to be read even on a fast Nvme SSD. For HDD, it is at least several dozens of thousand columns to get a proper throughput. This means you certainly cannot read the full columns efficiently due to the high amount of requested RAM. Using another data layout (tile + transposed data) is critical in this case.

Dataflow pipeline throughput decreases drastically as execution advances + unexpected side input behavior

I have a dataflow pipeline processing an input of about 1Gb of data with two dicts as side_inputs. The goal is to calculate features from the main dataset with the help of those two side_inputs.
Overall structure of the pipeline is as follows:
# First side input, ends up as a 2GB dict with 3.5 million keys
side_inp1 = ( p |
"read side_input1" >> beam.io.ReadFromAvro("$PATH/*.avro") |
"to list of tuples" >> beam.Map(lambda row: (row["key"], row["value"]))
)
# Second side input, ends up as a 1.6GB dict with 4.5 million keys
side_inp2 = (p |
"read side_input2" >> beam.io.ReadFromAvro("$PATH2/*.avro") |
"to list of tuples" >> beam.Map(lambda row: (row["key"], row["value"]))
)
# The main part of the pipeline, reading an avro dataset of 1 million rows -- 20GB
(p |
"read inputs" >> beam.io.ReadFromAvro("$MainPath/*.avro") |
"main func" >> beam.Map(MyMapper, pvalue.AsDict(side_inp1), pvalue.AsDict(side_inp2))
)
Here's the Dataflow graph:
And the "Featurize" step unwrapped:
So Featurize is a function that looks for ids in the side-inputs, .gets the vectors and does like 180 different ways of vector dot products to calculate some features. It's a completely CPU bound process and it's expected to take longer than the rest of the pipeline, but stalling is the thing that's strange here.
My problems are two fold:
The dataflow pipeline seems to slow down drastically as it moves further in the process. I don't know what the reasons are and how can I alleviate this problem. A throughput chart of the MyMapper step can be seen below, I'm wondering for the declining throughput (from ~400 rows/sec to nearly ~1 rows/sec in the end).
Also the behavior of side_inputs is strange to me. I expected the side_inputs to be read only and only once, but when I checkout the Job Metrics / Throughput chart, I observe the following chart. As can be seen, the pipeline is constantly reading in side_inputs, while what I want is only two dicts that are kept in memory.
Other job configurations
zone: us-central-1a
machine_type: m1-ultramem-40 (40 CPU cores, 960GB RAM)
disk_type/size: ssd/50GB
experiments: shuffle-service enabled.
max_num_workers: 1 to help ease calculations and metrics, and not have them vary due to auto-scaling.
Extra Observations
I'm constantly seeing log entires like the following in LogViewer: [INFO] Completed workitem: 4867151000103436312 in 1069.056863785 seconds"
All completed workItems so far have taken about 1000-1100 seconds, this is another source of confusion, why should throughput drop while processing workItems takes the same time as before? Has parallelism dropped for some reason? (maybe some hidden threading threshold that's out of my control, like harness_threads?).
In the later parts of the pipelines, looking at the logs, it looks the execution pattern is very sequential (Seems like it's executing 1 workItem, finishes it, goes to the next, which is strange to me, considering there's 1TB of available memory and 40cores)
There are 0 errors or even warnings

The throughput chart in point 1 is a good indicator that the performance in your job decreased somehow.
The side input is intended to be in memory; however, I'm not quite sure that a pipeline with only 1 highmem node is a good approach. By having only one node, the pipeline might have bottlenecks difficult to identify, e.g. Network or OS limitations (like max number of files opened in the OS related to the files loaded into memory). Because of beam's architecture, I think it is not a problem that you can have more nodes even if autoscaling is enabled since we find that autoscaling automatically chooses the appropriate number of worker instances required to run your job. If you are worried about calculations and metrics for other reasons, please share.
Regarding point 2, I think it is expected to find activity on the graph since the side input (in memory) is read by each element being processed. However, if this doesn't make sense for you, you can always add the complete job graph for us to understand any other details of the pipeline steps.
My recommendation is adding more workers to distribute the workaload as a PCollection is a distributed dataset that will be distributed among available nodes. You can try to have similar computational resources with more nodes, for example, 4 instances n2d-highmem-16 (16vCPU 128GB). With this changes it is possible that any bottlenecks dissapear or can be mitigated; in addition, you can monitor the new job in the same way:
Remember to check errors in your pipeline, so you can identify any other issues that are happening/causing the performance issue.
Check the CPU and Memory usage in Dataflow UI. If memory errors are happening at job level Stackdriver should shows them as memory errors, but also the memory in the host instance should be checked to be sure that it is not reaching the limit in the OS for other reasons.
You might want to check this example with side inputs as dictionaries. I'm not expert, but you can follow the best practice in the example.
UPDATE
If machines n2d-highmem-16 have OOM, it seems to me that each harness thread might use a copy of the dicts. Not quite sure if configuring the number of threads can help, but you can try to set number_of_worker_harness_threads in the pipeline options.
On the other hand, can you expand the step Featurize? The wall time is very high in this step (~6 days), let's check the composite transforms that absorbed such latency. For the problematic composite transforms let us know the code snippet. To identify the composite transforms that can have issues please refer to Side Inputs Metrics especially Time spent writing and Time spent reading.

How large data can Python Ray handle?

Python Ray looks interesting for machine learning applications. However, I wonder how large Python Ray can handle. Is it limited by memory or can it actually handle data that exceeds memory?

It currently works best when the data fits in memory (if you're on a cluster, then that means the aggregate memory of the cluster). If the data exceeds the available memory, then Ray will evict the least recently used objects. If those objects are needed later on, they will be reconstructed by rerunning the tasks that created them.

Run out of VRAM using Theano on Amazon cluster

I'm trying to execute the logistic_sgd.py code on an Amazon cluster running the ami-b141a2f5 (Theano - CUDA 7) image.
Instead of the included MNIST database I am using the SD19 database, which requires changing a few dimensional constants, but otherwise no code has been touched. The code runs fine locally, on my CPU, but once I SSH the code and data to the Amazon cluster and run it there, I get this output:
It looks to me like it is running out of VRAM, but it was my understanding that the code should run on a GPU already, without any tinkering on my part necessary. After following the suggestion from the error message, the error persists.

There's nothing especially strange here. The error message is almost certainly accurate: there really isn't enough VRAM. Often, a script will run fine on CPU but then fail like this on GPU simply because there is usually much more system memory available than GPU memory, especially since the system memory is virtualized (and can page out to disk if required) while the GPU memory isn't.
For this script, there needs to be enough memory to store the training, validation, and testing data sets, the model parameters, and enough working space to store intermediate results of the computation. There are two options available:
Reduce the amount of memory needed for one or more of these three components. Reducing the amount of training data is usually easiest; reducing the size of the model next. Unfortunately both of those two options will often impair the quality of the result that is being looked for. Reducing the amount of memory needed for intermediate results is usually beyond the developers control -- it is managed by Theano, but there is sometimes scope for altering the computation to achieve this goal once a good understanding of Theano's internals is achieved.
If the model parameters and working memory can fit in GPU memory then the most common solution is to change the code so that the data is no longer stored in GPU memory (i.e. just store it as numpy arrays, not as Theano shared variables) then pass each batch of data in as inputs instead of givens. The LSTM sample code is an example of this approach.

Processing buffers bigger than 65536 in Clyther/OpenCL

I am currently in the process of discovering OpenCL via the Python binding Clyther. So
far I am messing with a very simple script to get the sin or cos of a buffer of 65536.
Apparently 65536 is the limit for buffers on my card but say I'd have 16 million numbers in my buffer how would I go about it without constantly bringing the CPU into it to retrieve/send data?
What I have do so far is, fill buffer, run kernel, retrieve buffer, in a loop but that also
hits the CPU badly.
I looked a bit at OpenCL docs but I just failed to understand how that is achieved.
Thank you

This awfully looks like you are using __constant memory. The solution is to use __global memory instead, but you have to be careful about how you access it for best performance.
__constant memory is a special address space for often used constant values, but is restricted in size on current GPUs.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.