Google Dataflow / Apache Beam Python - Side-Input from PCollection kills performance

Google Dataflow / Apache Beam Python - Side-Input from PCollection kills performance - python

We are running logfile parsing jobs in google dataflow using the Python SDK. Data is spread over several 100s of daily logs, which we read via file-pattern from Cloud Storage. Data volume for all files is about 5-8 GB (gz files) with 50-80 million lines in total.
loglines = p | ReadFromText('gs://logfile-location/logs*-20180101')
In addition, we have a simple (small) mapping csv, that maps logfile-entries to human readable text. Has about 400 lines, 5 kb size.
For Example a logfile entry with [param=testing2] should be mapped to "Customer requested 14day free product trial" in the final output.
We do this in a simple beam.Map with sideinput, like so:
customerActions = loglines | beam.Map(map_logentries,mappingTable)
where map_logentries is the mapping function and mappingTable is said mapping table.
However, this only works if we read the mapping table in native python via open() / read(). If we do the same utilising the beam pipeline via ReadFromText() and pass the resulting PCollection as side-input to the Map, like so:
mappingTable = p | ReadFromText('gs://side-inputs/category-mapping.csv')
customerActions = loglines | beam.Map(map_logentries,beam.pvalue.AsIter(mappingTable))
performance breaks down completely to about 2-3 items per Second.
Now, my questions:
Why would performance break so badly, what is wrong with passing a
PCollection as side-input?
If it is maybe not recommended to use
PCollections as side-input, how is one supposed to build such as
pipeline that needs mappings that can/should not be hard coded into
the mapping function?
For us, the mapping does change frequently and I need to find a way to have "normal" users provide it. The idea was to have the mapping csv available in Cloud Storage, and simply incorporate it into the Pipeline via ReadFromText(). Reading it locally involves providing the mapping to the workers, so only the tech-team can do this.
I am aware that there are caching issues with side-input, but surely this should not apply to a 5kb input.
All code above is pseudo code to explain the problem. Any ideas and thoughts on this would be highly appreciated!

For more efficient side inputs (with small to medium size) you can utilize
beam.pvalue.AsList(mappingTable)
since AsList causes Beam to materialize the data, so you're sure that you will get in-memory list for that pcollection.
Intended for use in side-argument specification---the same places
where AsSingleton and AsIter are used, but forces materialization of
this PCollection as a list.
Source: https://beam.apache.org/documentation/sdks/pydoc/2.2.0/apache_beam.pvalue.html?highlight=aslist#apache_beam.pvalue.AsList

The code looks fine. However, since mappingTable is a mapping, wouldn't beam.pvalue.AsDict be more appropriate for your use case?
Your mappingTable is small enough so side input is a good use case here.
Given that mappingTable is also static, you can load it from GCS in start_bundle function of your DoFn. See the answer to this post for more details. If mappingTable becomes very large in future, you can also consider converting your map_logentries and mappingTable into PCollection of key-value pairs and join them using CoGroupByKey.

Related

Caching a large data structure in python across instances while maintaining types

I'm looking to use a distributed cache in python. I have a fastApi application and wish that every instance have access to the same data as our load balancer may route the incoming requests differently. The problem is that I'm storing / editing information about a relatively big data set from a arrow feather file and processing it with Vaex. The feather file automaticaly loads the correct types for the data. The data structure I need to store will use a user id as a key and the value will be a large array of arrays of numbers. I've looked at memcache and redis as possible caching solutions, but both seem to store entries as strings / simple values. I'm looking to avoid parsing strings and extra processing on a large amount data. Is there a distributed caching stategy that will let me persist types?
One solution we came up with is to store the data in mutliple feather files in a directory that is accessible to all instances of the app but this seems to be messy as you would need to clean up / delete the files after each session.

Redis 'strings' are actually able to store arbitrary binary data, it isn't limited to actual strings. From https://redis.io/topics/data-types:
Redis Strings are binary safe, this means that a Redis string can contain any kind of data, for instance a JPEG image or a serialized Ruby object.
A String value can be at max 512 Megabytes in length.
Another option is to use Flatbuffers, which is a serialisation protocol specifically designed to allow reading/writing serialised objects without expensive deserialisation.
Although I would suggest reconsidering storing large, complex data structures as cache values. The drawback is that any change will lead to having to rewrite the entire thing in cache which can get expensive, so consider breaking it up into smaller k/v pairs if possible. You could use Redis Hash data type to make this easier to implement.

Apache Beam I/O Transforms

The Apache Beam documentation Authoring I/O Transforms - Overview states:
Reading and writing data in Beam is a parallel task, and using ParDos, GroupByKeys, etc… is usually sufficient. Rarely, you will need the more specialized Source and Sink classes for specific features.
Could someone please provide a very basic example of how to do this in Python?
For example, if I had a local folder containing 100 jpeg images, how would I:
Use ParDos to read/open the files.
Run some arbitrary code on the images (maybe convert them to grey-scale).
Use ParDos to write the modified images to a different local folder.
Thanks,

Here is an example of pipeline https://github.com/apache/beam/blob/fc738ab9ac7fdbc8ac561e580b1a557b919437d0/sdks/python/apache_beam/examples/wordcount.py#L37
In your case, get the names of the file first and then read each file one at a time and write the output.
You might also want to push the file names to a groupby to use the parallelization provided by the runner.
So in total your pipeline might look something like
Read list of filesnames -> Send filenames to Shuffle using GroupBy Key -> Get 1 filename at a time in a pardo -> Read single file, process and write in a pardo

Running an experiment with different parameters, saving outputs efficiently

My task is:
I have a program, written in Python which takes a set of variables (A,B,C) as input, and outputs two numbers (X, Y).
I want to run this program over a large range of inputs (A,B,C)
I'll refer to running the program with a given set of variables as 'running an experiment'
I am using a cluster to do this (I submit my jobs in SLURM)
I then want to save the results of the experiments into a single file, e.g. a data frame with columns [A|B|C|X|Y], where each row is the output of a different experiment.
My current situation is:
I have written my program in the following form:
import io
from optparse import OptionParser
parser = OptionParser()
parser.add_option("-f", "--file",action="store", type="string", dest="filename")
parser.add_option("-a", "--input_a",type="int", dest="a")
parser.add_option("-b", "--input_b",type="int", dest="b")
parser.add_option("-c", "--input_c",type="int", dest="c")
(options, args) = parser.parse_args()
def F(a,b,c):
return((a+b, b+c))
Alice = options.input_a
Bob = options.input_b
Carol = options.input_c
with io.open("results_a_{0}_b_{1}_c_{2}.txt".format(Alice, Bob, Carol), "a") as f:
(x,y) = F(Alice, Bob, Carol)
f.write(u"first output = {0}, second output = {1}".format(x,y))
This allows me to run the program once, and save the results in a single file.
In principle, I could then submit this job for a range of (A,B,C), obtain a large number of text files with the results, and then try to aggregate them into a single file.
However, I assume this isn't the best way of going about things.
What I'd like to know is:
Is there a more natural way for me to run these experiments and save the results all in one file, and if so, what is it?
How should I submit my collection of jobs on SLURM to do so?
I currently am submitting (effectively) the script below, which is not really working:
...
for a in 1 2 3; do
for b in 10 20 30; do
for c in 100 200 300; do
python toy_experiment.py -a $a -b $b -c $c &
done
done
done
wait
[I'm wary that there are possibly many other places I'm going wrong here - I'm open to using something other than optparse to pass arguments to my program, to saving the results differently, etc. - my main goal is having a working solution.]

TL;DR
You should research how to use file locking in your system and use that to access a single output file safely.
You should also submit your script usign a job array, and letting each job in the array to run a single experiment.
Many things going on here.
One important question: how long usually takes F() to compute? I assume that you just wrote an example, but this is required to decide the best approach to the problem.
If the time span for every calculation is short, maybe you can run few batches aggregating in a single script several computations: in the example, 3 batches (for a==1, a==2 and a==3), each of them gcomputing all the possible experiments for b and c and generating 3 files that have to be aggregated at the end.
If the timespan is long, then the overload of creating some thousands of small files is not a big deal. And concatenating them afterwards will be easy.
Another thing: by putting all your jobs running simultaneously in the backgroud you are probably overloading a single machine. I don't know how do you ask SLURM for resources, but for sure you will be using only one node. And overusing it. If there are other users there, they will probably be pissed off. You must control the number of simultaneous jobs running in a node to be the number of processors granted in that node. Propably, starting your calculations with srun will help.
Another way would be to create a single job per calculation. You can encapsulate them in a job array. In that case, you will only run one experiment per job and you won't have to start and wait for anything in the background.
Finally, getting to your main question: which is the best way to save all this information in disk efficiently.
Creating thousands of files is easy and possible, but not the best way for the file system. Maybe you have access to some RAMdisk in a common node. Writing a small file in compute node local storage and copying that file to the common in-memory disk would be quite a lot more efficient. And when all the experiments have been done you can aggregate the results easily. The drawback is that the space is usually very limited (I don't really know the real size of your data) and it will be an in-memory disk: it can be lost in case of power failure.
Using a single file would be a better approach, but as Dmitri Chubarov pointed, you have to make use of the file locking mechanisms. Otherwise you risk getting mixed results.
Finally, the approach that I feel is best suited for your problem, is to use some kind of database-like solution. If you have access to a relational DB that supports transactions, just create a table with the needed fields and let your code connect and insert the results. Extracting them at the end will be a breeze. The DB can be a client/server one (MySQL, PostgreSQL,Oracle...) or an embedded one (HSQLDB). Another option would be to use some file format like NetCDF, which is precisely intended for this kind of scientific data and have some support for parallel access.

GAE MapReduce, How to write Multiple Outputs

I have a data set which I do multiple mappings on.
Assuming that I have 3 key-values pair for the reduce function, how do I modify the output such that I have 3 blobfiles - one for each of the key value pair?
Do let me know if I can clarify further.

I don't think such functionality exists (yet?) in the GAE Mapreduce library.
Depending on the size of your dataset, and the type of output required, you can small-time-investment hack your way around it by co-opting the reducer as another output writer. For example, if one of the reducer outputs should go straight back to the datastore, and another output should go to a file, you could open a file yourself and write the outputs to it. Alternatively, you could serialize and explicitly store the intermediate map results to a temporary datastore using operation.db.Put, and perform separate Map or Reduce jobs on that datastore. Of course, that will end up being more expensive than the first workaround.
In your specific key-value example, I'd suggest writing to a Google Cloud Storage File, and postprocessing it to split it into three files as required. That'll also give you more control over final file names.

Process 5 million key-value data in python.Will NoSql solve?

I would like to get the suggestion on using No-SQL datastore for my particular requirements.
Let me explain:
I have to process the five csv files. Each csv contains 5 million rows and also The common id field is presented in each csv.So, I need to merge all csv by iterating 5 million rows.So, I go with python dictionary to merge all files based on the common id field.But here the bottleneck is you can't store the 5 million keys in memory(< 1gig) with python-dictionary.
So, I decided to use No-Sql.I think It might be helpful to process the 5 million key value storage.Still I didn't have clear thoughts on this.
Anyway we can't reduce the iteration since we have the five csvs each has to be iterated for updating the values.
Is it there an simple steps to go with that?
If this is the way Could you give me the No-Sql datastore to process the key-value pair?
Note: We have the values as list type also.

If the CSV is already sorted by id you can use the merge-join algorithm. It allows you to iterate over the single lines, so you don't have to keep everything in memory.
Extending the algorithm to multiple tables/CSV files will be a greater challenge, though. (But probably faster than learning something new like Hadoop)

If this is just a one-time process, you might want to just setup an EC2 node with more than 1G of memory and run the python scripts there. 5 million items isn't that much, and a Python dictionary should be fairly capable of handling it. I don't think you need Hadoop in this case.
You could also try to optimize your scripts by reordering the items in several runs, than running over the 5 files synchronized using iterators so that you don't have to keep everything in memory at the same time.

As I understand you want to merge about 500,000 items from 5 input files. If you do this on one machine it might take long time to process 1g of data. So I suggest to check the possibility of using Hadoop. Hadoop is a batch processing tool. Usually Hadoop programs are written in Java, but you can write it in Python as well.
I recommend to check feasibility of using Hadoop to process your data in a cluster. You may use HBase (Column datastore) to store your data. It's an idea, check whether its applicable to your problem.
If this does not help, give some more details about the problem your are trying to solve. Technically you can use any language or datastore to solve this problem. But you need to find which one solves the best (in terms of time or resources) and your willingness to use/learn a new tool/db.
Excellent tutorial to get started: http://developer.yahoo.com/hadoop/tutorial/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.