pyspark accumulators - understanding their use

pyspark accumulators - understanding their use - python

I would like to understand what is the use of accumulators. Based upon online examples it seems that we can use to them to count specific issues with the data. For example I have a lot of license numbers, i can count how many of them are not valid using accumulators. But cannot we do the same using filter and map operations? Would it be possible to show a good example where accumulators are used? I would appreciate if you provide sample code in pyspark instead of java or scala

Accumulators are used mostly for diagnostics and retrieving additional data from the actions and typically shouldn't be used as a part of the main logic, especially when called inside transformations*.
Lets start with the first case. You can use accumulator or named accumulator to monitor program execution in close-to-real time (updated per task) and for example kill the job if you encounter to many invalid records. State of the named accumulators can be monitored for example using driver UI.
In case of actions it can used to get additional statistics. For example if you use foreach, foreachPartition to push data to external system you can use accumulators to keep track of failures.
* When are accumulators truly reliable?

Related

Recommended python scientific workflow management tool that defines dependency completeness on parameter state rather than time?

It's past time for me to move from my custom scientific workflow management (python) to some group effort. In brief, my workflow involves long running (days) processes with a large number of shared parameters. As a dependency graph, nodes are tasks that produce output or do some other work. That seems fairly universal in workflow tools.
However, key to my needs is that each task is defined by the parameters it requires. Tasks are instantiated with respect to the state of those parameters and all parameters of its dependencies. Thus if a task has completed its job according to a given parameter state, it is complete and not rerun. This parameter state is NOT the global parameter state but only what is relevant to that part of the DAG. This reliance on parameter state rather than time completed appears to be the essential difference between my needs and existing tools (at least what I have gathered from a quick look at Luigi and Airflow). Time completed might be one such parameter, but in general it is not the time that determines a (re)run of the DAG, but whether the parameter state is congruent with the parameter state of the calling task. There are non-trivial issues (to me) with 'parameter explosion' and the relationship to parameter state and the DAG, but those are not my question here.
My question -- which existing python tool would more readily allow defining 'complete' with respect to this parameter state? It's been suggested that Luigi is compatible with my needs by writing a custom complete method that would compare the metadata of built data ('targets') with the needed parameter state.
How about Airflow? I don't see any mention of this issue but have only briefly perused the docs. Since adding this functionality is a significant effort that takes away from my 'scientific' work, I would like to start out with the better tool. Airflow definitely has momentum but my needs may be too far from its purpose.
Defining the complete parameter state is needed for two reasons -- 1) with complex, long running tasks, I can't just re-run the DAG every time I change some parameter in the very large global parameter state, and 2) I need to know how the intermediate and final results have been produced for scientific and data integrity reasons.

I looked further into Luigi and Airflow and as far as I could discern neither of these is suitable for modification for my needs. The primary incompatibility is that these tools are fundamentally based on predetermined DAGs/workflows. My existing framework operates on instantiated and fully specified DAGs that are discovered at run-time rather than concisely described externally -- necessary because knowing whether each task is complete, for a given request, is dependent on many combinations of parameter values that define the output of that task and the utilized output of all upstream tasks. By instantiated, I mean the 'intermediate' results of individual runs each described by the full parameter state (variable values) necessary to reproduce (withstanding any stochastic element) identical output from that task.
So a 'Scheduler' that operates on a DAG ahead of time is not useful.
In general, most existing workflow frameworks, at least in python, that I've glanced at appear more to be designed to automate many relatively simple tasks in an easily scalable and robust manner with parallelization, with little emphasis put on the incremental building up of more complex analyses with results that must be reused when possible designed to link complex and expensive computational tasks the output of which may likely in turn be used as input for an additional unforeseen analysis.
I just discovered the 'Prefect' workflow this morning, and am intrigued to learn more -- at least it is clearly well funded ;-). My initial sense is that it may be less reliant on pre-scheduling and thus more fluid and more readily adapted to my needs, but that's just a hunch. In many ways some of my more complex 'single' tasks might be well suited to wrap an entire Prefect Flow if they played nicely together. It seems my needs are on the far end of the spectrum of deep complicated DAGs (I will not try to write mine out!) with never ending downstream additions.
I'm going to look into Prefect and Luigi more closely and see what I can borrow to make my framework more robust and less baroque. Or maybe I can add a layer of full data description to Prefect...
UPDATE -- discussing with Prefect folks, clear that I need to start with the underlying Dask and see if it is flexible enough -- perhaps using Dask delayed or futures. Clearly Dask is extraordinary. Graphchain built on top of Dask is a move in the right direction by facilitating permanent storage of 'intermediate' output computed over a dependency 'chain' identified by hash of code base and parameters. Pretty close to what I need, though with more explicit handling of those parameters that deterministically define the outputs.

Why do I need to shuffle my PCollection for it to autoscale on Cloud Dataflow?

Context
I am reading a file from Google Storage in Beam using a process that looks something like this:
data = pipeline | beam.Create(['gs://my/file.pkl']) | beam.ParDo(LoadFileDoFn)
Where LoadFileDoFn loads the file and creates a Python list of objects from it, which ParDo then returns as a PCollection.
I know I could probably implement a custom source to achieve something similar, but this answer and Beam's own documentation indicate that this approach with pseudo-dataset to read via ParDo is not uncommon and custom sources may be overkill.
It also works - I get a PCollection with the correct number of elements, which I can process as I like! However..
Autoscaling problems
The resulting PCollection does not autoscale at all on Cloud Dataflow. I first have to transform it via:
shuffled_data = data | beam.Shuffle()
I know this answer I also linked above explains pretty much this process - but it doesn't give any insight as to why this is necessary. As far as I can see at Beam's very high level of abstraction, I have a PCollection with N elements before the shuffle and a similar PCollection after the shuffle. Why does one scale, but the other not?
The documentation is not very helpful in this case (or in general, but that's another matter). What hidden attribute does the first PCollection have that prevents it from being distributed to multiple workers that the other doesn't have?

When you read via Create you are creating a PCollection that is bound to 1 worker. Since there are no keys associated with items there is no mechanism to distribute the work. Shuffle() will create a K,V under neath the covers and then shuffle which enables the PCollection items to be distributed to new workers as they spin up. You verify this behavior by turning off auto-scaling and fixing the worker size say to 25 - without the Shuffle you will only see 1 worker doing work.
Another way to distribute this work when Creating/Reading would be to build your own custom I/O for reading PKL files1. You'd create the appropriate splitter; however, not knowing what you have pickled it may not be splittable. IMO Shuffle() is a safe bet, modulo you having optimization to gain by writing a splittable reader.

appengine-mapreduce fails with out of memory during shuffle stage

I have ~50M entities stored in datastore. Each item can be of one type out of a total 7 types.
Next, I have a simple MapReduce job that counts the number of items of each type. It is written in python and based on appengine-mapreduce library. The Mapper emits (type, 1). The reducer simply adds the number of 1s received for each type.
When I run this job with 5000 shards, the map-stage runs fine. It uses a total of 20 instances which is maximum possible based on my task-queue configuration.
However, the shuffle-hash stage makes use of only one instance and fails with an out-of-memory error. I am not able to understand why only one instance is being used for hashing and how can I fix this out-of-memory error.
I have tried writing a combiner but I never saw a combiner stage on the mapreduce status page or in the logs.
Also, the wiki for appengine-mapreduce on github is obsolete and I cannot find an active community where I can ask questions.

You are correct that the Python shuffle is in-memory based and does not scale. There is a way to make the Python MR use the Java MR shuffle phase (which is fast and scales). Unfortunately documentation about it (the setup and the how the 2 libraries communicate) is poor. See this issue for more information.

Mapreduce on Google App Engine

I'm very confused with the state and documentation of mapreduce support in GAE.
In the official doc https://developers.google.com/appengine/docs/python/dataprocessing/, there is an example, but :
the application use mapreduce.input_readers.BlobstoreZipInputReader, and I would like to use mapreduce.input_readers.DatastoreInputReader. The documentation mention the parameters of DatastoreInputReader, but not the return value sent back to the map fonction....
the application "demo" (page Helloworld) has a mapreduce.yaml file wich IS NOT USED in the application ???
So I found http://code.google.com/p/appengine-mapreduce/. The is a complete example with mapreduce.input_readers.DatastoreInputReader, but it is written that reduce phase isn't supported yet !
So I would like to know if it is possible to implement the first form of mapreduce, with the DatastoreInputReader, to execute a real map / reduce to get a GROUP BY equivalent ?

The second example is from the earlier release, which did indeed just support the mapper phase. However, as the first example shows, the full map/reduce functionality is now supported and has been for some time. The mapreduce.yaml is from that earlier version, it is not used now.
I'm not sure what your actual question is. The value sent to the map function from DatastoreInputReader is, not surprisingly, the individual entity which is taken from the kind being mapped over.

Designing an extensible pipeline with Python

Context: I'm currently using Python to a code a data-reduction pipeline for a large astronomical imaging system. The main pipeline class passes experimental data through a number of discrete processing 'stages'.
The stages are written in separate .py files which constitute a package. A list of available stages is generated at runtime so the user can choose which stages to run the data through.
The aim of this approach is to allow the user to create additional stages in the future.
Issue: All of the pipeline configuration parameters and data structures are (currently) located within the main pipeline class. Is there a simple way to access these from within the stages which are imported at runtime?
My current best attempt seems 'wrong' and somewhat primitive, as it uses circular imports and class variables. Is there perhaps a way for a pipeline instance to pass a reference to itself as an argument to each of the stages it calls?
This is my first time coding a large python project and my lack of design knowledge is really showing.
Any help would be greatly appreciated.

I've built a similar system; it's called collective.transmogrifier. One of these days I'll make it more generic (it is currently tied to the CMF, one of the underpinnings of Plone).
Decoupling
What you need, is a way to decouple the component registration for your pipeline. In Transmogrifier, I use the Zope Component Architucture (embodied in the zope.component package). The ZCA lets me register components that implement a given interface and later look up those components as either a sequence or by name. There are other ways of doing this too, for example, python eggs have the concept of entry points.
The point is more that each component in the pipeline is referable by a text-only name, de-referenced at construction time. 3rd-party components can be slotted in for re-use by registering their own components independently from your pipeline package.
Configuration
Transmogrifier pipelines are configured using a textual format based on the python ConfigParser module, where different components of the pipeline are named, configured, and slotted together. When constructing the pipeline, each section thus is given a configuration object. Sections don't have to look up configuration centrally, the are configured on instantiation.
Central state
I also pass in a central 'transmogrifier' instance, which represents the pipeline. If any component needs to share per-pipeline state (such as caching a database connection for re-use between components), they can do so on that central instance. So in my case, each section does have a reference to the central pipeline.
Individual components and behaviour
Transmogrifier pipeline components are generators, that consume elements from a preceding component in the pipeline, then yield the results of their own processing. Components generally thus have a reference to the previous stage, but have no knowledge of what consumes their output. I say 'generally' because in Transmogrifier some pipeline elements can produce elements from an external source instead of using a previous element.
If you do need to alter the behaviour of a pipeline component based on individual items to be processed, mark those items themselves with extra information for each component to discover. In Transmogrifier, items are dictionaries, and you can add extra keys to a dictionary that use the name of a component so each component can look for this extra info and alter behaviour as needed.
Summary
Decouple your pipeline components by using an indirect lookup of elements based on a configuration.
When you instantiate your components, configure them at the same time and give them what they need to do their job. That could include a central object to keep track of pipeline-specific state.
When running your pipeline, only pass through items to process, and let each component base it's behaviour on that individual item only.

Ruffus is a python library "designed to allow scientific and other analyses to be automated with the minimum of fuss and the least effort".
Positive: It allows incremental processing of data and you can define very complicated sequences. Additionally, tasks are automatically parallelized. It allows you to switch on/off functions and their order is defined automatically from the patterns you specify.
Negative: it is sometimes too pythonic for my taste, it only switches and orders functions and not, for example, classes. But then of course you can have the code to initialize the classes within each function.
For the purpose you want, you use the #active_if identifier above a function, to enable it or disable it over the pipeline. You can retrieve whether it is going to be activated from an external configuration file, which you read with a ConfigParser.
In order to load the ConfigParser values, you have to write another python module, which initializes the ConfigParser instance. This module has to be imported on the first lines of the pipeline module.

Two options:
Have configuration somewhere else: have a config module, and use something like the django config system to make that available.
Instead of having the stages import the pipeline class, pass them a pipeline instance on instantiation.

A colleague of mine has worked on a similar pipeline for astrophysical synthetic emission maps from simulation data (svn checkout https://svn.gforge.hlrs.de/svn//opensesame).
The way he does this is:
The config lives in a separate object (actually a dictionary as in your case).
The stages are either:
receive the config object at instantiation as a constructor argument
get the config through assignment later on (e.g. stage.config = config_object)
receive the config object as an argument when executed (e.g. stage.exec(config_object, other_params))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.