Flink batch data processing - python

I'm evaluating Flink for some processing batches of data. As a simple example say I have 2000 points which I would like to pass through an FIR filter using functionality provided by scipy. The scipy filter is a simple function which accepts a set of coefficients and the data to filter and returns the data. Is is possible to create a transformation to handle this in Flink? It seems Flink transformations are applied on a point by point basis but I may be missing something.

This should certainly be possible. Flink already has a Python API (beta) you might want to use.
About your second question: Flink can apply a function point by point and can do other stuff, too. It depends what kink of function you are defining. For example, filter, project, map, flatMap are applied per record; max, min, reduce, etc. are applied to a group of records (the groups are defined via groupBy). There is also the possibility to join data from different dataset using join, cross, or cogroup. Please have a look into the list of available transformations in the documentation: https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/batch/dataset_transformations.html

Related

can I define data filters with intake catalogs?

I would like to use intake to not only link to published datasets, but filter them in the catalog itself. Filtering is trivial to in python once you open the data, but this means providing the user code beyond the metadata in order to give some guidance.
Motivation: often the user is not as familiar with the dataset as the producer, and it would be nice to do some preprocessing for them without adding a series of different filtering steps in python.
eg if we have opened a csv already, we can filter with:
df[df['rain'] > 70]
but I don't see any arguments in read_csv for either pandas or dask to do this.
There is, indeed, no way to pass a filter to pandas' or dask's read_csv functions, and therefore this is nt an option supported by Intake's CSV driver.
However, Intake does support dataset transforms: https://intake.readthedocs.io/en/latest/transforms.html This means, that you can operate on the output of one data source, and assign a new catalogue entry to the result. The transform/computation would be performed on every access, the filtered dataset is not stored anywhere (unless you also use the persist functionality).

Is .withColumn and .agg calculated in parallel in pyspark?

Consider for example
df.withColumn("customr_num", col("customr_num").cast("integer")).\
withColumn("customr_type", col("customr_type").cast("integer")).\
agg(myMax(sCollect_list("customr_num")).alias("myMaxCustomr_num"), \
myMean(sCollect_list("customr_type")).alias("myMeanCustomr_type"), \
myMean(sCollect_list("customr_num")).alias("myMeancustomr_num"),\
sMin("customr_num").alias("min_customr_num")).show()
Are .withColumn and the list of functions inside agg (sMin, myMax, myMean, etc.) calculated in parallel by Spark, or in sequence ?
If sequential, how do we parallelize them ?
By essence, as long as you have more than one partition, operations are always parallelized in spark. If what you mean though is, are the withColumn operations going to be computed in one pass over the dataset, then the answer is also yes. In general, you can use the Spark UI to know more about the way things are computed.
Let's take an example that's very similar to your example.
spark.range(1000)
.withColumn("test", 'id cast "double")
.withColumn("test2", 'id + 10)
.agg(sum('id), mean('test2), count('*))
.show
And let's have a look at the UI.
Range corresponds to the creation of the data, then you have project (the two withColumn operations) and then the aggregation (agg) within each partition (we have 2 here). In a given partition, these things are done sequentially, but for all the partitions at the same time. Also, they are in the same stage (on blue box) which mean that they are all computed in one pass over the data.
Then there is a shuffle (exchange) which mean that data is exchanged over the network (the result of the aggregations per partition) and the final aggregation is performed (HashAggregate) and then sent to the driver (collect)

Dask difference between .take and .compute

I don't fully understand the semantics of .take and .compute.
From the dask.bad.Bag api docs:
Bag.compute(**kwargs) Compute this dask collection
Bag.take(k[, npartitions, compute, warn]) Take the first k elements.
This makes me think that If I compute the whole collection and then take one element .take will not trigger a re-computations. But it does. So when should someone use take vs compute? Is compute not supposed to be used during development when you want to inspect the result of a computation? Because if take(N) has the same result and using compute doesn't save anything, why use compute?

Should the DataFrame function groupBy be avoided?

This link and others tell me that the Spark groupByKey is not to be used if there is a large number of keys, since Spark shuffles all the keys around. Does the same apply to the groupBy function as well? Or is this something different?
I'm asking this because I want to do what this question tries to do, but I have a very large number of keys. It should be possible to do this without shuffling all the data around by reducing on each node locally, but I can't find the PySpark way to do this (frankly, I find the documentation quite lacking).
Essentially, I am trying to do is:
# Non-working pseudocode
df.groupBy("A").reduce(lambda x,y: if (x.TotalValue > y.TotalValue) x else y)
However, the dataframe API does not offer a "reduce" option. I'm probably misunderstanding what exactly dataframe is trying to achieve.
A DataFrame groupBy followed by an agg will not move the data around unnecessarily, see here for a good example. Hence, there is no need to avoid it.
When using the RDD API, the opposite is true. Here it is preferable to avoid groupByKey and use a reducebyKey or combineByKey where possible. Some situations, however, do require one to use groupByKey.
The normal way to do this type of operation with the DataFrame API is to use groupBy followed by an aggregation using agg. In your example case, you want to find the maximum value for a single column for each group, this can be achived by the max function:
from pyspark.sql import functions as F
joined_df.groupBy("A").agg(F.max("TotalValue").alias("MaxValue"))
In addition to max there are a multitude of functions that can be used in combination with agg, see here for all operations.
The documentation is pretty all over the place.
There has been a lot of optimization work for dataframes. Dataframes has additional information about the structure of your data, which helps with this. I often find that many people recommend dataframes over RDDs due to "increased optimization."
There is a lot of heavy wizardry behind the scenes.
I recommend that you try "groupBy" on both RDDs and dataframes on large datasets and compare the results. Sometimes, you may need to just do it.
Also, for performance improvements, I suggest fiddling (through trial and error) with:
the spark configurations Doc
shuffle.partitions Doc

Difference between tf.data.Dataset.map() and tf.data.Dataset.apply()

With the recent upgrade to version 1.4, Tensorflow included tf.data in the library core.
One "major new feature" described in the version 1.4 release notes is tf.data.Dataset.apply(), which is a "method for
applying custom transformation functions". How is this different from the already existing tf.data.Dataset.map()?
The difference is that map will execute one function on every element of the Dataset separately, whereas apply will execute one function on the whole Dataset at once (such as group_by_window given as example in the documentation).
The argument of apply is a function that takes a Dataset and returns a Dataset when the argument of map is a function that takes one element and returns one transformed element.
Sunreef's answer is absolutely correct. You might still be wondering why we introduced Dataset.apply(), and I thought I'd offer some background.
The tf.data API has a set of core transformations—like Dataset.map() and Dataset.filter()—that are generally useful across a wide range of datasets, unlikely to change, and implemented as methods on the tf.data.Dataset object. In particular, they are subject to the same backwards compatibility guarantees as other core APIs in TensorFlow.
However, the core approach is a bit restrictive. We also want the freedom to experiment with new transformations before adding them to the core, and to allow other library developers to create their own reusable transformations. Therefore, in TensorFlow 1.4 we split out a set of custom transformations that live in tf.contrib.data. The custom transformations include some that have very specific functionality (like tf.contrib.data.sloppy_interleave()), and some where the API is still in flux (like tf.contrib.data.group_by_window()). Originally we implemented these custom transformations as functions from Dataset to Dataset, which had an unfortunate effect on the syntactic flow of a pipeline. For example:
dataset = tf.data.TFRecordDataset(...).map(...)
# Method chaining breaks when we apply a custom transformation.
dataset = custom_transformation(dataset, x, y, z)
dataset = dataset.shuffle(...).repeat(...).batch(...)
Since this seemed to be a common pattern, we added Dataset.apply() as a way to chain core and custom transformations in a single pipeline:
dataset = (tf.data.TFRecordDataset(...)
.map(...)
.apply(custom_transformation(x, y, z))
.shuffle(...)
.repeat(...)
.batch(...))
It's a minor feature in the grand scheme of things, but hopefully it helps to make tf.data programs easier to read, and the library easier to extend.
I don't have enough reputation to comment, but I just wanted to point out that you can actually use map to apply to multiple elements in a dataset contrary to #sunreef's comments on his own post.
According to the documentation, map takes as an argument
map_func: A function mapping a nested structure of tensors (having
shapes and types defined by self.output_shapes and self.output_types)
to another nested structure of tensors.
the output_shapes are defined by the dataset and can be modified by using api functions like batch. So, for example, you can do a batch normalization using only dataset.batch and .map with:
dataset = dataset ...
dataset.batch(batch_size)
dataset.map(normalize_fn)
It seems like the primary utility of apply() is when you really want to do a transformation across the entire dataset.
Simply, the arguement of transformation_func of apply() is Dataset; the arguement of map_func of map() is element

Categories