Efficient ParDo setup or start_bundle for side input - python

list A: 25M hashes
list B: 175K hashes
I want to check each hash in list B for existence in list A. For this I have a ParDo function and I yield when it's not matched. This is a deduplication process.
How do I set up this ParDo efficiently, now I do a side input of list A while processing list B. But shouldnt the side input go to setup() or start_bundle() of the ParDo so I store the lookup list (A) in the worker just once?
class Checknewrecords(beam.DoFn):
def process(self, element, hashlist):
if element['TA_HASH'] not in hashlist:
yield element
else:
pass
If you have a the answer please include a link to the documentation because I did not find any good documentation for the Python version.
transformed_records is a PCollection from a previous transformation
current_data is a PCollection from a BigQuery.read
new_records = transformed_records | 'Checknewrecords' >> beam.ParDo(Checknewrecords(), pvalue.AsList(current_data))

I believe that pvalue.AsDict is what you need, which will give you a dictionary style interface for the side input. You can find some examples on the Apache Beam Github search.
Here is a simplified example I just wrote, but please see the checked in example below (though a bit more complicated), incase I made a mistake.
class ComputeHashes(beam.DoFn):
def process(self, element):
# use the element as a key to produce a KV, value is not used
yield (HashFunction(element), true)
initial_elements = beam.Create("foo")
computed_hashes = initial_elements | beam.ParDo(ComputeHashes())
class FilterIfAlreadyComputedHash(beam.DoFn):
def process(self, element, hashes):
# Filter if it already exists in hashes
if not hashes.get(element):
yield element
more_elements = beam.Create("foo", "bar") # Read from your pipeline's source
small_words = more_elements | beam.ParDo(FilterIfAlreadyComputedHash(), beam.pvalue.AsDict(computed_hashes))
In the checked in example, from the beam github repo, in visionml_test.py a PCollection is converted to the Dictionary type view using beam.PValue.AsDict().
class VisionMlTestIT(unittest.TestCase):
def test_text_detection_with_language_hint(self):
IMAGES_TO_ANNOTATE = [
'gs://apache-beam-samples/advanced_analytics/vision/sign.jpg'
]
IMAGE_CONTEXT = [vision.types.ImageContext(language_hints=['en'])]
with TestPipeline(is_integration_test=True) as p:
contexts = p | 'Create context' >> beam.Create(
dict(zip(IMAGES_TO_ANNOTATE, IMAGE_CONTEXT)))
output = (
p
| beam.Create(IMAGES_TO_ANNOTATE)
| AnnotateImage(
features=[vision.types.Feature(type='TEXT_DETECTION')],
context_side_input=beam.pvalue.AsDict(contexts))
| beam.ParDo(extract))
The side input is passed into a FlatMap (in visionml.py), and, in the FlatMap's function, an entry is retrieved from the dictionary with .get(). This could also be passed into a Map or ParDo. See: beam python side input documentation (here they use .AsSingleton instead .AsDict). You can find an example here of using it in the process call.
class AnnotateImage(PTransform):
"""A ``PTransform`` for annotating images using the GCP Vision API.
ref: https://cloud.google.com/vision/docs/
Batches elements together using ``util.BatchElements`` PTransform and sends
each batch of elements to the GCP Vision API.
Element is a Union[text_type, binary_type] of either an URI (e.g. a GCS URI)
or binary_type base64-encoded image data.
Accepts an `AsDict` side input that maps each image to an image context.
"""
MAX_BATCH_SIZE = 5
MIN_BATCH_SIZE = 1
def __init__(
self,
features,
retry=None,
timeout=120,
max_batch_size=None,
min_batch_size=None,
client_options=None,
context_side_input=None,
metadata=None):
"""
Args:
features: (List[``vision.types.Feature.enums.Feature``]) Required.
The Vision API features to detect
retry: (google.api_core.retry.Retry) Optional.
A retry object used to retry requests.
If None is specified (default), requests will not be retried.
timeout: (float) Optional.
The time in seconds to wait for the response from the Vision API.
Default is 120.
max_batch_size: (int) Optional.
Maximum number of images to batch in the same request to the Vision API.
Default is 5 (which is also the Vision API max).
This parameter is primarily intended for testing.
min_batch_size: (int) Optional.
Minimum number of images to batch in the same request to the Vision API.
Default is None. This parameter is primarily intended for testing.
client_options:
(Union[dict, google.api_core.client_options.ClientOptions]) Optional.
Client options used to set user options on the client.
API Endpoint should be set through client_options.
context_side_input: (beam.pvalue.AsDict) Optional.
An ``AsDict`` of a PCollection to be passed to the
_ImageAnnotateFn as the image context mapping containing additional
image context and/or feature-specific parameters.
Example usage::
image_contexts =
[(''gs://cloud-samples-data/vision/ocr/sign.jpg'', Union[dict,
``vision.types.ImageContext()``]),
(''gs://cloud-samples-data/vision/ocr/sign.jpg'', Union[dict,
``vision.types.ImageContext()``]),]
context_side_input =
(
p
| "Image contexts" >> beam.Create(image_contexts)
)
visionml.AnnotateImage(features,
context_side_input=beam.pvalue.AsDict(context_side_input)))
metadata: (Optional[Sequence[Tuple[str, str]]]): Optional.
Additional metadata that is provided to the method.
"""
super(AnnotateImage, self).__init__()
self.features = features
self.retry = retry
self.timeout = timeout
self.max_batch_size = max_batch_size or AnnotateImage.MAX_BATCH_SIZE
if self.max_batch_size > AnnotateImage.MAX_BATCH_SIZE:
raise ValueError(
'Max batch_size exceeded. '
'Batch size needs to be smaller than {}'.format(
AnnotateImage.MAX_BATCH_SIZE))
self.min_batch_size = min_batch_size or AnnotateImage.MIN_BATCH_SIZE
self.client_options = client_options
self.context_side_input = context_side_input
self.metadata = metadata
def expand(self, pvalue):
return (
pvalue
| FlatMap(self._create_image_annotation_pairs, self.context_side_input)
| util.BatchElements(
min_batch_size=self.min_batch_size,
max_batch_size=self.max_batch_size)
| ParDo(
_ImageAnnotateFn(
features=self.features,
retry=self.retry,
timeout=self.timeout,
client_options=self.client_options,
metadata=self.metadata)))
#typehints.with_input_types(
Union[text_type, binary_type], Optional[vision.types.ImageContext])
#typehints.with_output_types(List[vision.types.AnnotateImageRequest])
def _create_image_annotation_pairs(self, element, context_side_input):
if context_side_input: # If we have a side input image context, use that
image_context = context_side_input.get(element)
else:
image_context = None
if isinstance(element, text_type):
image = vision.types.Image(
source=vision.types.ImageSource(image_uri=element))
else: # Typehint checks only allows text_type or binary_type
image = vision.types.Image(content=element)
request = vision.types.AnnotateImageRequest(
image=image, features=self.features, image_context=image_context)
yield request
Note, in Java you use it as .asMap().

Sorry, I initially misunderstood the question. Actually I don't think its possible to have a side input in start_bundle. It is only accessible in process_bundle. But you could instead do the work on the first call to process bundle and get a similar result.
class DoFnMethods(beam.DoFn):
def __init__(self):
self.first_element_processed = False
self.once_retrieved_side_input_data = None
def called_once(self, side_input):
if self.first_element_processed:
return
self.once_retrieved_side_input_data = side_input.get(...)
self.first_element_processed = True
def process(self, element, side_input):
self.called_once(side_input)
...
Note: You do need to be aware of the fact that start bundle and finish bundle will be called once for the bundle across all windows, and the side input is provided to process is different for each window computed. So if you are working with windows you may need to use a dict(keyed by window) for the self.first_element_processed and self.once_retrieved_side_input_data variables so you can called_onc once for each window.

Related

Windowing Strategy for Unbounded Side Input

I have a pipeline which streams IoT log data. In addition, it has a bounded side input, containing the initial configurations of the devices. This configuration changes over time and has to be updated by specific logs (ConfigLogs) coming from the main PubSub source. All remaining logs (PayloadLogs) need to consume at some point the updated configuration.
In order to have access to the latest configuration, I came up with the following pipeline design:
However, I was unable to get this to work. In particular, I struggle with the correct window/trigger strategy for the side input of Use Config Side Input.
Here is a pseudocode example
import apache_beam as beam
from apache_beam.transforms import window
from apache_beam.transforms.trigger import AccumulationMode, Repeatedly, AfterCount
class UpdateConfig(beam.DoFn):
# currently dummy logic
def process(self, element, outdated_config_list):
outdated_config = outdated_config_list[0]
print(element)
print(outdated_config)
yield {**outdated_config, **{element[0]: element[1]}}
class UseConfig(beam.DoFn):
# currently dummy logic
def process(self, element, latest_config_list):
print(element)
print(latest_config_list)
with beam.Pipeline() as pipeline:
# Bounded Side Input
outdated_config = (
pipeline
| "Config BQ" >> beam.Create([{3: 'config3', 4: 'config4', 5: 'config5'}])
| "Window Side Input" >> beam.WindowInto(window.GlobalWindows())
)
# Unbounded Source
config_logs, payload_logs = (
pipeline
| "IoT Data" >>
beam.io.ReadFromPubSub(subscription="MySub").with_output_types(bytes)
| "Decode" >> beam.Map(lambda x: eval(x.decode('utf-8')))
| "Partition Config/Payload Logs" >>
beam.Partition(lambda x, nums: 0 if type(x[0]) == int else 1, 2)
)
# Update of Bounded Side Input with part of Unbounded Source
latest_config = (
config_logs
| "Update Batch Config" >>
beam.ParDo(UpdateConfig(), batch_lookup_list=beam.pvalue.AsList(outdated_config))
| "LatestConfig Window/Trigger" >>
beam.WindowInto(
window.GlobalWindows(),
trigger=Repeatedly(AfterCount(1)),
accumulation_mode=AccumulationMode.DISCARDING
)
)
# Use Updated Side Input
(
payload_logs
| "Use Config Side Input" >>
beam.ParDo(UseConfig(), latest_config_list=beam.pvalue.AsList(latest_config))
)
which I feed with the following type of dummy data
# ConfigLog: (1, 'config1')
# PayloadLog: ('1', 'value1')
The prints of Update Batch Config are always executed, however Dataflow waits indefinitely in Use Config Side Input, even though I added an AfterCount trigger in LatestConfig Window/Trigger.
When I switch to a FixedWindow (and also add one for the PayloadLogs), I do get the prints in Use Config Side Input, but the side input is mostly empty, since the FixedWindow is triggered after a fixed amount of time, irrespectively if I had an incoming ConfigLog or not.
What is the correct windowing/triggering strategy when using a unbounded side input?

Elasticsearch behaviour

I have a question about the expected behaviour of Elasticsearch (version 7.9.1) that I'm having a hard time finding the answer to.
I query Elasticsearch with the help of the elasticsearch-dsl (version 7.3.0) library. My code is as follows:
item_search = ItemSearch(search, query_facets)
item_search = item_search[0:9999]
res = item_search.execute()
Here search is a search term for full-text search, and query_facets is a dictionary mapping fields to the terms in the fields.
The ItemSearch class looks like this:
class ItemSearch(FacetedSearch):
doc_types = [ItemsIndex, ]
size = 20
facets = {'language': TermsFacet(field='language.raw', size=size),}
def __init__(self, search, query_facets):
super().__init__(search, query_facets)
def search(self):
s = super(ItemSearch, self).search()
return s
The language field has many thousands of values, but I limited the return size to 20 since we never want to display more results than around that number anyway.
Now onto my actual question: I would expect that if I pass for example {'language' : ["Dutch"]} to ItemSearch as the query_facets parameter, that Elasticsearch returns the count for "Dutch", whether or not it belongs to the top 20 results. However, this is not the case. Is this the expected behaviour, or am I missing something? And if yes, how can I achieve the result I'm after?

Redis - Python example of xadd and xread

Could you give a very simple example of using Redis' xread and xadd in Python ( that displays the type and format of return values form xread and input of xadd)? I've already read many documentation but none of them are in Python.
The Redis doc gives an example:
> XADD mystream * sensor-id 1234 temperature 19.8
1518951480106-0
but If I try in python:
sample = {b"hello":b"12"}
id = r.xadd("mystream", sample)
I get this error:
redis.exceptions.ResponseError: WRONGTYPE Operation against a key holding the wrong kind of value
make sure to flush before running just to make sure that there doesn't exist a list / stream with the same name. :
redis-cli flushall
if __name__ == '__main__':
r = redis.Redis(host='localhost', port=6379, db=0)
encoder = JSONEncoder()
sample = {"hello": encoder.encode([1234,125, 1235, 1235])} # converts list to string
stream_name = 'mystream'
for i in range(10):
r.xadd(stream_name, sample)
# "$" doesn't seem to work in python
read_samples = r.xread({stream_name:b"0-0"})
Based on redis-py documentation:
Redis intitalization:
redis = redis.Redis(host='localhost')
To add a key-value pair (key-value should be a dictionary):
redis.xadd(stream_name, {key: value})
Block to read:
redis.xread({stream_name: '$'}, None, 0)
stream_name and ID should be a dictionary.
$ means the most new message.
Moreover, instead of passing a normal ID for the stream mystream I
passed the special ID $. This special ID means that XREAD should use
as last ID the maximum ID already stored in the stream mystream, so
that we will receive only new messages, starting from the time we
started listening.from here
COUNT should be NONE if you want to receive the newest, not just any number of messages.
0 for BLOCK option means Block with a timeout of 0 milliseconds (that means to never timeout)
Looking at the help (or the docstrings (1), (2)) for the functions, they're quite straightforward:
>>> import redis
>>> r = redis.Redis()
>>> help(r.xadd)
xadd(name, fields, id='*', maxlen=None, approximate=True)
Add to a stream.
name: name of the stream
fields: dict of field/value pairs to insert into the stream
id: Location to insert this record. By default it is appended.
maxlen: truncate old stream members beyond this size
approximate: actual stream length may be slightly more than maxlen
>>> help(r.xread)
xread(streams, count=None, block=None)
Block and monitor multiple streams for new data.
streams: a dict of stream names to stream IDs, where
IDs indicate the last ID already seen.
count: if set, only return this many items, beginning with the
earliest available.
block: number of milliseconds to wait, if nothing already present.

Simple example of retrieving 500 items from dynamodb using Python

Looking for a simple example of retrieving 500 items from dynamodb minimizing the number of queries. I know there's a "multiget" function that would let me break this up into chunks of 50 queries, but not sure how to do this.
I'm starting with a list of 500 keys. I'm then thinking of writing a function that takes this list of keys, breaks it up into "chunks," retrieves the values, stitches them back together, and returns a dict of 500 key-value pairs.
Or is there a better way to do this?
As a corollary, how would I "sort" the items afterwards?
Depending on you scheme, There are 2 ways of efficiently retrieving your 500 items.
1 Items are under the same hash_key, using a range_key
Use the query method with the hash_key
you may ask to sort the range_keys A-Z or Z-A
2 Items are on "random" keys
You said it: use the BatchGetItem method
Good news: the limit is actually 100/request or 1MB max
you will have to sort the results on the Python side.
On the practical side, since you use Python, I highly recommend the Boto library for low-level access or dynamodb-mapper library for higher level access (Disclaimer: I am one of the core dev of dynamodb-mapper).
Sadly, neither of these library provides an easy way to wrap the batch_get operation. On the contrary, there is a generator for scan and for query which 'pretends' you get all in a single query.
In order to get optimal results with the batch query, I recommend this workflow:
submit a batch with all of your 500 items.
store the results in your dicts
re-submit with the UnprocessedKeys as many times as needed
sort the results on the python side
Quick example
I assume you have created a table "MyTable" with a single hash_key
import boto
# Helper function. This is more or less the code
# I added to devolop branch
def resubmit(batch, prev):
# Empty (re-use) the batch
del batch[:]
# The batch answer contains the list of
# unprocessed keys grouped by tables
if 'UnprocessedKeys' in prev:
unprocessed = res['UnprocessedKeys']
else:
return None
# Load the unprocessed keys
for table_name, table_req in unprocessed.iteritems():
table_keys = table_req['Keys']
table = batch.layer2.get_table(table_name)
keys = []
for key in table_keys:
h = key['HashKeyElement']
r = None
if 'RangeKeyElement' in key:
r = key['RangeKeyElement']
keys.append((h, r))
attributes_to_get = None
if 'AttributesToGet' in table_req:
attributes_to_get = table_req['AttributesToGet']
batch.add_batch(table, keys, attributes_to_get=attributes_to_get)
return batch.submit()
# Main
db = boto.connect_dynamodb()
table = db.get_table('MyTable')
batch = db.new_batch_list()
keys = range (100) # Get items from 0 to 99
batch.add_batch(table, keys)
res = batch.submit()
while res:
print res # Do some usefull work here
res = resubmit(batch, res)
# The END
EDIT:
I've added a resubmit() function to BatchList in Boto develop branch. It greatly simplifies the worklow:
add all of your requested keys to BatchList
submit()
resubmit() as long as it does not return None.
this should be available in next release.

MapReduce on more than one datastore kind in Google App Engine

I just watched Batch data processing with App Engine session of Google I/O 2010, read some parts of MapReduce article from Google Research and now I am thinking to use MapReduce on Google App Engine to implement a recommender system in Python.
I prefer using appengine-mapreduce instead of Task Queue API because the former offers easy iteration over all instances of some kind, automatic batching, automatic task chaining, etc. The problem is: my recommender system needs to calculate correlation between instances of two different Models, i.e., instances of two distinct kinds.
Example:
I have these two Models: User and Item. Each one has a list of tags as an attribute. Below are the functions to calculate correlation between users and items. Note that calculateCorrelation should be called for every combination of users and items:
def calculateCorrelation(user, item):
return calculateCorrelationAverage(u.tags, i.tags)
def calculateCorrelationAverage(tags1, tags2):
correlationSum = 0.0
for (tag1, tag2) in allCombinations(tags1, tags2):
correlationSum += correlation(tag1, tag2)
return correlationSum / (len(tags1) + len(tags2))
def allCombinations(list1, list2):
combinations = []
for x in list1:
for y in list2:
combinations.append((x, y))
return combinations
But that calculateCorrelation is not a valid Mapper in appengine-mapreduce and maybe this function is not even compatible with MapReduce computation concept. Yet, I need to be sure... it would be really great for me having those appengine-mapreduce advantages like automatic batching and task chaining.
Is there any solution for that?
Should I define my own InputReader? A new InputReader that reads all instances of two different kinds is compatible with the current appengine-mapreduce implementation?
Or should I try the following?
Combine all keys of all entities of these two kinds, two by two, into instances of a new Model (possibly using MapReduce)
Iterate using mappers over instances of this new Model
For each instance, use keys inside it to get the two entities of different kinds and calculate the correlation between them.
Following Nick Johnson suggestion, I wrote my own InputReader. This reader fetch entities from two different kinds. It yields tuples with all combinations of these entities. Here it is:
class TwoKindsInputReader(InputReader):
_APP_PARAM = "_app"
_KIND1_PARAM = "kind1"
_KIND2_PARAM = "kind2"
MAPPER_PARAMS = "mapper_params"
def __init__(self, reader1, reader2):
self._reader1 = reader1
self._reader2 = reader2
def __iter__(self):
for u in self._reader1:
for e in self._reader2:
yield (u, e)
#classmethod
def from_json(cls, input_shard_state):
reader1 = DatastoreInputReader.from_json(input_shard_state[cls._KIND1_PARAM])
reader2 = DatastoreInputReader.from_json(input_shard_state[cls._KIND2_PARAM])
return cls(reader1, reader2)
def to_json(self):
json_dict = {}
json_dict[self._KIND1_PARAM] = self._reader1.to_json()
json_dict[self._KIND2_PARAM] = self._reader2.to_json()
return json_dict
#classmethod
def split_input(cls, mapper_spec):
params = mapper_spec.params
app = params.get(cls._APP_PARAM)
kind1 = params.get(cls._KIND1_PARAM)
kind2 = params.get(cls._KIND2_PARAM)
shard_count = mapper_spec.shard_count
shard_count_sqrt = int(math.sqrt(shard_count))
splitted1 = DatastoreInputReader._split_input_from_params(app, kind1, params, shard_count_sqrt)
splitted2 = DatastoreInputReader._split_input_from_params(app, kind2, params, shard_count_sqrt)
inputs = []
for u in splitted1:
for e in splitted2:
inputs.append(TwoKindsInputReader(u, e))
#mapper_spec.shard_count = len(inputs) #uncomment this in case of "Incorrect number of shard states" (at line 408 in handlers.py)
return inputs
#classmethod
def validate(cls, mapper_spec):
return True #TODO
This code should be used when you need to process all combinations of entities of two kinds. You can also generalize this for more than two kinds.
Here it is a valid the mapreduce.yaml for TwoKindsInputReader:
mapreduce:
- name: recommendationMapReduce
mapper:
input_reader: customInputReaders.TwoKindsInputReader
handler: recommendation.calculateCorrelationHandler
params:
- name: kind1
default: kinds.User
- name: kind2
default: kinds.Item
- name: shard_count
default: 16
It's difficult to know what to recommend without more details of what you're actually calculating. One simple option is to simply fetch the related entity inside the map call - there's nothing preventing you from doing datastore operations there.
This will result in a lot of small calls, though. Writing a custom InputReader, as you suggest, will allow you to fetch both sets of entities in parallel, which will significantly improve performance.
If you give more details as to how you need to join these entities, we may be able to provide more concrete suggestions.

Categories