How To Combine Parsed TextFiles In Apache-Beam DataFlow in Python? - python

This seems to work fine in DirectRunner, but errors out when I switch to DataflowRunner. I basically need to somehow combine the files that are read in, but as soon as I use beam.combiners.ToList() to concatenate my data, it introduces a whole slew of issues.
Code Example:
def convert_to_dataframe(readable_file):
yield pd.read_csv(io.TextIOWrapper(readable_file.open()))
class merge_dataframes(beam.DoFn):
def process(self, element):
yield pd.concat(element).reset_index(drop=True)
with beam.Pipeline(options=pipeline_options) as p:
(p
| 'Match Files From GCS' >> beam.io.fileio.MatchFiles(raw_data_path)
| 'Read Files' >> beam.io.fileio.ReadMatches()
| 'Shuffle' >> beam.Reshuffle()
| 'Create DataFrames' >> beam.FlatMap(convert_to_dataframe)
| 'Combine To List' >> beam.combiners.ToList()
| 'Merge DataFrames' >> beam.ParDo(merge_dataframes())
| 'Apply Transformations' >> beam.ParDo(ApplyPipeline(creds_path=args.creds_path,
project_name=args.project_name,
feature_group_name=args.feature_group_name
))
| 'Write To GCS' >> beam.io.WriteToText(feature_data_path,
file_name_suffix='.csv',
shard_name_template='')
)
Error:
"No objects to concatenate [while running 'Merge DataFrames']"
I don't understand this error because the part that does 'Combine To List' should have produced a list of dataframes that would then get passed into the step 'Merge DataFrames', which is indeed the case when I use DirectRunner.

Given this error, I suspect that MatchFiles is not actually matching anything (e.g. due to a bad filepattern) and, consequently, the output of beam.combiners.ToList is an empty list.

Related

Apache Beam – issue with Deduplicate function

I have an issue with apache_beam.transforms.deduplicate.Deduplicate transformation. Please, look on a code sample below:
with beam.Pipeline() as pipeline:
(
pipeline
# | 'Load' >> beam.Create(['a', 'b', 'b']) ## <- works fine
| 'Load' >> beam.io.ReadFromText('./input.txt'). ## <- breaks Dedup
| 'Dedup' >> Deduplicate(processing_time_duration=1000).with_input_types(AnyStr)
| 'Print' >> beam.Map(print)
)
If I create collection manually – everything works fine and as expected. But when I try to load something from disk (text file, Avro files, etc.) Deduplicate stops working and throws an exception:
Traceback (most recent call last):
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/direct/executor.py", line 370, in call
self.attempt_call(
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/direct/executor.py", line 404, in attempt_call
evaluator.start_bundle()
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/direct/transform_evaluator.py", line 867, in start_bundle
self.runner.start()
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/common.py", line 1475, in start
self._invoke_bundle_method(self.do_fn_invoker.invoke_start_bundle)
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/common.py", line 1460, in _invoke_bundle_method
self._reraise_augmented(exn)
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/common.py", line 1507, in _reraise_augmented
raise new_exn.with_traceback(tb)
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/common.py", line 1458, in _invoke_bundle_method
bundle_method()
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/common.py", line 559, in invoke_start_bundle
self.signature.start_bundle_method.method_value())
File "/Users/ds/.pyenv/versions/3.9.14/lib/python3.9/site-packages/apache_beam/runners/direct/sdf_direct_runner.py", line 122, in start_bundle
self._invoker = DoFnInvoker.create_invoker(
TypeError: create_invoker() got an unexpected keyword argument 'output_processor' [while running 'Load/Read/SDFBoundedSourceReader/ParDo(SDFBoundedSourceDoFn)/pair']
This happens only with Deduplicate and DeduplicatePerKey transformations. All other things like ParDo, Map, etc. work fine.
Python version: 3.9.14
Apache Beam: 2.41.0
Platform: Apple's M1 ARM
I hope it can help.
Indeed I tested your code and it doesn't works, maybe I am wrong but I think the Deduplicate PTransform seems to be more adapted for jobs with windowing logics (processing time and event time).
It works with beam.Create (even if it's a bounded source) but not with ReadFromText because a type is not inferred :
E TypeError: create_invoker() got an unexpected keyword argument 'output_processor'
I propose you another solution that works in your case, and it's more adapted to deduplicate data in batch job and bounded source :
def test_dedup(self):
with TestPipeline() as p:
(
p
# | 'Load' >> beam.Create(['a', 'b', 'b']) ## <- works fine
| 'Load' >> beam.io.ReadFromText(
f'{ROOT_DIR}/input.txt') ## <- breaks D
# | 'Dedup' >> Deduplicate(processing_time_duration=1000).with_input_types(AnyStr)
| 'Group by' >> beam.GroupBy(lambda el: el)
| 'Get key' >> beam.Map(lambda t: t[0])
| 'Print' >> beam.Map(self.print_el)
)
The input.txt content is :
1
2
2
3
4
The output PCollections is :
1
2
3
4
I used GroupBy on the current element, it gives me a Tuple => 2 -> [2, 2]
and then I added a map on the deduplicated key on the Tuple.

Log Apache Beam WriteToDatastore result to BigQuery - Python

I have a very simple pipeline in python
with beam.Pipeline(options=create_pipeline_options(pipeline_args)) as p:
rows = (p | 'ReadFromBigquery' >> beam.io.ReadFromBigQuery(table=f"{known_args.project}:{known_args.datasetId}.{known_args.tableId}", use_standard_sql=True))
entities = (rows | 'GetEntities' >> beam.ParDo(GetEntity()))
updated = (entities | 'Update Entities' >> beam.ParDo(UpdateEntity()))
_ = (updated | 'Write To Datastore' >> WriteToDatastore(known_args.project))
I want to log which entities have been correctly updated after WriteToDatastore has finished running so I could write them in a bigquery audit table. Ideally it would look something like this
successful_entities, failed entities = (updated | 'Write To Datastore' >> WriteToDatastoreWrapper(known_args.project))
_ = (successful_entities | 'Write Success To Bigquery' >> beam.io.WriteToBigQuery(table=f"{c.audit_table}:{known_args.datasetId}.{known_args.tableId}"))
_ = (failed_entities| 'Write Failed To Bigquery' >> beam.io.WriteToBigQuery(table=f"{c.audit_table}:{known_args.datasetId}.{known_args.tableId}"))
Is this possible to achieve?
Alternatively if the whole batch fails after n numbers of retries, is it possible to catch that failure and log which batch has failed (assuming I have some sort of runId to keep track of batches)
I hope it can help.
I propose you a solution with a dead letter queue before writing the result to Datastore.
Beam suggests using a dead letter queue in this case, and we can achieve that with TupleTags.
You can write it with Beam native but the code is verbose.
I created a library in Beam Java and Python called Asgarde :
Here the link of the Python version : https://github.com/tosun-si/pasgarde
You can install the package with pip :
pip install asgarde==0.16.0
With Asgarde, you can catch errors in each step of the pipeline, before writing the result with the IO (Datastore in this case)
Example :
input_teams: PCollection[str] = p | 'Read' >> beam.Create(team_names)
result = (CollectionComposer.of(input_teams)
.map('Map with country', lambda tname: TeamInfo(name=tname, country=team_countries[tname], city=''))
.map('Map with city', lambda tinfo: TeamInfo(name=tinfo.name, country=tinfo.country, city=team_cities[tinfo.name]))
.filter('Filter french team', lambda tinfo: tinfo.country == 'France'))
result_outputs: PCollection[TeamInfo] = result.outputs
result_failures: PCollection[Failure] = result.failures
Asgarde proposes a wrapper with CollectionComposer class instantiated from a PCollection.
Then each operator like map, flat_map, filter applies the operation while error handling.
The result of CollectionComposer is a Tuple with :
PCollection of successful outputs
PCollection of Failure
Failure is an object given by Asgarde :
#dataclass
class Failure:
pipeline_step: str
input_element: str
exception: Exception
This object gives the current input_element concerned by the error and the current exception.
pipeline_step is the current name used for the transformation.
Your pipeline can be adapted in the following way with Asgarde :
with beam.Pipeline(options=create_pipeline_options(pipeline_args)) as p:
rows = (p | 'ReadFromBigquery' >> beam.io.ReadFromBigQuery(table=f"{known_args.project}:{known_args.datasetId}.{known_args.tableId}", use_standard_sql=True))
result = (CollectionComposer.of(rows)
.map('GetEntities', lambda el : get_entity_function(el))
.map('Update Entities', lambda entity : update_entity_function(entity)))
result_outputs = result.outputs
result_failures: PCollection[Failure] = result.failures
(result_outputs | 'Write To Datastore' >> WriteToDatastore(known_args.project))
(result_failures
| 'Map before Write to BQ' >> beam.Map(failure_to_your_obj_function)
| 'Write Failed To Bigquery' >> beam.io.WriteToBigQuery(table=f"{c.audit_table}:{known_args.datasetId}.{known_args.tableId}"))
You can also apply the same logic with Beam native, I share you an example from my personal Github repository :
https://github.com/tosun-si/teams-league-python-dlq-native-beam-summit/blob/main/team_league/domain_ptransform/team_stats_transform.py

Beam job with Repeated AfterProcessingTime trigger runs forever

I tried to create a beam pipeline that allows late events and has an AfterProcessingTime trigger so that the trigger can aggregate all the data that arrive in time at its first fire and fire as few times as possible for late data. I found this tutorial, but my pipeline got stuck with their late_data_stream
options = StandardOptions(streaming=True)
with TestPipeline(options=options) as p:
_ = (p | create_late_data_stream()
| beam.Map(lambda x : x) # Work around for typing issue
| beam.WindowInto(beam.window.FixedWindows(5),
trigger=beam.trigger.Repeatedly(beam.trigger.AfterProcessingTime(5)),
accumulation_mode=beam.trigger.AccumulationMode.DISCARDING,
allowed_lateness=50)
| beam.combiners.Count.PerKey()
| beam.Map(lambda x : f'Count is {x[1]}')
| "Output Window" >> beam.ParDo(GetElementTimestamp(print_pane_info=True))
| "Print count" >> PrettyPrint()
)
Any idea why is this happening? And is there a way to have repeatedly trigger to stop if the watermark goes past window_end+ allowed_lateness?
Thanks in advance for any help.

Clarifications related to getting the total number of records in an aerospike set? Is Lua scripting needed?

I would like to get the total number of records in an aerospike set via python.
I guese, it is the value that is shown against n_objects against a set in the output of show sets -
aql> show sets
+-----------+------------------+----------------+-------------------+----------------+---------------------+--------------------------------------------+------------+
| n_objects | disable-eviction | set-enable-xdr | stop-writes-count | n-bytes-memory | ns_name | set_name | set-delete |
+
| 179 | "true" | "use-default" | 0 | 0 | "namespace" | "setName" | "false" |
From what I read from this, it seems it is only possible via lua scripting -
https://discuss.aerospike.com/t/fastest-way-to-count-records-returned-by-a-query/2379/4
Can someone confirm the same?
I am however able to find the count by using a counter variable by iterating over the result of select() and it is matching against the above count -
client = aerospike.client(config).connect()
scan = client.scan('namespace', 'set')
scan.select('PK','expiresIn','clientId','scopes','roles')
scan.foreach(process_result)
print "Total aeroCount"
print aeroCount
def process_result((key, metadata, record)):
global aeroCount
aeroCount=aeroCount+1
Update
I tried running command asinfo -v sets on command line first. It gave me the objects count as well, like this -
ns=namespace:set=setName:objects=29949:.
Not sure how exactly to get the objects count against a set from this. Does this command qualify as a command for the python function? I tried this -
client = aerospike.client(config).connect()
response = client.info_all("asinfo -v sets")
Here is an error I am getting -
File "Sandeepan-oauth_token_cache_complete_sanity_cp.py", line 89, in <module>
response = client.info_all("asinfo -v sets")
AttributeError: 'aerospike.Client' object has no attribute 'info_all'
Look into https://www.aerospike.com/apidocs/python/client.html?highlight=info#aerospike.Client.info_all - info_all() in the python client and pass the correct info command from the info command reference here: https://www.aerospike.com/docs/reference/info
The sets info command gives you instantaneous stats such as number of objects in a specified set.
$ python
>>> import aerospike
>>> aerospike.__version__
'2.1.2'
>>> config = {'hosts':[("127.0.0.1", 3000)]}
>>> client = aerospike.client(config).connect()
>>> client.info("sets")
{'BB9BE1CFE290C00': (None, 'ns=test:set=testMap:objects=1:tombstones=0:memory_data_bytes=0:truncate_lut=0:stop-writes-count=0:set-enable-xdr=use-default:disable-eviction=false;\n')}

web.py select returning no results, but length of that list does (also manual query in sql promt returns results)

I am working on my first (bigger) python application but I am running into some issues. I am trying to select entries from a table using the web.py import (I am using this since I will be using a web front-end later).
Below is my (simplified) code:
db = web.database(dbn='mysql', host='xxx', port=3306, user='monitor', pw='xxx', db='monitor')
dict = dict(hostname=nodeName)
#nodes = db.select('Nodes', dict,where="hostName = $hostname")
nodes = db.query('SELECT * FROM Nodes') <-- I have tried both, but have comparable results (this returns more entries)
length = len(list(nodes))
print(length)
print(list(nodes))
print(list(nodes)[0])
Below is the output from python:
0.03 (1): SELECT * FROM Nodes
6 <-- Length is correct
[] <-- Why is this empty?
Traceback (most recent call last):
File "monitor.py", line 30, in <module>
print(list(nodes)[0]) <-- If it is empty I can't select first element
IndexError: list index out of range
Below is mySQL output:
mysql> select * from monitor.Nodes;
+--------+-------------+
| nodeId | hostName |
+--------+-------------+
| 1 | TestServer |
| 2 | raspberryPi |
| 3 | TestServer |
| 4 | TestServer |
| 5 | TestServer |
| 6 | TestServer |
+--------+-------------+
6 rows in set (0.00 sec)
Conclusion: table contains entries, and the select/query statement is able to get them partially (it gets the length, but not the actual values?)
I have tried mutliple ways but currently I am not able to get the what I want. I want to select the data from my table and use it in my code.
Thanks for helping
Thanks to the people over at reddit I was able to solve the issue: https://www.reddit.com/r/learnpython/comments/53hdq1/webpy_select_returning_no_results_but_length_of/
Bottom line is: the implementation of the query method already fetches the data, so the second time I am calling list(nodes) the data is already gone, hence the empty result.
Solution is to store the list(nodes) in a variable and work from that.

Categories