I tried to create a beam pipeline that allows late events and has an AfterProcessingTime trigger so that the trigger can aggregate all the data that arrive in time at its first fire and fire as few times as possible for late data. I found this tutorial, but my pipeline got stuck with their late_data_stream
options = StandardOptions(streaming=True)
with TestPipeline(options=options) as p:
_ = (p | create_late_data_stream()
| beam.Map(lambda x : x) # Work around for typing issue
| beam.WindowInto(beam.window.FixedWindows(5),
trigger=beam.trigger.Repeatedly(beam.trigger.AfterProcessingTime(5)),
accumulation_mode=beam.trigger.AccumulationMode.DISCARDING,
allowed_lateness=50)
| beam.combiners.Count.PerKey()
| beam.Map(lambda x : f'Count is {x[1]}')
| "Output Window" >> beam.ParDo(GetElementTimestamp(print_pane_info=True))
| "Print count" >> PrettyPrint()
)
Any idea why is this happening? And is there a way to have repeatedly trigger to stop if the watermark goes past window_end+ allowed_lateness?
Thanks in advance for any help.
Related
I have a very simple pipeline in python
with beam.Pipeline(options=create_pipeline_options(pipeline_args)) as p:
rows = (p | 'ReadFromBigquery' >> beam.io.ReadFromBigQuery(table=f"{known_args.project}:{known_args.datasetId}.{known_args.tableId}", use_standard_sql=True))
entities = (rows | 'GetEntities' >> beam.ParDo(GetEntity()))
updated = (entities | 'Update Entities' >> beam.ParDo(UpdateEntity()))
_ = (updated | 'Write To Datastore' >> WriteToDatastore(known_args.project))
I want to log which entities have been correctly updated after WriteToDatastore has finished running so I could write them in a bigquery audit table. Ideally it would look something like this
successful_entities, failed entities = (updated | 'Write To Datastore' >> WriteToDatastoreWrapper(known_args.project))
_ = (successful_entities | 'Write Success To Bigquery' >> beam.io.WriteToBigQuery(table=f"{c.audit_table}:{known_args.datasetId}.{known_args.tableId}"))
_ = (failed_entities| 'Write Failed To Bigquery' >> beam.io.WriteToBigQuery(table=f"{c.audit_table}:{known_args.datasetId}.{known_args.tableId}"))
Is this possible to achieve?
Alternatively if the whole batch fails after n numbers of retries, is it possible to catch that failure and log which batch has failed (assuming I have some sort of runId to keep track of batches)
I hope it can help.
I propose you a solution with a dead letter queue before writing the result to Datastore.
Beam suggests using a dead letter queue in this case, and we can achieve that with TupleTags.
You can write it with Beam native but the code is verbose.
I created a library in Beam Java and Python called Asgarde :
Here the link of the Python version : https://github.com/tosun-si/pasgarde
You can install the package with pip :
pip install asgarde==0.16.0
With Asgarde, you can catch errors in each step of the pipeline, before writing the result with the IO (Datastore in this case)
Example :
input_teams: PCollection[str] = p | 'Read' >> beam.Create(team_names)
result = (CollectionComposer.of(input_teams)
.map('Map with country', lambda tname: TeamInfo(name=tname, country=team_countries[tname], city=''))
.map('Map with city', lambda tinfo: TeamInfo(name=tinfo.name, country=tinfo.country, city=team_cities[tinfo.name]))
.filter('Filter french team', lambda tinfo: tinfo.country == 'France'))
result_outputs: PCollection[TeamInfo] = result.outputs
result_failures: PCollection[Failure] = result.failures
Asgarde proposes a wrapper with CollectionComposer class instantiated from a PCollection.
Then each operator like map, flat_map, filter applies the operation while error handling.
The result of CollectionComposer is a Tuple with :
PCollection of successful outputs
PCollection of Failure
Failure is an object given by Asgarde :
#dataclass
class Failure:
pipeline_step: str
input_element: str
exception: Exception
This object gives the current input_element concerned by the error and the current exception.
pipeline_step is the current name used for the transformation.
Your pipeline can be adapted in the following way with Asgarde :
with beam.Pipeline(options=create_pipeline_options(pipeline_args)) as p:
rows = (p | 'ReadFromBigquery' >> beam.io.ReadFromBigQuery(table=f"{known_args.project}:{known_args.datasetId}.{known_args.tableId}", use_standard_sql=True))
result = (CollectionComposer.of(rows)
.map('GetEntities', lambda el : get_entity_function(el))
.map('Update Entities', lambda entity : update_entity_function(entity)))
result_outputs = result.outputs
result_failures: PCollection[Failure] = result.failures
(result_outputs | 'Write To Datastore' >> WriteToDatastore(known_args.project))
(result_failures
| 'Map before Write to BQ' >> beam.Map(failure_to_your_obj_function)
| 'Write Failed To Bigquery' >> beam.io.WriteToBigQuery(table=f"{c.audit_table}:{known_args.datasetId}.{known_args.tableId}"))
You can also apply the same logic with Beam native, I share you an example from my personal Github repository :
https://github.com/tosun-si/teams-league-python-dlq-native-beam-summit/blob/main/team_league/domain_ptransform/team_stats_transform.py
This seems to work fine in DirectRunner, but errors out when I switch to DataflowRunner. I basically need to somehow combine the files that are read in, but as soon as I use beam.combiners.ToList() to concatenate my data, it introduces a whole slew of issues.
Code Example:
def convert_to_dataframe(readable_file):
yield pd.read_csv(io.TextIOWrapper(readable_file.open()))
class merge_dataframes(beam.DoFn):
def process(self, element):
yield pd.concat(element).reset_index(drop=True)
with beam.Pipeline(options=pipeline_options) as p:
(p
| 'Match Files From GCS' >> beam.io.fileio.MatchFiles(raw_data_path)
| 'Read Files' >> beam.io.fileio.ReadMatches()
| 'Shuffle' >> beam.Reshuffle()
| 'Create DataFrames' >> beam.FlatMap(convert_to_dataframe)
| 'Combine To List' >> beam.combiners.ToList()
| 'Merge DataFrames' >> beam.ParDo(merge_dataframes())
| 'Apply Transformations' >> beam.ParDo(ApplyPipeline(creds_path=args.creds_path,
project_name=args.project_name,
feature_group_name=args.feature_group_name
))
| 'Write To GCS' >> beam.io.WriteToText(feature_data_path,
file_name_suffix='.csv',
shard_name_template='')
)
Error:
"No objects to concatenate [while running 'Merge DataFrames']"
I don't understand this error because the part that does 'Combine To List' should have produced a list of dataframes that would then get passed into the step 'Merge DataFrames', which is indeed the case when I use DirectRunner.
Given this error, I suspect that MatchFiles is not actually matching anything (e.g. due to a bad filepattern) and, consequently, the output of beam.combiners.ToList is an empty list.
it is a simple tik-tok game code in python, there seem to be an infinite loop but I can not spot it
while loc[:3]!='x'or 'o' | loc[3:6]!='x'or 'o' | loc[6:]!='x'or'o'| ((loc[0] and loc[3] and loc[6])!='x'or'o') | ((loc[1] and loc[4] and loc[7])!='x'or'o') | ((loc[2] and loc[5] and loc[8])!='x'or'o') | ((loc[2] and loc[4] and loc[6])!='x'or'o') | ((loc[0] and loc[4] and loc[8])!='x'or'o'):
keeping_turns=[]
if len(keeping_turns)==0 or len(keeping_turns)%2==0:
player=player_1
symbol=symbol_1
else:
player=player_2
symbol=symbol_2
#result=resultglob
result=input(' please ' +player+' insert your chosen location: ')
i expect to get to the last line, to input a value. but I get stuck in '*' and have to interrupt the kernel every time.enter code here
The problem
Each time the system receive a message from pubsub with a Sliding Windows it been duplicated
The code
| 'Parse dictionary' >> beam.Map(lambda elem: (elem['Serial'], int(elem['Value'])))
| 'window' >> beam.WindowInto(window.SlidingWindows(30, 15),accumulation_mode=AccumulationMode.DISCARDING)
| 'Count' >> beam.CombinePerKey(beam.combiners.MeanCombineFn())
The output
If I only send one message from pub/sub and try to print what I have after the sliding window finish with the code:
class print_row2(beam.DoFn):
def process(self, row=beam.DoFn.ElementParam, window=beam.DoFn.WindowParam,timestamp=beam.DoFn.TimestampParam):
print row, timestamp2str(float(window.start)), timestamp2str(float(window.end)),timestamp2str(float(timestamp))
The result
('77777', 120.0) 2018-11-16 08:21:15.000 2018-11-16 08:21:45.000 2018-11-16 08:21:45.000
('77777', 120.0) 2018-11-16 08:21:30.000 2018-11-16 08:22:00.000 2018-11-16 08:22:00.000
If I print the message before 'window' >> beam.WindowInto(window.SlidingWindows(30, 15)) I only get once
The process in "graphic mode:
time: ----t+00---t+15---t+30----t+45----t+60------>
: : : : :
w1: |=X===========| : :
w2: |==============| :
...
The message X was sent only once at the begining of the slidingwindow, it should only be received once, but is been receiving twice
I have tried with both AccumulationMode values, also with a trigger=AftyerWatermark but i can not fix the problem.
What could be wrong?
Extra
With FixedWindows this is the correct code for my porpouse:
| 'Window' >> beam.WindowInto(window.FixedWindows(1 * 30))
| 'Speed Average' >> beam.GroupByKey()
| "Calculating average" >> beam.CombineValues(beam.combiners.MeanCombineFn())
or
| 'Window' >> beam.WindowInto(window.FixedWindows(1 * 30))
| "Calculating average" >> beam.CombinePerKey(beam.combiners.MeanCombineFn())
All elements that belong to the window are emitted. If an element belongs to multiple windows it will be emitted in each window.
Accumulation mode only matters if you plan to handle late data/multiple trigger firings. In this case discarding mode gives you only new elements in the window when the trigger fires again, i.e. emits only the elements that arrived in the same window since previous trigger firing, the elements that were already emitted are not emitted again and are discarded. In accumulating mode the whole window will be emitted for every trigger firing, it will include old elements that were already emitted last time and new elements that have arrived since then.
If I understand your example you have sliding windows, they have a length of 30 seconds, and they start every 15 seconds. So they overlap for 15 seconds:
time: ----t+00---t+15---t+30----t+45----t+60------>
: : : : :
w1: |=============| : :
w2: |==============| :
w3: |===============|
...
So any element in your case will belong to at least two windows (except for first and last windows).
E.g. in your example, if your message was sent between 17:07:15 and 17:07:30 it will appear in both windows.
Fixed windows don't overlap so element can belong only to one window:
time: ----t+00---t+15---t+30----t+45----t+60------>
: : :
w1: |=============| :
w2: |===============|
w3: |====...
...
More about windows here: https://beam.apache.org/documentation/programming-guide/#windowing
I have exactly the same issue, however in java. I have a window with 10 seconds duration and a step of 3 seconds. When an event is emitted from the mqtt topic, that I subscribe to, it looks like the ParDo function that I have runs and emits the first and only event to all of the three "constructed" windows.
X is the event that i send at a random timestamp: 2020-09-15T21:17:57.292Z
time: ----t+00---t+15---t+30----t+45----t+60------>
: : : : :
w1: |X============| : :
w2: |X=============| :
w3: |X==============|
...
Even the same timestamp is assigned to them!! I must really doing something completely wrong.
I use Scala 2.12 and BEAM 2.23 with a Direct Runner.
[Hint]: I use states in the processElement function! Where the state is being hold per key + window. Maybe there is a bug there? I will try to test it without states.
UPDATE: Removed the state fields and the single event is assigned to one window.
I have written a simple function to resize an image from 1500x2000px to 900x1200px.
def resizeImage(file_list):
if file_list:
if not os.path.exists('resized'):
os.makedirs('resized')
i = 0
for files in file_list:
i += 1
im = Image.open(files)
im = im.resize((900,1200),Image.ANTIALIAS)
im.save('resized/' + files, quality=90)
print str(i) + " files resized successfully"
else:
print "No files to resize"
i used the timeit function to measure how long it takes to run with some example images. Here is an example of the results.
+---------------+-----------+---------------+---------------+---------------+
| Test Name | No. files | Min | Max | Average |
+---------------+-----------+---------------+---------------+---------------+
| Resize normal | 10 | 5.25000018229 | 5.31371171493 | 5.27186083393 |
+---------------+-----------+---------------+---------------+---------------+
But if i repeat the test the times gradually keep increasing i.e.
+---------------+-----------+---------------+---------------+---------------+
| Test Name | No. files | Min | Max | Average |
+---------------+-----------+---------------+---------------+---------------+
| Resize normal | 10 | 5.36660298734 | 5.57177596057 | 5.45903467485 |
+---------------+-----------+---------------+---------------+---------------+
+---------------+-----------+---------------+---------------+---------------+
| Test Name | No. files | Min | Max | Average |
+---------------+-----------+---------------+---------------+---------------+
| Resize normal | 10 | 5.58739076382 | 5.76515489024 | 5.70014196601 |
+---------------+-----------+---------------+---------------+---------------+
+---------------+-----------+---------------+---------------+-------------+
| Test Name | No. files | Min | Max | Average |
+---------------+-----------+---------------+---------------+-------------+
| Resize normal | 10 | 5.77366483042 | 6.00337707034 | 5.891541538 |
+---------------+-----------+---------------+---------------+-------------+
+---------------+-----------+---------------+--------------+---------------+
| Test Name | No. files | Min | Max | Average |
+---------------+-----------+---------------+--------------+---------------+
| Resize normal | 10 | 5.91993466793 | 6.1294756299 | 6.03516199948 |
+---------------+-----------+---------------+--------------+---------------+
This is how im running the test.
def resizeTest(repeats):
os.chdir('C:/Users/dominic/Desktop/resize-test')
files = glob.glob('*.jpg')
t = timeit.Timer(
"resizeImage(filess)",
setup="from imageToolkit import resizeImage; import glob; filess = glob.glob('*.jpg')"
)
time = t.repeat(repeats, 1)
results = {
'name': 'Resize normal',
'files': len(files),
'min': min(time),
'max': max(time),
'average': averageTime(time)
}
resultsTable(results)
I have moved the images that are processed from my mechanical hard drive to the SSD and the issue persists. I have also checked the Memory being used and it stays pretty steady through all the runs, topping out at around 26Mb, the process uses around 12% of one core of the CPU.
Going forward i like to experiment with the multiprocessing library to increase the speed, but i'd like to get to the bottom of this issue first.
Would this be an issue with my loop that causes the performance to degrade?
The im.save() call is slowing things down; repeated writing to the same directory is perhaps thrashing OS disk caches. When you removed the call, the OS was able to optimize the image read access times via disk caches.
If your machine has multiple CPU cores, you can indeed speed up the resize process, as the OS will schedule multiple sub-processes across those cores to run each resize operation. You'll not get a linear performance improvement, as all those processes still have to access the same disk for both reads and writes.