I am using Apache Beam on Python and would like to ask what is the equivalent of Apache Beam Java Wait.on() on python SDK?
currently I am having problem with this code snippet below
if len(output_pcoll) > 1:
merged = (tuple(output_pcoll) |
'MergePCollections1' >> beam.Flatten())
else:
merged = output_pcoll[0]
outlier_side_input = self.construct_outlier_side_input(merged)
(merged |
"RemoveOutlier" >>
beam.ParDo(utils.Remove_Outliers(),
beam.pvalue.AsDict(outlier_side_input)) |
"WriteToCSV" >>
beam.io.WriteToText('../../ML-DATA/{0}.{1}'.format(self.BUCKET,
self.OUTPUT), num_shards=1))
it seems Apache Beam does not wait until the code on self.construct_outlier_side_input finished executing and result in empty side input when executing "RemoveOutlier" in the next pipeline. In Java version you can use Wait.On() to wait for construct_outlier_side_input to finish executing, however I could not find the equivalent method in the Python SDK.
--Edit--
what i am trying to achieve is almost the same as in this link,
https://rmannibucau.metawerx.net/post/apache-beam-initialization-destruction-task
You can use additional outputs feature of Beam to do this.
A sample code snippet is as follows
results = (words | beam.ParDo(ProcessWords(), cutoff_length=2, marker='x')
.with_outputs('above_cutoff_lengths', 'marked strings',
main='below_cutoff_strings'))
below = results.below_cutoff_strings
above = results.above_cutoff_lengths
marked = results['marked strings'] # indexing works as well
Once you run the above code snippet you get multiple PCollections such as below, above and marked. You can then use side inputs to further filter or join the results
Hope that helps.
Update
Based on the comments I would like to mention that Apache Beam has capabilities to do stateful processing with the help of ValueState and BagState. If the requirement is to read through a PCollection and then make decisions based on if a prior value is present or not then such requirements can be handled through BagState as shown below:-
def process(self,
element,
timestamp=beam.DoFn.TimestampParam,
window=beam.DoFn.WindowParam,
buffer_1=beam.DoFn.StateParam(BUFFER_STATE_1),
buffer_2=beam.DoFn.StateParam(BUFFER_STATE_2),
watermark_timer=beam.DoFn.TimerParam(WATERMARK_TIMER)):
# Do you processing here
key, value = element
# Read all the data from buffer1
all_values_in_buffer_1 = [x for x in buffer_1.read()]
if StatefulDoFn._is_clear_buffer_1_required(all_values_in_buffer_1):
# clear the buffer data if required conditions are met.
buffer_1.clear()
# add the value to buffer 2
buffer_2.add(value)
if StatefulDoFn._all_condition_met():
# Clear the timer if certain condition met and you don't want to trigger
# the callback method.
watermark_timer.clear()
yield element
#on_timer(WATERMARK_TIMER)
def on_expiry_1(self,
timestamp=beam.DoFn.TimestampParam,
window=beam.DoFn.WindowParam,
key=beam.DoFn.KeyParam,
buffer_1=beam.DoFn.StateParam(BUFFER_STATE_1),
buffer_2=beam.DoFn.StateParam(BUFFER_STATE_2)):
# Window and key parameters are really useful especially for debugging issues.
yield 'expired1'
Related
I have a streaming apache beam pipeline which does operations on data and writes to big query, the table name and schema of said data is within the data itself, so i am using side inputs to provide table name and schema using side_inputs for both of them.
So my pipeline code looks something like this -
pipeline | "Writing to big query">>beam.io.WriteToBigQuery(
schema=lambda row,schema:write_table_schema(row,schema),
schema_side_inputs = (table_schema,),
project=args['PROJECT_ID'],dataset=args['DATASET_ID'],
table = lambda row,table_name:write_table_name(row,table_name),table_side_inputs=(table_name,) ,ignore_unknown_columns=args['ignore_unknown_columns'],
additional_bq_parameters=additional_bq_parameters, insert_retry_strategy= RetryStrategy.RETRY_ON_TRANSIENT_ERROR))
For this to work i needed to add window intervals (before write to big query)
pipeline = pipeline | "To Window Fixed Intervals" >> beam.WindowInto(beam.window.FixedWindows(10)))
This windowed data then goes on to becomes input to 3 pipeline operations, 2 side inputs to WriteToBigQuery are like this -
table_name = (pipeline
| "Get table name" >> beam.Map(lambda record: get_table_name(record)) )
table_name = beam.pvalue.AsSingleton(table_name)
table_schema = (pipeline
| "Get table schema" >> beam.Map(lambda record: get_table_schema(record)) )
table_schema = beam.pvalue.AsSingleton(table_schema)
All of this was working fine untill i need to split the data before windowing intervals like
mapped_data = (pipeline
|"Converting to map ">>beam.ParDo(ConvertToMap()).with_outputs("SUCCESS","FAILURE"))
pipeline = (mapped_data['SUCCESS']
| "To Window Fixed Intervals" >> beam.WindowInto(beam.window.FixedWindows(10)))
As soon as i did this, i encountered following error -
( ValueError: PCollection of size 2 with more than one element accessed as a singleton view. First two elements encountered are "name_1", "name_1". [while running 'Writing to big query/_StreamToBigQuery/AppendDestination-ptransform-48']
I've skipped some steps from the pipeline as it was way too complex.
How can i fix this error?
I've tried using AsDict instead of AsSingleton but it gives following error -
ValueError: dictionary update sequence element #0 has length 20; 2 is required [while running 'Writing to big query/_StreamToBigQuery/AppendDestination-ptransform-48']
I don't think there is any usecase of AsDict here.
Maybe the issue was not due to tagging but it was just waiting to happen with high data as it is a streaming pipeline.
Solution -
The issue here was the side inputs were being generated every time but the main input was being generated only conditionally. This makes the number of side inputs more then the main inputs, hence the issue.
After fixing this issue but making side inputs generate through the same conditions as the main input, i've encountered another issue -
Cannot convert GlobalWindow to apache_beam.utils.windowed_value._IntervalWindowBase [while running 'Writing to big query/_StreamToBigQuery/StreamInsertRows/ParDo(BigQueryWriteFn)-ptransform-124']
Adding these following windowing transforms to the pipeline
"Window into Global Intervals" >> beam.WindowInto(beam.window.FixedWindows(1)) |beam.GroupByKey()
gave the following error -
AbstractComponentCoderImpl.encode_to_stream ValueError: Number of components does not match number of coders. [while running 'WindowInto(WindowIntoFn)
Any help here is appreciated.
This issue occurs probably due to some issue with the code which is returning elements in GlobalWindow while the PCollection has a different window set. For your requirement, I would suggest you to insert beam.WindowInto(beam.window.GlobalWindows()) between beam.WindowInto(NONGLOBALWINDOW) | beam.GroupByKey() and other Ptransform which causes problems.
I am trying to implement Pull Model to query change feed using Azure Cosmos Python SDK. I found that to parallelise the querying process, the official documentation mentions about FeedRange value and create FeedIterator to iterate through each range of partition key values obtained from the FeedRange.
Currently my code snippet to query change feed looks like this and it is pretty straight-forward:
# function to get items from change feed based on a condition
def get_response(container_client, condition):
# Historical data read
if condition:
response = container.query_items_change_feed(
is_start_from_beginning = True,
# partition_key_range_id = 0
)
# reading from a checkpoint
else:
response = container.query_items_change_feed(
is_start_from_beginning = False,
continuation = last_continuation_token
)
return response
The problem with this approach is the efficiency when getting all the items from beginning (Historical Data Read). I tried this method with pretty small dataset of 500 items and the response took around 60 seconds. When dealing with millions or even billions of items the response might take too long to return.
Would querying change feed parallelly for each partition key range save time?
If yes, how to get PartitionKeyRangeId in Python SDK?
Is there any problems I need to consider when implementing this?
I hope I make sense!
I am trying to get all data from view(Lotus Notes) with lotusscript and Python(noteslib module) and export it to csv, but problem is that this takes too much time. I have tried two ways with loop through all documents:
import noteslib
db = noteslib.Database('database','file.nsf')
view = db.GetView('My View')
doc = view.GetFirstDocument()
data = list()
while doc:
data.append(doc.ColumnValues)
doc = view.GetNextDocument(doc)
To get about 1000 lines of data it took me 70 seconds, but view has about 85000 lines so get all data will be too much time, because manually when I use File->Export in Lotus Notes it is about 2 minutes to export all data to csv.
And I tried second way with AllEntries, but it was even slower:
database = []
ec = view.AllEntries
ent = ec.Getfirstentry()
while ent:
row = []
for v in ent.Columnvalues:
row.append(v)
database.append(row)
ent = ec.GetNextEntry(ent)
Everything that I found on the Internet is based on "NextDocument" or "AllEntries". Is there any way to do it faster?
It is (or at least used to be) very expensive from a time standpoint to open a Notes document, like you are doing in your code.
Since you are saying that you want to export the data that is being displayed in the view, you could use the NotesViewEntry class instead. It should be much faster.
Set col = view.AllEntries
Set entry = col.GetFirstEntry()
Do Until entry Is Nothing
values = entry.ColumnValues '*** Array of column values
'*** Do stuff here
Set entry = col.GetNextEntry(entry)
Loop
I wrote a blog about this back in 2013:
http://blog.texasswede.com/which-is-faster-columnvalues-or-getitemvalue/
Something is going on with your code "outside" the view navigation: You already chose the most performant way to navigate a view using "GetFirstDocument" and "GetNextDocument". Using the NotesViewNavigator as mentioned in the comments will be slightly better, but not significant.
You might get a little bit of performance out of your code by setting view.AutoUpdate = False to prohibit the view object to refresh when something in the backend changes. But as you only read data and not change view data that will not give you much of a performance boost.
My suggestion: Identify the REAL bottleneck of your code by commenting out single sections to find out when it starts to get slower:
First attempt:
while doc:
doc = view.GetNextDocument(doc)
Slow?
If not then next attempt:
while doc:
arr = doc.ColumnValues
doc = view.GetNextDocument(doc)
Slow?
If yes: ColumnValues is your enemy...
If not then next attempt:
while doc:
arr = doc.ColumnValues
data.append(arr)
doc = view.GetNextDocument(doc)
I would be very interested to get your results of where it starts to become slow.
I would suspect the performance issue is using COM/ActiveX in Python to access Notes databases. Transferring data via COM involves datatype 'marshalling', possibly at every step, and especially for 'out-of-process' method/property calls.
I don't think there is any way around this in COM. You should consider arranging a Notes 'agent' to do this for you instead (LotusScript or Java maybe). Even a basic LotusScript agent can export 000's of docs per minute. A further alternative may be to look at the Notes C-API (not an easy option and requires API calls from Python).
I'm processing analytics hits in an Apache Beam pipeline written in python. I'm using FixedWindows of 10 minutes and I would like to trigger an alert (for example with Cloud Pub/Sub) when a window is empty. So far here's what I've done:
ten_min_windows = day_hits | '10MinutesWindows' >> beam.WindowInto(
beam.window.FixedWindows(10 * 60))
ten_min_alerts = (ten_min_windows
| 'CountTransactions10Min' >> beam.CombineGlobally(count_transactions).without_defaults()
| 'KeepZeros10Min' >> beam.Filter(keep_zeros)
| 'ConvertToAlerts10Min' >> beam.ParDo(ToAlert()))
count_transactions filters to only keep transaction hits, then returns the length of the resulting list. keep_zeros returns true if the resulting length is 0. The problem is, if the PCollection did not contain transaction hits, no length is returned at all, and I get an empty PCollection because of without defaults. It seems I cannot take out without defaults as it is not allowed when using non global windows.
I've seen this thread advising to add a dummy element to each window, then check that the count is more than one.
Is this the best solution or is there a better way ?
How can I do this, as I will need to have exactly one element per window ? Can I code this in the pipeline directly, or do I need to schedule a fake hit to be sent (for example through Cloud Pub/Sub) every 10 minutes ?
You can use Metrics.counter in order to monitor the number of element processed in Stackdriver for example.
From there you can then set up alerting, based on your own rules, from your favorite monitoring tool.
I am trying to work out a way to change between two sets of labels on a map. I have a map with zip codes that are labeled and I want to be able to output two maps: one with the zip code label (ZIP) and one with a value from a field I have joined to the data (called chrlabel). The goal is to have one map showing data for each zip code, and a second map giving the zip code as reference.
My initial attempt which I can't get working looks like this:
1) I added a second data frame to my map and add a new layer that contains two polygons with names "zip" and "chrlabel".
2) I use this frame to enable data driven pages and then I hide it behind the primary frame (I don't want to see those polygons, I just want to use them to control the data driven pages).
3) In the zip code labels I tried to write a VBScript expression like this pseudo-code:
test = "
If test = "zip" then
label = ZIP
else
label = CHRLABEL
endif
This does not work because the dynamic text does not resolve to the page name in the VBScript.
Is there some way to call the page name in VBScript so that I can make this work?
If not, is there another way to do this?
My other thought is to add another field to the layer that gets filled with a one or a zero. Then I could replace the if-then test condition with if NewField = 1.
Then I would just need to write a script that updates all the NewFields for the zipcode features when the data driven page advances to the second page. Is there a way to trigger a script (python or other) when a data driven page changes?
Thanks
8 months too late, but for posterity...
You're making things hard on yourself - it would be much easier to set up a duplicate layer and use different layers, then adjust layer visibility. I'm not familiar with VBScript for this sort of thing, but in Python (using ESRI's library) it would look something like so [python 2.6, ArcMap 10 - sample only, haven't debugged this but I do similar things quite often]:
from arcpy import mapping
## Load the map from disk
mxdFilePath = "C:\\GIS_Maps_Folder\\MyMap.mxd"
mapDoc = mapping.MapDocument(mxdFilePath)
## Load map elements
dataFrame = mapping.ListDataFrames(mapDoc)[0] #assumes you want the first dataframe; you can also search by name
mxdLayers = mapping.ListLayers(dataFrame)
## Adjust layers
for layer in mxdLayers:
if (layer.name == 'zip'):
zip_lyr = layer
elif(layer.name == 'sample_units'):
labels_lyr = layer
## Print zip code map
zip_lyr.visible = True
zip_lyr.showLabels = True
labels_lyr.visible = False
labels_lyr.showLabels = False
zip_path = "C:\\Output_Folder\\Zips.pdf"
mapping.ExportToPDF(mapDoc, zip_path, layers_attributes="NONE", resolution=150)
## Print labels map
zip_lyr.visible = False
zip_lyr.showLabels = False
labels_lyr.visible = True
labels_lyr.showLabels = True
labels_path = "C:\\Output_Folder\\Labels.pdf"
mapping.ExportToPDF(mapDoc, labels_path, layers_attributes="NONE", resolution=150)
## Combine files (if desired)
pdfDoc = mapping.PDFDocumentCreate("C:\\Output_Folder\\Output.pdf"")
pdfDoc.appendPages(zip_path)
pdfDoc.appendPages(labels_path)
pdfDoc.saveAndClose()
As far as the Data Driven Pages go, you can export them all at once or in a loop, and adjust whatever you want, although I'm not sure why you'd need to if you use something similar to the above. The ESRI documentation and examples are actually quite good on this. (You should be able to get to all the other Python documentation pretty easily from that page.)