Getting messages from pubsub and then saving it into hourly or other interval files on gcs does not work. The job only writes the files when I shut down the job. Can anyone point me into the correct direction?
topic = 'test.txt'
jobname = 'streaming-' + topic.replace('.', '-')
input_topic= 'projects/PROJECT/topics/' + topic
u = Utils()
parsed_schema = u.get_parsed_avro_from_schema_service(
schema_name=topic,
schema_repo_url='localhost'
)
p = beam.Pipeline(options=pipelineoptions)
messages = p | 'Read from topic: ' + topic >> ReadFromPubSub(topic=input_topic).with_input_types(bytes)
windowed_lines = (
messages
| 'decode' >> beam.ParDo(DecodeAvro(), parsed_schema)
| beam.WindowInto(
window.FixedWindows(60),
trigger=AfterWatermark(),
accumulation_mode=AccumulationMode.DISCARDING
)
)
output = windowed_lines | 'write result' >> WriteToAvro(
file_path_prefix='gs://BUCKET/streaming/tests/',
shard_name_template=topic.split('.')[0] + '_' + str(uuid.uuid4()) + '_SSSS-of-NNNN',
schema=parsed_schema,
file_name_suffix='.avro',
num_shards=2
)
result = p.run()
result.wait_until_finish()
After some more research, I found that writing from an unbounded source into a bounded one is not supported by python sdk yet. So I will have to change to Java sdk for this.
Related
Created a gist modeled after the example in the documentation https://beam.apache.org/documentation/patterns/side-inputs/
Gist is here: https://gist.github.com/krishnap/9168f823d2b27547b5f5a41b5740896b#file-gistfile1-txt-L60
With or without the windowing of the two inputs, the join doesn't get called and the job gets stuck. Any ideas about what's going wrong? I tried both DirectRunner and Dataflow runner on GCP.
The side input does get updated periodically, though.
Here's the core part of the code.
# Create pipeline.
pipeline_options = PipelineOptions(streaming=True, save_main_session=True)
pipeline = beam.Pipeline(options=pipeline_options)
logging.info("started running")
side_input = (
pipeline
| "PeriodicImpulse" >> PeriodicImpulse(fire_interval=10, apply_windowing=True)
| "MapToFileName" >> beam.Map(load_data_gcs)
| "WindowMpInto xxx" >> beam.WindowInto(beam.transforms.window.FixedWindows(10)) # didn't help, with or without it
)
main_input = (
pipeline
| "MpImpulse" >> beam.Create(sample_main_input_elements)
| "MapMpToTimestamped" >> beam.Map(lambda src: TimestampedValue(src, src)) # didn't help, with or without it
| "WindowMpInto" >> beam.WindowInto(beam.transforms.window.FixedWindows(10)) # didn't help, with or without it
)
result = (
main_input
| "ApplyCrossJoin" >> beam.FlatMap(cross_join, rights=beam.pvalue.AsIter(side_input))
| beam.Map(logging.info)
)
pipeline.run()
EDIT: this code below works on Dataflow runner but not on DirectRunner locally. So DirectRunner still has a bug. But I don't understand why the previous version above also doesn't work on Dataflow. What am I missing?
gist with full example https://gist.github.com/krishnap/5b373614a82ca4131a7931cef50912ff
logging.info("started running")
side_input = (
pipeline
| "PeriodicImpulse" >> PeriodicImpulse(fire_interval=10, apply_windowing=True)
| "MapToFileName" >> beam.Map(load_data_gcs)
| beam.WindowInto(
beam.transforms.window.GlobalWindows(),
trigger=beam.trigger.Repeatedly(beam.trigger.AfterCount(1)),
accumulation_mode=beam.trigger.AccumulationMode.DISCARDING,
)
)
main_input = (
pipeline
| "MpImpulse" >> beam.Create(sample_main_input_elements)
| "MapMpToTimestamped" >> beam.Map(lambda src: TimestampedValue(src, src))
| "WindowMpInto" >> beam.WindowInto(beam.transforms.window.FixedWindows(1))
)
result = (
main_input
| "ApplyCrossJoin" >> beam.FlatMap(cross_join, rights=beam.pvalue.AsSingleton(side_input))
| beam.Map(logging.info)
)
I have a branching pipeline with multiple ParDo transforms that are merged and written to text file records in a GCS bucket.
I am receiving the following messages after my pipeline crashes:
The worker lost contact with the service.
RuntimeError: FileNotFoundError: [Errno 2] Not found: gs://MYBUCKET/JOBNAME.00000-of-00001.avro [while running 'WriteToText/WriteToText/Write/WriteImpl/WriteBundles/WriteBundles']
Which looks like it can't find the log file it's been writing to. It seems to be fine until a certain point when the error occurs. I'd like to wrap a try: / except: around it or a breakpoint, but I'm not even sure how to discover what the root cause is.
Is there a way to just write a single file? Or only open a file to write once? It's spamming thousands of output files into this bucket, which is something I'd like to eliminate and may be a factor.
with beam.Pipeline(argv=pipeline_args) as p:
csvlines = (
p | 'Read From CSV' >> beam.io.ReadFromText(known_args.input, skip_header_lines=1)
| 'Parse CSV to Dictionary' >> beam.ParDo(Split())
| 'Read Files into Memory' >> beam.ParDo(DownloadFilesDoFn())
| 'Windowing' >> beam.WindowInto(window.FixedWindows(20 * 60))
)
b1 = ( csvlines | 'Branch1' >> beam.ParDo(Branch1DoFn()) )
b2 = ( csvlines | 'Branch2' >> beam.ParDo(Branch2DoFn()) )
b3 = ( csvlines | 'Branch3' >> beam.ParDo(Branch3DoFn()) )
b4 = ( csvlines | 'Branch4' >> beam.ParDo(Branch4DoFn()) )
b5 = ( csvlines | 'Branch5' >> beam.ParDo(Branch5DoFn()) )
b6 = ( csvlines | 'Branch6' >> beam.ParDo(Branch6DoFn()) )
output = (
(b1,b2,b3,b4,b5,b6) | 'Merge PCollections' >> beam.Flatten()
| 'WriteToText' >> beam.io.Write(beam.io.textio.WriteToText(known_args.output))
)
This question is linked to this previous question which contains more detail about the implementation. The solution there suggested to create an instance of google.cloud.storage.Client() in the start_bundle() of every call to a ParDo(DoFn). This is connected to the same gcs bucket - given via the args in WriteToText(known_args.output)
class DownloadFilesDoFn(beam.DoFn):
def __init__(self):
import re
self.gcs_path_regex = re.compile(r'gs:\/\/([^\/]+)\/(.*)')
def start_bundle(self):
import google.cloud.storage
self.gcs = google.cloud.storage.Client()
def process(self, element):
self.file_match = self.gcs_path_regex.match(element['Url'])
self.bucket = self.gcs.get_bucket(self.file_match.group(1))
self.blob = self.bucket.get_blob(self.file_match.group(2))
self.f = self.blob.download_as_bytes()
It's likely the cause of this error is related to to having too many connections to the client. I'm not clear on good practice for this - since it's been suggested elsewhere that you can set up network connections in this way for each bundle.
Adding this to the end to remove the client object from memory at the end of the bundle should help close some unnecessary lingering connections.
def finish_bundle(self):
del self.gcs, self.gcs_path_regex
I am developing an ETL pipeline for Google Cloud Dataflow where I have several branching ParDo transforms which each require a local audio file. The branched results are then combined and exported as text.
This was initially a Python script that ran on a single machine that I am attempting to adapt for VM worker parallelisation using GC Dataflow.
The extraction process downloads the files from a single GCS bucket location then deletes them after the transform is completed to keep storage under control. This is due to the pre-processing module which requires local access to the files. This could be re-engineered to handle a byte stream instead of a file by rewriting some of the pre-processing libraries myself - however, some attempts at this aren't going well and I'd like to explore first how to handle parallelised local file operations in Apache Beam / GC Dataflow in order to understand the framework better.
In this rough implementation each branch downloads and deletes the files, with lots of double handling. In my implementation I have 8 branches, so each file is being downloaded and deleted 8 times. Could a GCS bucket instead be mounted on every worker rather than downloading files from the remote?
Or is there another way to ensure workers are being passed the correct reference to a file so that:
a single DownloadFilesDoFn() can download a batch
then fan out the local file references in PCollection to all the branches
and then a final CleanUpFilesDoFn() can remove them
How can you parallelise local file references?
What is the best branched ParDo strategy for Apache Beam / GC Dataflow if local file operations cannot be avoided?
Some example code of my existing implementation with two branches for simplicity.
# singleton decorator
def singleton(cls):
instances = {}
def getinstance():
if cls not in instances:
instances[cls] = cls()
return instances[cls]
return getinstance
#singleton
class Predict():
def __init__(self, model):
'''
Process audio, reads in filename
Returns Prediction
'''
self.model = model
def process(self, filename):
#simplified pseudocode
audio = preprocess.load(filename=filename)
prediction = inference(self.model, audio)
return prediction
class PredictDoFn(beam.DoFn):
def __init__(self, model):
self.localfile, self.model = "", model
def process(self, element):
# Construct Predict() object singleton per worker
predict = Predict(self.model)
subprocess.run(['gsutil','cp',element['GCSPath'],'./'], cwd=cwd, shell=False)
self.localfile = cwd + "/" + element['GCSPath'].split('/')[-1]
res = predict.process(self.localfile)
return [{
'Index': element['Index'],
'Title': element['Title'],
'File' : element['GCSPath'],
self.model + 'Prediction': res
}]
def finish_bundle(self):
subprocess.run(['rm',self.localfile], cwd=cwd, shell=False)
# DoFn to split csv into elements (GSC bucket could be read as a PCollection instead maybe)
class Split(beam.DoFn):
def process(self, element):
Index,Title,GCSPath = element.split(",")
GCSPath = 'gs://mybucket/'+ GCSPath
return [{
'Index': int(Index),
'Title': Title,
'GCSPath': GCSPath
}]
A simplified version of the pipeline:
with beam.Pipeline(argv=pipeline_args) as p:
files =
(
p | 'Read From CSV' >> beam.io.ReadFromText(known_args.input)
| 'Parse CSV into Dict' >> beam.ParDo(Split())
)
# prediction 1 branch
preds1 =
(
files | 'Prediction 1' >> beam.ParDo(PredictDoFn(model1))
)
# prediction 2 branch
preds2 =
(
files | 'Prediction 2' >> beam.ParDo(PredictDoFn(model2))
)
# join branches
joined = { preds1, preds2 }
# output to file
output =
(
joined | 'WriteToText' >> beam.io.Write(beam.io.textio.WriteToText(known_args.output))
)
In order to avoid downloading the files repeatedly, the contents of the files can be put into the pCollection.
class DownloadFilesDoFn(beam.DoFn):
def __init__(self):
import re
self.gcs_path_regex = re.compile(r'gs:\/\/([^\/]+)\/(.*)')
def start_bundle(self):
import google.cloud.storage
self.gcs = google.cloud.storage.Client()
def process(self, element):
file_match = self.gcs_path_regex.match(element['GCSPath'])
bucket = self.gcs.get_bucket(file_match.group(1))
blob = bucket.get_blob(file_match.group(2))
element['file_contents'] = blob.download_as_bytes()
yield element
Then PredictDoFn becomes:
class PredictDoFn(beam.DoFn):
def __init__(self, model):
self.model = model
def start_bundle(self):
self.predict = Predict(self.model)
def process(self, element):
res = self.predict.process(element['file_contents'])
return [{
'Index': element['Index'],
'Title': element['Title'],
'File' : element['GCSPath'],
self.model + 'Prediction': res
}]
and the pipeline:
with beam.Pipeline(argv=pipeline_args) as p:
files =
(
p | 'Read From CSV' >> beam.io.ReadFromText(known_args.input)
| 'Parse CSV into Dict' >> beam.ParDo(Split())
| 'Read files' >> beam.ParDo(DownloadFilesDoFn())
)
# prediction 1 branch
preds1 =
(
files | 'Prediction 1' >> beam.ParDo(PredictDoFn(model1))
)
# prediction 2 branch
preds2 =
(
files | 'Prediction 2' >> beam.ParDo(PredictDoFn(model2))
)
# join branches
joined = { preds1, preds2 }
# output to file
output =
(
joined | 'WriteToText' >> beam.io.Write(beam.io.textio.WriteToText(known_args.output))
)
I'm trying to create fixed windows of 10 sec using apache beam 2.23 with kafka as data source.
It seems to be getting triggered for every record even if I try to set AfterProcessingtime trigger to 15 and throwing the following error if I try to use GroupByKey.
Error : KeyError: 0 [while running '[17]: FixedWindow']
Data simulation :
from kafka import KafkaProducer
import time
producer = KafkaProducer()
id_val = 1001
while(1):
message = {}
message['id_val'] = str(id_val)
message['sensor_1'] = 10
if (id_val<1003):
id_val = id_val+1
else:
id_val=1001
time.sleep(2)
print(time.time())
producer.send('test', str(message).encode())
Beam snippet :
class AddTimestampFn(beam.DoFn):
def process(self, element):
timestamp = int(time.time())
yield beam.window.TimestampedValue(element, timestamp)
pipeline_options = PipelineOptions()
pipeline_options.view_as(StandardOptions).streaming = True
p = beam.Pipeline(options=pipeline_options)
with beam.Pipeline() as p:
lines = p | "Reading messages from Kafka" >> kafkaio.KafkaConsume(kafka_config)
groups = (
lines
| 'ParseEventFn' >> beam.Map(lambda x: (ast.literal_eval(x[1])))
| 'Add timestamp' >> beam.ParDo(AddTimestampFn())
| 'After timestamp add ' >> beam.ParDo(PrintFn("timestamp add"))
| 'FixedWindow' >> beam.WindowInto(
beam.window.FixedWindows(10*1),allowed_lateness = 30)
| 'Group ' >> beam.GroupByKey())
| 'After group' >> beam.ParDo(PrintFn("after group")))
What am I doing wrong here? I have just started using beam so it could be something really silly.
I have 2 questions on my development.
Question 1
I'm trying to create a template from a python code which consists of reading from BigQuery tables, apply some transformations and write in a different BigQuery table (which can exists or not).
The point is that I need to send the target table as parameter, but looks that I can't use parameters in the pipeline method WriteToBigQuery as it is raising the following error message: apache_beam.error.RuntimeValueProviderError: RuntimeValueProvider(option: project_target, type: str, default_value: 'Test').get() not called from a runtime context
Approach 1
with beam.Pipeline(options=options) as pipeline:
logging.info("Start logic process...")
kpis_report = (
pipeline
| "Process start" >> Create(["1"])
| "Delete previous data" >> ParDo(preTasks())
| "Read table" >> ParDo(readTable())
....
| 'Write table 2' >> Write(WriteToBigQuery(
table=custom_options.project_target.get() + ":" + custom_options.dataset_target.get() + "." + custom_options.table_target.get(),
schema=custom_options.target_schema.get(),
write_disposition=BigQueryDisposition.WRITE_APPEND,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED)
Approach 2
I created a ParDo function in order to get there the variable and set the WriteToBigQuery method. However, despite of having the pipeline execution completed sucessfully and seeing that the output is returning rows (theoretically written), I can't see the table nor data inserted on it.
with beam.Pipeline(options=options) as pipeline:
logging.info("Start logic process...")
kpis_report = (
pipeline
| "Process start" >> Create(["1"])
| "Pre-tasks" >> ParDo(preTasks())
| "Read table" >> ParDo(readTable())
....
| 'Write table 2' >> Write(WriteToBigQuery())
Where I tried with 2 methods and none works: BigQueryBatchFileLoads and WriteToBigQuery
class writeTable(beam.DoFn):
def process(self, element):
try:
#Load first here the parameters from the custom_options variable (Here we can do it)
result1 = Write(BigQueryBatchFileLoads(destination=target_table,
schema=target_schema,
write_disposition=BigQueryDisposition.WRITE_APPEND,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED))
result2 = WriteToBigQuery(table=target_table,
schema=target_schema,
write_disposition=BigQueryDisposition.WRITE_APPEND,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
method="FILE_LOADS"
)
Question 2
Other doubt I have is if in this last ParDo class, I need to return something as the element or result1 or result2 as we are in the last pipeline step.
Appreciate your help on this.
The most advisable way to do this is similar to #1, but passing the value provider without calling get, and passing a lambda for table:
with beam.Pipeline(options=options) as pipeline:
logging.info("Start logic process...")
kpis_report = (
pipeline
| "Process start" >> Create(["1"])
| "Delete previous data" >> ParDo(preTasks())
| "Read table" >> ParDo(readTable())
....
| 'Write table 2' >> WriteToBigQuery(
table=lambda x: custom_options.project_target.get() + ":" + custom_options.dataset_target.get() + "." + custom_options.table_target.get(),
schema=custom_options.target_schema,
write_disposition=BigQueryDisposition.WRITE_APPEND,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED)
This should work.