Dataflow batch job not scaling - python

My Dataflow job (Job ID: 2020-08-18_07_55_15-14428306650890914471) is not scaling past 1 worker, despite Dataflow setting the target workers to 1000.
The job is configured to query the Google Patents BigQuery dataset, tokenize the text using a ParDo custom function and the transformers (huggingface) library, serialize the result, and write everything to a giant parquet file.
I had assumed (after running the job yesterday, which mapped a function instead of using a beam.DoFn class) that the issue was some non-parallelizing object eliminating scaling; hence, refactoring the tokenization process as a class.
Here's the script, which is run from the command line with the following command:
python bq_to_parquet_pipeline_w_class.py --extra_package transformers-3.0.2.tar.gz
The script:
import os
import re
import argparse
import google.auth
import apache_beam as beam
from apache_beam.options import pipeline_options
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.runners import DataflowRunner
from apache_beam.io.gcp.internal.clients import bigquery
import pyarrow as pa
import pickle
from transformers import AutoTokenizer
print('Defining TokDoFn')
class TokDoFn(beam.DoFn):
def __init__(self, tok_version, block_size=200):
self.tok = AutoTokenizer.from_pretrained(tok_version)
self.block_size = block_size
def process(self, x):
txt = x['abs_text'] + ' ' + x['desc_text'] + ' ' + x['claims_text']
enc = self.tok.encode(txt)
for idx, token in enumerate(enc):
chunk = enc[idx:idx + self.block_size]
serialized = pickle.dumps(chunk)
yield serialized
def run(argv=None, save_main_session=True):
query_big = '''
with data as (
SELECT
(select text from unnest(abstract_localized) limit 1) abs_text,
(select text from unnest(description_localized) limit 1) desc_text,
(select text from unnest(claims_localized) limit 1) claims_text,
publication_date,
filing_date,
grant_date,
application_kind,
ipc
FROM `patents-public-data.patents.publications`
)
select *
FROM data
WHERE
abs_text is not null
AND desc_text is not null
AND claims_text is not null
AND ipc is not null
'''
query_sample = '''
SELECT *
FROM `client_name.patent_data.patent_samples`
LIMIT 2;
'''
print('Start Run()')
parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args(argv)
'''
Configure Options
'''
# Setting up the Apache Beam pipeline options.
# We use the save_main_session option because one or more DoFn's in this
# workflow rely on global context (e.g., a module imported at module level).
options = PipelineOptions(pipeline_args)
options.view_as(SetupOptions).save_main_session = save_main_session
# Sets the project to the default project in your current Google Cloud environment.
_, options.view_as(GoogleCloudOptions).project = google.auth.default()
# Sets the Google Cloud Region in which Cloud Dataflow runs.
options.view_as(GoogleCloudOptions).region = 'us-central1'
# IMPORTANT! Adjust the following to choose a Cloud Storage location.
dataflow_gcs_location = 'gs://client_name/dataset_cleaned_pq_classTok'
# Dataflow Staging Location. This location is used to stage the Dataflow Pipeline and SDK binary.
options.view_as(GoogleCloudOptions).staging_location = f'{dataflow_gcs_location}/staging'
# Dataflow Temp Location. This location is used to store temporary files or intermediate results before finally outputting to the sink.
options.view_as(GoogleCloudOptions).temp_location = f'{dataflow_gcs_location}/temp'
# The directory to store the output files of the job.
output_gcs_location = f'{dataflow_gcs_location}/output'
print('Options configured per GCP Notebook Examples')
print('Configuring BQ Table Schema for Beam')
#Write Schema (to PQ):
schema = pa.schema([
('block', pa.binary())
])
print('Starting pipeline...')
with beam.Pipeline(runner=DataflowRunner(), options=options) as p:
res = (p
| 'QueryTable' >> beam.io.Read(beam.io.BigQuerySource(query=query_big, use_standard_sql=True))
| beam.ParDo(TokDoFn(tok_version='gpt2', block_size=200))
| beam.Map(lambda x: {'block': x})
| beam.io.WriteToParquet(os.path.join(output_gcs_location, f'pq_out'),
schema,
record_batch_size=1000)
)
print('Pipeline built. Running...')
if __name__ == '__main__':
import logging
logging.getLogger().setLevel(logging.INFO)
logging.getLogger("transformers.tokenization_utils_base").setLevel(logging.ERROR)
run()

The solution is twofold:
The following quotas were being exceeded when I ran my job, all under 'Compute Engine API' (view your quotas here: https://console.cloud.google.com/iam-admin/quotas):
CPUs (I requested an increase to 50)
Persistent Disk Standard (GB) (I requested an increase to 12,500)
In_Use_IP_Address (I requested an increase to 50)
Note: If you read the console output while your job is running, any exceeded quotas should print out as an INFO line.
Following Peter Kim's advice above, I passed the flag --max_num_workers as part of my command:
python bq_to_parquet_pipeline_w_class.py --extra_package transformers-3.0.2.tar.gz --max_num_workers 22
And I started scaling!
All in all, it would be nice if there was a way to prompt users via the Dataflow console when a quota is being reached, and provide an easy means to request an increase to that (and recommended complementary) quotas, along with suggestions for what the increased amount to be requested should be.

Related

How to avoid duplication in a dataflow beam pipeline in the writeToBq step?

We have a job working on Dataflow that ingests data from a Pub/Sub to write it to BigQuery. On a limited amount of data we were not having any duplicates but on our current volume 100 evts/s we have duplicates in the BigQuery tables. What we call here a duplicate is a row with the same event uuid.
Here is my code:
class CustomParse(beam.DoFn):
""" Custom ParallelDo class to apply a custom transformation """
def to_runner_api_parameter(self, unused_context):
return "beam:transforms:custom_parsing:custom_v0", None
def process(self, message: beam.io.PubsubMessage, timestamp=beam.DoFn.TimestampParam, window=beam.DoFn.WindowParam):
import uuid
data_parsed = {
"data": message.data,
"dataflow_timestamp": timestamp.to_rfc3339(),
"uuid": uuid.uuid4().hex
}
yield data_parsed
def run():
parser = argparse.ArgumentParser()
parser.add_argument(
"--input_subscription",
help='Input PubSub subscription of the form "projects/<PROJECT>/subscriptions/<SUBSCRIPTION>."'
)
parser.add_argument(
"--output_table", help="Output BigQuery Table"
)
known_args, pipeline_args = parser.parse_known_args()
additional_bq_parameters = {
'timePartitioning': {'type': 'HOUR'}}
# Creating pipeline options
pipeline_options = PipelineOptions(pipeline_args)
def get_table_name(x):
namespace = NAMESPACE_EXTRACTED
date = x['dataflow_timestamp'][:10].replace('-', '')
return f"{known_args.output_table}_{namespace}_{date}"
# Defining our pipeline and its steps
p = beam.Pipeline(options=pipeline_options)
(
p
| "ReadFromPubSub" >> beam.io.gcp.pubsub.ReadFromPubSub(
subscription=known_args.input_subscription, timestamp_attribute=None, with_attributes=True
)
| "Prevent fusion" >> beam.transforms.util.Reshuffle()
| "CustomParse" >> beam.ParDo(CustomParse(broker_model))
| "WriteToBigQuery" >> beam.io.WriteToBigQuery(
table=get_table_name,
schema=BIGQUERY_SCHEMA,
additional_bq_parameters=additional_bq_parameters,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
batch_size=1000
)
)
pipeline_result = p.run()
if __name__ == "__main__":
run()
What should we do to avoid this ? Are we missing a combining step ? For the record it has not happened following an error.
I'm missing some context (for example, you haven't included the BrokerParsing transform), but based on what you've included here, it seems like the issue might be that you're not including the id_label parameter in the ReadFromPubSub transform. According to the documentation:
id_label – The attribute on incoming Pub/Sub messages to use as a unique record identifier. When specified, the value of this attribute (which can be any string that uniquely identifies the record) will be used for deduplication of messages. If not provided, we cannot guarantee that no duplicate data will be delivered on the Pub/Sub stream. In this case, deduplication of the stream will be strictly best effort.
I believe this is due to beam.io.WriteToBigQuery streaming_inserts guarantee
Use the BigQuery streaming insert API to insert data. This provides the lowest-latency insert path into BigQuery, and therefore is the default method when the input is unbounded. BigQuery will make a strong effort to ensure no duplicates when using this path, however there are some scenarios in which BigQuery is unable to make this guarantee (see https://cloud.google.com/bigquery/streaming-data-into-bigquery). A query can be run over the output table to periodically clean these rare duplicates. Alternatively, using the FILE_LOADS insert method does guarantee no duplicates, though the latency for the insert into BigQuery will be much higher. For more information, see Streaming Data into BigQuery.

How to handle BigQuery insert errors in a Dataflow pipeline using Python?

I'm trying to create a streaming pipeline with Dataflow that reads messages from a PubSub topic to end up writing them on a BigQuery table. I don't want to use any Dataflow template.
For the moment I just want to create a pipeline in a Python3 script executed from a Google VM Instance to carry out a loading and transformation process of every message that arrives from Pubsub (parsing the records that it contains and adding a new field) to end up writing the results on a BigQuery table.
Simplifying, my code would be:
#!/usr/bin/env python
from apache_beam.options.pipeline_options import PipelineOptions
from google.cloud import pubsub_v1,
import apache_beam as beam
import apache_beam.io.gcp.bigquery
import logging
import argparse
import sys
import json
from datetime import datetime, timedelta
def load_pubsub(message):
try:
data = json.loads(message)
records = data["messages"]
return records
except:
raise ImportError("Something went wrong reading data from the Pub/Sub topic")
class ParseTransformPubSub(beam.DoFn):
def __init__(self):
self.water_mark = (datetime.now() + timedelta(hours = 1)).strftime("%Y-%m-%d %H:%M:%S.%f")
def process(self, records):
for record in records:
record["E"] = self.water_mark
yield record
def main():
table_schema = apache_beam.io.gcp.bigquery.parse_table_schema_from_json(open("TableSchema.json"))
parser = argparse.ArgumentParser()
parser.add_argument('--input_topic')
parser.add_argument('--output_table')
known_args, pipeline_args = parser.parse_known_args(sys.argv)
with beam.Pipeline(argv = pipeline_args) as p:
pipe = ( p | 'ReadDataFromPubSub' >> beam.io.ReadStringsFromPubSub(known_args.input_topic)
| 'LoadJSON' >> beam.Map(load_pubsub)
| 'ParseTransform' >> beam.ParDo(ParseTransformPubSub())
| 'WriteToAvailabilityTable' >> beam.io.WriteToBigQuery(
table = known_args.output_table,
schema = table_schema,
create_disposition = beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition = beam.io.BigQueryDisposition.WRITE_APPEND)
)
result = p.run()
result.wait_until_finish()
if __name__ == '__main__':
logger = logging.getLogger().setLevel(logging.INFO)
main()
(For example) The messages published in the PubSub topic use to come as follows:
'{"messages":[{"A":"Alpha", "B":"V1", "C":3, "D":12},{"A":"Alpha", "B":"V1", "C":5, "D":14},{"A":"Alpha", "B":"V1", "C":3, "D":22}]}'
If the field "E" is added in the record, then the structure of the record (dictionary in Python) and the data type of the fields is what the BigQuery table expects.
The problems that a I want to handle are:
If some messages come with an unexpected structure I want to fork the pipeline flatten and write them in another BigQuery table.
If some messages come with an unexpected data type of a field, then in the last level of the pipeline when they should be written in the table an error will occur. I want to manage this type of error by diverting the record to a third table.
I read the documentation found on the following pages but I found nothing:
https://cloud.google.com/dataflow/docs/guides/troubleshooting-your-pipeline
https://cloud.google.com/dataflow/docs/guides/common-errors
By the way, if I choose the option to configure the pipeline through the template that reads from a PubSubSubscription and writes into BigQuery I get the following schema which turns out to be the same one I'm looking for:
Template: Cloud Pub/Sub Subscription to BigQuery
You can't catch the errors that occur in the sink to BigQuery. The message that you write into bigquery must be good.
The best pattern is to perform a transform which checks your messages structure and fields type. In case of error, you create an error flow and you write this issue flow in a file (for example, or in a table without schema, you write in plain text your message)
We do the following when errors occur at the BigQuery sink.
send a message (without stacktrace) to GCP Error Reporting, for developers to be notified
log the error to StackDriver
stop the pipeline execution (the best place for messages to wait until a developer has fixed the issue, is the incomming pubSub subscription)

FileUploadMiscError while persisting output file from Azure Batch

I'm facing the following error while trying to persist log files to Azure Blob storage from Azure Batch execution - "FileUploadMiscError - A miscellaneous error was encountered while uploading one of the output files". This error doesn't give a lot of information as to what might be going wrong. I tried checking the Microsoft Documentation for this error code, but it doesn't mention this particular error code.
Below is the relevant code for adding the task to Azure Batch that I have ported from C# to Python for persisting the log files.
Note: The container that I have configured gets created when the task is added, but there's no blob inside.
import datetime
import logging
import os
import azure.storage.blob.models as blob_model
import yaml
from azure.batch import models
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.common.cloudstorageaccount import CloudStorageAccount
from dotenv import load_dotenv
LOG = logging.getLogger(__name__)
def add_tasks(batch_client, job_id, task_id, io_details, blob_details):
task_commands = "This is a placeholder. Actual code has an actual task. This gets completed successfully."
LOG.info("Configuring the blob storage details")
base_blob_service = BaseBlobService(
account_name=blob_details['account_name'],
account_key=blob_details['account_key'])
LOG.info("Base blob service created")
base_blob_service.create_container(
container_name=blob_details['container_name'], fail_on_exist=False)
LOG.info("Container present")
container_sas = base_blob_service.generate_container_shared_access_signature(
container_name=blob_details['container_name'],
permission=blob_model.ContainerPermissions(write=True),
expiry=datetime.datetime.now() + datetime.timedelta(days=1))
LOG.info(f"Container SAS created: {container_sas}")
container_url = base_blob_service.make_container_url(
container_name=blob_details['container_name'], sas_token=container_sas)
LOG.info(f"Container URL created: {container_url}")
# fpath = task_id + '/output.txt'
fpath = task_id
LOG.info(f"Creating output file object:")
out_files_list = list()
out_files = models.OutputFile(
file_pattern=r"../stderr.txt",
destination=models.OutputFileDestination(
container=models.OutputFileBlobContainerDestination(
container_url=container_url, path=fpath)),
upload_options=models.OutputFileUploadOptions(
upload_condition=models.OutputFileUploadCondition.task_completion))
out_files_list.append(out_files)
LOG.info(f"Output files: {out_files_list}")
LOG.info(f"Creating the task now: {task_id}")
task = models.TaskAddParameter(
id=task_id, command_line=task_commands, output_files=out_files_list)
batch_client.task.add(job_id=job_id, task=task)
LOG.info(f"Added task: {task_id}")
There is a bug in Batch's OutputFile handling which causes it to fail to upload to containers if the full container URL includes any query-string parameters other than the ones included in the SAS token. Unfortunately, the azure-storage-blob Python module includes an extra query string parameter when generating the URL via make_container_url.
This issue was just raised to us, and a fix will be released in the coming weeks, but an easy workaround is instead of using make_container_url to craft the URL, craft it yourself like so: container_url = 'https://{}/{}?{}'.format(blob_service.primary_endpoint, blob_details['container_name'], container_sas).
The resulting URL should look something like this: https://<account>.blob.core.windows.net/<container>?se=2019-01-12T01%3A34%3A05Z&sp=w&sv=2018-03-28&sr=c&sig=<sig> - specifically it shouldn't have restype=container in it (which is what the azure-storage-blob package is including)

Deploying a Dataflow Pipeline using Python and Apache Beam

I am new to using Apache Beam and Dataflow. I would like to use a data-set as an input for a function that will be deployed in parallel using Dataflow. Here is what I have so far:
import os
import apache_beam as beam
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import StandardOptions
from apache_beam.options.pipeline_options import GoogleCloudOptions
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '[location of json service credentails]'
dataflow_options = ['--project=[PROJECT NAME]',
'--job_name=[JOB NAME]',
'--temp_location=gs://[BUCKET NAME]/temp',
'--staging_location=gs://[BUCKET NAME]/stage']
options = PipelineOptions(dataflow_options)
gcloud_options = options.view_as(GoogleCloudOptions)
options.view_as(StandardOptions).runner = 'dataflow'
with beam.Pipeline(options=options) as p:
new_p = p | beam.io.ReadFromText(file_pattern='[file location].csv',
skip_header_lines=1)
| beam.ParDo([Function Name]())
The CSV file will have 4 columns with n rows. Each row represents an instance and each column represents a parameter of that instance. I would like to slip all of the parameters of an instance into a beam.DoFn so I can run it on multiple machines with the help of dataflow.
How do I get a write the function to take multiple arguments from a PCollection? The function below is how I imagine it would go.
class function_name(beam.DoFn):
def process(self, col_1, col_2, col_3, col_4):
function = function(col_1) + function(col_2) + function(col_3) + function(col_4)
return [function]
The materialized return from ReadFromText will be a PCollection where the string is still delimited.
Your ParDo should take an element of String and then do a split which you could yield as Dict of col name and value.

How to iterate all files in google cloud storage to be used as dataflow input?

Use case
I want to parse multiple files from Cloud storage and insert the results into a BigQuery table.
Selecting one particular file to read works fine. However I'm struggling when switching out the one file to instead include all files by using the * glob pattern.
I'm executing the job like this:
python batch.py --project foobar --job_name foobar-metrics --runner DataflowRunner --staging_location gs://foobar-staging/dataflow --temp_location gs://foobar-staging/dataflow_temp --output foobar.test
This is the first Dataflow experiment and I'm not sure how to debug it or what best practices there are for a pipeline like this.
Expected outcome
I would expect that the job gets uploaded to Dataflow runner and that gathering the list of files and iterating each would happen in the cloud at run time. I would expect to be able to pass the contents of all files in the same way as I do when reading one file.
Actual outcome
The job blocks already at the point of trying to submit it to the Cloud Dataflow runner.
Contents of batch.py
"""A metric sink workflow."""
from __future__ import absolute_import
import json
import argparse
import logging
import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.metrics import Metrics
from apache_beam.metrics.metric import MetricsFilter
from apache_beam.utils.pipeline_options import PipelineOptions
from apache_beam.utils.pipeline_options import SetupOptions
from apache_beam.utils.pipeline_options import GoogleCloudOptions
class ExtractDatapointsFn(beam.DoFn):
"""
Parse json documents and extract the metrics datapoints.
"""
def __init__(self):
super(ExtractDatapointsFn, self).__init__()
self.total_invalid = Metrics.counter(self.__class__, 'total_invalid')
def process(self, element):
"""
Process json that contains metrics of each element.
Args:
element: the element being processed.
Returns:
unmarshaled json for each metric point.
"""
try:
# Catch parsing errors as well as our custom key check.
document = json.loads(element)
if not "DataPoints" in document:
raise ValueError("missing DataPoints")
except ValueError:
self.total_invalid.inc(1)
return
for point in document["DataPoints"]:
yield point
def run(argv=None):
"""
Main entry point; defines and runs the pipeline.
"""
parser = argparse.ArgumentParser()
parser.add_argument('--input',
dest='input',
default='gs://foobar-sink/*',
help='Input file to process.')
parser.add_argument('--output',
required=True,
help=(
'Output BigQuery table for results specified as: PROJECT:DATASET.TABLE '
'or DATASET.TABLE.'))
known_args, pipeline_args = parser.parse_known_args(argv)
# We use the save_main_session option because one or more DoFn's in this
# workflow rely on global context (e.g., a module imported at module level).
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = True
pipeline_options.view_as(GoogleCloudOptions)
pipe = beam.Pipeline(options=pipeline_options)
# Read the json data and extract the datapoints.
documents = pipe | 'read' >> ReadFromText(known_args.input)
metrics = documents | 'extract datapoints' >> beam.ParDo(ExtractDatapointsFn())
# BigQuery sink table.
_ = metrics | 'write bq' >> beam.io.Write(
beam.io.BigQuerySink(
known_args.output,
schema='Path:STRING, Value:FLOAT, Timestamp:TIMESTAMP',
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE))
# Actually run the pipeline (all operations above are deferred).
result = pipe.run()
result.wait_until_finish()
total_invalid_filter = MetricsFilter().with_name('total_invalid')
query_result = result.metrics().query(total_invalid_filter)
if query_result['counters']:
total_invalid_counter = query_result['counters'][0]
logging.info('number of invalid documents: %d', total_invalid_counter.committed)
else:
logging.info('no invalid documents were found')
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
We do size estimation of sources at job submission so that Dataflow service can use that information when initializing the job (for example, to determine initial number of workers). To estimate size of a glob we need to expand the glob. This could take some time (I believe several minutes for GCS) if the glob expands into more than 100k files. We'll look into ways in which we can improve user experience here.

Categories