I am new to using Apache Beam and Dataflow. I would like to use a data-set as an input for a function that will be deployed in parallel using Dataflow. Here is what I have so far:
import os
import apache_beam as beam
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import StandardOptions
from apache_beam.options.pipeline_options import GoogleCloudOptions
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '[location of json service credentails]'
dataflow_options = ['--project=[PROJECT NAME]',
'--job_name=[JOB NAME]',
'--temp_location=gs://[BUCKET NAME]/temp',
'--staging_location=gs://[BUCKET NAME]/stage']
options = PipelineOptions(dataflow_options)
gcloud_options = options.view_as(GoogleCloudOptions)
options.view_as(StandardOptions).runner = 'dataflow'
with beam.Pipeline(options=options) as p:
new_p = p | beam.io.ReadFromText(file_pattern='[file location].csv',
skip_header_lines=1)
| beam.ParDo([Function Name]())
The CSV file will have 4 columns with n rows. Each row represents an instance and each column represents a parameter of that instance. I would like to slip all of the parameters of an instance into a beam.DoFn so I can run it on multiple machines with the help of dataflow.
How do I get a write the function to take multiple arguments from a PCollection? The function below is how I imagine it would go.
class function_name(beam.DoFn):
def process(self, col_1, col_2, col_3, col_4):
function = function(col_1) + function(col_2) + function(col_3) + function(col_4)
return [function]
The materialized return from ReadFromText will be a PCollection where the string is still delimited.
Your ParDo should take an element of String and then do a split which you could yield as Dict of col name and value.
Related
I'm trying to run a beam script in python on GCP following this tutorial:
[https://levelup.gitconnected.com/scaling-scikit-learn-with-apache-beam-251eb6fcf75b][1]
but I keep getting the following error:
AttributeError: module 'google.cloud' has no attribute 'storage'
I have google-cloud-storage in my requirements.txt so really not sure what I'm missing here.
My full script:
import apache_beam as beam
import json
query = """
SELECT
year,
plurality,
apgar_5min,
mother_age,
father_age,
gestation_weeks,
ever_born,
case when mother_married = true then 1 else 0 end as mother_married,
weight_pounds as weight,
current_timestamp as time,
GENERATE_UUID() as guid
FROM `bigquery-public-data.samples.natality`
order by rand()
limit 100
"""
class ApplyDoFn(beam.DoFn):
def __init__(self):
self._model = None
from google.cloud import storage
import pandas as pd
import pickle as pkl
self._storage = storage
self._pkl = pkl
self._pd = pd
def process(self, element):
if self._model is None:
bucket = self._storage.Client().get_bucket('bqr_dump')
blob = bucket.get_blob('natality/sklearn-linear')
self._model = self._pkl.loads(blob.download_as_string())
new_x = self._pd.DataFrame.from_dict(element,
orient='index').transpose().fillna(0)
pred_weight = self._model.predict(new_x.iloc[:, 1:8])[0]
return [ {'guid': element['guid'],
'predicted_weight': pred_weight,
'time': str(element['time'])}]
# set up pipeline options
options = {'project': my-project-name,
'runner': 'DataflowRunner',
'temp_location': 'gs://bqr_dump/tmp',
'staging_location': 'gs://bqr_dump/tmp'
}
pipeline_options = beam.pipeline.PipelineOptions(flags=[], **options)
with beam.Pipeline(options=pipeline_options) as pipeline:
(
pipeline
| 'ReadTable' >> beam.io.Read(beam.io.BigQuerySource(
query=query,
use_standard_sql=True))
| 'Apply Model' >> beam.ParDo(ApplyDoFn())
| 'Save to BigQuery' >> beam.io.WriteToBigQuery(
'pzn-pi-sto:beam_test.weight_preds',
schema='guid:STRING,weight:FLOAT64,time:STRING',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED))
and my requirements.txt:
google-cloud==0.34.0
google-cloud-storage==1.30.0
apache-beam[GCP]==2.20.0
This issue is usually related to two main reasons: the modules not being well installed, which means that something broke during the installation and the second reason, the import of the module not being correctly done.
To fix the issue, in case the reason is the broken modules, reinstalling or checking it in a virtual environment would be the solution. As indicated here, a similar case as yours, this should fix your case.
For the second reason, try to change your code and import all the modules in the beginning of the code, as demonstrated in this official example here. Your code should be something like this:
import apache_beam as beam
import json
import pandas as pd
import pickle as pkl
from google.cloud import storage
...
Let me know if this information helped you!
Make sure u have installed the correct version. Because the modules Google maintains will have constant updates. If u just give pip install for the required package it is directly going to install the latest version of the package.
My Dataflow job (Job ID: 2020-08-18_07_55_15-14428306650890914471) is not scaling past 1 worker, despite Dataflow setting the target workers to 1000.
The job is configured to query the Google Patents BigQuery dataset, tokenize the text using a ParDo custom function and the transformers (huggingface) library, serialize the result, and write everything to a giant parquet file.
I had assumed (after running the job yesterday, which mapped a function instead of using a beam.DoFn class) that the issue was some non-parallelizing object eliminating scaling; hence, refactoring the tokenization process as a class.
Here's the script, which is run from the command line with the following command:
python bq_to_parquet_pipeline_w_class.py --extra_package transformers-3.0.2.tar.gz
The script:
import os
import re
import argparse
import google.auth
import apache_beam as beam
from apache_beam.options import pipeline_options
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.runners import DataflowRunner
from apache_beam.io.gcp.internal.clients import bigquery
import pyarrow as pa
import pickle
from transformers import AutoTokenizer
print('Defining TokDoFn')
class TokDoFn(beam.DoFn):
def __init__(self, tok_version, block_size=200):
self.tok = AutoTokenizer.from_pretrained(tok_version)
self.block_size = block_size
def process(self, x):
txt = x['abs_text'] + ' ' + x['desc_text'] + ' ' + x['claims_text']
enc = self.tok.encode(txt)
for idx, token in enumerate(enc):
chunk = enc[idx:idx + self.block_size]
serialized = pickle.dumps(chunk)
yield serialized
def run(argv=None, save_main_session=True):
query_big = '''
with data as (
SELECT
(select text from unnest(abstract_localized) limit 1) abs_text,
(select text from unnest(description_localized) limit 1) desc_text,
(select text from unnest(claims_localized) limit 1) claims_text,
publication_date,
filing_date,
grant_date,
application_kind,
ipc
FROM `patents-public-data.patents.publications`
)
select *
FROM data
WHERE
abs_text is not null
AND desc_text is not null
AND claims_text is not null
AND ipc is not null
'''
query_sample = '''
SELECT *
FROM `client_name.patent_data.patent_samples`
LIMIT 2;
'''
print('Start Run()')
parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args(argv)
'''
Configure Options
'''
# Setting up the Apache Beam pipeline options.
# We use the save_main_session option because one or more DoFn's in this
# workflow rely on global context (e.g., a module imported at module level).
options = PipelineOptions(pipeline_args)
options.view_as(SetupOptions).save_main_session = save_main_session
# Sets the project to the default project in your current Google Cloud environment.
_, options.view_as(GoogleCloudOptions).project = google.auth.default()
# Sets the Google Cloud Region in which Cloud Dataflow runs.
options.view_as(GoogleCloudOptions).region = 'us-central1'
# IMPORTANT! Adjust the following to choose a Cloud Storage location.
dataflow_gcs_location = 'gs://client_name/dataset_cleaned_pq_classTok'
# Dataflow Staging Location. This location is used to stage the Dataflow Pipeline and SDK binary.
options.view_as(GoogleCloudOptions).staging_location = f'{dataflow_gcs_location}/staging'
# Dataflow Temp Location. This location is used to store temporary files or intermediate results before finally outputting to the sink.
options.view_as(GoogleCloudOptions).temp_location = f'{dataflow_gcs_location}/temp'
# The directory to store the output files of the job.
output_gcs_location = f'{dataflow_gcs_location}/output'
print('Options configured per GCP Notebook Examples')
print('Configuring BQ Table Schema for Beam')
#Write Schema (to PQ):
schema = pa.schema([
('block', pa.binary())
])
print('Starting pipeline...')
with beam.Pipeline(runner=DataflowRunner(), options=options) as p:
res = (p
| 'QueryTable' >> beam.io.Read(beam.io.BigQuerySource(query=query_big, use_standard_sql=True))
| beam.ParDo(TokDoFn(tok_version='gpt2', block_size=200))
| beam.Map(lambda x: {'block': x})
| beam.io.WriteToParquet(os.path.join(output_gcs_location, f'pq_out'),
schema,
record_batch_size=1000)
)
print('Pipeline built. Running...')
if __name__ == '__main__':
import logging
logging.getLogger().setLevel(logging.INFO)
logging.getLogger("transformers.tokenization_utils_base").setLevel(logging.ERROR)
run()
The solution is twofold:
The following quotas were being exceeded when I ran my job, all under 'Compute Engine API' (view your quotas here: https://console.cloud.google.com/iam-admin/quotas):
CPUs (I requested an increase to 50)
Persistent Disk Standard (GB) (I requested an increase to 12,500)
In_Use_IP_Address (I requested an increase to 50)
Note: If you read the console output while your job is running, any exceeded quotas should print out as an INFO line.
Following Peter Kim's advice above, I passed the flag --max_num_workers as part of my command:
python bq_to_parquet_pipeline_w_class.py --extra_package transformers-3.0.2.tar.gz --max_num_workers 22
And I started scaling!
All in all, it would be nice if there was a way to prompt users via the Dataflow console when a quota is being reached, and provide an easy means to request an increase to that (and recommended complementary) quotas, along with suggestions for what the increased amount to be requested should be.
I'm trying to create a streaming pipeline with Dataflow that reads messages from a PubSub topic to end up writing them on a BigQuery table. I don't want to use any Dataflow template.
For the moment I just want to create a pipeline in a Python3 script executed from a Google VM Instance to carry out a loading and transformation process of every message that arrives from Pubsub (parsing the records that it contains and adding a new field) to end up writing the results on a BigQuery table.
Simplifying, my code would be:
#!/usr/bin/env python
from apache_beam.options.pipeline_options import PipelineOptions
from google.cloud import pubsub_v1,
import apache_beam as beam
import apache_beam.io.gcp.bigquery
import logging
import argparse
import sys
import json
from datetime import datetime, timedelta
def load_pubsub(message):
try:
data = json.loads(message)
records = data["messages"]
return records
except:
raise ImportError("Something went wrong reading data from the Pub/Sub topic")
class ParseTransformPubSub(beam.DoFn):
def __init__(self):
self.water_mark = (datetime.now() + timedelta(hours = 1)).strftime("%Y-%m-%d %H:%M:%S.%f")
def process(self, records):
for record in records:
record["E"] = self.water_mark
yield record
def main():
table_schema = apache_beam.io.gcp.bigquery.parse_table_schema_from_json(open("TableSchema.json"))
parser = argparse.ArgumentParser()
parser.add_argument('--input_topic')
parser.add_argument('--output_table')
known_args, pipeline_args = parser.parse_known_args(sys.argv)
with beam.Pipeline(argv = pipeline_args) as p:
pipe = ( p | 'ReadDataFromPubSub' >> beam.io.ReadStringsFromPubSub(known_args.input_topic)
| 'LoadJSON' >> beam.Map(load_pubsub)
| 'ParseTransform' >> beam.ParDo(ParseTransformPubSub())
| 'WriteToAvailabilityTable' >> beam.io.WriteToBigQuery(
table = known_args.output_table,
schema = table_schema,
create_disposition = beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition = beam.io.BigQueryDisposition.WRITE_APPEND)
)
result = p.run()
result.wait_until_finish()
if __name__ == '__main__':
logger = logging.getLogger().setLevel(logging.INFO)
main()
(For example) The messages published in the PubSub topic use to come as follows:
'{"messages":[{"A":"Alpha", "B":"V1", "C":3, "D":12},{"A":"Alpha", "B":"V1", "C":5, "D":14},{"A":"Alpha", "B":"V1", "C":3, "D":22}]}'
If the field "E" is added in the record, then the structure of the record (dictionary in Python) and the data type of the fields is what the BigQuery table expects.
The problems that a I want to handle are:
If some messages come with an unexpected structure I want to fork the pipeline flatten and write them in another BigQuery table.
If some messages come with an unexpected data type of a field, then in the last level of the pipeline when they should be written in the table an error will occur. I want to manage this type of error by diverting the record to a third table.
I read the documentation found on the following pages but I found nothing:
https://cloud.google.com/dataflow/docs/guides/troubleshooting-your-pipeline
https://cloud.google.com/dataflow/docs/guides/common-errors
By the way, if I choose the option to configure the pipeline through the template that reads from a PubSubSubscription and writes into BigQuery I get the following schema which turns out to be the same one I'm looking for:
Template: Cloud Pub/Sub Subscription to BigQuery
You can't catch the errors that occur in the sink to BigQuery. The message that you write into bigquery must be good.
The best pattern is to perform a transform which checks your messages structure and fields type. In case of error, you create an error flow and you write this issue flow in a file (for example, or in a table without schema, you write in plain text your message)
We do the following when errors occur at the BigQuery sink.
send a message (without stacktrace) to GCP Error Reporting, for developers to be notified
log the error to StackDriver
stop the pipeline execution (the best place for messages to wait until a developer has fixed the issue, is the incomming pubSub subscription)
I'm using Gcloud Composer as my Airflow. When I try to use Jinja in my HQL code, it does not translate it correctly.
I know that the HiveOperator has a Jinja translator as I'm used to it, but the DataProcHiveOperator doesn't.
I've tried to use the HiveConf directly into my HQL files, but when setting those values to my Partition (i.e. INSERT INTO TABLE abc PARTITION (ds = ${hiveconf:ds}))`, it doesn't work.
I have also added the following to my HQL file:
SET ds=to_date(current_timestamp());
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
But it didn't work as HIVE is transforming the formula above into a STRING.
So my idea was to combine both operators to have the Jinja translator working fine, but when I do that, I get the following error: ERROR - submit() takes from 3 to 4 positional arguments but 5 were given.
I'm not very familiar with Python coding and any help would be great, see below code for the operator I'm trying to build;
Header of the Python File (please note that the file contains other Operators not mentioned in this question):
import ntpath
import os
import re
import time
import uuid
from datetime import timedelta
from airflow.contrib.hooks.gcp_dataproc_hook import DataProcHook
from airflow.contrib.hooks.gcs_hook import GoogleCloudStorageHook
from airflow.exceptions import AirflowException
from airflow.models import BaseOperator
from airflow.utils.decorators import apply_defaults
from airflow.version import version
from googleapiclient.errors import HttpError
from airflow.utils import timezone
from airflow.utils.operator_helpers import context_to_airflow_vars
modified DataprocHiveOperator:
class DataProcHiveOperator(BaseOperator):
template_fields = ['query', 'variables', 'job_name', 'cluster_name', 'dataproc_jars']
template_ext = ('.q',)
ui_color = '#0273d4'
#apply_defaults
def __init__(
self,
query=None,
query_uri=None,
hiveconfs=None,
hiveconf_jinja_translate=False,
variables=None,
job_name='{{task.task_id}}_{{ds_nodash}}',
cluster_name='cluster-1',
dataproc_hive_properties=None,
dataproc_hive_jars=None,
gcp_conn_id='google_cloud_default',
delegate_to=None,
region='global',
job_error_states=['ERROR'],
*args,
**kwargs):
super(DataProcHiveOperator, self).__init__(*args, **kwargs)
self.gcp_conn_id = gcp_conn_id
self.delegate_to = delegate_to
self.query = query
self.query_uri = query_uri
self.hiveconfs = hiveconfs or {}
self.hiveconf_jinja_translate = hiveconf_jinja_translate
self.variables = variables
self.job_name = job_name
self.cluster_name = cluster_name
self.dataproc_properties = dataproc_hive_properties
self.dataproc_jars = dataproc_hive_jars
self.region = region
self.job_error_states = job_error_states
def prepare_template(self):
if self.hiveconf_jinja_translate:
self.query_uri= re.sub(
"(\$\{(hiveconf:)?([ a-zA-Z0-9_]*)\})", "{{ \g<3> }}", self.query_uri)
def execute(self, context):
hook = DataProcHook(gcp_conn_id=self.gcp_conn_id,
delegate_to=self.delegate_to)
job = hook.create_job_template(self.task_id, self.cluster_name, "hiveJob",
self.dataproc_properties)
if self.query is None:
job.add_query_uri(self.query_uri)
else:
job.add_query(self.query)
if self.hiveconf_jinja_translate:
self.hiveconfs = context_to_airflow_vars(context)
else:
self.hiveconfs.update(context_to_airflow_vars(context))
job.add_variables(self.variables)
job.add_jar_file_uris(self.dataproc_jars)
job.set_job_name(self.job_name)
job_to_submit = job.build()
self.dataproc_job_id = job_to_submit["job"]["reference"]["jobId"]
hook.submit(hook.project_id, job_to_submit, self.region, self.job_error_states)
I would like to be able to use Jinja templating inside my HQL code to allow partition automation on my data pipeline.
P.S: I'll use the Jinja templating mostly for Partition DateStamp
Does anyone know what is the error message I'm getting + help me solve it?
ERROR - submit() takes from 3 to 4 positional arguments but 5 were given
Thank you!
It is because of the 5th argument job_error_states which is only in master and not in the current stable release (1.10.1).
Source Code for 1.10.1 -> https://github.com/apache/incubator-airflow/blob/76a5fc4d2eb3c214ca25406f03b4a0c5d7250f71/airflow/contrib/hooks/gcp_dataproc_hook.py#L219
So remove that parameter and it should work.
Use case
I want to parse multiple files from Cloud storage and insert the results into a BigQuery table.
Selecting one particular file to read works fine. However I'm struggling when switching out the one file to instead include all files by using the * glob pattern.
I'm executing the job like this:
python batch.py --project foobar --job_name foobar-metrics --runner DataflowRunner --staging_location gs://foobar-staging/dataflow --temp_location gs://foobar-staging/dataflow_temp --output foobar.test
This is the first Dataflow experiment and I'm not sure how to debug it or what best practices there are for a pipeline like this.
Expected outcome
I would expect that the job gets uploaded to Dataflow runner and that gathering the list of files and iterating each would happen in the cloud at run time. I would expect to be able to pass the contents of all files in the same way as I do when reading one file.
Actual outcome
The job blocks already at the point of trying to submit it to the Cloud Dataflow runner.
Contents of batch.py
"""A metric sink workflow."""
from __future__ import absolute_import
import json
import argparse
import logging
import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.metrics import Metrics
from apache_beam.metrics.metric import MetricsFilter
from apache_beam.utils.pipeline_options import PipelineOptions
from apache_beam.utils.pipeline_options import SetupOptions
from apache_beam.utils.pipeline_options import GoogleCloudOptions
class ExtractDatapointsFn(beam.DoFn):
"""
Parse json documents and extract the metrics datapoints.
"""
def __init__(self):
super(ExtractDatapointsFn, self).__init__()
self.total_invalid = Metrics.counter(self.__class__, 'total_invalid')
def process(self, element):
"""
Process json that contains metrics of each element.
Args:
element: the element being processed.
Returns:
unmarshaled json for each metric point.
"""
try:
# Catch parsing errors as well as our custom key check.
document = json.loads(element)
if not "DataPoints" in document:
raise ValueError("missing DataPoints")
except ValueError:
self.total_invalid.inc(1)
return
for point in document["DataPoints"]:
yield point
def run(argv=None):
"""
Main entry point; defines and runs the pipeline.
"""
parser = argparse.ArgumentParser()
parser.add_argument('--input',
dest='input',
default='gs://foobar-sink/*',
help='Input file to process.')
parser.add_argument('--output',
required=True,
help=(
'Output BigQuery table for results specified as: PROJECT:DATASET.TABLE '
'or DATASET.TABLE.'))
known_args, pipeline_args = parser.parse_known_args(argv)
# We use the save_main_session option because one or more DoFn's in this
# workflow rely on global context (e.g., a module imported at module level).
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = True
pipeline_options.view_as(GoogleCloudOptions)
pipe = beam.Pipeline(options=pipeline_options)
# Read the json data and extract the datapoints.
documents = pipe | 'read' >> ReadFromText(known_args.input)
metrics = documents | 'extract datapoints' >> beam.ParDo(ExtractDatapointsFn())
# BigQuery sink table.
_ = metrics | 'write bq' >> beam.io.Write(
beam.io.BigQuerySink(
known_args.output,
schema='Path:STRING, Value:FLOAT, Timestamp:TIMESTAMP',
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE))
# Actually run the pipeline (all operations above are deferred).
result = pipe.run()
result.wait_until_finish()
total_invalid_filter = MetricsFilter().with_name('total_invalid')
query_result = result.metrics().query(total_invalid_filter)
if query_result['counters']:
total_invalid_counter = query_result['counters'][0]
logging.info('number of invalid documents: %d', total_invalid_counter.committed)
else:
logging.info('no invalid documents were found')
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
We do size estimation of sources at job submission so that Dataflow service can use that information when initializing the job (for example, to determine initial number of workers). To estimate size of a glob we need to expand the glob. This could take some time (I believe several minutes for GCS) if the glob expands into more than 100k files. We'll look into ways in which we can improve user experience here.