The below code builds the pipeline and DAG is generated.
RuntimeError: NotImplementedError [while running 'generatedPtransform-438']Please let me know if there is any direct connector for mysql in python for beam.
from apache_beam.options.pipeline_options import PipelineOptions
from google.cloud import pubsub_v1
from google.cloud import bigquery
import mysql.connector
import apache_beam as beam
import logging
import argparse
import sys
import re
PROJECT="12344"
TOPIC = "projects/12344/topics/mytopic"
class insertfn(beam.Dofn):
def insertdata(self,data):
db_conn=mysql.connector.connect(host="localhost",user="abc",passwd="root",database="new")
db_cursor=db_conn.cursor()
emp_sql = " INSERT INTO emp(ename,eid,dept) VALUES (%s,%s,%s)"
db_cusror.executemany(emp_sql,(data[0],data[1],data[2]))
db_conn.commit()
print(db_cursor.rowcount,"record inserted")
class Split(beam.DoFn):
def process(self, data):
data = data.split(",")
return [{
'ename': data[0],
'eid': data[1],
'dept': data[2]
}]
def main(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument("--input_topic")
parser.add_argument("--output")
known_args = parser.parse_known_args(argv)
p = beam.Pipeline(options=PipelineOptions())
(p
| 'ReadData' >> beam.io.ReadFromPubSub(topic=TOPIC).with_output_types(bytes)
| "Decode" >> beam.Map(lambda x: x.decode('utf-8'))
| 'ParseCSV' >> beam.ParDo(Split())
| 'WriteToMySQL' >> beam.ParDo(insertfn())
)
result = p.run()
result.wait_until_finish()
After our discussion in the comment section, I noticed that you are not using the proper commands to execute the DataFlow pipeline.
According to the documentation, there are mandatory flags which must be defined in order to run the pipeline in Dataflow Managed Service. These flags are described below:
job_name - The name of the Dataflow job being executed.
project - The ID of your Google Cloud project. runner - The pipeline
runner - that will parse your program and construct your pipeline. For
cloud execution, this must be DataflowRunner.
staging_location - A Cloud Storage path for Dataflow to stage code packages needed by workers executing the job.
temp_location - A Cloud Storage path for Dataflow to stage temporary job files created during the execution of the pipeline.
In addition to these flags, there are others you can use, in your case since you use a PubSub topic:
--input_topic: sets the input Pub/Sub topic to read messages from.
Therefore, an example to run a Dataflow pipeline would be as follows:
python RunPipelineDataflow.py \
--job_name=jobName\
--project=$PROJECT_NAME \
--runner=DataflowRunner \
--staging_location=gs://YOUR_BUCKET_NAME/AND_STAGING_DIRECTORY\
--temp_location=gs://$BUCKET_NAME/temp
--input_topic=projects/$PROJECT_NAME/topics/$TOPIC_NAME \
I would like to point the importance of using DataflowRunner, it allows you to use the Cloud Dataflow managed service, providing a fully managed service, autoscaling and dynamic work rebalancing. However, it is also possible to use DirectRunner which executes your pipeline in your machine, it is designed to validate the pipeline.
Related
I have set up a DAG that runs a Dataflow job. Dag triggers it fine, and it runs successfully yet the output file doesn't appear in the output location. The output location is a bucket in another project and the SA being used has access to write to that bucket... any idea why the file is not generating?
DF Job:
import apache_beam as beam
from apache_beam.options.value_provider import StaticValueProvider
from apache_beam.options.pipeline_options import PipelineOptions
from datetime import datetime
import logging
class UserOptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument('--templated_int', type=int)
parser.add_value_provider_argument("--input", type=str )
parser.add_value_provider_argument("--output", type=str )
class process_file(beam.DoFn):
def __init__(self, templated_int):
self.templated_int = templated_int
def process(self, an_int):
yield self.templated_int.get() + an_int
def clean_file():
pipeline_options = PipelineOptions()
user_options = pipeline_options.view_as(UserOptions)
tstmp = datetime.now().strftime("%Y%m%d%H")
output = user_options.output
logging.info('Input: ', user_options.input)
logging.info('Output: ', output)
with beam.Pipeline(options=pipeline_options) as p:
p | 'Read from a File' >> beam.io.ReadFromText(user_options.input, skip_header_lines=1) | 'Split into rows' >> beam.Map(lambda x:x.split(",")) | 'Confirm index locations' >> beam.Map(lambda x:f'{x[0]},{x[1]}{x[2]}{x[3]}{x[4]},{x[5]}') | 'Write to clean file' >> beam.io.WriteToText(output)
p.run().wait_until_finish()
if __name__ == "__main__":
clean_file()
When you select a step in your Dataflow Pipeline graph, the logs panel toggles from displaying Job Logs generated by Dataflow service showing logs from the Compute Engine instances running your pipeline step.
Cloud Logging combines all the collected logs from your projects’s Compute Engine instances in one location. Additionally, see Logging pipeline messages for more information on using dataflow’s various logging capabilities.
I would like to run Spacy Lemmatization on a column within a ParDo on GCP DataFlow.
My DataFlow project is composed by 3 files: main.py which is the file containing the script, myfile.json which contains the service account key, and setup.py which contains the requirements for the project :
main.py
import apache_beam as beam
from apache_beam.io.gcp.internal.clients import bigquery
from apache_beam.options.pipeline_options import PipelineOptions
import unidecode
import string
import spacy
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "myfile.json"
table_spec = bigquery.TableReference(
projectId='scrappers-293910',
datasetId='mydataset',
tableId='mytable')
options = PipelineOptions(
job_name="lemmatize-job-offers-description-2",
project="myproject",
region="europe-west6",
temp_location="gs://mygcp/options/temp_location/",
staging_location="gs://mygcp/options/staging_location/")
nlp = spacy.load("fr_core_news_sm", disable=["tagger", "parser", "attribute_ruler", "ner", "textcat"])
class CleanText(beam.DoFn):
def process(self, row):
row['descriptioncleaned'] = ' '.join(unidecode.unidecode(str(row['description'])).lower().translate(str.maketrans(string.punctuation, ' '*len(string.punctuation))).split())
yield row
class LemmaText(beam.DoFn):
def process(self, row):
doc = nlp(row['descriptioncleaned'])
row['descriptionlemmatized'] = ' '.join(list(set([token.lemma_ for token in doc])))
yield row
with beam.Pipeline(runner="DataflowRunner", options=options) as pipeline:
soft = pipeline \
| "ReadFromBigQuery" >> beam.io.ReadFromBigQuery(table=table_spec, gcs_location="gs://mygcp/gcs_location") \
| "CleanText" >> beam.ParDo(CleanText()) \
| "LemmaText" >> beam.ParDo(LemmaText()) \
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery('mybq.path', custom_gcs_temp_location="gs://mygcp/gcs_temp_location", create_disposition="CREATE_IF_NEEDED", write_disposition="WRITE_TRUNCATE")
setup.py
import setuptools
setuptools.setup(
name='PACKAGE-NAME',
install_requires=['spacy', 'unidecode', 'fr_core_news_lg # git+https://github.com/explosion/spacy-models/releases/download/fr_core_news_lg-3.2.0/fr_core_news_lg-3.2.0.tar.gz'],
packages=setuptools.find_packages()
)
and I send the job to DataFlow with the above cmd:
python3 main.py --setup_file ./setup.py
Locally it works fine, but as soon as I send it to DataFlow, after few minutes I get :
I searched for the reason and it seems to be the module dependencies.
Is it alright to import the Spacy model like I did ? What am I doing wrong ?
https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/.
It seems that you can use a requirements file with requirements_file pipeline option.
Additionally, if you run into name error, see https://cloud.google.com/dataflow/docs/resources/faq#how_do_i_handle_nameerrors.
I have been running Dataflow jobs based on a template created back in December that passes some arguments at runtime, without any issues.
I have had to make some changes to the template now and I seem to be having issues generating a working template, even when using the same code/versions of beam as before.
My jobs just hang indefinitely - tried leaving one and it timed out after an hour or so.
There's certainly an issue as even my first step which is just creating an empty PCollection doesn't succeed, it just says running.
I have abstracted the hell out of the function to work out what the issue might be, since there are no errors or oddities in the logs.
Sharing below the very slimmed down pipeline, as soon as I comment out the 2nd and 3rd lines in the pipeline which use the value provider arguments the job succeeds (at creating an empty PCollection).
My use of the 'add_value_provider_argument' follows pretty closely the official snippet here: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py#L554
and
https://cloud.google.com/dataflow/docs/guides/templates/creating-templates#using-valueprovider-in-your-functions
I borrowed it from Pablo here: https://stackoverflow.com/a/58327762/5687904
I even tried building a completely fresh environment in a new VM thinking that maybe my environment has something corrupting the template without failing to build it.
I've tried Dataflow SDK 2.15.0 which is what the original template used as well as 2.24.0 (most recent one).
Would really appreciate any ideas around debugging this as I'm starting to despair.
import logging
import pandas as pd
import argparse
import datetime
#================ Apache beam ======================
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import StandardOptions
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.options.pipeline_options import WorkerOptions
from apache_beam.options.pipeline_options import DebugOptions
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.io import fileio
import io
#======================
PROJECT_ID = 'my-project'
GCS_STAGING_LOCATION = 'gs://my-bucket//gcs_staging_location/'
GCS_TMP_LOCATION = 'gs://my-bucket/gcs_tmp_location/'
#======================================
# https://cloud.google.com/dataflow/docs/guides/templates/creating-templates#valueprovider
class FileIterator(beam.DoFn):
def __init__(self, files_bucket):
self.files_bucket = files_bucket
def process(self, element):
files = pd.read_csv(str(element), header=None).values[0].tolist()
bucket = self.files_bucket.get()
files = [str(bucket) + '/' + file for file in files]
logging.info('Files list is: {}'.format(files))
return files
#=========================================================
# https://stackoverflow.com/questions/58240058/ways-of-using-value-provider-parameter-in-python-apache-beam
class OutputValueProviderFn(beam.DoFn):
def __init__(self, vp):
self.vp = vp
def process(self, unused_elm):
yield self.vp.get()
#=========================================================
class RuntimeOptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument(
'--files_bucket',
help='Bucket where the raw files are',
type=str)
parser.add_value_provider_argument(
'--complete_batch',
help='Text file with filenames in it location',
type=str)
parser.add_value_provider_argument(
'--comp_table',
required=False,
help='BQ table to write to (dataset.table)',
type=str)
#=========================================================
def run():
#====================================
# TODO PUT AS PARAMETERS
#====================================
dt_now = datetime.datetime.now().strftime('%Y%m%d')
job_name = 'dataflow-test-{}'.format(dt_now)
pipeline_options_batch = PipelineOptions()
runtime_options = pipeline_options_batch.view_as(RuntimeOptions)
setup_options = pipeline_options_batch.view_as(SetupOptions)
setup_options.setup_file = './setup.py'
google_cloud_options = pipeline_options_batch.view_as(GoogleCloudOptions)
google_cloud_options.project = PROJECT_ID
google_cloud_options.staging_location = GCS_STAGING_LOCATION
google_cloud_options.temp_location = GCS_TMP_LOCATION
pipeline_options_batch.view_as(StandardOptions).runner = 'DataflowRunner'
pipeline_options_batch.view_as(WorkerOptions).autoscaling_algorithm = 'THROUGHPUT_BASED'
pipeline_options_batch.view_as(WorkerOptions).max_num_workers = 10
pipeline_options_batch.view_as(SetupOptions).save_main_session = True
pipeline_options_batch.view_as(DebugOptions).experiments = ['use_beam_bq_sink']
with beam.Pipeline(options=pipeline_options_batch) as pipeline_2:
try:
final_data = (
pipeline_2
|'Create empty PCollection' >> beam.Create([None])
|'Get accepted batch file'>> beam.ParDo(OutputValueProviderFn(runtime_options.complete_batch))
# |'Read all filenames into a list'>> beam.ParDo(FileIterator(runtime_options.files_bucket))
)
except Exception as exception:
logging.error(exception)
pass
#=========================================================
if __name__ == "__main__":
run()
It seems that when you created the template, the Apache Beam SDK used was forward-compatible with the packages versions within the setup.py file and it was working okey; however, when you performed the update the SDK version may not be forward-compatible with the same listed versions in the setup.py.
Based on this documentation, the Apache Beam SDK and Dataflow workers must have forward-compatible libraries to avoid version collisions that can result in unexpected behavior in the service.
In order to know the required packages versions within each Apache Beam SDK version take a look at this page.
I am trying to write a Python script to stream data from my Google Cloud Storage bucket to Big Query with the help of Dataflow pipe line. I am able to start a job but that job is running as batch and not the streaming one and we are not allowed to use Pub/Sub.
Below is the code I am trying with details made generic:
from __future__ import absolute_import
import argparse
import re
import logging
import apache_beam as beam
import json
from past.builtins import unicode
from apache_beam.io import ReadFromText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
# This class has all the functions which facilitate data transposition
class WordExtractingDoFn(beam.DoFn):
def __init__(self):
super(WordExtractingDoFn, self).__init__()
# Create Bigquery Row
dict function
return
def run_bq(argv=None):
parser = argparse.ArgumentParser()
schema1 = your schema
# All Command Line Arguments being added to the parser
parser.add_argument(
'--input', dest='input', required=False,
default='gs://your-bucket-path/')
parser.add_argument('--output', dest='output', required=False,
default='yourdataset.yourtable')
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_args.extend([
'--runner=DataflowRunner',
'--project=your-project',
'--staging_location=gs://your-staging-bucket-path/',
'--temp_location=gs://your-temp-bucket-path/',
'--job_name=pubsubbql1',
'--streaming'
])
pushtobq = WordExtractingDoFn()
# Pipeline Creation Begins
p = beam.Pipeline(options=PipelineOptions(pipeline_args))
(p
| 'Read from a File' >> beam.io.ReadFromText(known_args.input)
| 'String To BigQuery Row' >> beam.Map(dict-file)
| 'Write to BigQuery' >> beam.io.WriteToBigQuery(
known_args.output,
schema=schema2
)
)
# Run Pipeline
p.run().wait_until_finish()
# Main Method to call
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run_bq()
With the above code I am able to create jobs but they are batch jobs, my main motive is to take data from buckets which is in json format and I need to insert it into BigQuery.
I've been trying to get a DataFlow runner to work all day, without success. The worker just loads the job into data flow and does nothing for 1 hour.
Everything runs as expected locally. The process is:
Data from BQ Source -> Some data manipulation -> Writing TF Records
I think something goes wrong when reading data from BQ:
Job Type State Start Time Duration User Email Bytes Processed Bytes Billed Billing Tier Labels
---------- --------- ----------------- ---------- ---------------------------------------------------- ----------------- -------------- -------------- --------
extract SUCCESS 08 Nov 11:06:10 0:00:02 27xxxxxxx7565-compute#developer.gserviceaccount.com
Looks like nothing has been processed.
Basic Pipeline:
import apache_beam as beam
import datetime
import tensorflow_transform.beam.impl as beam_impl
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions
#beam.ptransform_fn
def ReadDataFromBQ(pcoll, project, dataset, table):
bq = beam.io.BigQuerySource(dataset=dataset,
table=table,
project=project,
validate=True,
use_standard_sql=True,)
return pcoll | "ReadFromBQ" >> beam.io.Read(bq)
with beam.Pipeline(options=options) as pipeline:
with beam_impl.Context(temp_dir=google_cloud_options.temp_location):
train_data = pipeline | 'LoadTrainData' >> ReadDataFromBQ(dataset='d_name',
project='project-name',
table='table_name')
it still doesn't work.
I'm using the 2.7.0 version of SDK.
import apache_beam as beam
beam.__version__
'2.7.0' # local
My setup.py file is:
import setuptools
from setuptools import find_packages
REQUIRES = ['tensorflow_transform']
setuptools.setup(
name='Beam',
version='0.0.1',
install_requires=REQUIRES,
packages=find_packages(),
)
Workflow failed. Causes: The Dataflow job appears to be stuck because
no worker activity has been seen in the last 1h. You can get help with
Cloud Dataflow at https://cloud.google.com/dataflow/support.
Job_id: 2018-11-07_12_27_39-17873629436928290134 for full pipeline.
Job_id: 2018-11-08_04_30_38-16805982576734763423 for reduced pipeline (Just Read BQ and Write Txt to GCS)
Prior to this, everything seemed to be working correctly:
2018-11-07 (20:29:44) BigQuery export job "dataflow_job_5858975562210600855" started. You can check its status with the bq...
2018-11-07 (20:29:44) BigQuery export job "dataflow_job_5509154328514239323" started. You can check its status with the bq...
2018-11-07 (20:30:15) BigQuery export job finished: "dataflow_job_5858975562210600855"
2018-11-07 (20:30:15) BigQuery export job finished: "dataflow_job_5509154328514239323"
2018-11-07 (21:30:15) Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been se...