I have been running Dataflow jobs based on a template created back in December that passes some arguments at runtime, without any issues.
I have had to make some changes to the template now and I seem to be having issues generating a working template, even when using the same code/versions of beam as before.
My jobs just hang indefinitely - tried leaving one and it timed out after an hour or so.
There's certainly an issue as even my first step which is just creating an empty PCollection doesn't succeed, it just says running.
I have abstracted the hell out of the function to work out what the issue might be, since there are no errors or oddities in the logs.
Sharing below the very slimmed down pipeline, as soon as I comment out the 2nd and 3rd lines in the pipeline which use the value provider arguments the job succeeds (at creating an empty PCollection).
My use of the 'add_value_provider_argument' follows pretty closely the official snippet here: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py#L554
and
https://cloud.google.com/dataflow/docs/guides/templates/creating-templates#using-valueprovider-in-your-functions
I borrowed it from Pablo here: https://stackoverflow.com/a/58327762/5687904
I even tried building a completely fresh environment in a new VM thinking that maybe my environment has something corrupting the template without failing to build it.
I've tried Dataflow SDK 2.15.0 which is what the original template used as well as 2.24.0 (most recent one).
Would really appreciate any ideas around debugging this as I'm starting to despair.
import logging
import pandas as pd
import argparse
import datetime
#================ Apache beam ======================
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import StandardOptions
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.options.pipeline_options import WorkerOptions
from apache_beam.options.pipeline_options import DebugOptions
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.io import fileio
import io
#======================
PROJECT_ID = 'my-project'
GCS_STAGING_LOCATION = 'gs://my-bucket//gcs_staging_location/'
GCS_TMP_LOCATION = 'gs://my-bucket/gcs_tmp_location/'
#======================================
# https://cloud.google.com/dataflow/docs/guides/templates/creating-templates#valueprovider
class FileIterator(beam.DoFn):
def __init__(self, files_bucket):
self.files_bucket = files_bucket
def process(self, element):
files = pd.read_csv(str(element), header=None).values[0].tolist()
bucket = self.files_bucket.get()
files = [str(bucket) + '/' + file for file in files]
logging.info('Files list is: {}'.format(files))
return files
#=========================================================
# https://stackoverflow.com/questions/58240058/ways-of-using-value-provider-parameter-in-python-apache-beam
class OutputValueProviderFn(beam.DoFn):
def __init__(self, vp):
self.vp = vp
def process(self, unused_elm):
yield self.vp.get()
#=========================================================
class RuntimeOptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument(
'--files_bucket',
help='Bucket where the raw files are',
type=str)
parser.add_value_provider_argument(
'--complete_batch',
help='Text file with filenames in it location',
type=str)
parser.add_value_provider_argument(
'--comp_table',
required=False,
help='BQ table to write to (dataset.table)',
type=str)
#=========================================================
def run():
#====================================
# TODO PUT AS PARAMETERS
#====================================
dt_now = datetime.datetime.now().strftime('%Y%m%d')
job_name = 'dataflow-test-{}'.format(dt_now)
pipeline_options_batch = PipelineOptions()
runtime_options = pipeline_options_batch.view_as(RuntimeOptions)
setup_options = pipeline_options_batch.view_as(SetupOptions)
setup_options.setup_file = './setup.py'
google_cloud_options = pipeline_options_batch.view_as(GoogleCloudOptions)
google_cloud_options.project = PROJECT_ID
google_cloud_options.staging_location = GCS_STAGING_LOCATION
google_cloud_options.temp_location = GCS_TMP_LOCATION
pipeline_options_batch.view_as(StandardOptions).runner = 'DataflowRunner'
pipeline_options_batch.view_as(WorkerOptions).autoscaling_algorithm = 'THROUGHPUT_BASED'
pipeline_options_batch.view_as(WorkerOptions).max_num_workers = 10
pipeline_options_batch.view_as(SetupOptions).save_main_session = True
pipeline_options_batch.view_as(DebugOptions).experiments = ['use_beam_bq_sink']
with beam.Pipeline(options=pipeline_options_batch) as pipeline_2:
try:
final_data = (
pipeline_2
|'Create empty PCollection' >> beam.Create([None])
|'Get accepted batch file'>> beam.ParDo(OutputValueProviderFn(runtime_options.complete_batch))
# |'Read all filenames into a list'>> beam.ParDo(FileIterator(runtime_options.files_bucket))
)
except Exception as exception:
logging.error(exception)
pass
#=========================================================
if __name__ == "__main__":
run()
It seems that when you created the template, the Apache Beam SDK used was forward-compatible with the packages versions within the setup.py file and it was working okey; however, when you performed the update the SDK version may not be forward-compatible with the same listed versions in the setup.py.
Based on this documentation, the Apache Beam SDK and Dataflow workers must have forward-compatible libraries to avoid version collisions that can result in unexpected behavior in the service.
In order to know the required packages versions within each Apache Beam SDK version take a look at this page.
Related
I am trying to read from a Kafka topic using the KafkaIO python module that uses the Java expansion service.
However similar to this question with the Java implementation, my pipeline is stuck in reading from Kafka and does not move to the next step in the pipeline.
import os
import logging
import apache_beam as beam
from apache_beam.io.kafka import ReadFromKafka
from apache_beam.options.pipeline_options import PipelineOptions, StandardOptions
from typing import List
from typing import Optional
from pprint import pprint
"""
Read from auth0-logs-json kafka topic
"""
CONSUMER_CONFIG = {
"bootstrap.servers": os.environ["bootstrap_servers"],
"security.protocol":"SASL_SSL",
"sasl.mechanism":"PLAIN",
"sasl.username":os.environ["sasl_username"],
"sasl.password":os.environ["sasl_password"],
"group.id":"ddp_staging_auth0_logs_os.environflow",
"sasl.jaas.config":f'org.apache.kafka.common.security.plain.PlainLoginModule required serviceName="Kafka" username=\"{os.environ["sasl_username"]}\" password=\"{os.environ["sasl_password"]}\";',
"auto.offset.reset":"earliest"
}
def run(beam_args: Optional[List[str]] = None) -> None:
TEST_JSON_TOPIC = "test_json_ser_topic"
######## Kafka Streaming Pipeline ########
beam_options = PipelineOptions(beam_args, save_main_session=True)
beam_options.view_as(StandardOptions).streaming = True
with beam.Pipeline(options=beam_options) as pipeline:
(
pipeline
|'Read_Kafka' >> ReadFromKafka(
consumer_config=CONSUMER_CONFIG,
topics=[TEST_JSON_TOPIC] )
|'Log topic msg' >> beam.ParDo(logging.info)
)
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
Here are the logs from the DirectRunner run:
https://drive.google.com/file/d/1QFAcRAoDr5ltPq6Dz0O9oA54wV9l0BqM/view?usp=sharing
I am trying to create a template for a beam pipeline to run it on GCP Dataflow. The pipeline uses the apache beam dataframe module's read_csv to read the file. I want the file name to be passed in as an argument to the template.I figured out that the we have to use RuntimeValuePRovider for the same.
I have written the below code by using the documentation as a reference: https://cloud.google.com/dataflow/docs/guides/templates/creating-templates#using-valueprovider-in-your-pipeline-options
import apache_beam as beam
from apache_beam.dataframe.io import read_csv
from apache_beam.options.pipeline_options import PipelineOptions
class MyOptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument('--file_name',
type=str,
default= 'gs://default-bucket/default-file.csv')
pipeline_options = PipelineOptions(
runner='DataflowRunner',
project='my-project',
job_name='read-csv',
temp_location='gs://dataflow-test-bucket/temp',
region='us-central1')
p = beam.Pipeline(options=pipeline_options)
my_options = pipeline_options.view_as(MyOptions)
# Hardcoding the file works fine: df = p | read_csv('gs://default-bucket/default-file.csv')
df = p | read_csv(my_options.file_name)
beam.dataframe.convert.to_pcollection(df) | beam.Map(print)
p.run().wait_until_finish()
When I run the code, I get the following error:
Exception has occurred: WontImplementError
non-deferred
File "D:\WorkArea\dataflow_args_test_projects\read_csv.py", line 37, in
df = p | read_csv(my_options.file_name)
What is the correct way to access the RuntimeValueProvider when using read_csv?
The below code builds the pipeline and DAG is generated.
RuntimeError: NotImplementedError [while running 'generatedPtransform-438']Please let me know if there is any direct connector for mysql in python for beam.
from apache_beam.options.pipeline_options import PipelineOptions
from google.cloud import pubsub_v1
from google.cloud import bigquery
import mysql.connector
import apache_beam as beam
import logging
import argparse
import sys
import re
PROJECT="12344"
TOPIC = "projects/12344/topics/mytopic"
class insertfn(beam.Dofn):
def insertdata(self,data):
db_conn=mysql.connector.connect(host="localhost",user="abc",passwd="root",database="new")
db_cursor=db_conn.cursor()
emp_sql = " INSERT INTO emp(ename,eid,dept) VALUES (%s,%s,%s)"
db_cusror.executemany(emp_sql,(data[0],data[1],data[2]))
db_conn.commit()
print(db_cursor.rowcount,"record inserted")
class Split(beam.DoFn):
def process(self, data):
data = data.split(",")
return [{
'ename': data[0],
'eid': data[1],
'dept': data[2]
}]
def main(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument("--input_topic")
parser.add_argument("--output")
known_args = parser.parse_known_args(argv)
p = beam.Pipeline(options=PipelineOptions())
(p
| 'ReadData' >> beam.io.ReadFromPubSub(topic=TOPIC).with_output_types(bytes)
| "Decode" >> beam.Map(lambda x: x.decode('utf-8'))
| 'ParseCSV' >> beam.ParDo(Split())
| 'WriteToMySQL' >> beam.ParDo(insertfn())
)
result = p.run()
result.wait_until_finish()
After our discussion in the comment section, I noticed that you are not using the proper commands to execute the DataFlow pipeline.
According to the documentation, there are mandatory flags which must be defined in order to run the pipeline in Dataflow Managed Service. These flags are described below:
job_name - The name of the Dataflow job being executed.
project - The ID of your Google Cloud project. runner - The pipeline
runner - that will parse your program and construct your pipeline. For
cloud execution, this must be DataflowRunner.
staging_location - A Cloud Storage path for Dataflow to stage code packages needed by workers executing the job.
temp_location - A Cloud Storage path for Dataflow to stage temporary job files created during the execution of the pipeline.
In addition to these flags, there are others you can use, in your case since you use a PubSub topic:
--input_topic: sets the input Pub/Sub topic to read messages from.
Therefore, an example to run a Dataflow pipeline would be as follows:
python RunPipelineDataflow.py \
--job_name=jobName\
--project=$PROJECT_NAME \
--runner=DataflowRunner \
--staging_location=gs://YOUR_BUCKET_NAME/AND_STAGING_DIRECTORY\
--temp_location=gs://$BUCKET_NAME/temp
--input_topic=projects/$PROJECT_NAME/topics/$TOPIC_NAME \
I would like to point the importance of using DataflowRunner, it allows you to use the Cloud Dataflow managed service, providing a fully managed service, autoscaling and dynamic work rebalancing. However, it is also possible to use DirectRunner which executes your pipeline in your machine, it is designed to validate the pipeline.
I am getting multiple threads of the same process in CLOSE_WAIT because of which i am getting 'too many files open' error.
OSError: [Errno 24] Too many open files:
This is happening when multiple calls to google cloud speech api is made.
Have gone through various answers on stackoverflow, but i am unable to figure out the solution.
sudo lsof | grep -i close | wc -l
15180
The code I have shared is a trimmed version of the actual code. I am able to reproduce the error using the code below.
import os
import tornado.httpserver, tornado.ioloop, tornado.options, tornado.web, tornado.escape
import os.path
import string
import json
from google.cloud import speech
from google.cloud.speech import types, enums
tornado.options.parse_command_line()
tornado.options.define("port", default=8888, help="run on the given port", type=int)
SPEECH_TO_TEXT_CREDENTIALS = 'my_json_file.json'
UPLOAD_FOLDER = '/home/ubuntu/uploads'
class Application(tornado.web.Application):
def __init__(self):
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = SPEECH_TO_TEXT_CREDENTIALS
self.speech_client = speech.SpeechClient()
handlers = [
(r"/test_bug/client/googlestt2", GoogleSTTHandler)
]
tornado.web.Application.__init__(self, handlers)
class GoogleSTTHandler(tornado.web.RequestHandler):
def post(self):
if 'audio' not in self.request.files:
self.finish({'Error': "No audio provided"})
audio_filename = 'test.wav'
audio = self.request.files['audio'][0]
with open(os.path.join(UPLOAD_FOLDER, audio_filename), 'wb') as f:
f.write(audio['body'])
with open(os.path.join(UPLOAD_FOLDER, audio_filename), 'rb') as audio_file:
content = audio_file.read()
audio = types.RecognitionAudio(content=content)
config = types.RecognitionConfig(encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16, language_code='en-IN')
response = self.application.speech_client.recognize(config, audio)
if not response.results:
Transcript_Upload = "Empty Audio"
else:
for result in response.results:
Transcript_Upload = 'Transcript: {}'.format(result.alternatives[0].transcript)
self.finish(Transcript_Upload)
def main():
http_server = tornado.httpserver.HTTPServer(Application())
http_server.listen(tornado.options.options.port)
tornado.ioloop.IOLoop.instance().start()
if __name__ == "__main__":
main()
Please suggest if I am doing something wrong and how to fix this.
This known issue in the google-cloud-python as well as gcloud-python - https://github.com/googleapis/google-cloud-python/issues/5570.
I dropped it and since then I've been using google API directly.
As side note, you are using synchronous API, but to leverage Tornado (actually any asynchronous framework) you should use async libs/calls etc like google-cloud-python's Asynchronous Recognition
I'm trying to use Dataflow in the GCP. The contextualization is the following one.
-I have created a pipeline that works correctly in local. This is test.py document script: (I do a subprocess fonction which takes the script "script2.py" to be executed, script located in local and stored in a bucket in the cloud as well)
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.options.pipeline_options import StandardOptions
from apache_beam.options.pipeline_options import SetupOptions
project ="titanium-index-200721"
bucket ="pipeline-operation-test"
class catchOutput(beam.DoFn):
def process(self,element):
import subprocess
import sys
s2_out = subprocess.check_output([sys.executable, "script2.py", "34"])
return [s2_out]
def run():
project = "titanium-index-200721"
job_name = "test-setup-subprocess-newerr"
staging_location = 'gs://pipeline-operation-test/staging'
temp_location = 'gs://pipeline-operation-test/temp'
setup = './setup.py'
options = PipelineOptions()
google_cloud_options = options.view_as(GoogleCloudOptions)
options.view_as(SetupOptions).setup_file = "./setup.py"
google_cloud_options.project = project
google_cloud_options.job_name = job_name
google_cloud_options.staging_location = staging_location
google_cloud_options.temp_location = temp_location
options.view_as(StandardOptions).runner = 'DataflowRunner'
p = beam.Pipeline(options=options)
input = 'gs://pipeline-operation-test/input2.txt'
output = 'gs://pipeline-operation-test/OUTPUTsetup.csv'
results =(
p|
'ReadMyFile'>>beam.io.ReadFromText(input)|
'Split'>>beam.ParDo(catchOutput())|
'CreateOutput'>>beam.io.WriteToText(output)
)
p.run()
if __name__ == '__main__':
run()
I have done a "setup.py" script for being able to include all the pakcages needed in future scripts to be also run in the dataflow of gcp.
Nevertheless when I try to run all that in the cloud, I'm having some problemsm to be more precise, when running the dataflow I get the following error:
RuntimeError: CalledProcessError: Command '['/usr/bin/python', 'script2.py', '34']' returned non-zero exit status 2 [while running 'Split']
I have tried placing the import call functions (subprocess,sys) in differents zones, I have also tried to modify the path of the script2.py which is in the bucket, but nothing has worked.
Finally one way to quit the error is by modifying the script with:
try:
s2_out = subprocess.check_output([sys.executable, "script2.py", "34"])
except subprocess.CalledProcessError as e:
s2_out = e.output
But then my output is nothing. Because by doing that I only less the pipeline run but not to be correctly executed.
Anybody knows how could be this fixed?
Thanks you very much!
Guillem