I'd like to embed my dataflow inside a Cloud Function WITHOUT USING TEMPLATE. I ran into an error at first and according to this answer, I should packaging up my code as a dependency. This is the structure of my Cloud Function:
file wb_flow.py
def main(identifier, schema_file):
"""The main function which creates the pipeline and runs it."""
table_name = f"wijken_en_buurten_{cbsodata.get_info(identifier)['Period']}"
pipeline_options = PipelineOptions(
[
'--runner', 'DataflowRunner',
'--project', 'veneficus',
'--region', 'europe-west4',
'--temp_location', 'gs://vf_etl/test',
'--staging_location', 'gs://vf_etl/temp',
'--setup_file', 'setup.py'
]
)
p = beam.Pipeline(options=pipeline_options)
(p
| 'Read from BQ Table' >> beam.Create(cbsodata.get_data(identifier))
| 'Write Projects to BigQuery' >> beam.io.WriteToBigQuery(
f"cbs.{table_name}",
schema=schema,
# Creates the table in BigQuery if it does not yet exist.
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)
)
p.run()#.wait_until_finish()
file main.py:
import base64
from wb_flow import main
def run(event, context):
"""The main function which creates the pipeline and runs it."""
message = base64.b64decode(event['data']).decode('utf-8').split(',')
identifier, schema_file = message[0], message[1]
main(identifier, schema_file)
and setup.py:
import setuptools
setuptools.setup(
name='wb_flow',
version='1.0.0',
install_requires=[],
packages=setuptools.find_packages(),
)
I got this error in the construction of the dataflow
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/apache_beam/runners/portability/stager.py", line 579, in _build_setup_package
os.chdir(os.path.dirname(setup_file))
FileNotFoundError: [Errno 2] No such file or directory: ''
And I believe it means that it couldnt find my setup.py. How can I specify the path to my setup file?
Alternative, I tried to do this without setup.py, and the Dataflow said it couldnt find wb_flow module
Update
When I specified my setup path as /workspace/setup.py, I have this error
subprocess.CalledProcessError: Command '['/layers/google.python.pip/pip/bin/python3', 'setup.py', 'sdist', '--dist-dir', '/tmp/tmp3rxejr4g']' returned non-zero exit status 1."
Try to change
'--setup_file', 'setup.py'
into
'--setup_file', './setup.py'
This worked for me :)
Related
When creating a Pipeline with Python SDK V2 for Azure ML all contents of my current working directory are uploaded. Can I blacklist some files being upload? E.g. I use load_env(".env") in order to read some credentials but I don't wan't it to be uploaded.
Directory content:
./src
utilities.py # contains helper function to get Azure credentials
.env # contains credentials
conda.yaml
script.py
A minimal pipeline example:
import mldesigner
import mlflow
from azure.ai.ml import MLClient
from azure.ai.ml.dsl import pipeline
from src.utilities import get_credential
credential = get_credential() # calls `load_env(".env") locally
ml_client = MLClient(
credential=credential,
subscription_id="foo",
resource_group_name="bar",
workspace_name="foofoo",
)
#mldesigner.command_component(
name="testcomponent",
display_name="Test Component",
description="Test Component description.",
environment=dict(
conda_file="./conda.yaml",
image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
),
)
def test_component():
mlflow.log_metric("metric", 0)
cluster_name = "foobar"
#pipeline(default_compute=cluster_name)
def pipe():
test_component()
pipeline_job = pipe()
pipeline_job = ml_client.jobs.create_or_update(
pipeline_job, experiment_name="pipeline_samples"
)
After running python script.py the pipeline job is created and runs in Azure ML. If I have a look at the Pipeline in Azure ML UI and inspect Test Component and the tab Code I find all source files including .env.
How can I prevent uploading this file using the SDK while creating a pipeline job?
you can use a .gitignore or .amlignore file in your working directory to specify files and directories to ignore. These files will not be included when you run the pipeline by default.
Here is the document to prevent unnecessary files.
or
# Get all files in the current working directory
all_files = os.listdir()
# Remove ".env" file from the list of files
all_files.remove(".env")
#pipeline(default_compute=cluster_name, files=all_files)
def pipe():
test_component()
pipeline_job = pipe()
pipeline_job = ml_client.jobs.create_or_update(
pipeline_job, experiment_name="pipeline_samples"
)
I have written a python package that trains a neural network. I then package it up using the below command.
python3 setup.py sdist --formats=gztar
When I run this job through the GCP console, and manually click through all the options, I get logs from my program as expected (see example below)
Example successful logs:
However, when I run the exact same job programmatically, no logs appear. Only the final error (if one occurs):
Example logs missing:
In both cases, the program is running - I just cant see any of the outputs. What could the reason for this be? For reference, I have also included the code I used to programmatically start the training process:
ENTRY_POINT = "projects.yaw_correction.yaw_correction"
TIMESTAMP = datetime.datetime.strftime(datetime.datetime.now(),"%y%m%d_%H%M%S")
PROJECT = "yaw_correction"
GCP_PROJECT = "our_gcp_project_name"
LOCATION = "europe-west1"
BUCKET_NAME = "our_bucket_name"
DISPLAY_NAME = "Training_Job_" + TIMESTAMP
CONTAINER_URI = "europe-docker.pkg.dev/vertex-ai/training/pytorch-xla.1-9:latest"
MODEL_NAME = "Model_" + TIMESTAMP
ARGS = [f"/gcs/fotokite-training-data/yaw_correction/", "--cloud", "--gpu"]
TENSORBOARD = "projects/"our_gcp_project_name"/locations/europe-west4/tensorboards/yaw_correction"
MACHINE_TYPE = "n1-standard-4"
REPLICA_COUNT = 1
ACCELERATOR_TYPE = "ACCELERATOR_TYPE_UNSPECIFIED"
ACCELERATOR_COUNT = 0
SYNC = False
#Delete existing source distributions
def deleteDist():
dirpath = Path('dist')
if dirpath.exists() and dirpath.is_dir():
shutil.rmtree(dirpath)
# Copy distribution to the cloud bucket storage
deleteDist()
subprocess.run("python3 setup.py sdist --formats=gztar", shell=True)
filename = [x for x in Path('dist').glob('*')]
if len(filename) != 1:
raise Exception("More than one distribution was found")
print(str(filename[0]))
PACKAGE_URI = f"gs://{BUCKET_NAME}/distributions/"
subprocess.run(f"gsutil cp {str(filename[0])} {PACKAGE_URI}", shell=True)
PACKAGE_URI += str(filename[0].name)
deleteDist()
# Initialise the compute instance
aiplatform.init(project=GCP_PROJECT, location=LOCATION, staging_bucket=BUCKET_NAME)
# Schedule the job
job = aiplatform.CustomPythonPackageTrainingJob(
display_name=DISPLAY_NAME,
#script_path="trainer/test.py",
python_package_gcs_uri=PACKAGE_URI,
python_module_name=ENTRY_POINT,
#requirements=['tensorflow_datasets~=4.2.0', 'SQLAlchemy~=1.4.26', 'google-cloud-secret-manager~=2.7.2', 'cloud-sql-python-connector==0.4.2', 'PyMySQL==1.0.2'],
container_uri=CONTAINER_URI,
)
model = job.run(
dataset=None,
#base_output_dir=f"gs://{BUCKET_NAME}/{PROJECT}/Train_{TIMESTAMP}",
base_output_dir=f"gs://{BUCKET_NAME}/{PROJECT}/",
service_account="vertex-ai-fotokite-service-acc#fotokite-cv-gcp-exploration.iam.gserviceaccount.com",
environment_variables=None,
args=ARGS,
replica_count=REPLICA_COUNT,
machine_type=MACHINE_TYPE,
accelerator_type=ACCELERATOR_TYPE,
accelerator_count=ACCELERATOR_TYPE,
#tensorboard=TENSORBOARD,
sync=SYNC
)
print(model)
print("JOB SUBMITTED")
Regularly this kind of error "The replica workerpool0-0 exited with a non-zero status of 1" is because something is wrong in the process of package the python file or in the code.
You can see these options.
You could check if all the files are in the package(training files and dependencies) like in this
example:
setup.py
demo/PKG
demo/SOURCES.txt
demo/dependency_links.txt
demo/requires.txt
demo/level.txt
trainer/__init__.py
trainer/metadata.py
trainer/model.py
trainer/task.py
trainer/utils.py
You could see the official troubleshooting guide from Google Cloud
with this type of error and how to see more information about this
error.
You can see this oficial documentation about packaging.
I am trying to execute a apache beam pipeline as a dataflow job in Google Cloud Platform.
My project structure is as follows:
root_dir/
__init__.py
setup.py
main.py
utils/
__init__.py
log_util.py
config_util.py
Here's my setup.py
setuptools.setup(
name='dataflow_example',
version='1.0',
install_requires=[
"google-cloud-tasks==2.2.0",
"google-cloud-pubsub>=0.1.0",
"google-cloud-storage==1.39.0",
"google-cloud-bigquery==2.6.2",
"google-cloud-secret-manager==2.0.0",
"google-api-python-client==2.3.0",
"oauth2client==4.1.3",
"apache-beam[gcp]>=2.20.0",
"wheel>=0.36.2"
],
packages=setuptools.find_packages()
)
Here's my pipeline code:
import math
import apache_beam as beam
from datetime import datetime
from apache_beam.options.pipeline_options import PipelineOptions
from utils.log_util import LogUtil
from utils.config_util import ConfigUtil
class DataflowExample:
config = {}
def __init__(self):
self.config = ConfigUtil.get_config(module_config=["config"])
self.project = self.config['project']
self.region = self.config['location']
self.bucket = self.config['core_bucket']
self.batch_size = 10
def execute_pipeline(self):
try:
LogUtil.log_n_notify(log_type="info", msg=f"Dataflow started")
query = "SELECT id, name, company FROM `<bigquery_table>` LIMIT 10"
beam_options = {
"project": self.project,
"region": self.region,
"job_name": "dataflow_example",
"runner": "DataflowRunner",
"temp_location": f"gs://{self.bucket}/temp_location/"
}
options = PipelineOptions(**beam_options, save_main_session=True)
with beam.Pipeline(options=options) as pipeline:
data = (
pipeline
| 'Read from BQ ' >> beam.io.Read(beam.io.ReadFromBigQuery(query=query, use_standard_sql=True))
| 'Count records' >> beam.combiners.Count.Globally()
| 'Print ' >> beam.ParDo(PrintCount(), self.batch_size)
)
LogUtil.log_n_notify(log_type="info", msg=f"Dataflow completed")
except Exception as e:
LogUtil.log_n_notify(log_type="error", msg=f"Exception in execute_pipeline - {str(e)}")
class PrintCount(beam.DoFn):
def __init__(self):
self.logger = LogUtil()
def process(self, row_count, batch_size):
try:
current_date = datetime.today().date()
total = int(math.ceil(row_count / batch_size))
self.logger.log_n_notify(log_type="info", msg=f"Records pulled from table on {current_date} is {row_count}")
self.logger.log_n_notify(log_type="info", msg=f"Records per batch: {batch_size}. Total batches: {total}")
except Exception as e:
self.logger.log_n_notify(log_type="error", msg=f"Exception in PrintCount.process - {str(e)}")
if __name__ == "__main__":
df_example = DataflowExample()
df_example.execute_pipeline()
Functionality of pipeline is
Query against BigQuery Table.
Count the total records fetched from querying.
Print using the custom Log module present in utils folder.
I am running the job using cloud shell using command - python3 - main.py
Though the Dataflow job starts, the worker nodes throws error after few mins saying "ModuleNotFoundError: No module named 'utils'"
"utils" folder is available and the same code works fine when executed with "DirectRunner".
log_util and config_util files are custom util files for logging and config fetching respectively.
Also, I tried running with setup_file options as python3 - main.py --setup_file </path/of/setup.py> which makes the job to just freeze and does not proceed even after 15 mins.
How do I resolve the ModuleNotFoundError with "DataflowRunner"?
Posting as community wiki. As confirmed by #GopinathS the error and fix are as follows:
The error encountered by the workers is Beam SDK base version 2.32.0 does not match Dataflow Python worker version 2.28.0. Please check Dataflow worker startup logs and make sure that correct version of Beam SDK is installed.
To fix this "apache-beam[gcp]>=2.20.0" is removed from install_requires of setup.py since, the '>=' is assigning the latest available version (2.32.0 as of this writing) while the workers version are only 2.28.0.
Updated setup.py:
setuptools.setup(
name='dataflow_example',
version='1.0',
install_requires=[
"google-cloud-tasks==2.2.0",
"google-cloud-pubsub>=0.1.0",
"google-cloud-storage==1.39.0",
"google-cloud-bigquery==2.6.2",
"google-cloud-secret-manager==2.0.0",
"google-api-python-client==2.3.0",
"oauth2client==4.1.3", # removed apache-beam[gcp]>=2.20.0
"wheel>=0.36.2"
],
packages=setuptools.find_packages()
)
Updated beam_options in the pipeline code:
beam_options = {
"project": self.project,
"region": self.region,
"job_name": "dataflow_example",
"runner": "DataflowRunner",
"temp_location": f"gs://{self.bucket}/temp_location/",
"setup_file": "./setup.py"
}
Also make sure that you pass all the pipeline options at once and not partially.
If you pass --setup_file </path/of/setup.py> in the command then make sure to read and append the setup file path into the already defined beam_options variable using argument_parser in your code.
To avoid parsing the argument and appending into beam_options I instead added it directly in beam_options as "setup_file": "./setup.py"
Dataflow might have problems with installing packages that are platform locked in isolated network.
It won't be able to compile them if no network is there.
Or maybe it tries installing them but since cannot compile downloads wheels? No idea.
Still to be able to use packages like psycopg2 (binaries), or google-cloud-secret-manager (no binaries BUT dependencies have binaries), you need to install everything that has no binaries (none-any) AND no dependencies with binaries, by requirements.txt and the rest by --extra_packages param with wheel. Example:
--extra_packages=package_1_needed_by_2-manylinux.whl \
--extra_packages=package_2_needed_by_3-manylinux.whl \
--extra_packages=what-you-need_needing_3-none-any.whl
I am trying to use dataflow to complete a task that requires the use of a .csv and .json files. From what I understand, I should be able to create a setup.py file that will include these files and distribute them to multiple workers.
This is how my files are laid out:
pipline.py
setup.py
utils /
-->__init__.py
-->**CSV.csv**
-->**JSON.json**
This is my setup.py file:
import setuptools
setuptools.setup(name='utils',
version='0.0.1',
description='utils',
packages=setuptools.find_packages(),
package_data={'utils': ['**CSV.csv**', '**JSON.json**']},
include_package_data=True)
This is my bean.DoFn functions:
class DoWork(beam.DoFn):
def process(self, element):
import pandas as pd
df_csv = pd.read_csv('**CSV.csv**')
df_json = pd.read_json('**JSON.json**')
Do other stuff with dataframes
yield [stuff]
My pipeline is setup like so:
dataflow_options = ['--job_name=pipline',
'--project=pipeline',
'--temp_location=gs://pipeline/temp',
'--staging_location=gs://pipeline/stage',
'--setup_file=./setup.py']
options = PipelineOptions(dataflow_options)
gcloud_options = options.view_as(GoogleCloudOptions)
options.view_as(StandardOptions).runner = 'DataflowRunner'
with beam.Pipeline(options=options) as p:
update = p | beam.Create(files) | beam.ParDo(DoWork())
Basically I keep getting an:
IOError: File CSV.csv does not exist
It doesn't think the .json file exists either but is just erroring out before it reaches that step. The files are possibly not making it to dataflow or I am referencing them incorrectly within the DoFn. Should I actually be putting the files into data_files parameter of the setup function instead of package_data?
you need to upload the input files in gs and give a gs location rather than CSV. I think you ran the code locally having the csv file in the same directory as the code. But running it using DataflowRunner will need the files in gs.
I'm trying to use Dataflow in the GCP. The contextualization is the following one.
-I have created a pipeline that works correctly in local. This is test.py document script: (I do a subprocess fonction which takes the script "script2.py" to be executed, script located in local and stored in a bucket in the cloud as well)
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.options.pipeline_options import StandardOptions
from apache_beam.options.pipeline_options import SetupOptions
project ="titanium-index-200721"
bucket ="pipeline-operation-test"
class catchOutput(beam.DoFn):
def process(self,element):
import subprocess
import sys
s2_out = subprocess.check_output([sys.executable, "script2.py", "34"])
return [s2_out]
def run():
project = "titanium-index-200721"
job_name = "test-setup-subprocess-newerr"
staging_location = 'gs://pipeline-operation-test/staging'
temp_location = 'gs://pipeline-operation-test/temp'
setup = './setup.py'
options = PipelineOptions()
google_cloud_options = options.view_as(GoogleCloudOptions)
options.view_as(SetupOptions).setup_file = "./setup.py"
google_cloud_options.project = project
google_cloud_options.job_name = job_name
google_cloud_options.staging_location = staging_location
google_cloud_options.temp_location = temp_location
options.view_as(StandardOptions).runner = 'DataflowRunner'
p = beam.Pipeline(options=options)
input = 'gs://pipeline-operation-test/input2.txt'
output = 'gs://pipeline-operation-test/OUTPUTsetup.csv'
results =(
p|
'ReadMyFile'>>beam.io.ReadFromText(input)|
'Split'>>beam.ParDo(catchOutput())|
'CreateOutput'>>beam.io.WriteToText(output)
)
p.run()
if __name__ == '__main__':
run()
I have done a "setup.py" script for being able to include all the pakcages needed in future scripts to be also run in the dataflow of gcp.
Nevertheless when I try to run all that in the cloud, I'm having some problemsm to be more precise, when running the dataflow I get the following error:
RuntimeError: CalledProcessError: Command '['/usr/bin/python', 'script2.py', '34']' returned non-zero exit status 2 [while running 'Split']
I have tried placing the import call functions (subprocess,sys) in differents zones, I have also tried to modify the path of the script2.py which is in the bucket, but nothing has worked.
Finally one way to quit the error is by modifying the script with:
try:
s2_out = subprocess.check_output([sys.executable, "script2.py", "34"])
except subprocess.CalledProcessError as e:
s2_out = e.output
But then my output is nothing. Because by doing that I only less the pipeline run but not to be correctly executed.
Anybody knows how could be this fixed?
Thanks you very much!
Guillem