I am trying to execute a apache beam pipeline as a dataflow job in Google Cloud Platform.
My project structure is as follows:
root_dir/
__init__.py
setup.py
main.py
utils/
__init__.py
log_util.py
config_util.py
Here's my setup.py
setuptools.setup(
name='dataflow_example',
version='1.0',
install_requires=[
"google-cloud-tasks==2.2.0",
"google-cloud-pubsub>=0.1.0",
"google-cloud-storage==1.39.0",
"google-cloud-bigquery==2.6.2",
"google-cloud-secret-manager==2.0.0",
"google-api-python-client==2.3.0",
"oauth2client==4.1.3",
"apache-beam[gcp]>=2.20.0",
"wheel>=0.36.2"
],
packages=setuptools.find_packages()
)
Here's my pipeline code:
import math
import apache_beam as beam
from datetime import datetime
from apache_beam.options.pipeline_options import PipelineOptions
from utils.log_util import LogUtil
from utils.config_util import ConfigUtil
class DataflowExample:
config = {}
def __init__(self):
self.config = ConfigUtil.get_config(module_config=["config"])
self.project = self.config['project']
self.region = self.config['location']
self.bucket = self.config['core_bucket']
self.batch_size = 10
def execute_pipeline(self):
try:
LogUtil.log_n_notify(log_type="info", msg=f"Dataflow started")
query = "SELECT id, name, company FROM `<bigquery_table>` LIMIT 10"
beam_options = {
"project": self.project,
"region": self.region,
"job_name": "dataflow_example",
"runner": "DataflowRunner",
"temp_location": f"gs://{self.bucket}/temp_location/"
}
options = PipelineOptions(**beam_options, save_main_session=True)
with beam.Pipeline(options=options) as pipeline:
data = (
pipeline
| 'Read from BQ ' >> beam.io.Read(beam.io.ReadFromBigQuery(query=query, use_standard_sql=True))
| 'Count records' >> beam.combiners.Count.Globally()
| 'Print ' >> beam.ParDo(PrintCount(), self.batch_size)
)
LogUtil.log_n_notify(log_type="info", msg=f"Dataflow completed")
except Exception as e:
LogUtil.log_n_notify(log_type="error", msg=f"Exception in execute_pipeline - {str(e)}")
class PrintCount(beam.DoFn):
def __init__(self):
self.logger = LogUtil()
def process(self, row_count, batch_size):
try:
current_date = datetime.today().date()
total = int(math.ceil(row_count / batch_size))
self.logger.log_n_notify(log_type="info", msg=f"Records pulled from table on {current_date} is {row_count}")
self.logger.log_n_notify(log_type="info", msg=f"Records per batch: {batch_size}. Total batches: {total}")
except Exception as e:
self.logger.log_n_notify(log_type="error", msg=f"Exception in PrintCount.process - {str(e)}")
if __name__ == "__main__":
df_example = DataflowExample()
df_example.execute_pipeline()
Functionality of pipeline is
Query against BigQuery Table.
Count the total records fetched from querying.
Print using the custom Log module present in utils folder.
I am running the job using cloud shell using command - python3 - main.py
Though the Dataflow job starts, the worker nodes throws error after few mins saying "ModuleNotFoundError: No module named 'utils'"
"utils" folder is available and the same code works fine when executed with "DirectRunner".
log_util and config_util files are custom util files for logging and config fetching respectively.
Also, I tried running with setup_file options as python3 - main.py --setup_file </path/of/setup.py> which makes the job to just freeze and does not proceed even after 15 mins.
How do I resolve the ModuleNotFoundError with "DataflowRunner"?
Posting as community wiki. As confirmed by #GopinathS the error and fix are as follows:
The error encountered by the workers is Beam SDK base version 2.32.0 does not match Dataflow Python worker version 2.28.0. Please check Dataflow worker startup logs and make sure that correct version of Beam SDK is installed.
To fix this "apache-beam[gcp]>=2.20.0" is removed from install_requires of setup.py since, the '>=' is assigning the latest available version (2.32.0 as of this writing) while the workers version are only 2.28.0.
Updated setup.py:
setuptools.setup(
name='dataflow_example',
version='1.0',
install_requires=[
"google-cloud-tasks==2.2.0",
"google-cloud-pubsub>=0.1.0",
"google-cloud-storage==1.39.0",
"google-cloud-bigquery==2.6.2",
"google-cloud-secret-manager==2.0.0",
"google-api-python-client==2.3.0",
"oauth2client==4.1.3", # removed apache-beam[gcp]>=2.20.0
"wheel>=0.36.2"
],
packages=setuptools.find_packages()
)
Updated beam_options in the pipeline code:
beam_options = {
"project": self.project,
"region": self.region,
"job_name": "dataflow_example",
"runner": "DataflowRunner",
"temp_location": f"gs://{self.bucket}/temp_location/",
"setup_file": "./setup.py"
}
Also make sure that you pass all the pipeline options at once and not partially.
If you pass --setup_file </path/of/setup.py> in the command then make sure to read and append the setup file path into the already defined beam_options variable using argument_parser in your code.
To avoid parsing the argument and appending into beam_options I instead added it directly in beam_options as "setup_file": "./setup.py"
Dataflow might have problems with installing packages that are platform locked in isolated network.
It won't be able to compile them if no network is there.
Or maybe it tries installing them but since cannot compile downloads wheels? No idea.
Still to be able to use packages like psycopg2 (binaries), or google-cloud-secret-manager (no binaries BUT dependencies have binaries), you need to install everything that has no binaries (none-any) AND no dependencies with binaries, by requirements.txt and the rest by --extra_packages param with wheel. Example:
--extra_packages=package_1_needed_by_2-manylinux.whl \
--extra_packages=package_2_needed_by_3-manylinux.whl \
--extra_packages=what-you-need_needing_3-none-any.whl
Related
I would like to run Spacy Lemmatization on a column within a ParDo on GCP DataFlow.
My DataFlow project is composed by 3 files: main.py which is the file containing the script, myfile.json which contains the service account key, and setup.py which contains the requirements for the project :
main.py
import apache_beam as beam
from apache_beam.io.gcp.internal.clients import bigquery
from apache_beam.options.pipeline_options import PipelineOptions
import unidecode
import string
import spacy
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "myfile.json"
table_spec = bigquery.TableReference(
projectId='scrappers-293910',
datasetId='mydataset',
tableId='mytable')
options = PipelineOptions(
job_name="lemmatize-job-offers-description-2",
project="myproject",
region="europe-west6",
temp_location="gs://mygcp/options/temp_location/",
staging_location="gs://mygcp/options/staging_location/")
nlp = spacy.load("fr_core_news_sm", disable=["tagger", "parser", "attribute_ruler", "ner", "textcat"])
class CleanText(beam.DoFn):
def process(self, row):
row['descriptioncleaned'] = ' '.join(unidecode.unidecode(str(row['description'])).lower().translate(str.maketrans(string.punctuation, ' '*len(string.punctuation))).split())
yield row
class LemmaText(beam.DoFn):
def process(self, row):
doc = nlp(row['descriptioncleaned'])
row['descriptionlemmatized'] = ' '.join(list(set([token.lemma_ for token in doc])))
yield row
with beam.Pipeline(runner="DataflowRunner", options=options) as pipeline:
soft = pipeline \
| "ReadFromBigQuery" >> beam.io.ReadFromBigQuery(table=table_spec, gcs_location="gs://mygcp/gcs_location") \
| "CleanText" >> beam.ParDo(CleanText()) \
| "LemmaText" >> beam.ParDo(LemmaText()) \
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery('mybq.path', custom_gcs_temp_location="gs://mygcp/gcs_temp_location", create_disposition="CREATE_IF_NEEDED", write_disposition="WRITE_TRUNCATE")
setup.py
import setuptools
setuptools.setup(
name='PACKAGE-NAME',
install_requires=['spacy', 'unidecode', 'fr_core_news_lg # git+https://github.com/explosion/spacy-models/releases/download/fr_core_news_lg-3.2.0/fr_core_news_lg-3.2.0.tar.gz'],
packages=setuptools.find_packages()
)
and I send the job to DataFlow with the above cmd:
python3 main.py --setup_file ./setup.py
Locally it works fine, but as soon as I send it to DataFlow, after few minutes I get :
I searched for the reason and it seems to be the module dependencies.
Is it alright to import the Spacy model like I did ? What am I doing wrong ?
https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/.
It seems that you can use a requirements file with requirements_file pipeline option.
Additionally, if you run into name error, see https://cloud.google.com/dataflow/docs/resources/faq#how_do_i_handle_nameerrors.
I would like to be able to run an ad-hoc python script that would access and run analytics on the model calculated by a dbt run, are there any best practices around this?
We recently built a tool that could that caters very much to this scenario. It leverages the ease of referencing tables from dbt in Python-land. It's called fal.
The idea is that you would define the python scripts you would like to run after your dbt models are run:
# schema.yml
models:
- name: iris
meta:
owner: "#matteo"
fal:
scripts:
- "notify.py"
And then the file notify.py is called if the iris model was run in the last dbt run:
# notify.py
import os
from slack_sdk import WebClient
from slack_sdk.errors import SlackApiError
CHANNEL_ID = os.getenv("SLACK_BOT_CHANNEL")
SLACK_TOKEN = os.getenv("SLACK_BOT_TOKEN")
client = WebClient(token=SLACK_TOKEN)
message_text = f"""Model: {context.current_model.name}
Status: {context.current_model.status}
Owner: {context.current_model.meta['owner']}"""
try:
response = client.chat_postMessage(
channel=CHANNEL_ID,
text=message_text
)
except SlackApiError as e:
assert e.response["error"]
Each script is ran with a reference to the current model for which it is running in a context variable.
To start using fal, just pip install fal and start writing your python scripts.
For production, I'd recommend an orchestration layer such as apache airflow.
See this blog post to get started, but essentially you'll have an orchestration DAG (note - not a dbt DAG) that does something like:
dbt run <with args> -> your python code
Fair warning, though, this can add a bit of complexity to your project.
I suppose you could get a similar effect with a CI/CD tool like github actions or circleCI
As a part of upgrading Kedro from 0.16.2 to 0.17.3 in our organization, I've made changes to Kedro related files in our codebase based on Kedro starter pyspark-iris on 0.17.3.
Now I get an error of Error: No such command 'run' on kedro run.
setup.py
from setuptools import find_packages, setup
entry_point = "kedro-project = kedro-package.__main__:main"
# get the dependencies and installs
with open("requirements.txt", "r", encoding="utf-8") as f:
# Make sure we strip all comments and options (e.g "--extra-index-url")
# that arise from a modified pip.conf file that configure global options
# when running kedro build-reqs
requires = []
for line in f:
req = line.split("#", 1)[0].strip()
if req and not req.startswith("--"):
requires.append(req)
setup(
name="kedro-package",
version="0.1",
packages=find_packages(exclude=["tests"]),
entry_points={"console_scripts": [entry_point]},
install_requires=requires,
extras_require={
"docs": [
"sphinx~=3.4.3",
"sphinx_rtd_theme==0.5.1",
"nbsphinx==0.8.1",
"nbstripout==0.3.3",
"recommonmark==0.7.1",
"sphinx-autodoc-typehints==1.11.1",
"sphinx_copybutton==0.3.1",
"jupyter_client>=5.1.0, <6.0",
"tornado>=4.2, <6.0",
"ipykernel~=5.3",
]
},
)
main.py
from pathlib import Path
from kedro.framework.project import configure_project
import logging
from .cli import run
def main():
package_name = str(Path(__file__).resolve().parent.name)
logging.getLogger(__name__).info(f"package name is: {package_name}")
configure_project(package_name=package_name)
run()
if __name__ == "__main__":
main()
and cli.py is at the same level as main.py which are directly inside the package (altered to kedro-package here for anonymity)
This only happens when performing kedro run on the EMR. When we run locally we don't see that error. Rather it errors out because it can't connect to S3, which is expected.
Additionally, I've tried running
I'd like to embed my dataflow inside a Cloud Function WITHOUT USING TEMPLATE. I ran into an error at first and according to this answer, I should packaging up my code as a dependency. This is the structure of my Cloud Function:
file wb_flow.py
def main(identifier, schema_file):
"""The main function which creates the pipeline and runs it."""
table_name = f"wijken_en_buurten_{cbsodata.get_info(identifier)['Period']}"
pipeline_options = PipelineOptions(
[
'--runner', 'DataflowRunner',
'--project', 'veneficus',
'--region', 'europe-west4',
'--temp_location', 'gs://vf_etl/test',
'--staging_location', 'gs://vf_etl/temp',
'--setup_file', 'setup.py'
]
)
p = beam.Pipeline(options=pipeline_options)
(p
| 'Read from BQ Table' >> beam.Create(cbsodata.get_data(identifier))
| 'Write Projects to BigQuery' >> beam.io.WriteToBigQuery(
f"cbs.{table_name}",
schema=schema,
# Creates the table in BigQuery if it does not yet exist.
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)
)
p.run()#.wait_until_finish()
file main.py:
import base64
from wb_flow import main
def run(event, context):
"""The main function which creates the pipeline and runs it."""
message = base64.b64decode(event['data']).decode('utf-8').split(',')
identifier, schema_file = message[0], message[1]
main(identifier, schema_file)
and setup.py:
import setuptools
setuptools.setup(
name='wb_flow',
version='1.0.0',
install_requires=[],
packages=setuptools.find_packages(),
)
I got this error in the construction of the dataflow
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/apache_beam/runners/portability/stager.py", line 579, in _build_setup_package
os.chdir(os.path.dirname(setup_file))
FileNotFoundError: [Errno 2] No such file or directory: ''
And I believe it means that it couldnt find my setup.py. How can I specify the path to my setup file?
Alternative, I tried to do this without setup.py, and the Dataflow said it couldnt find wb_flow module
Update
When I specified my setup path as /workspace/setup.py, I have this error
subprocess.CalledProcessError: Command '['/layers/google.python.pip/pip/bin/python3', 'setup.py', 'sdist', '--dist-dir', '/tmp/tmp3rxejr4g']' returned non-zero exit status 1."
Try to change
'--setup_file', 'setup.py'
into
'--setup_file', './setup.py'
This worked for me :)
The below code builds the pipeline and DAG is generated.
RuntimeError: NotImplementedError [while running 'generatedPtransform-438']Please let me know if there is any direct connector for mysql in python for beam.
from apache_beam.options.pipeline_options import PipelineOptions
from google.cloud import pubsub_v1
from google.cloud import bigquery
import mysql.connector
import apache_beam as beam
import logging
import argparse
import sys
import re
PROJECT="12344"
TOPIC = "projects/12344/topics/mytopic"
class insertfn(beam.Dofn):
def insertdata(self,data):
db_conn=mysql.connector.connect(host="localhost",user="abc",passwd="root",database="new")
db_cursor=db_conn.cursor()
emp_sql = " INSERT INTO emp(ename,eid,dept) VALUES (%s,%s,%s)"
db_cusror.executemany(emp_sql,(data[0],data[1],data[2]))
db_conn.commit()
print(db_cursor.rowcount,"record inserted")
class Split(beam.DoFn):
def process(self, data):
data = data.split(",")
return [{
'ename': data[0],
'eid': data[1],
'dept': data[2]
}]
def main(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument("--input_topic")
parser.add_argument("--output")
known_args = parser.parse_known_args(argv)
p = beam.Pipeline(options=PipelineOptions())
(p
| 'ReadData' >> beam.io.ReadFromPubSub(topic=TOPIC).with_output_types(bytes)
| "Decode" >> beam.Map(lambda x: x.decode('utf-8'))
| 'ParseCSV' >> beam.ParDo(Split())
| 'WriteToMySQL' >> beam.ParDo(insertfn())
)
result = p.run()
result.wait_until_finish()
After our discussion in the comment section, I noticed that you are not using the proper commands to execute the DataFlow pipeline.
According to the documentation, there are mandatory flags which must be defined in order to run the pipeline in Dataflow Managed Service. These flags are described below:
job_name - The name of the Dataflow job being executed.
project - The ID of your Google Cloud project. runner - The pipeline
runner - that will parse your program and construct your pipeline. For
cloud execution, this must be DataflowRunner.
staging_location - A Cloud Storage path for Dataflow to stage code packages needed by workers executing the job.
temp_location - A Cloud Storage path for Dataflow to stage temporary job files created during the execution of the pipeline.
In addition to these flags, there are others you can use, in your case since you use a PubSub topic:
--input_topic: sets the input Pub/Sub topic to read messages from.
Therefore, an example to run a Dataflow pipeline would be as follows:
python RunPipelineDataflow.py \
--job_name=jobName\
--project=$PROJECT_NAME \
--runner=DataflowRunner \
--staging_location=gs://YOUR_BUCKET_NAME/AND_STAGING_DIRECTORY\
--temp_location=gs://$BUCKET_NAME/temp
--input_topic=projects/$PROJECT_NAME/topics/$TOPIC_NAME \
I would like to point the importance of using DataflowRunner, it allows you to use the Cloud Dataflow managed service, providing a fully managed service, autoscaling and dynamic work rebalancing. However, it is also possible to use DirectRunner which executes your pipeline in your machine, it is designed to validate the pipeline.