Unable to pinpoint issue with GCP Composer (Airflow) DAG task failure - python

I am new at using Apache Airflow.
Some operators of my dag have a failed status. I am trying to understand the origin of the error.
Here are the details of the problem:
My dag is pretty big, and certain parts of it are composed of sub-dags.
What I notice in the Composer UI, is that the Subdags that failed, all did in a task_id named download_file that uses XCom with a GoogleCloudStorageDownloadOperator.
>> GoogleCloudStorageDownloadOperator(
task_id='download_file',
bucket="sftp_sef",
object="{{task_instance.xcom_pull(task_ids='find_file') | first }}",
filename="/home/airflow/gcs/data/zips/{{{{ds_nodash}}}}_{0}.zip".format(table)
)
The logs in the said Subdag do not show anything useful.
LOG :
[2020-04-07 15:19:25,618] {models.py:1359} INFO - Dependencies all met
for [2020-04-07 15:19:25,660]
{models.py:1359} INFO - Dependencies all met for [2020-04-07 15:19:25,660]
{models.py:1577} INFO -
------------------------------------------------------------------------------- Starting attempt 10 of 1
[2020-04-07 15:19:25,685] {models.py:1599} INFO - Executing
on
2020-04-06T11:44:31+00:00 [2020-04-07 15:19:25,685]
{base_task_runner.py:118} INFO - Running: ['bash', '-c', 'airflow run
datamart_integration.consentement_email download_file
2020-04-06T11:44:31+00:00 --job_id 156313 --pool integration --raw -sd
DAGS_FOLDER/datamart/datamart_integration.py --cfg_path
/tmp/tmpacazgnve']
I am not sure if there is somewhere I am not checking... Here are my questions :
How do I debug errors in my Composer DAGs in general
Is it a good idea to create a local airflow environment to run &
debug my dags locally?
How do I verify if there are errors in XCom?

Regarding your three questions:
First, when using Cloud Composer you have several ways of debugging error in your code. According to the documentation, you should:
Check the Airflow logs.
These logs are related to single DAG tasks. It is possible to view them in the Cloud Storage's logs folder and in the Web Airflow interface.
When you create a Cloud Composer environment a Cloud Storage Bucket is also created and associate with it. Thus, Cloud Composer stores the logs for single DAG tasks in the logs folder inside this bucket, each workflow folder has a folder for its DAGs and sub-DAGs. You can check its structure here.
Regarding the Airflow web interface, it is refreshed every 60 seconds.Also, you can check more about it here.
Review the Google Cloud's operations suite.
You can use Cloud Monitoring and Cloud Logging with Cloud Composer. Whereas Cloud Monitoring provides visibility into the performance and overall health of cloud-powered applications, Cloud Logging shows the logs that the scheduler and worker containers produce. Therefore, you can use both or just the one you find more useful based on your need.
In the Cloud Console, check for errors on the pages for the Google Cloud components running your environment.
In the Airflow web interface, check in the DAG's Graph View for failed task instances.
Thus, these are the steps recommended when troubleshooting your DAG.
Second, regarding testing and debugging, it is recommended that you separate production and test environment to avoid DAG interference.
Furthermore, it is possible to test your DAG locally, there is a tutorial in the documentation about this topic, here. Testing locally allows you to identify syntax and task errors. However, I must point that it won't be possible to check/evaluate dependencies and communication to the database.
Third, in general, in order to verify errors in Xcom you should check:
If there is any error code/number;
Check with a sample code from the documentation if your syntax is correct;
Check if the packages if they are deprecated;
I would like to point that, according to this documentation, the path to GoogleCloudStorageDownloadOperator was updated to GCSToLocalOperator.
In addition, I also encourage you to have a look at this: code and documentation to check Xcom syntax and errors.
Feel free to share the error code with me if you need further help.

Related

Is there a way to test and debug DAGs from a local device to Composer Cloud?

I want to deploy and edit Airflow DAGs from local and find errors without going to upload DAGs.
You are able to test a single instance in a local environment and see the log output; viewing the output enables you to check for syntax errors and task errors that might occur, but note that testing in a local environment does not check dependencies or communication status to the database.
I would recommend you to put the DAGs in a data/test folder in your test environment and follow this guide that Google provides.

What is a convenient way to deploy and manage execution of a Python SDK Apache Beam pipeline for Google cloud Dataflow

Once an Apache Beam pipeline designed and tested in Google’s cloud Dataflow using Python SDK and DataflowRunner what is a convenient way to have it in the Google cloud and manage its execution?
What is a convenient way to deploy and manage execution of a Python SDK Apache Beam pipeline for Google Cloud Dataflow?
Should it be somehow packaged? Uploaded to Google storage? Create a Dataflow template? How can one schedule its execution beyond a developer execution it from its development environment?
Update
Preferably without 3rd party tools or a need in additional management tools/infrastructure beyond Google cloud and Dataflow in particular.
Intuitively you’d expect that “deploying a pipeline” section under How-to guides of the Dataflow documentation will cover that. But you find an explanation of that only 8 sections below in the “templates overview” section.
According to that section:
Cloud Dataflow templates introduce a new development and execution workflow that differs from traditional job execution workflow. The template workflow separates the development step from the staging and execution steps.
Trivially you do not deploy and execute your Dataflow pipeline from Google Cloud. But if you need to share the execution of a pipeline with nontechnical members of your cloud or simply want to trigger it without being dependant on a development environment or 3rd party tools then Dataflow templates is what you need.
Once a pipeline developed and tested you can create a Dataflow job template from it.
Please note that:
To create templates with the Cloud Dataflow SDK 2.x for Python, you must have version 2.0.0 or higher.
You will need to execute your pipeline using DataflowRunner with pipeline options that will generate a template on the Google Cloud storage rather than running it.
For more details refer to creating templates documentation section and to run it from template refer to executing templates section.
I'd say the most convenient way is to use Airflow. This allows you to author, schedule, and monitor workflows. The Dataflow Operator can start your designed data pipeline. Airflow can be started either on a small VM, or by using Cloud Composer, which is a tool on the Google Cloud Platform.
There are more options to automate your workflow, such as Jenkins, Azkaban, Rundeck, or even running a simple cronjob (which I'd discourage you to use). You might want to take a look at these options as well, but Airflow probably fits your needs.

How to integrate Airflow with Github for running scripts

If we maintain our code/scripts in github repository account, is there any way to copy these scripts from Github repository and execute on some other cluster ( which can be Hadoop or Spark).
Does airflow provides any operator to connect to Github for fetching such files ?
Maintaining scripts in Github will provide more flexibility as every change in the code will be reflected and used directly from there.
Any idea on this scenario will really help.
You can use GitPython as part of a PythonOperator task to run the pull as per a specified schedule.
import git
g = git.cmd.Git( git_dir )
g.pull()
Don't forget to make sure that you have added the relevant keys so that the airflow workers have permission to pull the data.

What does an Apache Beam Dataflow job do locally?

I'm having some issues with an Apache Beam Python SDK defined Dataflow. If I step through my code it reaches the the pipeline.run() step which I assume means the execution graph was successfully defined. However, the job never registers on the Dataflow monitoring tool, which from this makes me think it never reaches the pipeline validation step.
I'd like to know more about what happens between these two steps to help in debugging the issue. I see output indicating packages in my requirements.txt and apache-beam are getting pip installed and it seems like something is getting pickled before being sent to Google's servers. Why is that? If I already have apache-beam downloaded, why download it again? What exactly is getting pickled?
I'm not looking for a solution to my problem here, just trying to understand the process better.
During Graph Construction, Dataflow checks for errors and any illegal operations in the pipeline. Once the checks are successful, execution graph is translated into JSON and transmitted to Dataflow service. In Dataflow service the JSON graph is validated and it becomes a job.
However, if the pipeline is executed locally, the graph is not translated to JSON or transmitted to Dataflow service. So, the graph would not show up as a job in the monitoring tool, it will run on the local machine [1]. You can follow the documentation to configure the local machine [2].
[1] https://cloud.google.com/dataflow/service/dataflow-service-desc#pipeline-lifecycle-from-pipeline-code-to-dataflow-job
[2] https://cloud.google.com/dataflow/pipelines/specifying-exec-params#configuring-pipelineoptions-for-local-execution
The packages from requirements.txt are downloaded using pip download and they are staged on to the staging location. This staging location will be used as cache by the Dataflow and is used to look up packages when pip install -r requirements.txt is called on Dataflow workers to reduce calls to the pypi.

Apache Airflow Continous Integration Workflow and Dependency management

I'm thinking of starting to use Apache Airflow for a project and am wondering how people manage continuous integration and dependencies with airflow. More specifically
Say I have the following set up
3 Airflow servers: dev staging and production.
I have two python DAG'S whose source code I want to keep in seperate repos.
The DAG's themselves are simple, basically just use a Python operator to call main(*args, **kwargs). However the actually code that's run by main is very large and stretches several files/modules.
Each python code base has different dependencies
for example,
Dag1 uses Python2.7 pandas==0.18.1, requests=2.13.0
Dag2 uses Python3.6 pandas==0.20.0 and Numba==0.27 as well as some cythonized code that needs to be compiled
How do I manage Airflow running these two Dag's with completely different dependencies?
Also, how do I manage the continuous integration of the code for both these Dags into each different Airflow enivornment (dev, staging, Prod)(do I just get jenkins or something to ssh to the airflow server and do something like git pull origin BRANCH)
Hopefully this question isn't too vague and people see the problems i'm having.
We use docker to run the code with different dependencies and DockerOperator in airflow DAG, which can run docker containers, also on remote machines (with docker daemon already running). We actually have only one airflow server to run jobs but more machines with docker daemon running, which the airflow executors call.
For continuous integration we use gitlab CI with the Gitlab container registry for each repository. This should be easily doable with Jenkins.

Categories