How to integrate Airflow with Github for running scripts

How to integrate Airflow with Github for running scripts - python

If we maintain our code/scripts in github repository account, is there any way to copy these scripts from Github repository and execute on some other cluster ( which can be Hadoop or Spark).
Does airflow provides any operator to connect to Github for fetching such files ?
Maintaining scripts in Github will provide more flexibility as every change in the code will be reflected and used directly from there.
Any idea on this scenario will really help.

You can use GitPython as part of a PythonOperator task to run the pull as per a specified schedule.
import git
g = git.cmd.Git( git_dir )
g.pull()
Don't forget to make sure that you have added the relevant keys so that the airflow workers have permission to pull the data.

Related

Adding Python Libraries to Airflow-Puckel on Docker

I am new to Docker and Airflow and am having trouble figuring out the correct place to add the httplib2 Python library to the container. I am using the Airflow-Puckel image. Do I need to add it to the Dockerfile or the docker-compose yml file or both and once added do I just need to rebuild the container with up and it will run?

From my own experience while learning Airflow and Docker, I strongly recommend using the official docker-compose file, maintained by Airflow. If you are on your first steps with Docker and Airflow, the guides and docs may come in very handy and comprehensive. Also, there is the fact that the images are more likely to be updated with the last Airflow version.
For example, once you are done with the initialization, you can take a look at this article where it's is explained how to add packages to each of the services being run on Compose or how to set it up as production-ready. You could check this answer for an example too.
Good luck!

How to upload packages to an instance in a Processing step in Sagemaker?

I have to do large scale feature engineering on some data. My current approach is to spin up an instance using SKLearnProcessor and then scale the job by choosing a larger instance size or increasing the number of instances. I require using some packages that are not installed on Sagemaker instances by default and so I want to install the packages using .whl files.
Another hurdle is that the Sagemaker role does not have internet access.
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor
sess = sagemaker.Session()
sess.default_bucket()
region = boto3.session.Session().region_name
role = get_execution_role()
sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
role=role,
sagemaker_session = sess,
instance_type="ml.t3.medium",
instance_count=1)
sklearn_processor.run(code='script.py')
Attempted resolutions:
Upload the packages to a CodeCommit repository and clone the repo into the SKLearnProcessor instances. Failed with error fatal: could not read Username for 'https://git-codecommit.eu-west-1.amazonaws.com': No such device or address. I tried cloning the repo into a sagemaker notebook instance and it works, so its not a problem with my script.
Use a bash script to copy the packages from s3 using the CLI. The bash script I used is based off this post. But the packages never get copied, and an error message is not thrown.
Also looked into using the package s3fs but it didn't seem suitable to copy the wheel files.
Alternatives
My client is hesitant to spin up containers from custom docker images. Any alternatives?

2. Use a bash script to copy the packages from s3 using the CLI. The bash script I used is based off this post. But the packages never get copied, and an error message is not thrown.
This approach seems sound.
You may be better off overriding the command field on the SKLearnProcessor to /bin/bash, run a bash script like install_and_run_my_python_code.sh that installs the wheel containing your python dependencies, then runs your main python entry point script.
Additionally, instead of using AWS S3 calls to download your code in a script, you could use a ProcessingInput to download your code rather than doing this with AWS CLI calls in a bash script, which is what the SKLearnProcessor does to download your entry point script.py code across all the instances.

Unable to pinpoint issue with GCP Composer (Airflow) DAG task failure

I am new at using Apache Airflow.
Some operators of my dag have a failed status. I am trying to understand the origin of the error.
Here are the details of the problem:
My dag is pretty big, and certain parts of it are composed of sub-dags.
What I notice in the Composer UI, is that the Subdags that failed, all did in a task_id named download_file that uses XCom with a GoogleCloudStorageDownloadOperator.
>> GoogleCloudStorageDownloadOperator(
task_id='download_file',
bucket="sftp_sef",
object="{{task_instance.xcom_pull(task_ids='find_file') | first }}",
filename="/home/airflow/gcs/data/zips/{{{{ds_nodash}}}}_{0}.zip".format(table)
)
The logs in the said Subdag do not show anything useful.
LOG :
[2020-04-07 15:19:25,618] {models.py:1359} INFO - Dependencies all met
for [2020-04-07 15:19:25,660]
{models.py:1359} INFO - Dependencies all met for [2020-04-07 15:19:25,660]
{models.py:1577} INFO -
------------------------------------------------------------------------------- Starting attempt 10 of 1
[2020-04-07 15:19:25,685] {models.py:1599} INFO - Executing
on
2020-04-06T11:44:31+00:00 [2020-04-07 15:19:25,685]
{base_task_runner.py:118} INFO - Running: ['bash', '-c', 'airflow run
datamart_integration.consentement_email download_file
2020-04-06T11:44:31+00:00 --job_id 156313 --pool integration --raw -sd
DAGS_FOLDER/datamart/datamart_integration.py --cfg_path
/tmp/tmpacazgnve']
I am not sure if there is somewhere I am not checking... Here are my questions :
How do I debug errors in my Composer DAGs in general
Is it a good idea to create a local airflow environment to run &
debug my dags locally?
How do I verify if there are errors in XCom?

Regarding your three questions:
First, when using Cloud Composer you have several ways of debugging error in your code. According to the documentation, you should:
Check the Airflow logs.
These logs are related to single DAG tasks. It is possible to view them in the Cloud Storage's logs folder and in the Web Airflow interface.
When you create a Cloud Composer environment a Cloud Storage Bucket is also created and associate with it. Thus, Cloud Composer stores the logs for single DAG tasks in the logs folder inside this bucket, each workflow folder has a folder for its DAGs and sub-DAGs. You can check its structure here.
Regarding the Airflow web interface, it is refreshed every 60 seconds.Also, you can check more about it here.
Review the Google Cloud's operations suite.
You can use Cloud Monitoring and Cloud Logging with Cloud Composer. Whereas Cloud Monitoring provides visibility into the performance and overall health of cloud-powered applications, Cloud Logging shows the logs that the scheduler and worker containers produce. Therefore, you can use both or just the one you find more useful based on your need.
In the Cloud Console, check for errors on the pages for the Google Cloud components running your environment.
In the Airflow web interface, check in the DAG's Graph View for failed task instances.
Thus, these are the steps recommended when troubleshooting your DAG.
Second, regarding testing and debugging, it is recommended that you separate production and test environment to avoid DAG interference.
Furthermore, it is possible to test your DAG locally, there is a tutorial in the documentation about this topic, here. Testing locally allows you to identify syntax and task errors. However, I must point that it won't be possible to check/evaluate dependencies and communication to the database.
Third, in general, in order to verify errors in Xcom you should check:
If there is any error code/number;
Check with a sample code from the documentation if your syntax is correct;
Check if the packages if they are deprecated;
I would like to point that, according to this documentation, the path to GoogleCloudStorageDownloadOperator was updated to GCSToLocalOperator.
In addition, I also encourage you to have a look at this: code and documentation to check Xcom syntax and errors.
Feel free to share the error code with me if you need further help.

Using gitlab cicd to automatically merge branches

gitlab has this functionality that you can use pipelines that will execute code whenever you push code to your project.
this is done through their .gitlab-ci.yml file format
i am trying to somehow make the pipeline to merge all branches with prefix "ready/"
i have written a python program to do it locally, but it wont execute on the gitlab docker remote machine. this is due to the fact that it only lists "* and master" as branches with "git branch -a".
i have tried to checkout to master but that dosent work.
is this even possible on the gitlab pipeline? how would i go forward?

There are a couple of ways to achieve this depending what credentials you want to use, what you prefer, and what is better suited to your use case.
Use SSH in CI/CD (with SSH keys) to use your standard git commands to pull, do whatever, then push to the repo as part of a pipeline job.
Use the merge requests API which requires a personal access token. The API allows you to create, accept, and merge a merge request.
If you have a lot of branches, then you may want to use the first method.

What does an Apache Beam Dataflow job do locally?

I'm having some issues with an Apache Beam Python SDK defined Dataflow. If I step through my code it reaches the the pipeline.run() step which I assume means the execution graph was successfully defined. However, the job never registers on the Dataflow monitoring tool, which from this makes me think it never reaches the pipeline validation step.
I'd like to know more about what happens between these two steps to help in debugging the issue. I see output indicating packages in my requirements.txt and apache-beam are getting pip installed and it seems like something is getting pickled before being sent to Google's servers. Why is that? If I already have apache-beam downloaded, why download it again? What exactly is getting pickled?
I'm not looking for a solution to my problem here, just trying to understand the process better.

During Graph Construction, Dataflow checks for errors and any illegal operations in the pipeline. Once the checks are successful, execution graph is translated into JSON and transmitted to Dataflow service. In Dataflow service the JSON graph is validated and it becomes a job.
However, if the pipeline is executed locally, the graph is not translated to JSON or transmitted to Dataflow service. So, the graph would not show up as a job in the monitoring tool, it will run on the local machine [1]. You can follow the documentation to configure the local machine [2].
[1] https://cloud.google.com/dataflow/service/dataflow-service-desc#pipeline-lifecycle-from-pipeline-code-to-dataflow-job
[2] https://cloud.google.com/dataflow/pipelines/specifying-exec-params#configuring-pipelineoptions-for-local-execution

The packages from requirements.txt are downloaded using pip download and they are staged on to the staging location. This staging location will be used as cache by the Dataflow and is used to look up packages when pip install -r requirements.txt is called on Dataflow workers to reduce calls to the pypi.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.