I'm trying to run the SqlSensor locally under docker on a Windows 10 machine, it runs on Linux but get below errors when trying to run the same simple DAG locally.
The reason I'm trying to set this up is so that I can develop locally and test to speed up the development cycle.
Error from Airflow log:
[2018-05-22 08:27:04,929] {{models.py:1428}} INFO - Executing <Task(SqlSensor): limits_test> on 2018-05-21 08:00:00
[2018-05-22 08:27:04,929] {{base_task_runner.py:115}} INFO - Running: ['bash', '-c', 'airflow run sql-sensor-test-dag limits_test 2018-05-21T08:00:00 --job_id 8 --raw -sd DAGS_FOLDER/sql_sensor_test.py']
[2018-05-22 08:27:05,685] {{base_task_runner.py:98}} INFO - Subtask: [2018-05-22 08:27:05,684] {{__init__.py:45}} INFO - Using executor CeleryExecutor
[2018-05-22 08:27:05,749] {{base_task_runner.py:98}} INFO - Subtask: [2018-05-22 08:27:05,749] {{models.py:189}} INFO - Filling up the DagBag from /usr/local/airflow/dags/sql_sensor_test.py
[2018-05-22 08:27:05,791] {{cli.py:374}} INFO - Running on host 0f8e7a60dbab
[2018-05-22 08:27:05,858] {{base_task_runner.py:98}} INFO - Subtask: [2018-05-22 08:27:05,858] {{base_hook.py:80}} INFO - Using connection to: LABCHGVA-SQL295
[2018-05-22 08:27:05,888] {{base_task_runner.py:98}} INFO - Subtask: [2018-05-22 08:27:05,888] {{sensors.py:111}} INFO - Poking: SELECT max(snapshot_id) FROM limits_run
[2018-05-22 08:27:05,896] {{base_task_runner.py:98}} INFO - Subtask: [2018-05-22 08:27:05,896] {{base_hook.py:80}} INFO - Using connection to: LABCHGVA-SQL295
[2018-05-22 08:27:05,924] {{models.py:1595}} ERROR - Connection to the database failed for an unknown reason.
Traceback (most recent call last):
File "pymssql.pyx", line 635, in pymssql.connect (pymssql.c:10734)
File "_mssql.pyx", line 1902, in _mssql.connect (_mssql.c:21821)
File "_mssql.pyx", line 638, in _mssql.MSSQLConnection.__init__ (_mssql.c:6594)
_mssql.MSSQLDriverException: Connection to the database failed for an unknown reason.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/models.py", line 1493, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.6/site-packages/airflow/operators/sensors.py", line 78, in execute
while not self.poke(context):
File "/usr/local/lib/python3.6/site-packages/airflow/operators/sensors.py", line 112, in poke
records = hook.get_records(self.sql)
File "/usr/local/lib/python3.6/site-packages/airflow/hooks/dbapi_hook.py", line 106, in get_records
with closing(self.get_conn()) as conn:
File "/usr/local/lib/python3.6/site-packages/airflow/hooks/mssql_hook.py", line 43, in get_conn
port=conn.port)
File "pymssql.pyx", line 644, in pymssql.connect (pymssql.c:10892)
pymssql.InterfaceError: Connection to the database failed for an unknown reason.
Using this Docker image:
FROM puckel/docker-airflow:1.9.0-2
USER root
RUN apt-get update
RUN apt-get install freetds-dev -yqq && \
pip install apache-airflow[mssql]
USER airflow
And the following simple DAG:
from datetime import timedelta
import airflow
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.sensors import SqlSensor
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'catchup': False,
'start_date': airflow.utils.dates.days_ago(1),
'email': ['myemail#company.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 10,
'retry_delay': timedelta(minutes=15),
'sla': timedelta(hours=3)
}
dag = DAG(
'sql-sensor-test-dag',
default_args=default_args,
description='Sensor tests',
schedule_interval='0 8 * * *'
# schedule_interval='#once'
)
with dag:
sql_sensor = SqlSensor(
task_id='limits_test',
conn_id='bpeak_limits_ro',
sql="SELECT max(snapshot_id) FROM limits_run"
)
done = DummyOperator(task_id='done')
sql_sensor >> done
Related
2023-02-05 11:32:43,293] {{taskinstance.py:887}} INFO - Executing <Task(PythonOperator): download_from_s3> on 2023-02-05T11:32:34.016335+00:00
[2023-02-05 11:32:43,299] {{standard_task_runner.py:53}} INFO - Started process 87503 to run task
[2023-02-05 11:32:43,474] {{logging_mixin.py:112}} INFO - Running %s on host %s <TaskInstance: s3_download.download_from_s3 2023-02-05T11:32:34.016335+00:00 [running]> 67c7842be21b
[2023-02-05 11:32:43,555] {{taskinstance.py:1128}} ERROR - 'S3Hook' object has no attribute 'download_file'
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 966, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 113, in execute
return_value = self.execute_callable()
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 118, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/usr/local/airflow/dags/dwnld_frm_awss3.py", line 12, in download_from_s3
file_name = hook.download_file(key=key, bucket_name=bucket_name, local_path=local_path)
AttributeError: 'S3Hook' object has no attribute 'download_file'
[2023-02-05 11:32:43,570] {{taskinstance.py:1185}} INFO - Marking task as FAILED.dag_id=s3_download, task_id=download_from_s3, execution_date=20230205T113234, start_date=20230205T113243, end_date=20230205T113243
Getting error of download_file
My code is
import os
from datetime import datetime
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.hooks.S3_hook import S3Hook
from airflow.contrib.hooks.aws_hook import AwsHook
# Function of the DAG
def download_from_s3(key: str, bucket_name: str, local_path: str) -> str:
hook = S3Hook('my_conn_S3')
file_name = hook.download_file(key=key, bucket_name=bucket_name, local_path=local_path)
return file_name
with DAG(
dag_id='s3_download',
schedule_interval='#daily',
start_date=datetime(2023, 2, 4),
catchup=False
) as dag:
task_download_from_s3 = PythonOperator(
task_id='download_from_s3',
python_callable=download_from_s3,
op_kwargs={
'key': 'sample.txt',
'bucket_name': 'airflow-sample-s3-bucket',
'local_path': '/usr/local/airflow/'
}
)
The imports suggests that you are using older version of Airflow.
You should install Amazon backport provider then import the hook as from airflow.providers.amazon.aws.hooks.s3 import S3Hook
Note that Airflow 1.10 is end-of-life for 2+ years, you should upgrade Airflow version as soon as possible. To upgrade Airflow you can follow this guide.
I'm trying to run a DAG that calls a docker container and executes a command inside it, but Airflow cannot execute the task. Follows the error launched:
*** Reading local file: /opt/airflow/logs/docker_operator_dag/docker_command_hello/2021-05-26T02:40:13.171571+00:00/2.log
[2021-05-26 02:45:26,001] {taskinstance.py:877} INFO - Dependencies all met for <TaskInstance: docker_operator_dag.docker_command_hello 2021-05-26T02:40:13.171571+00:00 [queued]>
[2021-05-26 02:45:26,030] {taskinstance.py:877} INFO - Dependencies all met for <TaskInstance: docker_operator_dag.docker_command_hello 2021-05-26T02:40:13.171571+00:00 [queued]>
[2021-05-26 02:45:26,031] {taskinstance.py:1068} INFO -
--------------------------------------------------------------------------------
[2021-05-26 02:45:26,031] {taskinstance.py:1069} INFO - Starting attempt 2 of 2
[2021-05-26 02:45:26,032] {taskinstance.py:1070} INFO -
--------------------------------------------------------------------------------
[2021-05-26 02:45:26,060] {taskinstance.py:1089} INFO - Executing <Task(DockerOperator): docker_command_hello> on 2021-05-26T02:40:13.171571+00:00
[2021-05-26 02:45:26,080] {standard_task_runner.py:52} INFO - Started process 67 to run task
[2021-05-26 02:45:26,083] {standard_task_runner.py:76} INFO - Running: ['airflow', 'tasks', 'run', 'docker_operator_dag', 'docker_command_hello', '2021-05-26T02:40:13.171571+00:00', '--job-id', '11', '--pool', 'default_pool', '--raw', '--subdir', 'DAGS_FOLDER/docker_job/docker-job.py', '--cfg-path', '/tmp/tmp3vl5dv7x', '--error-file', '/tmp/tmptazwx_tc']
[2021-05-26 02:45:26,084] {standard_task_runner.py:77} INFO - Job 11: Subtask docker_command_hello
[2021-05-26 02:45:26,181] {logging_mixin.py:104} INFO - Running <TaskInstance: docker_operator_dag.docker_command_hello 2021-05-26T02:40:13.171571+00:00 [running]> on host f1e5cfe4a07f
[2021-05-26 02:45:26,219] {taskinstance.py:1283} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=docker_operator_dag
AIRFLOW_CTX_TASK_ID=docker_command_hello
AIRFLOW_CTX_EXECUTION_DATE=2021-05-26T02:40:13.171571+00:00
AIRFLOW_CTX_DAG_RUN_ID=manual__2021-05-26T02:40:13.171571+00:00
[2021-05-26 02:45:26,227] {taskinstance.py:1482} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 677, in urlopen
chunked=chunked,
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 392, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/local/lib/python3.6/http/client.py", line 1287, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/local/lib/python3.6/http/client.py", line 1333, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.6/http/client.py", line 1282, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.6/http/client.py", line 1042, in _send_output
self.send(msg)
File "/usr/local/lib/python3.6/http/client.py", line 980, in send
self.connect()
File "/home/airflow/.local/lib/python3.6/site-packages/docker/transport/unixconn.py", line 43, in connect
sock.connect(self.unix_socket)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 727, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/util/retry.py", line 410, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/packages/six.py", line 734, in reraise
raise value.with_traceback(tb)
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 677, in urlopen
chunked=chunked,
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 392, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/local/lib/python3.6/http/client.py", line 1287, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/local/lib/python3.6/http/client.py", line 1333, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.6/http/client.py", line 1282, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.6/http/client.py", line 1042, in _send_output
self.send(msg)
File "/usr/local/lib/python3.6/http/client.py", line 980, in send
self.connect()
File "/home/airflow/.local/lib/python3.6/site-packages/docker/transport/unixconn.py", line 43, in connect
sock.connect(self.unix_socket)
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionRefusedError(111, 'Connection refused'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.6/site-packages/docker/api/client.py", line 207, in _retrieve_server_version
return self.version(api_version=False)["ApiVersion"]
File "/home/airflow/.local/lib/python3.6/site-packages/docker/api/daemon.py", line 181, in version
return self._result(self._get(url), json=True)
File "/home/airflow/.local/lib/python3.6/site-packages/docker/utils/decorators.py", line 46, in inner
return f(self, *args, **kwargs)
File "/home/airflow/.local/lib/python3.6/site-packages/docker/api/client.py", line 230, in _get
return self.get(url, **self._set_request_timeout(kwargs))
File "/home/airflow/.local/lib/python3.6/site-packages/requests/sessions.py", line 555, in get
return self.request('GET', url, **kwargs)
File "/home/airflow/.local/lib/python3.6/site-packages/requests/sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "/home/airflow/.local/lib/python3.6/site-packages/requests/sessions.py", line 655, in send
r = adapter.send(request, **kwargs)
File "/home/airflow/.local/lib/python3.6/site-packages/requests/adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionRefusedError(111, 'Connection refused'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1138, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1311, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1341, in _execute_task
result = task_copy.execute(context=context)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/providers/docker/operators/docker.py", line 287, in execute
self.cli = self._get_cli()
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/providers/docker/operators/docker.py", line 319, in _get_cli
return APIClient(base_url=self.docker_url, version=self.api_version, tls=tls_config)
File "/home/airflow/.local/lib/python3.6/site-packages/docker/api/client.py", line 190, in __init__
self._version = self._retrieve_server_version()
File "/home/airflow/.local/lib/python3.6/site-packages/docker/api/client.py", line 215, in _retrieve_server_version
'Error while fetching server API version: {0}'.format(e)
docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', ConnectionRefusedError(111, 'Connection refused'))
[2021-05-26 02:45:26,230] {taskinstance.py:1532} INFO - Marking task as FAILED. dag_id=docker_operator_dag, task_id=docker_command_hello, execution_date=20210526T024013, start_date=20210526T024526, end_date=20210526T024526
[2021-05-26 02:45:26,261] {local_task_job.py:146} INFO - Task exited with return code 1
I'm using the Airflow docker-compose found here https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html and trying to run the following DAG:
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.docker_operator import DockerOperator
from airflow.operators.python_operator import BranchPythonOperator
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner' : 'airflow',
'description' : 'Use of the DockerOperator',
'depend_on_past' : False,
'start_date' : datetime(2021, 5, 1),
'email_on_failure' : False,
'email_on_retry' : False,
'retries' : 1,
'retry_delay' : timedelta(minutes=5)
}
with DAG('docker_operator_dag', default_args=default_args, schedule_interval="5 * * * *", catchup=False) as dag:
start_dag = DummyOperator(
task_id='start_dag'
)
end_dag = DummyOperator(
task_id='end_dag'
)
t1 = BashOperator(
task_id='print_current_date',
bash_command='date'
)
t2 = DockerOperator(
task_id='docker_command_sleep',
image='docker_image_task',
container_name='task___command_sleep',
api_version='auto',
auto_remove=True,
command="/bin/sleep 30",
docker_url="unix://var/run/docker.sock",
network_mode="bridge"
)
t3 = DockerOperator(
task_id='docker_command_hello',
image='docker_image_task',
container_name='task___command_hello',
api_version='auto',
auto_remove=True,
command="/bin/sleep 40",
docker_url="unix://var/run/docker.sock",
network_mode="bridge"
)
t4 = BashOperator(
task_id='print_hello',
bash_command='echo "hello world"'
)
start_dag >> t1
t1 >> t2 >> t4
t1 >> t3 >> t4
t4 >> end_dag
Also, I'm using Windows 10, I've tried adding the following in volumes: - "/var/run/docker.sock:/var/run/docker.sock" and I've succeeded in Ubuntu, but in Windows doesn't.
As requested, follow the docker-compose file:
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
# Basic Airflow cluster configuration for CeleryExecutor with Redis and PostgreSQL.
#
# WARNING: This configuration is for local development. Do not use it in a production deployment.
#
# This configuration supports basic configuration using environment variables or an .env file
# The following variables are supported:
#
# AIRFLOW_IMAGE_NAME - Docker image name used to run Airflow.
# Default: apache/airflow:master-python3.8
# AIRFLOW_UID - User ID in Airflow containers
# Default: 50000
# AIRFLOW_GID - Group ID in Airflow containers
# Default: 50000
# _AIRFLOW_WWW_USER_USERNAME - Username for the administrator account.
# Default: airflow
# _AIRFLOW_WWW_USER_PASSWORD - Password for the administrator account.
# Default: airflow
#
# Feel free to modify this file to suit your needs.
---
version: '3'
x-airflow-common:
&airflow-common
image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.0.2}
environment:
&airflow-common-env
AIRFLOW__CORE__EXECUTOR: CeleryExecutor
AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow#postgres/airflow
AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow#postgres/airflow
AIRFLOW__CELERY__BROKER_URL: redis://:#redis:6379/0
AIRFLOW__CORE__FERNET_KEY: ''
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL: 5 # Just to have a fast load in the front-end. Do not use it in production with those configurations.
AIRFLOW__API__AUTH_BACKEND: 'airflow.api.auth.backend.basic_auth'
AIRFLOW__CORE__ENABLE_XCOM_PICKLING: 'true' # "_run_image of the DockerOperator returns now a python string, not a byte string" Ref: https://github.com/apache/airflow/issues/13487
volumes:
- ./dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
- ./plugins:/opt/airflow/plugins
- "/var/run/docker.sock:/var/run/docker.sock" # We will pass the Docker Deamon as a volume to allow the webserver containers start docker images. Ref: https://stackoverflow.com/q/51342810/7024760
user: "${AIRFLOW_UID:-50000}:${AIRFLOW_GID:-50000}"
depends_on:
redis:
condition: service_healthy
postgres:
condition: service_healthy
services:
postgres:
image: postgres:13
environment:
POSTGRES_USER: airflow
POSTGRES_PASSWORD: airflow
POSTGRES_DB: airflow
volumes:
- postgres-db-volume:/var/lib/postgresql/data
healthcheck:
test: ["CMD", "pg_isready", "-U", "airflow"]
interval: 5s
retries: 5
restart: always
redis:
image: redis:latest
ports:
- 6379:6379
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 30s
retries: 50
restart: always
airflow-webserver:
<<: *airflow-common
command: webserver
ports:
- 8080:8080
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
interval: 10s
timeout: 10s
retries: 5
restart: always
airflow-scheduler:
<<: *airflow-common
command: scheduler
restart: always
airflow-worker:
<<: *airflow-common
command: celery worker
restart: always
airflow-init:
<<: *airflow-common
command: version
environment:
<<: *airflow-common-env
_AIRFLOW_DB_UPGRADE: 'true'
_AIRFLOW_WWW_USER_CREATE: 'true'
_AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
_AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}
flower:
<<: *airflow-common
command: celery flower
ports:
- 5555:5555
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:5555/"]
interval: 10s
timeout: 10s
retries: 5
restart: always
volumes:
postgres-db-volume:
I am pretty new to airflow and trying to run an ETL process for every 5 min. I have an airflow dag which I am trying to schedule to run for every 5 minutes but the dag fails with an error message ERROR-bash command failed, Permission Denied.
The dag is basically an ETL process with one BashOperator(which fails) and three PythonOperators which downstream process for BashOperator.
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
from airflow.contrib.sensors.file_sensor import FileSensor
from bin.int_medications import int_meds_auto_updt, storage, insert, del_stag, int_med_stag_clean
DAG_DEFAULT_ARGS = {
'owner':'airflow',
'depends_on_past':False,
'retires':1,
}
dag3 = DAG(dag_id = 'int_meds_dag_v1',
start_date=datetime(2019, 10, 10),
default_args = DAG_DEFAULT_ARGS,
schedule_interval = '*/5 * * * *',
catchup = False)
cmd_command = "/home/akash/airflow/dags/bin/int_medications/int_meds_auto_updt.py"
data_loading = BashOperator(
task_id = "int_meds",
bash_command = cmd_command,
dag=dag3)
data_cleaning = PythonOperator(task_id = 'data_cleaning', python_callable = int_med_stag_clean.clean_stag)
data_insert = PythonOperator(task_id = 'data_insert', python_callable = insert.insert_stag)
data_delete = PythonOperator(task_id = 'data_delete', python_callable = del_stag.delete_stag)
data_loading >> data_cleaning >> data_insert >> data_delete
Attached is the code for the dag file and the error message is below.
*** Reading local file: /home/akash/airflow/logs/int_meds_dag_v1/int_meds/2019-10-10T14:45:00+00:00/1.log
[2019-10-10 10:50:26,649] {__init__.py:1139} INFO - Dependencies all met for <TaskInstance: int_meds_dag_v1.int_meds 2019-10-10T14:45:00+00:00 [queued]>
[2019-10-10 10:50:26,652] {__init__.py:1139} INFO - Dependencies all met for <TaskInstance: int_meds_dag_v1.int_meds 2019-10-10T14:45:00+00:00 [queued]>
[2019-10-10 10:50:26,652] {__init__.py:1353} INFO -
--------------------------------------------------------------------------------
[2019-10-10 10:50:26,652] {__init__.py:1354} INFO - Starting attempt 1 of 1
[2019-10-10 10:50:26,652] {__init__.py:1355} INFO -
--------------------------------------------------------------------------------
[2019-10-10 10:50:26,659] {__init__.py:1374} INFO - Executing <Task(BashOperator): int_meds> on 2019-10-10T14:45:00+00:00
[2019-10-10 10:50:26,659] {base_task_runner.py:119} INFO - Running: ['airflow', 'run', 'int_meds_dag_v1', 'int_meds', '2019-10-10T14:45:00+00:00', '--job_id', '15495', '--raw', '-sd', 'DAGS_FOLDER/int_med_dag.py', '--cfg_path', '/tmp/tmpenegd6zi']
[2019-10-10 10:50:28,319] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds [2019-10-10 10:50:28,318] {__init__.py:51} INFO - Using executor SequentialExecutor
[2019-10-10 10:50:28,436] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds [2019-10-10 10:50:28,436] {__init__.py:305} INFO - Filling up the DagBag from /home/akash/airflow/dags/int_med_dag.py
[2019-10-10 10:50:29,739] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds [2019-10-10 10:50:29,739] {cli.py:517} INFO - Running <TaskInstance: int_meds_dag_v1.int_meds 2019-10-10T14:45:00+00:00 [running]> on host TRLPowerSpec.local
[2019-10-10 10:50:29,751] {bash_operator.py:81} INFO - Tmp dir root location:
/tmp
[2019-10-10 10:50:29,751] {bash_operator.py:90} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_ID=int_meds_dag_v1
AIRFLOW_CTX_TASK_ID=int_meds
AIRFLOW_CTX_EXECUTION_DATE=2019-10-10T14:45:00+00:00
AIRFLOW_CTX_DAG_RUN_ID=scheduled__2019-10-10T14:45:00+00:00
[2019-10-10 10:50:29,751] {bash_operator.py:104} INFO - Temporary script location: /tmp/airflowtmp7a1q6w0c/int_medsykc0by4v
[2019-10-10 10:50:29,751] {bash_operator.py:114} INFO - Running command: /home/akash/airflow/dags/bin/int_medications/int_meds_auto_updt.py
[2019-10-10 10:50:29,756] {bash_operator.py:123} INFO - Output:
[2019-10-10 10:50:29,757] {bash_operator.py:127} INFO - /tmp/airflowtmp7a1q6w0c/int_medsykc0by4v: line 1: /home/akash/airflow/dags/bin/int_medications/int_meds_auto_updt.py: Permission denied
[2019-10-10 10:50:29,757] {bash_operator.py:131} INFO - Command exited with return code 126
[2019-10-10 10:50:29,760] {__init__.py:1580} ERROR - Bash command failed
Traceback (most recent call last):
File "/home/akash/miniconda3/lib/python3.7/site-packages/airflow/models/__init__.py", line 1441, in _run_raw_task
result = task_copy.execute(context=context)
File "/home/akash/miniconda3/lib/python3.7/site-packages/airflow/operators/bash_operator.py", line 135, in execute
raise AirflowException("Bash command failed")
airflow.exceptions.AirflowException: Bash command failed
[2019-10-10 10:50:29,761] {__init__.py:1611} INFO - Marking task as FAILED.
[2019-10-10 10:50:29,768] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds Traceback (most recent call last):
[2019-10-10 10:50:29,768] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds File "/home/akash/miniconda3/bin/airflow", line 32, in <module>
[2019-10-10 10:50:29,768] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds args.func(args)
[2019-10-10 10:50:29,768] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds File "/home/akash/miniconda3/lib/python3.7/site-packages/airflow/utils/cli.py", line 74, in wrapper
[2019-10-10 10:50:29,768] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds return f(*args, **kwargs)
[2019-10-10 10:50:29,769] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds File "/home/akash/miniconda3/lib/python3.7/site-packages/airflow/bin/cli.py", line 523, in run
[2019-10-10 10:50:29,769] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds _run(args, dag, ti)
[2019-10-10 10:50:29,769] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds File "/home/akash/miniconda3/lib/python3.7/site-packages/airflow/bin/cli.py", line 442, in _run
[2019-10-10 10:50:29,769] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds pool=args.pool,
[2019-10-10 10:50:29,769] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds File "/home/akash/miniconda3/lib/python3.7/site-packages/airflow/utils/db.py", line 73, in wrapper
[2019-10-10 10:50:29,769] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds return func(*args, **kwargs)
[2019-10-10 10:50:29,769] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds File "/home/akash/miniconda3/lib/python3.7/site-packages/airflow/models/__init__.py", line 1441, in _run_raw_task
[2019-10-10 10:50:29,769] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds result = task_copy.execute(context=context)
[2019-10-10 10:50:29,769] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds File "/home/akash/miniconda3/lib/python3.7/site-packages/airflow/operators/bash_operator.py", line 135, in execute
[2019-10-10 10:50:29,769] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds raise AirflowException("Bash command failed")
[2019-10-10 10:50:29,769] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds airflow.exceptions.AirflowException: Bash command failed
[2019-10-10 10:50:31,649] {logging_mixin.py:95} INFO - [2019-10-10 10:50:31,649] {jobs.py:2562} INFO - Task exited with return code 1
I also tried giving permissions to the python file using
sudo chmod -R -f 777 /path/to/file
but still, it throws the same error in airflow.
I'd really appreciate it if I can know what the mistake is and I can rectify it.
Bash Operator expects either a bash file in bash_command argument(in that case file extension should be .sh) or a Bash command. Try replacing cmd_command with:
cmd_command = "python /home/akash/airflow/dags/bin/int_medications/int_meds_auto_updt.py"
Alternatively, you could use PythonOperator instead and run code from int_meds_auto_updt.py
Am unable to find out the issue that i have made, logs are shown below.
The DAG, connection, pig script that i have created are also shown below.
DAG:
from airflow.operators import BashOperator, PigOperator
from airflow.models import DAG
from datetime import datetime
default_args = {
'owner': 'hadoop',
'start_date': datetime.now()
}
dag = DAG(dag_id='ETL-DEMO',default_args=default_args,schedule_interval='#hourly')
fly_task_1 = BashOperator(
task_id='fly_task_1',
bash_command='sleep 10 ; echo "fly_task_2"',
dag=dag)
fly_task_2 = PigOperator(
task_id='fly_task_2',
pig='/pig/sample.pig',
pig_cli_conn_id='pig_cli',
dag=dag)
fly_task_2.set_upstream(fly_task_1)
PIG SCRIPT:
rmf /onlyvinish/sample_out;
a_load = load '/onlyvinish/sample.txt' using PigStorage(',');
a_gen = foreach a_load generate (int)$0 as a;
b_gen = foreach a_gen generate a, a+1, a+2, a+3, a+4, a+5;
store b_gen into '/onlyvinish/sample_out' using PigStorage(',');
Connections:
Log for the failed task:
[2017-01-24 00:03:27,199] {models.py:168} INFO - Filling up the DagBag from /home/hadoop/airflow/dags/ETL.py
[2017-01-24 00:03:27,276] {jobs.py:2042} INFO - Subprocess PID is 8532
[2017-01-24 00:03:29,410] {models.py:168} INFO - Filling up the DagBag from /home/hadoop/airflow/dags/ETL.py
[2017-01-24 00:03:29,487] {models.py:1078} INFO - Dependencies all met for <TaskInstance: ETL-DEMO.fly_task_2 2017-01-24 00:03:07.199790 [queued]>
[2017-01-24 00:03:29,496] {models.py:1078} INFO - Dependencies all met for <TaskInstance: ETL-DEMO.fly_task_2 2017-01-24 00:03:07.199790 [queued]>
[2017-01-24 00:03:29,496] {models.py:1266} INFO -
--------------------------------------------------------------------------------
Starting attempt 1 of 1
--------------------------------------------------------------------------------
[2017-01-24 00:03:29,533] {models.py:1289} INFO - Executing <Task(PigOperator): fly_task_2> on 2017-01-24 00:03:07.199790
[2017-01-24 00:03:29,550] {pig_operator.py:64} INFO - Executing: rmf /onlyvinish/sample_out;
a_load = load '/onlyvinish/sample.txt' using PigStorage(',');
a_gen = foreach a_load generate (int)$0 as a;
b_gen = foreach a_gen generate a, a+1, a+2, a+3, a+4, a+5;
store b_gen into '/onlyvinish/sample_out' using PigStorage(',');
[2017-01-24 00:03:29,612] {pig_hook.py:67} INFO - pig -f /tmp/airflow_pigop_sm5bjE/tmpNP0ZXM
[2017-01-24 00:03:29,620] {models.py:1364} ERROR - [Errno 2] No such file or directory
Traceback (most recent call last):
File "/home/hadoop/anaconda2/lib/python2.7/site-packages/airflow-1.7.2.dev0-py2.7.egg/airflow/models.py", line 1321, in run
result = task_copy.execute(context=context)
File "/home/hadoop/anaconda2/lib/python2.7/site-packages/airflow-1.7.2.dev0-py2.7.egg/airflow/operators/pig_operator.py", line 66, in execute
self.hook.run_cli(pig=self.pig)
File "/home/hadoop/anaconda2/lib/python2.7/site-packages/airflow-1.7.2.dev0-py2.7.egg/airflow/hooks/pig_hook.py", line 72, in run_cli
cwd=tmp_dir)
File "/home/hadoop/anaconda2/lib/python2.7/subprocess.py", line 711, in __init__
errread, errwrite)
File "/home/hadoop/anaconda2/lib/python2.7/subprocess.py", line 1343, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
[2017-01-24 00:03:29,623] {models.py:1388} INFO - Marking task as FAILED.
[2017-01-24 00:03:29,636] {models.py:1409} ERROR - [Errno 2] No such file or directory
Airflow: 1.7.2
Python: 2.7
Rhel:6.7
Please let me know what am doing wrong.?
Running a worker on a different machine results in errors specified below. I have followed the configuration instructions and have sync the dags folder.
I would also like to confirm that RabbitMQ and PostgreSQL only needs to be installed on the Airflow core machine and does not need to be installed on the workers (the workers only connect to the core).
The specification of the setup is detailed below:
Airflow core/server computer
Has the following installed:
Python 2.7 with
airflow (AIRFLOW_HOME = ~/airflow)
celery
psycogp2
RabbitMQ
PostgreSQL
Configurations made in airflow.cfg:
sql_alchemy_conn = postgresql+psycopg2://username:password#192.168.1.2:5432/airflow
executor = CeleryExecutor
broker_url = amqp://username:password#192.168.1.2:5672//
celery_result_backend = postgresql+psycopg2://username:password#192.168.1.2:5432/airflow
Tests performed:
RabbitMQ is running
Can connect to PostgreSQL and have confirmed that Airflow has created tables
Can start and view the webserver (including custom dags)
.
.
Airflow worker computer
Has the following installed:
Python 2.7 with
airflow (AIRFLOW_HOME = ~/airflow)
celery
psycogp2
Configurations made in airflow.cfg are exactly the same as in the server:
sql_alchemy_conn = postgresql+psycopg2://username:password#192.168.1.2:5432/airflow
executor = CeleryExecutor
broker_url = amqp://username:password#192.168.1.2:5672//
celery_result_backend = postgresql+psycopg2://username:password#192.168.1.2:5432/airflow
Output from commands run on the worker machine:
When running airflow flower:
ubuntu#airflow_client:~/airflow$ airflow flower
[2016-06-13 04:19:42,814] {__init__.py:36} INFO - Using executor CeleryExecutor
Traceback (most recent call last):
File "/home/ubuntu/anaconda2/bin/airflow", line 15, in <module>
args.func(args)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/airflow/bin/cli.py", line 576, in flower
os.execvp("flower", ['flower', '-b', broka, port, api])
File "/home/ubuntu/anaconda2/lib/python2.7/os.py", line 346, in execvp
_execvpe(file, args)
File "/home/ubuntu/anaconda2/lib/python2.7/os.py", line 382, in _execvpe
func(fullname, *argrest)
OSError: [Errno 2] No such file or directory
When running airflow worker:
ubuntu#airflow_client:~$ airflow worker
[2016-06-13 04:08:43,573] {__init__.py:36} INFO - Using executor CeleryExecutor
[2016-06-13 04:08:43,935: ERROR/MainProcess] Unrecoverable error: ImportError('No module named postgresql',)
Traceback (most recent call last):
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/celery/worker/__init__.py", line 206, in start
self.blueprint.start(self)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/celery/bootsteps.py", line 119, in start
self.on_start()
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/celery/apps/worker.py", line 169, in on_start
string(self.colored.cyan(' \n', self.startup_info())),
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/celery/apps/worker.py", line 230, in startup_info
results=self.app.backend.as_uri(),
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/kombu/utils/__init__.py", line 325, in __get__
value = obj.__dict__[self.__name__] = self.__get(obj)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/celery/app/base.py", line 626, in backend
return self._get_backend()
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/celery/app/base.py", line 444, in _get_backend
self.loader)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/celery/backends/__init__.py", line 68, in get_backend_by_url
return get_backend_cls(backend, loader), url
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/celery/backends/__init__.py", line 49, in get_backend_cls
cls = symbol_by_name(backend, aliases)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/kombu/utils/__init__.py", line 96, in symbol_by_name
module = imp(module_name, package=package, **kwargs)
File "/home/ubuntu/anaconda2/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
ImportError: No module named postgresql
When celery_result_backend is changed to the default db+mysql://airflow:airflow#localhost:3306/airflow and the airflow worker is run again the result is:
ubuntu#airflow_client:~/airflow$ airflow worker
[2016-06-13 04:17:32,387] {__init__.py:36} INFO - Using executor CeleryExecutor
-------------- celery#airflow_client2 v3.1.23 (Cipater)
---- **** -----
--- * *** * -- Linux-3.19.0-59-generic-x86_64-with-debian-jessie-sid
-- * - **** ---
- ** ---------- [config]
- ** ---------- .> app: airflow.executors.celery_executor:0x7f5cb65cb510
- ** ---------- .> transport: amqp://username:**#192.168.1.2:5672//
- ** ---------- .> results: mysql://airflow:**#localhost:3306/airflow
- *** --- * --- .> concurrency: 16 (prefork)
-- ******* ----
--- ***** ----- [queues]
-------------- .> default exchange=default(direct) key=celery
[2016-06-13 04:17:33,385] {__init__.py:36} INFO - Using executor CeleryExecutor
Starting flask
[2016-06-13 04:17:33,737] {_internal.py:87} INFO - * Running on http://0.0.0.0:8793/ (Press CTRL+C to quit)
[2016-06-13 04:17:34,536: WARNING/MainProcess] celery#airflow_client2 ready.
What am I missing? How can I diagnose this further?
The ImportError: No module named postgresql error is due to the invalid prefix used in your celery_result_backend. When using a database as a Celery backend, the connection URL must be prefixed with db+. See
https://docs.celeryproject.org/en/stable/userguide/configuration.html#conf-database-result-backend
So replace:
celery_result_backend = postgresql+psycopg2://username:password#192.168.1.2:5432/airflow
with something like:
celery_result_backend = db+postgresql://username:password#192.168.1.2:5432/airflow
You need to ensure to install Celery Flower. That is, pip install flower.