I am trying to schedule a spark job in airflow, here is my dag,
from __future__ import print_function
import airflow
from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator
from airflow.models import DAG
from datetime import datetime, timedelta
import os
APPLICATION_FILE_PATH = "/home/ubuntu/airflow/dags/scripts/"
default_args = {
'owner': 'Airflow',
'depends_on_past': False,
'retries': 0,
}
dag = DAG('datatodb', default_args=default_args, start_date=(datetime(2018,7,28)),schedule_interval='0 5 * * *')
data_to_db = SparkSubmitOperator(
task_id='data_to_db',
application=APPLICATION_FILE_PATH+"ds_load.py",
dag=dag,
run_as_user='ubuntu',
application_args=["{{ ds }}"]
)
data_to_db
And my python script is this,
from pyspark import SparkContext
from pyspark.sql import SQLContext, SparkSession
from datetime import datetime, timedelta
import sys
def write_to_db(previous_day,spark_session):
drop_cols = ['created_date','Year','Month']
datapath = "s3a://***/"
s3path = datapath + 'date=' + str(previous_day)
data_to_load_df = spark_session.read.parquet(s3path).drop(*drop_cols).withColumn('date',lit(previous_day))
data_to_load_df.write.format('jdbc').options(url='jdbc:mysql://servername:3306/dbname',
driver='com.mysql.jdbc.Driver',
dbtable='report_table',
user='****',
password="***").mode('append').save()
def main(previous_day,spark_session=None):
if spark_session is None:
spark_session = SparkSession.builder.appName("s3_to_db").getOrCreate()
write_to_db(previous_day,spark_session)
if __name__ == "__main__":
previous_day = sys.argv[1]
main(previous_day)
Not sure what is wrong with this, I keep getting this error,
[2018-08-01 02:08:37,278] {base_task_runner.py:98} INFO - Subtask: [2018-08-01 02:08:37,278] {base_hook.py:80} INFO - Using connection to: local
[2018-08-01 02:08:37,298] {base_task_runner.py:98} INFO - Subtask: Traceback (most recent call last):
[2018-08-01 02:08:37,299] {base_task_runner.py:98} INFO - Subtask: File "/home/ubuntu/anaconda2/bin/airflow", line 27, in <module>
[2018-08-01 02:08:37,299] {base_task_runner.py:98} INFO - Subtask: args.func(args)
[2018-08-01 02:08:37,299] {base_task_runner.py:98} INFO - Subtask: File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/airflow/bin/cli.py", line 392, in run
[2018-08-01 02:08:37,299] {base_task_runner.py:98} INFO - Subtask: pool=args.pool,
[2018-08-01 02:08:37,299] {base_task_runner.py:98} INFO - Subtask: File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/airflow/utils/db.py", line 50, in wrapper
[2018-08-01 02:08:37,299] {base_task_runner.py:98} INFO - Subtask: result = func(*args, **kwargs)
[2018-08-01 02:08:37,300] {base_task_runner.py:98} INFO - Subtask: File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/airflow/models.py", line 1493, in _run_raw_task
[2018-08-01 02:08:37,300] {base_task_runner.py:98} INFO - Subtask: result = task_copy.execute(context=context)
[2018-08-01 02:08:37,300] {base_task_runner.py:98} INFO - Subtask: File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/airflow/contrib/operators/spark_submit_operator.py", line 145, in execute
[2018-08-01 02:08:37,300] {base_task_runner.py:98} INFO - Subtask: self._hook.submit(self._application)
[2018-08-01 02:08:37,300] {base_task_runner.py:98} INFO - Subtask: File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/airflow/contrib/hooks/spark_submit_hook.py", line 231, in submit
[2018-08-01 02:08:37,300] {base_task_runner.py:98} INFO - Subtask: **kwargs)
[2018-08-01 02:08:37,301] {base_task_runner.py:98} INFO - Subtask: File "/home/ubuntu/anaconda2/lib/python2.7/subprocess.py", line 394, in __init__
[2018-08-01 02:08:37,301] {base_task_runner.py:98} INFO - Subtask: errread, errwrite)
[2018-08-01 02:08:37,301] {base_task_runner.py:98} INFO - Subtask: File "/home/ubuntu/anaconda2/lib/python2.7/subprocess.py", line 1047, in _execute_child
[2018-08-01 02:08:37,301] {base_task_runner.py:98} INFO - Subtask: raise child_exception
[2018-08-01 02:08:37,301] {base_task_runner.py:98} INFO - Subtask: OSError: [Errno 2] No such file or directory
I checked the python script in its path, its there, also checked the mysql driver, its in the jars folder. This error message does not give me much information on which file is missing. Can anyone help me with this?
Answering my own question, since I resolved it. My bad, I just had to restart the airflow webserver and scheduler after placing the driver jars in spark/jar folder. It worked fine then.
Related
I am working with TFX and AIrflow to build the pipeline for image classification and segmentation. I am replicating the repo by Jan Marcel Kezmann (Link). While executing the classification_dag.py in dags/classification_pipeline and segmentation_dag.py in dags/segmentation_pipeline, I am getting a runtime error.
Error message is mentioned below.
[2023-01-23, 10:55:37 UTC] {taskinstance.py:1889} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/bhargavpatel/anaconda3/envs/ic_pipeline/lib/python3.7/site-packages/airflow/operators/python.py", line 171, in execute
return_value = self.execute_callable()
File "/home/bhargavpatel/anaconda3/envs/ic_pipeline/lib/python3.7/site-packages/airflow/operators/python.py", line 189, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/home/bhargavpatel/anaconda3/envs/ic_pipeline/lib/python3.7/site-packages/tfx/orchestration/airflow/airflow_component.py", line 76, in _airflow_component_launcher
launcher.launch()
File "/home/bhargavpatel/anaconda3/envs/ic_pipeline/lib/python3.7/site-packages/tfx/orchestration/launcher/base_component_launcher.py", line 209, in launch
copy.deepcopy(execution_decision.exec_properties))
File "/home/bhargavpatel/anaconda3/envs/ic_pipeline/lib/python3.7/site-packages/tfx/orchestration/launcher/in_process_component_launcher.py", line 74, in _run_executor
copy.deepcopy(input_dict), output_dict, copy.deepcopy(exec_properties))
File "/home/bhargavpatel/anaconda3/envs/ic_pipeline/lib/python3.7/site-packages/tfx/components/transform/executor.py", line 586, in Do
TransformProcessor().Transform(label_inputs, label_outputs, status_file)
File "/home/bhargavpatel/anaconda3/envs/ic_pipeline/lib/python3.7/site-packages/tfx/components/transform/executor.py", line 1145, in Transform
make_beam_pipeline_fn)
File "/home/bhargavpatel/anaconda3/envs/ic_pipeline/lib/python3.7/site-packages/tfx/components/transform/executor.py", line 1526, in _RunBeamImpl
output_path=dataset.materialize_output_path))
File "/home/bhargavpatel/anaconda3/envs/ic_pipeline/lib/python3.7/site-packages/apache_beam/pipeline.py", line 597, in __exit__
self.result = self.run()
File "/home/bhargavpatel/anaconda3/envs/ic_pipeline/lib/python3.7/site-packages/apache_beam/pipeline.py", line 574, in run
return self.runner.run_pipeline(self, self._options)
File "/home/bhargavpatel/anaconda3/envs/ic_pipeline/lib/python3.7/site-packages/apache_beam/runners/direct/direct_runner.py", line 131, in run_pipeline
return runner.run_pipeline(pipeline, options)
File "/home/bhargavpatel/anaconda3/envs/ic_pipeline/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 201, in run_pipeline
options)
File "/home/bhargavpatel/anaconda3/envs/ic_pipeline/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 212, in run_via_runner_api
return self.run_stages(stage_context, stages)
File "/home/bhargavpatel/anaconda3/envs/ic_pipeline/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 443, in run_stages
runner_execution_context, bundle_context_manager, bundle_input)
File "/home/bhargavpatel/anaconda3/envs/ic_pipeline/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 776, in _execute_bundle
bundle_manager))
File "/home/bhargavpatel/anaconda3/envs/ic_pipeline/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 1000, in _run_bundle
data_input, data_output, input_timers, expected_timer_output)
File "/home/bhargavpatel/anaconda3/envs/ic_pipeline/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 1393, in process_bundle
for ix, part in enumerate(input.partition(self._num_workers)):
AttributeError: 'NoneType' object has no attribute 'partition'
[2023-01-23, 10:55:37 UTC] {taskinstance.py:1400} INFO - Marking task as FAILED. dag_id=segmentation_dag, task_id=Transform, execution_date=20230123T052420, start_date=20230123T052509, end_date=20230123T052537
[2023-01-23, 10:55:37 UTC] {standard_task_runner.py:97} ERROR - Failed to execute job 43 for task Transform ('NoneType' object has no attribute 'partition'; 42847)
[2023-01-23, 10:55:37 UTC] {local_task_job.py:156} INFO - Task exited with return code 1
[2023-01-23, 10:55:37 UTC] {local_task_job.py:273} INFO - 0 downstream tasks scheduled from follow-on schedule check
Airflow Dag is as below
Segmentation Dag
I tried to check multiple versions of Airflow and TFX. I belive this error is due to incompatible versions as error is coming from the executor of transform component and airflow. Before the dag execution, I have compile the individule module and have not found any error in them.
Versions
TFX:- 1.12.0
Tensorflow:- 2.11.0
Airflow:- 2.3.0
I am pretty new to airflow and trying to run an ETL process for every 5 min. I have an airflow dag which I am trying to schedule to run for every 5 minutes but the dag fails with an error message ERROR-bash command failed, Permission Denied.
The dag is basically an ETL process with one BashOperator(which fails) and three PythonOperators which downstream process for BashOperator.
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
from airflow.contrib.sensors.file_sensor import FileSensor
from bin.int_medications import int_meds_auto_updt, storage, insert, del_stag, int_med_stag_clean
DAG_DEFAULT_ARGS = {
'owner':'airflow',
'depends_on_past':False,
'retires':1,
}
dag3 = DAG(dag_id = 'int_meds_dag_v1',
start_date=datetime(2019, 10, 10),
default_args = DAG_DEFAULT_ARGS,
schedule_interval = '*/5 * * * *',
catchup = False)
cmd_command = "/home/akash/airflow/dags/bin/int_medications/int_meds_auto_updt.py"
data_loading = BashOperator(
task_id = "int_meds",
bash_command = cmd_command,
dag=dag3)
data_cleaning = PythonOperator(task_id = 'data_cleaning', python_callable = int_med_stag_clean.clean_stag)
data_insert = PythonOperator(task_id = 'data_insert', python_callable = insert.insert_stag)
data_delete = PythonOperator(task_id = 'data_delete', python_callable = del_stag.delete_stag)
data_loading >> data_cleaning >> data_insert >> data_delete
Attached is the code for the dag file and the error message is below.
*** Reading local file: /home/akash/airflow/logs/int_meds_dag_v1/int_meds/2019-10-10T14:45:00+00:00/1.log
[2019-10-10 10:50:26,649] {__init__.py:1139} INFO - Dependencies all met for <TaskInstance: int_meds_dag_v1.int_meds 2019-10-10T14:45:00+00:00 [queued]>
[2019-10-10 10:50:26,652] {__init__.py:1139} INFO - Dependencies all met for <TaskInstance: int_meds_dag_v1.int_meds 2019-10-10T14:45:00+00:00 [queued]>
[2019-10-10 10:50:26,652] {__init__.py:1353} INFO -
--------------------------------------------------------------------------------
[2019-10-10 10:50:26,652] {__init__.py:1354} INFO - Starting attempt 1 of 1
[2019-10-10 10:50:26,652] {__init__.py:1355} INFO -
--------------------------------------------------------------------------------
[2019-10-10 10:50:26,659] {__init__.py:1374} INFO - Executing <Task(BashOperator): int_meds> on 2019-10-10T14:45:00+00:00
[2019-10-10 10:50:26,659] {base_task_runner.py:119} INFO - Running: ['airflow', 'run', 'int_meds_dag_v1', 'int_meds', '2019-10-10T14:45:00+00:00', '--job_id', '15495', '--raw', '-sd', 'DAGS_FOLDER/int_med_dag.py', '--cfg_path', '/tmp/tmpenegd6zi']
[2019-10-10 10:50:28,319] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds [2019-10-10 10:50:28,318] {__init__.py:51} INFO - Using executor SequentialExecutor
[2019-10-10 10:50:28,436] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds [2019-10-10 10:50:28,436] {__init__.py:305} INFO - Filling up the DagBag from /home/akash/airflow/dags/int_med_dag.py
[2019-10-10 10:50:29,739] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds [2019-10-10 10:50:29,739] {cli.py:517} INFO - Running <TaskInstance: int_meds_dag_v1.int_meds 2019-10-10T14:45:00+00:00 [running]> on host TRLPowerSpec.local
[2019-10-10 10:50:29,751] {bash_operator.py:81} INFO - Tmp dir root location:
/tmp
[2019-10-10 10:50:29,751] {bash_operator.py:90} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_ID=int_meds_dag_v1
AIRFLOW_CTX_TASK_ID=int_meds
AIRFLOW_CTX_EXECUTION_DATE=2019-10-10T14:45:00+00:00
AIRFLOW_CTX_DAG_RUN_ID=scheduled__2019-10-10T14:45:00+00:00
[2019-10-10 10:50:29,751] {bash_operator.py:104} INFO - Temporary script location: /tmp/airflowtmp7a1q6w0c/int_medsykc0by4v
[2019-10-10 10:50:29,751] {bash_operator.py:114} INFO - Running command: /home/akash/airflow/dags/bin/int_medications/int_meds_auto_updt.py
[2019-10-10 10:50:29,756] {bash_operator.py:123} INFO - Output:
[2019-10-10 10:50:29,757] {bash_operator.py:127} INFO - /tmp/airflowtmp7a1q6w0c/int_medsykc0by4v: line 1: /home/akash/airflow/dags/bin/int_medications/int_meds_auto_updt.py: Permission denied
[2019-10-10 10:50:29,757] {bash_operator.py:131} INFO - Command exited with return code 126
[2019-10-10 10:50:29,760] {__init__.py:1580} ERROR - Bash command failed
Traceback (most recent call last):
File "/home/akash/miniconda3/lib/python3.7/site-packages/airflow/models/__init__.py", line 1441, in _run_raw_task
result = task_copy.execute(context=context)
File "/home/akash/miniconda3/lib/python3.7/site-packages/airflow/operators/bash_operator.py", line 135, in execute
raise AirflowException("Bash command failed")
airflow.exceptions.AirflowException: Bash command failed
[2019-10-10 10:50:29,761] {__init__.py:1611} INFO - Marking task as FAILED.
[2019-10-10 10:50:29,768] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds Traceback (most recent call last):
[2019-10-10 10:50:29,768] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds File "/home/akash/miniconda3/bin/airflow", line 32, in <module>
[2019-10-10 10:50:29,768] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds args.func(args)
[2019-10-10 10:50:29,768] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds File "/home/akash/miniconda3/lib/python3.7/site-packages/airflow/utils/cli.py", line 74, in wrapper
[2019-10-10 10:50:29,768] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds return f(*args, **kwargs)
[2019-10-10 10:50:29,769] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds File "/home/akash/miniconda3/lib/python3.7/site-packages/airflow/bin/cli.py", line 523, in run
[2019-10-10 10:50:29,769] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds _run(args, dag, ti)
[2019-10-10 10:50:29,769] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds File "/home/akash/miniconda3/lib/python3.7/site-packages/airflow/bin/cli.py", line 442, in _run
[2019-10-10 10:50:29,769] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds pool=args.pool,
[2019-10-10 10:50:29,769] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds File "/home/akash/miniconda3/lib/python3.7/site-packages/airflow/utils/db.py", line 73, in wrapper
[2019-10-10 10:50:29,769] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds return func(*args, **kwargs)
[2019-10-10 10:50:29,769] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds File "/home/akash/miniconda3/lib/python3.7/site-packages/airflow/models/__init__.py", line 1441, in _run_raw_task
[2019-10-10 10:50:29,769] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds result = task_copy.execute(context=context)
[2019-10-10 10:50:29,769] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds File "/home/akash/miniconda3/lib/python3.7/site-packages/airflow/operators/bash_operator.py", line 135, in execute
[2019-10-10 10:50:29,769] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds raise AirflowException("Bash command failed")
[2019-10-10 10:50:29,769] {base_task_runner.py:101} INFO - Job 15495: Subtask int_meds airflow.exceptions.AirflowException: Bash command failed
[2019-10-10 10:50:31,649] {logging_mixin.py:95} INFO - [2019-10-10 10:50:31,649] {jobs.py:2562} INFO - Task exited with return code 1
I also tried giving permissions to the python file using
sudo chmod -R -f 777 /path/to/file
but still, it throws the same error in airflow.
I'd really appreciate it if I can know what the mistake is and I can rectify it.
Bash Operator expects either a bash file in bash_command argument(in that case file extension should be .sh) or a Bash command. Try replacing cmd_command with:
cmd_command = "python /home/akash/airflow/dags/bin/int_medications/int_meds_auto_updt.py"
Alternatively, you could use PythonOperator instead and run code from int_meds_auto_updt.py
I'm trying to run the SqlSensor locally under docker on a Windows 10 machine, it runs on Linux but get below errors when trying to run the same simple DAG locally.
The reason I'm trying to set this up is so that I can develop locally and test to speed up the development cycle.
Error from Airflow log:
[2018-05-22 08:27:04,929] {{models.py:1428}} INFO - Executing <Task(SqlSensor): limits_test> on 2018-05-21 08:00:00
[2018-05-22 08:27:04,929] {{base_task_runner.py:115}} INFO - Running: ['bash', '-c', 'airflow run sql-sensor-test-dag limits_test 2018-05-21T08:00:00 --job_id 8 --raw -sd DAGS_FOLDER/sql_sensor_test.py']
[2018-05-22 08:27:05,685] {{base_task_runner.py:98}} INFO - Subtask: [2018-05-22 08:27:05,684] {{__init__.py:45}} INFO - Using executor CeleryExecutor
[2018-05-22 08:27:05,749] {{base_task_runner.py:98}} INFO - Subtask: [2018-05-22 08:27:05,749] {{models.py:189}} INFO - Filling up the DagBag from /usr/local/airflow/dags/sql_sensor_test.py
[2018-05-22 08:27:05,791] {{cli.py:374}} INFO - Running on host 0f8e7a60dbab
[2018-05-22 08:27:05,858] {{base_task_runner.py:98}} INFO - Subtask: [2018-05-22 08:27:05,858] {{base_hook.py:80}} INFO - Using connection to: LABCHGVA-SQL295
[2018-05-22 08:27:05,888] {{base_task_runner.py:98}} INFO - Subtask: [2018-05-22 08:27:05,888] {{sensors.py:111}} INFO - Poking: SELECT max(snapshot_id) FROM limits_run
[2018-05-22 08:27:05,896] {{base_task_runner.py:98}} INFO - Subtask: [2018-05-22 08:27:05,896] {{base_hook.py:80}} INFO - Using connection to: LABCHGVA-SQL295
[2018-05-22 08:27:05,924] {{models.py:1595}} ERROR - Connection to the database failed for an unknown reason.
Traceback (most recent call last):
File "pymssql.pyx", line 635, in pymssql.connect (pymssql.c:10734)
File "_mssql.pyx", line 1902, in _mssql.connect (_mssql.c:21821)
File "_mssql.pyx", line 638, in _mssql.MSSQLConnection.__init__ (_mssql.c:6594)
_mssql.MSSQLDriverException: Connection to the database failed for an unknown reason.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/models.py", line 1493, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.6/site-packages/airflow/operators/sensors.py", line 78, in execute
while not self.poke(context):
File "/usr/local/lib/python3.6/site-packages/airflow/operators/sensors.py", line 112, in poke
records = hook.get_records(self.sql)
File "/usr/local/lib/python3.6/site-packages/airflow/hooks/dbapi_hook.py", line 106, in get_records
with closing(self.get_conn()) as conn:
File "/usr/local/lib/python3.6/site-packages/airflow/hooks/mssql_hook.py", line 43, in get_conn
port=conn.port)
File "pymssql.pyx", line 644, in pymssql.connect (pymssql.c:10892)
pymssql.InterfaceError: Connection to the database failed for an unknown reason.
Using this Docker image:
FROM puckel/docker-airflow:1.9.0-2
USER root
RUN apt-get update
RUN apt-get install freetds-dev -yqq && \
pip install apache-airflow[mssql]
USER airflow
And the following simple DAG:
from datetime import timedelta
import airflow
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.sensors import SqlSensor
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'catchup': False,
'start_date': airflow.utils.dates.days_ago(1),
'email': ['myemail#company.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 10,
'retry_delay': timedelta(minutes=15),
'sla': timedelta(hours=3)
}
dag = DAG(
'sql-sensor-test-dag',
default_args=default_args,
description='Sensor tests',
schedule_interval='0 8 * * *'
# schedule_interval='#once'
)
with dag:
sql_sensor = SqlSensor(
task_id='limits_test',
conn_id='bpeak_limits_ro',
sql="SELECT max(snapshot_id) FROM limits_run"
)
done = DummyOperator(task_id='done')
sql_sensor >> done
Am unable to find out the issue that i have made, logs are shown below.
The DAG, connection, pig script that i have created are also shown below.
DAG:
from airflow.operators import BashOperator, PigOperator
from airflow.models import DAG
from datetime import datetime
default_args = {
'owner': 'hadoop',
'start_date': datetime.now()
}
dag = DAG(dag_id='ETL-DEMO',default_args=default_args,schedule_interval='#hourly')
fly_task_1 = BashOperator(
task_id='fly_task_1',
bash_command='sleep 10 ; echo "fly_task_2"',
dag=dag)
fly_task_2 = PigOperator(
task_id='fly_task_2',
pig='/pig/sample.pig',
pig_cli_conn_id='pig_cli',
dag=dag)
fly_task_2.set_upstream(fly_task_1)
PIG SCRIPT:
rmf /onlyvinish/sample_out;
a_load = load '/onlyvinish/sample.txt' using PigStorage(',');
a_gen = foreach a_load generate (int)$0 as a;
b_gen = foreach a_gen generate a, a+1, a+2, a+3, a+4, a+5;
store b_gen into '/onlyvinish/sample_out' using PigStorage(',');
Connections:
Log for the failed task:
[2017-01-24 00:03:27,199] {models.py:168} INFO - Filling up the DagBag from /home/hadoop/airflow/dags/ETL.py
[2017-01-24 00:03:27,276] {jobs.py:2042} INFO - Subprocess PID is 8532
[2017-01-24 00:03:29,410] {models.py:168} INFO - Filling up the DagBag from /home/hadoop/airflow/dags/ETL.py
[2017-01-24 00:03:29,487] {models.py:1078} INFO - Dependencies all met for <TaskInstance: ETL-DEMO.fly_task_2 2017-01-24 00:03:07.199790 [queued]>
[2017-01-24 00:03:29,496] {models.py:1078} INFO - Dependencies all met for <TaskInstance: ETL-DEMO.fly_task_2 2017-01-24 00:03:07.199790 [queued]>
[2017-01-24 00:03:29,496] {models.py:1266} INFO -
--------------------------------------------------------------------------------
Starting attempt 1 of 1
--------------------------------------------------------------------------------
[2017-01-24 00:03:29,533] {models.py:1289} INFO - Executing <Task(PigOperator): fly_task_2> on 2017-01-24 00:03:07.199790
[2017-01-24 00:03:29,550] {pig_operator.py:64} INFO - Executing: rmf /onlyvinish/sample_out;
a_load = load '/onlyvinish/sample.txt' using PigStorage(',');
a_gen = foreach a_load generate (int)$0 as a;
b_gen = foreach a_gen generate a, a+1, a+2, a+3, a+4, a+5;
store b_gen into '/onlyvinish/sample_out' using PigStorage(',');
[2017-01-24 00:03:29,612] {pig_hook.py:67} INFO - pig -f /tmp/airflow_pigop_sm5bjE/tmpNP0ZXM
[2017-01-24 00:03:29,620] {models.py:1364} ERROR - [Errno 2] No such file or directory
Traceback (most recent call last):
File "/home/hadoop/anaconda2/lib/python2.7/site-packages/airflow-1.7.2.dev0-py2.7.egg/airflow/models.py", line 1321, in run
result = task_copy.execute(context=context)
File "/home/hadoop/anaconda2/lib/python2.7/site-packages/airflow-1.7.2.dev0-py2.7.egg/airflow/operators/pig_operator.py", line 66, in execute
self.hook.run_cli(pig=self.pig)
File "/home/hadoop/anaconda2/lib/python2.7/site-packages/airflow-1.7.2.dev0-py2.7.egg/airflow/hooks/pig_hook.py", line 72, in run_cli
cwd=tmp_dir)
File "/home/hadoop/anaconda2/lib/python2.7/subprocess.py", line 711, in __init__
errread, errwrite)
File "/home/hadoop/anaconda2/lib/python2.7/subprocess.py", line 1343, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
[2017-01-24 00:03:29,623] {models.py:1388} INFO - Marking task as FAILED.
[2017-01-24 00:03:29,636] {models.py:1409} ERROR - [Errno 2] No such file or directory
Airflow: 1.7.2
Python: 2.7
Rhel:6.7
Please let me know what am doing wrong.?
Running a worker on a different machine results in errors specified below. I have followed the configuration instructions and have sync the dags folder.
I would also like to confirm that RabbitMQ and PostgreSQL only needs to be installed on the Airflow core machine and does not need to be installed on the workers (the workers only connect to the core).
The specification of the setup is detailed below:
Airflow core/server computer
Has the following installed:
Python 2.7 with
airflow (AIRFLOW_HOME = ~/airflow)
celery
psycogp2
RabbitMQ
PostgreSQL
Configurations made in airflow.cfg:
sql_alchemy_conn = postgresql+psycopg2://username:password#192.168.1.2:5432/airflow
executor = CeleryExecutor
broker_url = amqp://username:password#192.168.1.2:5672//
celery_result_backend = postgresql+psycopg2://username:password#192.168.1.2:5432/airflow
Tests performed:
RabbitMQ is running
Can connect to PostgreSQL and have confirmed that Airflow has created tables
Can start and view the webserver (including custom dags)
.
.
Airflow worker computer
Has the following installed:
Python 2.7 with
airflow (AIRFLOW_HOME = ~/airflow)
celery
psycogp2
Configurations made in airflow.cfg are exactly the same as in the server:
sql_alchemy_conn = postgresql+psycopg2://username:password#192.168.1.2:5432/airflow
executor = CeleryExecutor
broker_url = amqp://username:password#192.168.1.2:5672//
celery_result_backend = postgresql+psycopg2://username:password#192.168.1.2:5432/airflow
Output from commands run on the worker machine:
When running airflow flower:
ubuntu#airflow_client:~/airflow$ airflow flower
[2016-06-13 04:19:42,814] {__init__.py:36} INFO - Using executor CeleryExecutor
Traceback (most recent call last):
File "/home/ubuntu/anaconda2/bin/airflow", line 15, in <module>
args.func(args)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/airflow/bin/cli.py", line 576, in flower
os.execvp("flower", ['flower', '-b', broka, port, api])
File "/home/ubuntu/anaconda2/lib/python2.7/os.py", line 346, in execvp
_execvpe(file, args)
File "/home/ubuntu/anaconda2/lib/python2.7/os.py", line 382, in _execvpe
func(fullname, *argrest)
OSError: [Errno 2] No such file or directory
When running airflow worker:
ubuntu#airflow_client:~$ airflow worker
[2016-06-13 04:08:43,573] {__init__.py:36} INFO - Using executor CeleryExecutor
[2016-06-13 04:08:43,935: ERROR/MainProcess] Unrecoverable error: ImportError('No module named postgresql',)
Traceback (most recent call last):
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/celery/worker/__init__.py", line 206, in start
self.blueprint.start(self)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/celery/bootsteps.py", line 119, in start
self.on_start()
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/celery/apps/worker.py", line 169, in on_start
string(self.colored.cyan(' \n', self.startup_info())),
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/celery/apps/worker.py", line 230, in startup_info
results=self.app.backend.as_uri(),
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/kombu/utils/__init__.py", line 325, in __get__
value = obj.__dict__[self.__name__] = self.__get(obj)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/celery/app/base.py", line 626, in backend
return self._get_backend()
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/celery/app/base.py", line 444, in _get_backend
self.loader)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/celery/backends/__init__.py", line 68, in get_backend_by_url
return get_backend_cls(backend, loader), url
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/celery/backends/__init__.py", line 49, in get_backend_cls
cls = symbol_by_name(backend, aliases)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/kombu/utils/__init__.py", line 96, in symbol_by_name
module = imp(module_name, package=package, **kwargs)
File "/home/ubuntu/anaconda2/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
ImportError: No module named postgresql
When celery_result_backend is changed to the default db+mysql://airflow:airflow#localhost:3306/airflow and the airflow worker is run again the result is:
ubuntu#airflow_client:~/airflow$ airflow worker
[2016-06-13 04:17:32,387] {__init__.py:36} INFO - Using executor CeleryExecutor
-------------- celery#airflow_client2 v3.1.23 (Cipater)
---- **** -----
--- * *** * -- Linux-3.19.0-59-generic-x86_64-with-debian-jessie-sid
-- * - **** ---
- ** ---------- [config]
- ** ---------- .> app: airflow.executors.celery_executor:0x7f5cb65cb510
- ** ---------- .> transport: amqp://username:**#192.168.1.2:5672//
- ** ---------- .> results: mysql://airflow:**#localhost:3306/airflow
- *** --- * --- .> concurrency: 16 (prefork)
-- ******* ----
--- ***** ----- [queues]
-------------- .> default exchange=default(direct) key=celery
[2016-06-13 04:17:33,385] {__init__.py:36} INFO - Using executor CeleryExecutor
Starting flask
[2016-06-13 04:17:33,737] {_internal.py:87} INFO - * Running on http://0.0.0.0:8793/ (Press CTRL+C to quit)
[2016-06-13 04:17:34,536: WARNING/MainProcess] celery#airflow_client2 ready.
What am I missing? How can I diagnose this further?
The ImportError: No module named postgresql error is due to the invalid prefix used in your celery_result_backend. When using a database as a Celery backend, the connection URL must be prefixed with db+. See
https://docs.celeryproject.org/en/stable/userguide/configuration.html#conf-database-result-backend
So replace:
celery_result_backend = postgresql+psycopg2://username:password#192.168.1.2:5432/airflow
with something like:
celery_result_backend = db+postgresql://username:password#192.168.1.2:5432/airflow
You need to ensure to install Celery Flower. That is, pip install flower.