In my airflow DAG, I have setup an on_failure_callback function that pushes exceptions to a Slack integration. I pass in the context of the task and extract the exception from the context using `context.get('exception') as described in this answer.
The problem is, it does not show the real cause of the Exception. The real cause is in the INFO section of the airflow logs while the ERROR section only contains the eventual errors that were caused by the real problem.
Example, below is a sample log that prints when I throw a custom error
2021-12-22 13:53:46,006] {pod_launcher.py:173} INFO - Event: transform-file-schema-6ba2b26845da43daa1b59ca5b221c839 had an event of type Pending
[2021-12-22 13:53:46,006] {pod_launcher.py:139} WARNING - Pod not yet started: transform-file-schema-6ba2b26845da43daa1b59ca5b221c839
[2021-12-22 13:53:47,017] {pod_launcher.py:173} INFO - Event: transform-file-schema-6ba2b26845da43daa1b59ca5b221c839 had an event of type Running
[2021-12-22 13:53:47,063] {pod_launcher.py:156} INFO - b'ERROR:root:Apatha throw error\n'
[2021-12-22 13:53:47,064] {pod_launcher.py:156} INFO - b'Traceback (most recent call last):\n'
[2021-12-22 13:53:47,064] {pod_launcher.py:156} INFO - b' File "job_transform_file_schema.py", line 8, in <module>\n'
[2021-12-22 13:53:47,064] {pod_launcher.py:156} INFO - b' JobRunner().run(sys.argv[1:])\n'
[2021-12-22 13:53:47,064] {pod_launcher.py:156} INFO - b' File "/usr/local/lib/python3.7/dist-packages/id_intl_dataflow/transform_file_schema.py", line 448, in run\n'
[2021-12-22 13:53:47,065] {pod_launcher.py:156} INFO - b' raise ApathaError("throw custom error")\n'
[2021-12-22 13:53:47,065] {pod_launcher.py:156} INFO - b'id_intl_dataflow.transform_file_schema.ApathaError: throw custom error\n'
[2021-12-22 13:53:48,080] {pod_launcher.py:160} INFO - Container transform-file-schema-6ba2b26845da43daa1b59ca5b221c839 has state running
[2021-12-22 13:53:50,154] {pod_launcher.py:267} INFO - Running command... cat /airflow/xcom/return.jso
[2021-12-22 13:53:50,201] {pod_launcher.py:274} INFO - cat: can't open '/airflow/xcom/return.json': No such file or director
[2021-12-22 13:53:50,202] {pod_launcher.py:267} INFO - Running command... kill -s SIGINT
[2021-12-22 13:54:16,091] {taskinstance.py:1152} ERROR - Pod Launching failed: Failed to extract xcom from pod: transform-file-schema-6ba2b26845da43daa1b59ca5b221c839
Traceback (most recent call last)
File "/usr/local/lib/airflow/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py", line 361, in execut
final_state, _, result = self.create_new_pod_for_operator(labels, launcher
File "/usr/local/lib/airflow/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py", line 508, in create_new_pod_for_operato
final_state, result = launcher.monitor_pod(pod=self.pod, get_logs=self.get_logs
File "/usr/local/lib/airflow/airflow/kubernetes/pod_launcher.py", line 162, in monitor_po
result = self._extract_xcom(pod
File "/usr/local/lib/airflow/airflow/kubernetes/pod_launcher.py", line 262, in _extract_xco
raise AirflowException('Failed to extract xcom from pod: {}'.format(pod.metadata.name)
airflow.exceptions.AirflowException: Failed to extract xcom from pod: transform-file-schema-6ba2b26845da43daa1b59ca5b221c83
During handling of the above exception, another exception occurred
As you can see the real reason of the error is in the INFO section:
[2021-12-22 13:53:47,063] {pod_launcher.py:156} INFO - b'ERROR:root:Apatha throw error\n'
Because of this issue, the context.get('exception') is not returning the true reason of a failure. What do I change so that context.get('exception') also has the INFO logs? Alternatively, what other variable in context can I use to get the INFO logs which have the root cause of this issue?
Related
I have a process where I need to list off failed workflows in Google cloud platform so they can be highlighted to be fixed. I have managed to do this quite simply writing a gcloud command and calling it in a shell script, but I need to transfer this to Python.
I have the following shell command as an example where I am able to filter on a specific workflow to pull back any failures using the --filter flag.
gcloud workflows executions list --project test-project "projects/test-project/locations/europe-west4/workflows/test-workflow" --location europe-west4 --filter STATE:FAILED
In the documentation for filtering it not possible to do this on the executions but on the list workflows instead, which is fine. You can see within the code snippet below I am trying to filter on STATE:FAILED like in the gcloud command.
The example below doesn't work and there are no examples within the Google cloud documents. I have checked in the following webpage:
https://cloud.google.com/python/docs/reference/workflows/latest/google.cloud.workflows_v1.types.ListWorkflowsRequest
from google.cloud import workflows_v1
from google.cloud.workflows import executions_v1
# Create a client
workflow_client = workflows_v1.WorkflowsClient()
execution_client = executions_v1.ExecutionsClient()
project = "test-project"
location = "europe-west4"
# Initialize request argument(s)
request = workflows_v1.ListWorkflowsRequest(
parent=f"projects/{project}/locations/{location}",
filter="STATE:FAILED"
)
# Make the request
workflow_page_result = workflow_client.list_workflows(request=request)
# Handle the response
with open("./workflows.txt", "w") as workflow_file:
for workflow_response in workflow_page_result:
name = workflow_response.name
request = executions_v1.ListExecutionsRequest(
parent=name,
)
execution_page_result = execution_client.list_executions(request=request)
# Handle the response
for execution_response in execution_page_result:
print(execution_response)
workflow_file.write(name)
What is the correct syntax for filtering on the failed state within the Python code? Where would I look to find this information in the Google documentation?
I get the following error message:
/Users/richard.drury/code/gcp/git/service-level-monitoring/venv/bin/python /Users/richard.drury/code/gcp/git/service-level-monitoring/daily_checks/sensor_checks/workflows.py
Traceback (most recent call last):
File "/Users/richard.drury/code/gcp/git/service-level-monitoring/venv/lib/python3.10/site-packages/google/api_core/grpc_helpers.py", line 50, in error_remapped_callable
return callable_(*args, **kwargs)
File "/Users/richard.drury/code/gcp/git/service-level-monitoring/venv/lib/python3.10/site-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/Users/richard.drury/code/gcp/git/service-level-monitoring/venv/lib/python3.10/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.INVALID_ARGUMENT
details = "The request was invalid: invalid list filter: Field 'STATE' not found in 'resource'."
debug_error_string = "{"created":"#1670416298.880340000","description":"Error received from peer ipv4:216.58.213.10:443","file":"src/core/lib/surface/call.cc","file_line":967,"grpc_message":"The request was invalid: invalid list filter: Field 'STATE' not found in 'resource'.","grpc_status":3}"
>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/richard.drury/code/gcp/git/service-level-monitoring/daily_checks/sensor_checks/workflows.py", line 15, in <module>
workflow_page_result = workflow_client.list_workflows(request=request)
File "/Users/richard.drury/code/gcp/git/service-level-monitoring/venv/lib/python3.10/site-packages/google/cloud/workflows_v1/services/workflows/client.py", line 537, in list_workflows
response = rpc(
File "/Users/richard.drury/code/gcp/git/service-level-monitoring/venv/lib/python3.10/site-packages/google/api_core/gapic_v1/method.py", line 154, in __call__
return wrapped_func(*args, **kwargs)
File "/Users/richard.drury/code/gcp/git/service-level-monitoring/venv/lib/python3.10/site-packages/google/api_core/grpc_helpers.py", line 52, in error_remapped_callable
raise exceptions.from_grpc_error(exc) from exc
google.api_core.exceptions.InvalidArgument: 400 The request was invalid: invalid list filter: Field 'STATE' not found in 'resource'. [field_violations {
field: "filter"
description: "invalid list filter: Field \'STATE\' not found in \'resource\'."
}
]
Process finished with exit code 1
At random occurrences, I notice such errors in the Flask Celery application that uses SQLAlchemy.
When I check database connectivity, it is fine. It seems the app somehow loses connection to the PostgreSQL database server.
[2021-02-08 00:01:00,124: ERROR/ForkPoolWorker-1] Task gbms.celery_service.create_recurring_tasks[04c258ab-9162-44ed-86ff-f90c4a214fd2] raised unexpected: DatabaseError('(psycopg2.DatabaseError) error with status PGRES_TUPLES_OK and no message from the libpq')
Traceback (most recent call last):
File "/gbmsenv/lib64/python3.8/site-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context
self.dialect.do_execute(
File "/gbmsenv/lib64/python3.8/site-packages/sqlalchemy/engine/default.py", line 593, in do_execute
cursor.execute(statement, parameters)
psycopg2.DatabaseError: error with status PGRES_TUPLES_OK and no message from the libpq
The above exception was the direct cause of the following exception:
What steps should I take to identify the cause and fix the issue?
I'm getting this SIGTerm error on Airflow 1.10.11 using LocalExecutor.
[2020-09-21 10:26:51,210] {{taskinstance.py:955}} ERROR - Received SIGTERM. Terminating subprocesses.
The dag task is doing this:
reading some data from SQL Server (on Windows) to a pandas dataframe.
And then it writes it to a file (it doesn't even get to this part).
The strange thing is if I limit the number of rows to return in the query (say TOP 100), the dag succeeds.
If I run the python code in my machine locally, it succeeds. I'm using pyodbc and sqlalchemy. It fails on this line after only 20 or 30 seconds:
df_query_results = pd.read_sql(sql_query, engine)
Airflow log
[2020-09-21 10:26:51,210] {{helpers.py:325}} INFO - Sending Signals.SIGTERM to GPID xxx
[2020-09-21 10:26:51,210] {{taskinstance.py:955}} ERROR - Received SIGTERM. Terminating subprocesses.
[2020-09-21 10:26:51,804] {{taskinstance.py:1150}} ERROR - Task received SIGTERM signal
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/airflow/dags/operators/sql_to_avro.py", line 39, in execute
df_query_results = pd.read_sql(sql_query, engine)
File "/usr/local/lib64/python3.6/site-packages/pandas/io/sql.py", line 436, in read_sql
chunksize=chunksize,
File "/usr/local/lib64/python3.6/site-packages/pandas/io/sql.py", line 1231, in read_query
data = result.fetchall()
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/result.py", line 1216, in fetchall
e, None, None, self.cursor, self.context
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1478, in _handle_dbapi_exception
util.reraise(*exc_info)
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/util/compat.py", line 153, in reraise
raise value
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/result.py", line 1211, in fetchall
l = self.process_rows(self._fetchall_impl())
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/result.py", line 1161, in _fetchall_impl
return self.cursor.fetchall()
File "/usr/local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 957, in signal_handler
raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
[2020-09-21 10:26:51,813] {{taskinstance.py:1194}} INFO - Marking task as FAILED.
EDIT:
I missed this earlier, but there is a warning message about the hostname.
WARNING - The recorded hostname da2mgrl001d1.mycompany.corp does not match this instance's hostname airflow-mycompany-dev.i.mct360.com
I had a Linux/network engineer help out. Unfortunately, I don't know the full details but the fix was they changed the hostname_callable setting in airflow.cfg to hostname_callable = socket:gethostname. It was previously set to socket:getfqdn
Note: I found a couple different (maybe related?) questions where this was the resolution.
How to fix the error "AirflowException("Hostname of job runner does not match")"?
https://stackoverflow.com/a/59108743/220997
I'm running Kafka locally on my Mac Pro (Sierra; 10.12.6) just to get started with development. I've started ZooKeeper and a Kafka server (0.11.0.1):
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties
I've got topics created:
bin/kafka-topics.sh --list --zookeeper localhost:2181
__consumer_offsets
access
my-topic
(not sure what __consumer_offsets is, I created the other two).
I've installed kafka-python (1.3.4).
My sample program is dead simple:
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers=['localhost:9092'])
producer.send('my-topic', 'Another message')
But it croaks with the following message:
Traceback (most recent call last):
File "produce.py", line 3, in <module>
producer = KafkaProducer(bootstrap_servers=['localhost:9092'])
File "/Library/Python/2.7/site-packages/kafka/producer/kafka.py", line 347, in __init__
**self.config)
File "/Library/Python/2.7/site-packages/kafka/client_async.py", line 220, in __init__
self.config['api_version'] = self.check_version(timeout=check_timeout)
File "/Library/Python/2.7/site-packages/kafka/client_async.py", line 861, in check_version
raise Errors.NoBrokersAvailable()
kafka.errors.NoBrokersAvailable: NoBrokersAvailable
Ideas? Any assistance appreciated.
Please insure that you have the setting defined in the server.config file
advertised.listeners=PLAINTEXT://your.host.name:9092 .
It might be possible that the host name resolution is giving some other host name , by default Kafka uses java.net.InetAddress.getCanonicalHostName()
If you're using wurstmeister/kafka, please notice that in Kafka's last version many parameters have been deprecated.
Instead of using -
KAFKA_HOST:
KAFKA_PORT: 9092
KAFKA_ADVERTISED_HOST_NAME: <IP-ADDRESS>
KAFKA_ADVERTISED_PORT: 9092
you need to use -
KAFKA_LISTENERS: PLAINTEXT://:9092
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://<IP-ADDRESS>:9092
view this link for more details
I have a problem configuring Endpoints API. Any code i use, from my own, to google's examples on site fail with the same traceback
WARNING 2016-11-01 06:16:48,279 client.py:229] no scheduler thread, scheduler.run() will be invoked by report(...)
Traceback (most recent call last):
File "/home/vladimir/projects/sb_fork/sb/lib/vendor/google/api/control/client.py", line 225, in start
self._thread.start()
File "/home/vladimir/sdk/google-cloud-sdk/platform/google_appengine/google/appengine/api/background_thread/background_thread.py", line 108, in start
start_new_background_thread(self.__bootstrap, ())
File "/home/vladimir/sdk/google-cloud-sdk/platform/google_appengine/google/appengine/api/background_thread/background_thread.py", line 87, in start_new_background_thread
raise ERROR_MAP[error.application_error](error.error_detail)
FrontendsNotSupported
INFO 2016-11-01 06:16:48,280 client.py:327] created a scheduler to control flushing
INFO 2016-11-01 06:16:48,280 client.py:330] scheduling initial check and flush
INFO 2016-11-01 06:16:48,288 client.py:804] Refreshing access_token
/home/vladimir/projects/sb_fork/sb/lib/vendor/urllib3/contrib/appengine.py:113: AppEnginePlatformWarning: urllib3 is using URLFetch on Google App Engine sandbox instead of sockets. To use sockets directly instead of URLFetch see https://urllib3.readthedocs.io/en/latest/contrib.html.
AppEnginePlatformWarning)
ERROR 2016-11-01 06:16:49,895 service_config.py:125] Fetching service config failed (status code 403)
ERROR 2016-11-01 06:16:49,896 wsgi.py:263]
Traceback (most recent call last):
File "/home/vladimir/sdk/google-cloud-sdk/platform/google_appengine/google/appengine/runtime/wsgi.py", line 240, in Handle
handler = _config_handle.add_wsgi_middleware(self._LoadHandler())
File "/home/vladimir/sdk/google-cloud-sdk/platform/google_appengine/google/appengine/runtime/wsgi.py", line 299, in _LoadHandler
handler, path, err = LoadObject(self._handler)
File "/home/vladimir/sdk/google-cloud-sdk/platform/google_appengine/google/appengine/runtime/wsgi.py", line 85, in LoadObject
obj = __import__(path[0])
File "/home/vladimir/projects/sb_fork/sb/main.py", line 27, in <module>
api_app = endpoints.api_server([SolarisAPI,], restricted=False)
File "/home/vladimir/projects/sb_fork/sb/lib/vendor/endpoints/apiserving.py", line 497, in api_server
controller)
File "/home/vladimir/projects/sb_fork/sb/lib/vendor/google/api/control/wsgi.py", line 77, in add_all
a_service = loader.load()
File "/home/vladimir/projects/sb_fork/sb/lib/vendor/google/api/control/service.py", line 110, in load
return self._load_func(**kw)
File "/home/vladimir/projects/sb_fork/sb/lib/vendor/google/api/config/service_config.py", line 78, in fetch_service_config
_log_and_raise(Exception, message_template.format(status_code))
File "/home/vladimir/projects/sb_fork/sb/lib/vendor/google/api/config/service_config.py", line 126, in _log_and_raise
raise exception_class(message)
Exception: Fetching service config failed (status code 403)
INFO 2016-11-01 06:16:49,913 module.py:788] default: "GET / HTTP/1.1" 500 -
My app.yaml is configured like the new Endpoints Migrating to 2.0 document states:
- url: /_ah/api/.*
script: api.solaris.api_app
And main.py imports the API into the app:
api_app = endpoints.api_server([SolarisAPI,], restricted=False)
I use Google Cloud SDK with these versions:
Google Cloud SDK 132.0.0
app-engine-python 1.9.40
bq 2.0.24
bq-nix 2.0.24
core 2016.10.24
core-nix 2016.10.24
gcloud
gsutil 4.22
gsutil-nix 4.22
Have you tried generating and uploading the OpenAPI configuration for the service? See the sections named "Generating the OpenAPI configuration file" and "Deploying the OpenAPI configuration file" in the python library documentation.
Note that in step 2 of the generation process, you may need to prepend python to the command (e.g python lib/endpoints/endpointscfg.py get_swagger_spec ...), since the PyPi package doesn't preserve executable file permissions right now.
To get rid of the "FrontendsNotSupported" you need to use a "B*" instance class.
The error "Exception: Fetching service config failed" should be gone if you follow the steps in https://cloud.google.com/endpoints/docs/frameworks/python/quickstart-frameworks-python. As already pointed out by Brad, the section "OpenAPI configuration" and the resulting environment variables are required to make the service configuration work.