Airflow error while using pandas - Task received SIGTERM signal - python

I'm getting this SIGTerm error on Airflow 1.10.11 using LocalExecutor.
[2020-09-21 10:26:51,210] {{taskinstance.py:955}} ERROR - Received SIGTERM. Terminating subprocesses.
The dag task is doing this:
reading some data from SQL Server (on Windows) to a pandas dataframe.
And then it writes it to a file (it doesn't even get to this part).
The strange thing is if I limit the number of rows to return in the query (say TOP 100), the dag succeeds.
If I run the python code in my machine locally, it succeeds. I'm using pyodbc and sqlalchemy. It fails on this line after only 20 or 30 seconds:
df_query_results = pd.read_sql(sql_query, engine)
Airflow log
[2020-09-21 10:26:51,210] {{helpers.py:325}} INFO - Sending Signals.SIGTERM to GPID xxx
[2020-09-21 10:26:51,210] {{taskinstance.py:955}} ERROR - Received SIGTERM. Terminating subprocesses.
[2020-09-21 10:26:51,804] {{taskinstance.py:1150}} ERROR - Task received SIGTERM signal
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/airflow/dags/operators/sql_to_avro.py", line 39, in execute
df_query_results = pd.read_sql(sql_query, engine)
File "/usr/local/lib64/python3.6/site-packages/pandas/io/sql.py", line 436, in read_sql
chunksize=chunksize,
File "/usr/local/lib64/python3.6/site-packages/pandas/io/sql.py", line 1231, in read_query
data = result.fetchall()
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/result.py", line 1216, in fetchall
e, None, None, self.cursor, self.context
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1478, in _handle_dbapi_exception
util.reraise(*exc_info)
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/util/compat.py", line 153, in reraise
raise value
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/result.py", line 1211, in fetchall
l = self.process_rows(self._fetchall_impl())
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/result.py", line 1161, in _fetchall_impl
return self.cursor.fetchall()
File "/usr/local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 957, in signal_handler
raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
[2020-09-21 10:26:51,813] {{taskinstance.py:1194}} INFO - Marking task as FAILED.
EDIT:
I missed this earlier, but there is a warning message about the hostname.
WARNING - The recorded hostname da2mgrl001d1.mycompany.corp does not match this instance's hostname airflow-mycompany-dev.i.mct360.com

I had a Linux/network engineer help out. Unfortunately, I don't know the full details but the fix was they changed the hostname_callable setting in airflow.cfg to hostname_callable = socket:gethostname. It was previously set to socket:getfqdn
Note: I found a couple different (maybe related?) questions where this was the resolution.
How to fix the error "AirflowException("Hostname of job runner does not match")"?
https://stackoverflow.com/a/59108743/220997

Related

Faust-Streaming Crashes with error "Partition is not assigned"

We have recently switched to faust-streaming(0.6.9) from faust 1.10.4. Post this we have seen the applications crashing with the below exception. The application has multiple layers with aggregation and filtering of data at each stage. At each stage, the processor sends the message to Kafka topic and the respective faust app agent consumes the message. But we have kept the partition count the same, for the Kafka topic, at each layer.
Cluster Size = 12
Topic & Table Parition count = 36
faust-streaming version = 0.6.9
kafka-python version = 2.0.2
[2021-07-29 10:05:23,761] [18808] [ERROR] [^---Fetcher]: Crashed reason=AssertionError(‘Partition is not assigned’)
Traceback (most recent call last):
File “/usr/local/lib/python3.8/site-packages/mode/services.py”, line 802, in _execute_task
await task
File “/usr/local/lib/python3.8/site-packages/faust/transport/consumer.py”, line 176, in _fetcher
await consumer._drain_messages(self)
File “/usr/local/lib/python3.8/site-packages/faust/transport/consumer.py”, line 1104, in _drain_messages
async for tp, message in ait:
File “/usr/local/lib/python3.8/site-packages/faust/transport/consumer.py”, line 714, in getmany
highwater_mark = self.highwater(tp)
File “/usr/local/lib/python3.8/site-packages/faust/transport/consumer.py”, line 1367, in highwater
return self._thread.highwater(tp)
File “/usr/local/lib/python3.8/site-packages/faust/transport/drivers/aiokafka.py”, line 923, in highwater
return self._ensure_consumer().highwater(tp)
File “/usr/local/lib/python3.8/site-packages/aiokafka/consumer/consumer.py”, line 673, in highwater
assert self._subscription.is_assigned(partition), \
AssertionError: Partition is not assigned
[2021-07-29 10:05:23,764] [18808] [INFO] [^Worker]: Stopping...
[2021-07-29 10:05:23,765] [18808] [INFO] [^-App]: Stopping...
Please help us here.

ModuleNotFoundError("'kafka' is not a valid name. Did you mean one of aiokafka, kafka?")

I am using Celery and Kafka to run some jobs in order to push data to Kafka. I also use Faust to connect the workers. But unfortunately, I got an error after running faust -A project.streams.app worker -l info in order to run the pipeline. I wonder if anyone can help me.
/home/admin/.local/lib/python3.6/site-packages/faust/fixups/django.py:71: UserWarning: Using settings.DEBUG leads to a memory leak, never
use this setting in production environments!
warnings.warn(WARN_DEBUG_ENABLED)
Command raised exception: ModuleNotFoundError("'kafka' is not a valid name. Did you mean one of aiokafka, kafka?",)
File "/home/admin/.local/lib/python3.6/site-packages/mode/worker.py", line 67, in exiting
yield
File "/home/admin/.local/lib/python3.6/site-packages/faust/cli/base.py", line 528, in _inner
cmd()
File "/home/admin/.local/lib/python3.6/site-packages/faust/cli/base.py", line 611, in __call__
self.run_using_worker(*args, **kwargs)
File "/home/admin/.local/lib/python3.6/site-packages/faust/cli/base.py", line 620, in run_using_worker
self.on_worker_created(worker)
File "/home/admin/.local/lib/python3.6/site-packages/faust/cli/worker.py", line 57, in on_worker_created
self.say(self.banner(worker))
File "/home/admin/.local/lib/python3.6/site-packages/faust/cli/worker.py", line 97, in banner
self._banner_data(worker))
File "/home/admin/.local/lib/python3.6/site-packages/faust/cli/worker.py", line 127, in _banner_data
(' transport', app.transport.driver_version),
File "/home/admin/.local/lib/python3.6/site-packages/faust/app/base.py", line 1831, in transport
self._transport = self._new_transport()
File "/home/admin/.local/lib/python3.6/site-packages/faust/app/base.py", line 1686, in _new_transport
return transport.by_url(self.conf.broker_consumer[0])(
File "/home/admin/.local/lib/python3.6/site-packages/mode/utils/imports.py", line 101, in by_url
return self.by_name(URL(url).scheme)
File "/home/admin/.local/lib/python3.6/site-packages/mode/utils/imports.py", line 115, in by_name
f'{name!r} is not a valid name. {alt}') from exc
I don't know what was wrong with Faust but I run pip install faust by chance and it solved the problem.

Is it possible to deploy a daml smart contract with bots to Hyperledger Fabric?

I deployed quickstart tutorial based on the example "daml-on-fabric" https://github.com/hacera/daml-on-fabric and after that i tried to deploy the pingpong example from dazl https://github.com/digital-asset/dazl-client/tree/master/samples/ping-pong. The bots from the example works fine on daml ledger. However, when i try to deploy this example on fabric the bots are unable to send the transactions. Everything works fine based on this read me from https://github.com/hacera/daml-on-fabric/blob/master/README.md. The smart contract look like to be deployed on Fabric. The error is when i try to use the bots from pingpong python files https://github.com/digital-asset/dazl-client/blob/master/samples/ping-pong/README.md
I receive this error:
[ ERROR] 2020-03-10 15:40:57,475 | dazl | A command submission failed!
Traceback (most recent call last):
File "/home/vasisiop/.local/share/virtualenvs/ping-pong-sDNeps76/lib/python3.7/site-packages/dazl/client/_party_client_impl.py", line 415, in main_writer
await submit_command_async(client, p, commands)
File "/home/vasisiop/anaconda3/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/vasisiop/.local/share/virtualenvs/ping-pong-sDNeps76/lib/python3.7/site-packages/dazl/protocols/v1/grpc.py", line 42, in <lambda>
lambda: self.connection.command_service.SubmitAndWait(request))
File "/home/vasisiop/.local/share/virtualenvs/ping-pong-sDNeps76/lib/python3.7/site-packages/grpc/_channel.py", line 824, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/home/vasisiop/.local/share/virtualenvs/ping-pong-sDNeps76/lib/python3.7/site-packages/grpc/_channel.py", line 726, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.INVALID_ARGUMENT
details = "Party not known on ledger"
debug_error_string = "{"created":"#1583847657.473821297","description":"Error received from peer ipv6:[::1]:6865","file":"src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Party not known on ledger","grpc_status":3}"
>
[ ERROR] 2020-03-10 15:40:57,476 | dazl | An event handler in a bot has thrown an exception!
Traceback (most recent call last):
File "/home/vasisiop/.local/share/virtualenvs/ping-pong-sDNeps76/lib/python3.7/site-packages/dazl/client/bots.py", line 157, in _handle_event
await handler.callback(new_event)
File "/home/vasisiop/.local/share/virtualenvs/ping-pong-sDNeps76/lib/python3.7/site-packages/dazl/client/_party_client_impl.py", line 415, in main_writer
await submit_command_async(client, p, commands)
File "/home/vasisiop/anaconda3/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/vasisiop/.local/share/virtualenvs/ping-pong-sDNeps76/lib/python3.7/site-packages/dazl/protocols/v1/grpc.py", line 42, in <lambda>
lambda: self.connection.command_service.SubmitAndWait(request))
File "/home/vasisiop/.local/share/virtualenvs/ping-pong-sDNeps76/lib/python3.7/site-packages/grpc/_channel.py", line 824, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/home/vasisiop/.local/share/virtualenvs/ping-pong-sDNeps76/lib/python3.7/site-packages/grpc/_channel.py", line 726, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.INVALID_ARGUMENT
details = "Party not known on ledger"
debug_error_string = "{"created":"#1583847657.473821297","description":"Error received from peer ipv6:[::1]:6865","file":"src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Party not known on ledger","grpc_status":3}"
From the error message it looks like the parties defined in the quick start example have not been allocated on the ledger, hence the "Party not known on ledger" error.
You can follow the steps in https://docs.daml.com/deploy/index.html with use of daml deploy --host= --port=, which will both upload the dars and allocate the parties on the ledger.
You can also run just the allocate party command daml ledger allocate-parties, which will allocate based on the parties defined you your daml.yaml.

Google Pub/Sub Python Subscriber gets ALREADY_EXISTS after 10 minutes

I deployed a Python subscriber Sunday morning. The subscriber restarts every day. Starting today (3 days later), it experiences an ALREADY_EXISTS error 10-20 minutes after starting.
I've restarted it several times now. Every time it runs fine, retrieves previous messages and processes correctly. Then about 10-20 minutes later it dies again. At the time it dies, it didn't receive any messages and nothing significant happened.
Subscriber code
project_id = '***'
subscription = 'pop'
subscription_id = 'projects/{}/subscriptions/{}'.format(project_id, subscription)
def subscribe(callback):
subscriber = pubsub.SubscriberClient()
subscription = subscriber.subscribe(subscription_id)
future = subscription.open(callback)
future.result()
Error
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/google/api_core/grpc_helpers.py", line 54, in error_remapped_callable
return callable_(*args, **kwargs)
File "/usr/local/lib64/python3.6/site-packages/grpc/_channel.py", line 341, in _next
raise self
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.ABORTED, The operation was aborted.)>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "../monitors/bautopops.py", line 45, in <module>
main()
File "../monitors/bautopops.py", line 41, in main
gmail.start_pubsub_subscription(callback)
File "/home/ec2-user/bautopops/gmail.py", line 270, in start_pubsub_subscription
pubsub.subscribe(pubsub_callback)
File "/home/ec2-user/bautopops/pubsub.py", line 19, in subscribe
future.result()
File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/futures.py", line 103, in result
raise err
File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_consumer.py", line 336, in _blocking_consume
for response in responses:
File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_consumer.py", line 462, in _pausable_iterator
yield next(iterator)
File "/usr/local/lib64/python3.6/site-packages/grpc/_channel.py", line 347, in __next__
return self._next()
File "/usr/local/lib/python3.6/site-packages/google/api_core/grpc_helpers.py", line 56, in error_remapped_callable
six.raise_from(exceptions.from_grpc_error(exc), exc)
File "<string>", line 2, in raise_from
google.api_core.exceptions.Aborted: 409 The operation was aborted.
Guest worker-005 exited with error A process has ended with a probable error condition: process ended with exit code 1.
I found in the Google Pub/Sub Reference that error 409 means ALREADY_EXISTS:
The topic or subscription already exists. This is an error on creation operations.
This really doesn't help since I am not creating a subscription. I am only subscribing to one already created, as described in the Google Cloud Docs.
Lastly, I found this Github Issue which experiences the same error, but under totally different circumstances. It does not seem to be related to my problem.
Please help.

How to debug intermittent errors from Django app served with gunicorn (possible race condition)?

I have a Django app being served with nginx+gunicorn with 3 gunicorn worker processes. Occasionally (maybe once every 100 requests or so) one of the worker processes gets into a state where it starts failing most (but not all) requests that it serves, and then it throws an exception when it tries to email me about it. The gunicorn error logs look like this:
[2015-04-29 10:41:39 +0000] [20833] [ERROR] Error handling request
Traceback (most recent call last):
File "/home/django/virtualenvs/homestead_django/local/lib/python2.7/site-packages/gunicorn/workers/sync.py", line 130, in handle
File "/home/django/virtualenvs/homestead_django/local/lib/python2.7/site-packages/gunicorn/workers/sync.py", line 171, in handle_request
File "/home/django/virtualenvs/homestead_django/local/lib/python2.7/site-packages/django/core/handlers/wsgi.py", line 206, in __call__
File "/home/django/virtualenvs/homestead_django/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 196, in get_response
File "/home/django/virtualenvs/homestead_django/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 226, in handle_uncaught_exception
File "/usr/lib/python2.7/logging/__init__.py", line 1178, in error
File "/usr/lib/python2.7/logging/__init__.py", line 1271, in _log
File "/usr/lib/python2.7/logging/__init__.py", line 1281, in handle
File "/usr/lib/python2.7/logging/__init__.py", line 1321, in callHandlers
File "/usr/lib/python2.7/logging/__init__.py", line 749, in handle
File "/home/django/virtualenvs/homestead_django/local/lib/python2.7/site-packages/django/utils/log.py", line 122, in emit
File "/home/django/virtualenvs/homestead_django/local/lib/python2.7/site-packages/django/utils/log.py", line 125, in connection
File "/home/django/virtualenvs/homestead_django/local/lib/python2.7/site-packages/django/core/mail/__init__.py", line 29, in get_connection
File "/home/django/virtualenvs/homestead_django/local/lib/python2.7/site-packages/django/utils/module_loading.py", line 26, in import_by_path
File "/home/django/virtualenvs/homestead_django/local/lib/python2.7/site-packages/django/utils/module_loading.py", line 21, in import_by_path
File "/home/django/virtualenvs/homestead_django/local/lib/python2.7/site-packages/django/utils/importlib.py", line 40, in import_module
ImproperlyConfigured: Error importing module django.core.mail.backends.smtp: "No module named smtp"
So some uncaught exception is happening and then Django is trying to email me about it. The fact that it can't import django.core.mail.backends.smtp doesn't make sense because django.core.mail.backends.smtp should definitely be on the worker process' Python path. I can import it just fine from a manage.py shell and I do get emails for other server errors (actual software bugs) so I know that works. It's like the the worker process' environment is corrupted somehow.
Once a worker process enters this state it has a really hard time recovering; almost every request it serves ends up failing in this same manner. If I restart gunicorn everything is good (until another worker process falls into this weird state again).
I don't notice any obvious patterns so I don't think this is being triggered by a bug in my app (the URLs error'ing out are different, etc). It seems like some sort of race condition.
Currently I'm using gunicorn's --max-requests option to mitigate this problem but I'd like to understand what's going on here. Is this a race condition? How can I debug this?
I suggest you use Sentry which gives a smart way of handling errors.
You can use it as a cloud based solution (getsentry) or you can install it on your own server (github).
Before I was using django core log mailer now I always use sentry.
I do not work at Sentry but their solution is pretty awesome !
We discovered one particular view that was pegging the CPU for a few seconds every time it was loaded that seemed to be triggering this issue. I still don't understand how slamming a gunicorn worker could result in a corrupted execution environment, but fixing the high-CPU view seems to have gotten rid of this issue.

Categories