Live ECS logging into Cloudwatch

Live ECS logging into Cloudwatch - python

I am using an ECS task which runs a Docker container to execute some terraform commands.
I would like to logs the results of the terraform commands into Cloudwatch, if possible live. I am using the logging package of Python 3.
The function I use to output the result of the command is the following:
def execute_command(command):
"""
This method is used to execute the several commands
:param command: The command to be executed
:return decoded: The result of the command execution
"""
logging.info('Executing: {}'.format(command))
process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
communicate = process.communicate()
decoded = (communicate[0].decode('utf-8'), communicate[1].decode('utf-8'))
for stdout in decoded[0].split('\n'):
if stdout != '':
logging.info(stdout)
for stderr in decoded[1].split('\n'):
if stderr != '':
logging.warning(stderr)
return decoded
Which is called the following way:
apply_command = 'terraform apply -input=false -auto-approve -no-color {}'.format(plan_path)
terraform_apply_output = utils.execute_command(apply_command)
if terraform_apply_output[1] is not '':
logging.info('Apply has failed. See above logs')
aws_utils.remove_message_from_queue(metadata['receipt_handle'])
utils.exit_pipeline(1)
When the terraform command succeed, I can see its output after the command has been executed (i.e: see the result of the apply command after the resources have been applied), which is expected by the code.
When the terraform command failed (let's say because some resources were already deployed and not saved in a .tfstate), then I cannot see the login and the ECS task quit without error message.
I can see 2 reasons for it:
The result of the failed terraform command returns a non-zero code, which means the ECS task exits before outputing the logs into stdout (and so, into Cloudwatch).
The result of the failed terraform command is sent to stderr, which is not correctly logged.
What is my error here, and how could I fix it? Any help greatly appreciated :)

This question sounds suspectly familiar to me. Anyway.
Adding a sleep(10) just before exiting the task will fix the issue.
From AWS support:
I’ve been investigating the issue further and I noticed an internal
ticket regarding CloudWatch logs sometimes being truncated for Fargate
tasks. The problem was reported as a known issue in the latest Fargate
platform version (1.3.0). [1] Looking at our internal tickets for the
same, as you mentioned in the case description, the current workaround
to avoid this situation is extending the lifetime of the existing
container by adding a delay (~>10 seconds) between the logging output
of the application and the exit of the process (exit of the
container). I can confirm that our service team are still working to
get a permanent resolution for this reported issue. Unfortunately,
there is no ETA shared for when the fix will be deployed. However,
I've taken this opportunity to add this case to the internal ticket to
inform the team of the similar and try to expedite the process. In
addition, I'd recommend keeping an eye on the ECS release notes for
updates to the Fargate platform version which address this behaviour:
-- https://aws.amazon.com/new/
-- https://docs.aws.amazon.com/AmazonECS/latest/developerguide/document_history.html
"

Related

Azure Batch JobPreparationTask fails with "UserError"

I am trying to mount a File Share (Not a blob storage) during the JobPerparationTask. My node OS is Ubuntu 16.04.
To do this, I am doing the following:
job_user = batchmodels.AutoUserSpecification(
scope=batchmodels.AutoUserScope.pool,
elevation_level=batchmodels.ElevationLevel.admin)
start_task = batch.models.JobPreparationTask(command_line=start_commands, user_identity=batchmodels.UserIdentity(auto_user=job_user))
end_task = batch.models.JobReleaseTask(command_line=end_commands,user_identity=batchmodels.UserIdentity(auto_user=job_user))
job = batch.models.JobAddParameter(
job_id,
batch.models.PoolInformation(pool_id=pool_id),job_preparation_task=start_task, job_release_task=end_task)
My start_commands and end_commands are fine, but there is something wrong with the User permissions...
I get no output in the stderr.txt or in the stdout.txt file.
I do not see any logs what-so-ever (where are they?). All I am able to find is a message showing this:
Exit code
1
Retry count
0
Failure info
Category
UserError
Code
FailureExitCode
Message
The task exited with an exit code representing a failure
Details
Message: The task exited with an exit code representing a failure
Very detailed error message!
Anyway, I have also tried changing AutoUserScope.oool to AutoUserScope.task, but there is no change.
Anyone have any ideas?

I had this issue which was frustrating me because I couldn't get any logs from my application.
What I ended up doing is RDP'ing into the node my job ran on, going to %AZ_BATCH_TASK_WORKING_DIR% as specified in Azure Batch Compute Environment Variables and then checking the stdout.txt and stderr.txt in my job.
The error was that I formulated my CloudTask's commandline incorrectly, so it could not launch my application in the first place.
To RDP into your machine, in Azure Portal:
Batch Account
Pools (select your pool)
Nodes
Select the node that ran your job
Select "Connect" link at the top.

kubectl exec returning `Handshake status 500`

I'm using python kubernetes 3.0.0 library and kubernetes 1.6.6 on AWS.
I have pods that can disappear quickly. Sometimes when I try to exec to them I get ApiException Handshake status 500 error status.
This is happening with in cluster configuration as well as kube config.
When pod/container doesn't exist I get 404 error which is reasonable but 500 is Internal Server Error. I don't get any 500 errors in kube-apiserver.log where I do find 404 ones.
What does it mean and can someone point me in the right direction.

I know that this question is a little old, but I thought I would share what I found when trying to use python/kubernetes attach/exec for several debugging cases (since this isn't documented anywhere I can find).
As far as I can tell, it's all about making the keyword arguments match the actual container configuration as opposed to what you want the container to do.
When creating pods using kubectl run, if you don't use -i --tty flags (indicating interactive/TTY allocation), and then attempt to set either the tty or stdin flags to True in your function, then you'll get a mysterious 500 error with no other debug info. If you need to use stdin and tty and you are using a configuration file (as opposed to run), then make sure you set the stdin and tty flags to true in spec.containers.
While running resp.readline_stdout(), if you get a OverflowError: timestamp too large to convert to C _PyTime_t, set the keyword argument timeout=<any integer>. By default, the timeout variable defaults to None, which is an invalid value in that function.
If you run the attach/exec command and get an APIException and a status code of 0, the error Reason: hostname 'X.X.X.X' doesn't match either of..., note that there appears to be an incompatibility with Python 2. Works in Python 3. Should be patched eventually.
I can confirm 404 code is thrown via an ApiException when the pod doesn't exist.
If you are getting a mysterious error saying upgrade request required, note that you need to use the kubernetes.stream.stream function to wrap the call to attach/exec. You can see this issue on GitHub and this example code to help you get past that part.
Here's my example: resp = kubernetes.stream.stream(k8s.connect_get_namespaced_pod_attach, name='alpine-python-2', namespace="default", stderr=True, stdin=True, stdout=True, tty=True, _preload_content=False)
Note that the _preload_content=False is essential in the attach command or else the call will block indefinitely.
I know that was probably more information than you wanted, but hopefully at least some of it will help you.

For me, The reason for 500 was basically pod unable to pull the image from GCR

For me the reason was,
I had two pods, with same label attached, 1 pod was in Evicted state and other was running , i deleted that pod, which was Evicted and issue was fixed

How to detect system ACPI G2/S5 Soft Off event with python on linux

I am working on an app using Google's compute engine and would like to use pre-emptible instances.
I need my code to respond to the 30s warning google gives via an ACPI G2 Soft Off signal that they send when they are going to take away your VM as described here: https://cloud.google.com/compute/docs/instances/preemptible.
How do I detect this event in my python code that is running on the machine and react to it accordingly (in my case I need to put the job the VM was working on back on a queue of open jobs so that a different machine can take it).

I am not answering the question directly, but I think that your actual intent is different:
The G2 power button event is generated by both preemption of a VM and the gcloud instances stop command (or the corresponding API, which it calls);
I am assuming that you want to react specially only on instance preemption.
Avoid a common misunderstanding
GCE does not send a "30s termination warning" with the power button event. It just sends the normal, honest power button soft-off event that immediately initiates shutdown of the system.
The "warning" part that comes with it is simple: “Here is your power button event, shutdown the OS ASAP, because you have 30s before we pull the plug off the wall socket. You've been warned!”
You have two system services that you can combine in different ways to get the desired behavior.
1. Use the fact that the system is shutting down upon ACPI G2
The most kosher (and, AFAIK, the only supported) way of handling the ACPI power button event is let the system handle it, and execute what you want in the instance shutdown script. In a systemd-managed machine, the default GCP shutdown script is simply invoked by a Type=oneshot service's ExecStop= command (see systemd.service(8)). The script is ran relatively late in shutdown sequence.
If you must ensure that the shutdown script is ran after (or before) some of your services is sent a signal to terminate, you can modify some of service dependencies. Things to keep in mind:
After and Before are reversed on shutdown: if X is started after Y, then it's stopped before Y.
The After dependency ensures that the service in the sequence is told to terminate before the shutdown script is run. It does not ensure that the service has already terminated.
The shutdown script is run when the google-shutdown-scripts.service is stopped as part of system shutdown.
With all that in mind, you can do sudo systemctl edit google-shutdown-scripts.service. This will create an empty configuration override file and open your $EDITOR, where you can put your After and Before dependencies, for example,
[Unit]
# Make sure that shutdown script is run (synchronously) *before* mysvc1.service is stopped.
After=mysvc1.service
# Make sure that mysvc2.service is sent a command to stop before the shutdown script is run
Before=mysvc2.service
You may specify as many After or Before clauses as you want, 0 or more of each. Read systemd.unit(8) for more information.
2. Use GCP metadata
There is an instance metadatum v1/instance/preempted. If the instance is preempted, it's value is TRUE, otherwise it's FALSE.
GCP has a thorough documentation on working with instance metadata. In short, there are two ways you can use this (or any other) metadata value:
Query its value at any time, e. g. in the shutdown script. curl(1) equivalent:
curl -sfH 'Metadata-Flavor: Google' \
'http://169.254.169.254/computeMetadata/v1/instance/preempted'
Run an HTTP request that will complete (200) when the metadatum changes. The only change that can ever happen to it is from FALSE to TRUE, as preemption is irreversible.
curl -sfH 'Metadata-Flavor: Google' \
'http://169.254.169.254/computeMetadata/v1/instance/preempted?wait_for_change=true'
Caveat: The metadata server may return the 503 response if it's temporarily unavailable (this is very rare, but happens), so certain retry logic is required. This especially true for the long-running second form (with ?wait_for_change=true), as the pending request may return at any time with the code 503. Your code should be ready to handle this and restart the query. curl does not return the HTTP error code directly, but you can use the fact that x=$(curl ....) expression returned an empty string if you scripting it; your criterion for positive detection of preemption is [[ $x == TRUE ]] in this case.
Summary
If you want to detect that the VM is shutting down for any reason, use Google-provided shutdown script.
If you also need to distinguish whether the VM was in fact preempted, as opposed to gcloud instance stop <vmname> (which also sends the power button event!), query the preempted metadata in the shutdown script.
Run a pending HTTP request for metadata change, and react on it accordingly. This will complete successfully when VM is preempted only (but may complete with an error at any time too).
If the daemon that you run is your own, you can also directly query the preempted metadata from the code path which handles the termination signal, if you need to distinguish between different shutdown reasons.
It is not impossible that the real decision point is whether you have an "active job" that you want to return to the "queue", or not: if your service is requested to stop while holding on an active job, just return it, regardless of the reason why you are being stopped. But I cannot comment on this, not knowing your actual design.

I think the simplest way to handle GCP preemption is using SIGTERM.
The SIGTERM signal is a generic signal used to cause program
termination. Unlike SIGKILL, this signal can be blocked, handled, and
ignored. It is the normal way to politely ask a program to terminate. https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html
This does depend on shutdown scripts, which are run on a "best effort" basis. In practice, shutdown scripts are very reliable for short scripts.
In your shutdown script:
echo "Running shutdown script"
preempted = curl "http://metadata.google.internal/computeMetadata/v1/instance/preempted" -H "Metadata-Flavor: Google"
if $preempted; then
PID="$(pgrep -o "python")"
echo "Send SIGTERM to python"
kill "$PID"
sleep infinity
fi
echo "Shutting down"
In main.py:
import signal
import os
def sigterm_handler(sig, frame):
print("Got SIGTERM")
os.environ["IS_PREEMPTED"] = True
# Call cleanup functions
signal.signal(signal.SIGTERM, sigterm_handler)
if __name__ == "__main__":
print("Main")

Why does stdout not flush when connecting to a process that is run with supervisord?

I am using Supervisor (process controller written in python) to start and control my web server and associated services. I find the need at times to enter into pdb (or really ipdb) to debug when the server is running. I am having trouble doing this through Supervisor.
Supervisor allows the processes to be started and controlled with a daemon called supervisord, and offers access through a client called supervisorctl. This client allows you to attach to one of the foreground processes that has been started using a 'fg' command. Like this:
supervisor> fg webserver
All logging data gets sent to the terminal. But I do not get any text from the pdb debugger. It does accept my input so stdin seems to be working.
As part of my investigation I was able to confirm that neither print nor raw_input send and text out either; but in the case of raw_input the stdin is indeed working.
I was also able to confirm that this works:
sys.stdout.write('message')
sys.flush()
I though that when I issued the fg command that it would be as if I had run the process in the foreground in the standard terminal ... but it appears that supervisorctl is doing something more. Regular printing does not flush for example. Any ideas?
How can I get pdb, standard prints, etc to work properly when connecting to the foreground terminal using the fg command in supervisorctl?
(Possible helpful ref: http://supervisord.org/subprocess.html#nondaemonizing-of-subprocesses)

It turns out that python defaults to buffering its output stream. In certain cases (such as this one) - it results in output being detained.
Idioms like this exist:
sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0)
to force the buffer to zero.
But the better alternative I think is to start the base python process in an unbuffered state using the -u flag. Within the supervisord.conf file it simply becomes:
command=python -u script.py
ref: http://docs.python.org/2/using/cmdline.html#envvar-PYTHONUNBUFFERED
Also note that this dirties up your log file - especially if you are using something like ipdb with ANSI coloring. But since it is a dev environment it is not likely that this matters.
If this is an issue - another solution is to stop the process to be debugged in supervisorctl and then run the process temporarily in another terminal for debugging. This would keep the logfiles clean if that is needed.

It could be that your webserver redirects its own stdout (internally) to a log file (i.e. it ignores supervisord's stdout redirection), and that prevents supervisord from controlling where its stdout goes.
To check if this is the case, you can tail -f the log, and see if the output you expected to see in your terminal goes there.
If that's the case, see if you can find a way to configure your webserver not to do that, or, if all else fails, try working with two terminals... (one for input, one for ouptut)

python,running command line servers - they're not listening properly

Im attempting to start a server app (in erlang, opens ports and listens for http requests) via the command line using pexpect (or even directly using subprocess.Popen()).
the app starts fine, logs (via pexpect) to the screen fine, I can interact with it as well via command line...
the issue is that the servers wont listen for incoming requests. The app listens when I start it up manually, by typing commands in the command line. using subprocess/pexpect stops the app from listening somehow...
when I start it manually "netstat -tlp" displays the app as listening, when I start it via python (subprocess/pexpect) netstat does not register the app...
I have a feeling it has something to do with the environemnt, the way python forks things, etc.
Any ideas?
thank you
basic example:
note:
"-pz" - just ads ./ebin to the modules path for the erl VM, library search path
"-run" - runs moduleName, without any parameters.
command_str = "erl -pz ./ebin -run moduleName"
child = pexpect.spawn(command_str)
child.interact() # Give control of the child to the user
all of this stuff works correctly, which is strange. I have logging inside my code and all the log messages output as they should. the server wouldnt listen even if I started up its process via a bash script, so I dont think its the python code thats causing it (thats why I have a feeling that its something regarding the way the new OS process is started).

It could be to do with the way that command line arguments are passed to the subprocess.
Without more specific code, I can't say for sure, but I had this problem working on sshsplit ( https://launchpad.net/sshsplit )
To pass arguments correctly (in this example "ssh -ND 3000"), you should use something like this:
openargs = ["ssh", "-ND", "3000"]
print "Launching %s" %(" ".join(openargs))
p = subprocess.Popen(openargs, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
This will not only allow you to see exactly what command you are launching, but should correctly pass the values to the executable. Although I can't say for sure without seeing some code, this seems the most likely cause of failure (could it also be that the program requires a specific working directory, or configuration file?).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.