kubectl exec returning `Handshake status 500`

kubectl exec returning `Handshake status 500` - python

I'm using python kubernetes 3.0.0 library and kubernetes 1.6.6 on AWS.
I have pods that can disappear quickly. Sometimes when I try to exec to them I get ApiException Handshake status 500 error status.
This is happening with in cluster configuration as well as kube config.
When pod/container doesn't exist I get 404 error which is reasonable but 500 is Internal Server Error. I don't get any 500 errors in kube-apiserver.log where I do find 404 ones.
What does it mean and can someone point me in the right direction.

I know that this question is a little old, but I thought I would share what I found when trying to use python/kubernetes attach/exec for several debugging cases (since this isn't documented anywhere I can find).
As far as I can tell, it's all about making the keyword arguments match the actual container configuration as opposed to what you want the container to do.
When creating pods using kubectl run, if you don't use -i --tty flags (indicating interactive/TTY allocation), and then attempt to set either the tty or stdin flags to True in your function, then you'll get a mysterious 500 error with no other debug info. If you need to use stdin and tty and you are using a configuration file (as opposed to run), then make sure you set the stdin and tty flags to true in spec.containers.
While running resp.readline_stdout(), if you get a OverflowError: timestamp too large to convert to C _PyTime_t, set the keyword argument timeout=<any integer>. By default, the timeout variable defaults to None, which is an invalid value in that function.
If you run the attach/exec command and get an APIException and a status code of 0, the error Reason: hostname 'X.X.X.X' doesn't match either of..., note that there appears to be an incompatibility with Python 2. Works in Python 3. Should be patched eventually.
I can confirm 404 code is thrown via an ApiException when the pod doesn't exist.
If you are getting a mysterious error saying upgrade request required, note that you need to use the kubernetes.stream.stream function to wrap the call to attach/exec. You can see this issue on GitHub and this example code to help you get past that part.
Here's my example: resp = kubernetes.stream.stream(k8s.connect_get_namespaced_pod_attach, name='alpine-python-2', namespace="default", stderr=True, stdin=True, stdout=True, tty=True, _preload_content=False)
Note that the _preload_content=False is essential in the attach command or else the call will block indefinitely.
I know that was probably more information than you wanted, but hopefully at least some of it will help you.

For me, The reason for 500 was basically pod unable to pull the image from GCR

For me the reason was,
I had two pods, with same label attached, 1 pod was in Evicted state and other was running , i deleted that pod, which was Evicted and issue was fixed

Related

Live ECS logging into Cloudwatch

I am using an ECS task which runs a Docker container to execute some terraform commands.
I would like to logs the results of the terraform commands into Cloudwatch, if possible live. I am using the logging package of Python 3.
The function I use to output the result of the command is the following:
def execute_command(command):
"""
This method is used to execute the several commands
:param command: The command to be executed
:return decoded: The result of the command execution
"""
logging.info('Executing: {}'.format(command))
process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
communicate = process.communicate()
decoded = (communicate[0].decode('utf-8'), communicate[1].decode('utf-8'))
for stdout in decoded[0].split('\n'):
if stdout != '':
logging.info(stdout)
for stderr in decoded[1].split('\n'):
if stderr != '':
logging.warning(stderr)
return decoded
Which is called the following way:
apply_command = 'terraform apply -input=false -auto-approve -no-color {}'.format(plan_path)
terraform_apply_output = utils.execute_command(apply_command)
if terraform_apply_output[1] is not '':
logging.info('Apply has failed. See above logs')
aws_utils.remove_message_from_queue(metadata['receipt_handle'])
utils.exit_pipeline(1)
When the terraform command succeed, I can see its output after the command has been executed (i.e: see the result of the apply command after the resources have been applied), which is expected by the code.
When the terraform command failed (let's say because some resources were already deployed and not saved in a .tfstate), then I cannot see the login and the ECS task quit without error message.
I can see 2 reasons for it:
The result of the failed terraform command returns a non-zero code, which means the ECS task exits before outputing the logs into stdout (and so, into Cloudwatch).
The result of the failed terraform command is sent to stderr, which is not correctly logged.
What is my error here, and how could I fix it? Any help greatly appreciated :)

This question sounds suspectly familiar to me. Anyway.
Adding a sleep(10) just before exiting the task will fix the issue.
From AWS support:
I’ve been investigating the issue further and I noticed an internal
ticket regarding CloudWatch logs sometimes being truncated for Fargate
tasks. The problem was reported as a known issue in the latest Fargate
platform version (1.3.0). [1] Looking at our internal tickets for the
same, as you mentioned in the case description, the current workaround
to avoid this situation is extending the lifetime of the existing
container by adding a delay (~>10 seconds) between the logging output
of the application and the exit of the process (exit of the
container). I can confirm that our service team are still working to
get a permanent resolution for this reported issue. Unfortunately,
there is no ETA shared for when the fix will be deployed. However,
I've taken this opportunity to add this case to the internal ticket to
inform the team of the similar and try to expedite the process. In
addition, I'd recommend keeping an eye on the ECS release notes for
updates to the Fargate platform version which address this behaviour:
-- https://aws.amazon.com/new/
-- https://docs.aws.amazon.com/AmazonECS/latest/developerguide/document_history.html
"

Fabric - Python 3 - What is context and what does it have to contain and why do I need to pass it?

This is my fabric code:
from fabric import Connection, task
server = Connection(host="usrename#server.com:22", connect_kwargs={"password": "mypassword"})
#task
def dostuff(somethingmustbehere):
server.run("uname -a")
This code works just fine. When I execute fab dostuff it does what I want it to do.
When I remove somethingmustbehere however I get this error message:
raise TypeError("Tasks must have an initial Context argument!")
TypeError: Tasks must have an initial Context argument!
I never defined somethingmustbehere anywhere in my code. I just put it in and the error is gone and everything works. But why? What is this variable? Why do I need it? Why is it so important? And if it is so important why can it just be empty? I am really lost here. Yes it works, but I cannot run code that I don't understand. It drives me insane. :-)
Please be aware that I'm talking about the Python 3(!) version of Fabric!
The Fabric version is 2.4.0

To be able to run a #task you need a context argument. Fabric uses invoke task() which expects to see a context object. Normally we name the variable c or ctx (which I always use to make it more clear). I don't prefer using c because I use it normally for connection
Check this line on github from invoke package repo, you will see that it raises an exception when the context argument is not present, but it doesn't explain why!
To know more about Context object, what it 's and why we need it, you can read the following on the site of pyinvoke:
Aside: what exactly is this ‘context’ arg anyway? A common problem
task runners face is transmission of “global” data - values loaded
from configuration files or other configuration vectors, given via CLI
flags, generated in ‘setup’ tasks, etc.
Some libraries (such as Fabric 1.x) implement this via module-level
attributes, which makes testing difficult and error prone, limits
concurrency, and increases implementation complexity.
Invoke encapsulates state in explicit Context objects, handed to tasks
when they execute . The context is the primary API endpoint, offering
methods which honor the current state (such as Context.run) as well as
access to that state itself.
Check these both links :
Context
what exactly is this ‘context’ arg anyway?
To be honest, I wasted a lot of time figuring out what context is and why my code wouldn't run without it. But at some point I just gave up and started using to make my code run without errors.

Templated Dataflow hangs over an hour at loading and quits with an error

When translating a Dataflow from command line execution to templatable execution, I encountered the following problem.
The Dataflow starts but hangs at the loading state just after the following log message:
(71df0de383b642bd): Starting 1 workers in europe-west2-a...
After an hour, it seems like a time-out triggers and the Dataflow stops with the following log message:
(84598aaa4185b9a0): Workflow failed. Causes: (84598aaa4185b571): The Dataflow appears to be stuck. Please reach out to the Dataflow team at http://stackoverflow.com/questions/tagged/google-cloud-dataflow.
I followed the translation guide from: https://cloud.google.com/dataflow/docs/templates/creating-templates. I used the "RuntimeValueProvider" to catch all my arguments which I normally inserted via command line.
Can I get any help on this?

It appears to be that at creation time, I had only set a --zone parameter without a --region parameter. This results in a template file having the --region option set to it's default ("us-central1" at this moment) and leaving the --zone option to whatever I had set it. In my case, this was "europe-west1-b".
So having a template with --region us-central1 and --zone europe-west1-b is failing. Kind of logically in the end. But... There was no feedback on this. Because if you don't set the --region flag at template creation, neither de default value is rendered in the information pane. So it might come handy if these values can be rendered into the information pane on the right. Or even highlight mismatches that may cause such a failure.
Now, in the end, I create my templates with --region europe-west1 and --zone europe-west1-b to fix this. (I thought it also works without the supplementing --zone option, but I just pick one already)
Thanks a lot Pablo, for pointing me to the regions. My templates work now.

How do I silence "VI_WARN_CONFIG_NLOADED" warning when using PyVisa?

I'm attempting to use PyVisa inside NodeJS using the Python-Shell module. And I have the python code working, talking to the HPIB equipment. Only, I'm getting the warning-
c:\python34\lib\site-packages\pyvisa\ctwrapper\functions.py:1222: VisaIOWarning:
VI_WARN_CONFIG_NLOADED (1073676407): The specified configuration either does not
exist or could not be loaded. VISA-specified defaults will be used.
ret = library.viOpenDefaultRM(byref(session))
Its only a warning. But because I want to use stdin/stdout to push data into, and receive data from the python code, this warning is causing the wrapper to stop, causing the wrapper to callback with an error.
At least I think that is what's happening.
Any ideas?

How should I indicate that my Python shell script is returning an error?

I’m writing a shell script in Python (#!/usr/bin/env python). I’m a bit new to shell scripting in general, so apologies if I’m misunderstanding something.
My current understanding is that if my shell script works successfully, I should call sys.exit() to indicate that it’s succeeded (i.e. return 0).
If I’ve encountered an error (specifically, that the user has passed in an argument that I’m not expecting), what should I return, and how?
Is it okay just to call sys.exit() with any non-zero value, e.g. sys.exit(1)?

Most Shell utilites have various return values depending on the error that occurs.
The standard is when exiting with a status code of 0, it means the execution ended successfully.
For other error codes, this is highly dependant on the utility itself. You're most likely to learn about error codes in the man pages of the aforementioned utilities.
Here's a simple example of the ls man page:
Exit status:
0 if OK,
1 if minor problems (e.g., cannot access subdirectory),
2 if serious trouble (e.g., cannot access command-line argument).
It's highly recommended that you document properly your utility's exit codes in order for its users to use it correctly.

Any non zero value will do. So sys.exit(1) is correct. These error codes are useful for using scripts on the command line:
python test.py && echo 'success'
The above will not print 'success' when your script returns anything but 0.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.