Tests fail ran by gitlab-ci, but not ran in bash - python

I'm using gitlab-ci to automatically build a C++ project and run unit-tests written in python (it runs the daemon, and then communicates via the network/socket based interface).
The problem I'm finding is that when the tests are run by the GitLab-CI runner, they fail for various reasons (with one test, it stalls indefinitely on a particular network operation, on the other it doesn't receive a packet that should have been sent).
BUT: When I open up SSH and run the tests manually, they all work successfully (the tests also succeed on all of our developers' machines [linux/windows/OSX]).
At this point I've been trying to replicate enough of the build/test conditions that gitlab-ci is using but I don't really know any exact details, and none of my experiments have reproduced the problem.
I'd really appreciate help with either of the following:
Guidance on running the tests manually outside of gitlab-ci, but replicating its environment so I can get the same errors/failures and debug the daemon and/or tests, OR
Insight into why the test would fail when ran by GitLab-CI-Runner
Sidetrack 1:
For some reason, not all the (mostly debugging) output that would normally be sent to the shell shows up in the gitlab-ci output.
Sidetrack 2:
I also played around setting it up with jenkins, but one of the tests fails to even connect to the daemon, while the rest do it fine.

-i usually replicate the problem by using a docker container only for the runner and running the tests inside it, dont know if you have it setup like this =(.
-Normally the test doesnt actually fail if you log in the container you will see he actually does everything but doesnt report back to the Gilab CI, dont freak out it does it job it simply does not say it.
PS: you can see if its actually running by checking the processes on the machine.
example:
im running a gitlab ci with java and docker:
gitlab ci starts doing its thing then hangs at a download,meanwhile i log in the container and check that he is actually working and manages to upload my compiled docker image.

Related

AWS Fargate Docker - How to print and see stdout/stderr from a headless ubuntu docker?

This may be a sort of 101 question, but in setting this up for the first time there are no hints about such a fundamental and common task. Basically I have a headless ubuntu running as a docker image inside AWS, which gets built via github actions CI/CD. All is running well.
Inside ubuntu I have some python scripts, let's say a custom server, cron jobs, some software running etc. How can I know, remotely, if there were any errors logged by any of these? Let's keep it simple: How can I print an error message, from a python server inside ubuntu, that I can read from outside docker? Does AWS have any kind of web interface for viewing stdout/stderr logs? Or at least an ssh console? Any examples somewhere?
Furthermore, I've set up my docker with healthchecks, to confirm that my servers running inside ubuntu are online and serving. Those work because I can test them in localhost by doing docker ps and shows Status 'healthy'. How do I see this same thing when live in AWS?
Have I really missed something this big? It feels like this should be the first thing flashing on the main page of setting up a docker on AWS.
There's a few things to unpack here, that you learn after digging through a lot of stuff you don't need in order to get started, just so you can know how to get started.
Docker will log by default the output of the startup processes that you've described your dockerfile setup, e.g. when you do ENTRYPOINT bash -C /home/ubuntu/my_dockerfile_sh_scripts/myStartupScripts.sh. If any subproceses spawned by those processes also log to stdout/stderr, the messages should bubble up to the host process, and therefore be shown in the docker log. If they don't bubble, look up subprocess stdout/stderr in linux.
Ok we know that, but where the heck is AWS's stats and logs page? Well in Amazon Cloudwatch™ of course. Didn't you already know about that term? Why, it says so right there when you create a docker, or on your ECS console next to your docker Clusters, or next to your running docker image Service. OH WAIT! No, no it does not! There is no utterance of "Cloudwatch" anywhere. Well there is this one page that has "Cloudwatch" on it, which you can get to if you know the url, but hey look at that, you don't actually see any sort of logs coming from your code in docker anywhere on there so ..yeah. So where do you see your actual logs and output? There is this Logs tab, in your Service's page (the page of the currently running docker image): https://eu-central-1.console.aws.amazon.com/ecs/home?region=eu-central-1#/clusters/your-cluster-name/services/your-cluster-docker-image-service/logs. This generically named and not-described tab, shows a log not of some status for the service, from the AWS side, but actually shows you the docker logs I mentioned in point 1. Ok. How do I view this as a raw file or access this remotely via script? Well I don't know. I guess you'll find out about that basic common task, after reading a couple of manuals about setting up the AWS CLI (another thing you didn't know existed).
Like I said in point 1, docker cannot log generic operating system log messages, or show you log files generated by your server, or just other software or jobs that are running that weren't described and started by your dockerfile/config. So how do we get AWS to see that? Well, It's a bit of a pain in the ass, you have to either replace your docker image's default OS's (e.g. ubuntu) logging driver with sudo yum install -y awslogs and set that up, or you can create symbolic links between specific log files and the stdout/stderr stream (docker docs mention of this). Also check Mark B's answer. But probably the easiest thing is to write your own little scripts with short messages that print out to the main process what's the status of things. Usually that's all you need unless you're an enterprise.
Is there any ssh or otherwise an AWS online command line interface page into the running docker, like you get in your localhost docker desktop? So you could maybe cd and ls browse or search for files and see if everything's fine? No. Make your own. Or better yet, avoid needing that in the first place, even though it's inconvenient for R&D.
Healthchecks. Where the heck do I see my docker healthchecks? The equivalent to the localhost method of just running the docker ps command. Well by default there aren't any healthchecks shown anywhere on AWS. Why would you need healthchecks anyway? So what if your dockerfile has HEALTHCHECKs defined?..🙂 You have to set that up in Fargate™ (..whatever fargate even means cause the name's not written anywhere ("UX")). You have to create what is called a new Task Definition Revision. Go to your Clusters in Amazon ECS. Go to your cluster. Then you click on your Service's entry in the Task Definition column of the services table on the bottom. You click on Create New Revision (new task definition revision). On the new page you click on your container in the Container Definitions table. On the new page you scroll down to HEALTHCHECK, bingo! Now what is this? What commands to I paste in here? It's not automatically taking the HEALTHCHECK that I defined in my dockerfile, so does that mean I must write something else here?? What environment are the healthchecks even run in? Is it my docker? Is it linux? Here's the answer: you paste in this box, what you already wrote in your dockerfile's HEALTHCHECK. Just use http://127.0.0.1 (localhost) as you would in your local docker desktop testing environment. Now click Update. Click Create. K, now we're still not done. Go back to your Amazon ECS / Clusters / cluster. Click on your service name in the services table. Click Update. Select the latest Revision. Check "force new deployment". Then keep clicking Next until finally you click Update Service. You can also define what triggers your image to be shut down on healthcheck fail. For example if it ran out of ram. Now #Amazon, I hope you take this answer and staple it to your shitty ass ECS experience.
I swear the relentlessly, exclusively bottom-up UX of platforms like AWS and Azure are what is keeping the tutorial blogger industry alive.. How would I know what AWS CloudWatch is, or that it even exists? There are no hints about these things anywhere while you set up. You'd think the first thing that flashes on your screen after you completed a docker setup would be "hey, 99.9% of people right now need to set up logging. You should use cloudwatch. And here's how you connect healthchecks to cloudwatch". But no, of course not..! 🙃
Instead, AWS's "engineer" approach here seems to be: here's a grid of holes in the wall, and here's a mess of wires next to it in a bucket. Now in order to do the common frequently done tasks you want to do, you must first read the manual for each hole, and the manual for each wire in the bucket, then find all of the holes and wires you need, and plug them in the right order (and for the right order you need to find a blog post because that always involves some level of not following the docs and definitely also magic).
I guess it's called "job security" for if you're an enterprise server engineer :)
I faced the same issue, I found the AWS Wiki, the /dev/stdout symbolic link doesn't work to me, but /proc/1/fd/1 symbolic link works to me.
Here is the solution:
Step 1. Add those commands to your Dockerfile.
# forward logs to docker log collector
RUN ln -sf /proc/1/fd/1 /var/log/console.log \
&& ln -sf /proc/1/fd/2 /var/log/error.log
Step 2. refer to "Mark B"'s step2.
Step 1. Update your docker image by deleting all the log files you care about, and replacing them with symbolic links to stdout or stderr, for example to capture logs in an nginx container I may do the following in the Dockerfile:
RUN ln -sf /dev/stdout /var/log/nginx/access.log \
&& ln -sf /dev/stderr /var/log/nginx/error.log
Step 2. Configure the awslogs driver in the ECS Task Definition, like so:
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "my-log-group",
"awslogs-region": "my-aws-region",
"awslogs-stream-prefix": "my-log-prefix"
}
}
And as long as you gave the ECS Execution Role permission to write to AWS Logs, log data will start appearing in CloudWatch Logs.

Error: Failed to Write, Broken Pipe when running python script on AWS EC2 instance

I'm new to cloud computing in general, and I've started a free trial with Amazon's Web Services, in hopes of using their EC2 servers to run some code for Kaggle competitions. I'm currently working on running a test Python script for doing some image processing and testing a linear classifier (I don't suspect these details are relevant to my problem, but wanted to provide some context).
Here are the steps I've gone through to run my script on an EC2 instance:
Log in to AWS, and start EC2 instance where I've installed relevant Python modules for my tasks (e.g. Anaconda distribution). As a sidenote, all my data and the script I want to run are in the same directory on this server instance.
SSH to my EC2 instance from my laptop, and cd to the directory with my script.
Run screen to run program in background.
Run script via python program.py and detach from screen session (ctrl + A, D)
Keep EC2 instance running, but exit from SSH session connecting my laptop to the server.
I've followed these steps a number of times, which result in either (a) "Broken Pipe" errors, or (b) in an error where the connection appears to "hang". In the case of (b), I've attempted to disconnect from the SSH session and reconnect to the server, however I am unable to do so due to an error stating "connection has been reset by peer".
I'm not sure if I need to configure something differently on the EC2 instance, or if I need to specify different options when connecting to the server via SSH. Any help here would be appreciated. Thanks for reading.
EDIT: I've been successful in running some example scripts using scikit-learn by setting up an iPython notebook, launching it with nohup, and running the code in a notebook cell. However, when trying to do the same with my Kaggle competition code, the same "hanging" issue happens, and the connection appears to be dropped, causing the code to stop running. The image dataset I'm running the code on in the second case is quite a bit larger than the dataset processed by the example code in the first case. Not sure if dataset size along is causing the issue, or how to solve this.

How to create a "watchdog" for a python script running on a shared host (no ssh access or shell scripts)

My friends and I have written a simple telegram bot in python. The script is run on a remote shared host. The problem is that for some reason the script stops from time to time, and we want to have some sort of a mechanism to check whether it is running or not and restart it if necessary.
However, we don't have access to ssh, we can't run bash scripts and I couldn't find a way to install supervisord. Is there a way to achieve the same result by using a different method?
P.S. I would appreciate it if you gave detailed a explanation as I'm a newbie hobbyist. However, I have no problem with researching and learning new things.
You can have a small supervisor Python script whose only purpose is to start (and restart) your main application Python script. When your application crashes the supervisor takes care and restarts it.

How to diagnose failure to create process when running windows executable from a network share

I would like to run some Windows binaries from a network share to remove update and installation issues across our large collection of test and developer machines. The specific package in question is Pytest. I have a copy of python (2.7.6) installed on the network share and it works fine. Using this python, I installed pytest onto the network share using my account. My account is special because it's the only one with read/write access to the network share--all other accounts have read-only access.
After installing pytest, I ran the executable py.test.exe in the Scripts directory and it was able to run
X:\python\2.7.6\windows\x86>Scripts\py.test.exe
============================= test session starts =============================
platform win32 -- Python 2.7.6 -- py-1.4.20 -- pytest-2.5.2
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
X:\python\2.7.6\windows\x86\lib\ntpath.py:401: KeyboardInterrupt
============================== in 18.30 seconds ==============================
X:\python\2.7.6\windows\x86>
However, other users cannot run this executable, and get the message
failed to create process
If pytest is installed on a local drive there is no problem. I have performed the following experiments:
Run py.test locally as an administrator: works
Run py.test locally as a regular user: fails to create process
Run from network share using my special id: works
Run from network share using 'regular' id (which has Administrator rights on that PC): fails to create process
Run from network share using 'regular' id and specifically run in a window as Administrator: fails to create process
I had a couple of theories that the py.test.exe must be
Run as administrator and possibly
Run from a directory with read/write permissions
I tried sorting all files in the network folder by time stamp to see if perhaps there was some file (like a .pid) being created that was stymied by the read-only nature of the network share, but I didn't see anything, so the read/write aspect may not be applicable. So, it appears that somehow binaries running from a network share are not run with administrator privileges, except for my special id.
Anyone know why and how I can address this? Using EnableLinkedConnections GPO seemed to be related but the symptom is different and in fact did not help.
Any suggestions on how to further diagnose the problem?

python testing server-deployed application

I've got a small application (https://github.com/tkoomzaaskz/cherry-api) and I would like to integrate it with travis. In fact, travis is probably not important here. My question is how can I configure a build/job to execute the following sequence:
start the server that serves the application
run tests
close the server (which means close the build)
The application is written in python/CherryPy (basic webapp framework). On my localhost I do it using two consoles. One runs the server and another one runs the tests - it's pretty easy and works fine. But when I want to execute all this in the CI environment, I fall in trouble - I'm unable to gain control after the server is started, because the server process waits for requests... and waits... and waits... and tests are never run (https://travis-ci.org/tkoomzaaskz/cherry-api/builds/10855029 - this build is infinite). Additionally, I don't know how to close the server. This is my .travis.yml:
before_script: python src/hello.py
script: nosetests
src/hello.py starts the built-in CherryPy server (listens on localhost:8080). I know I can move it to the background by adding the &: before_script: python src/hello.py & but then I shall find the process ID in the CI-environment and kill the process which seems very very dirty solution and I guess there's something better than that.
I'd appreciate any hints on how can I configure this.
edit: I've configured this dirty run in the background and then kill the process in this file. The build passes now. Still, I think it's ugly...

Categories