I have a python data load service. One of the steps in the service is to refresh multiple Oracle materialized views. We have noticed that the service often gets stuck at this step and the issue gets fixed after a restart (pod).
I want to configure a command based openshift liveness probe here.
The purpose is to detect if the service is stuck at this step for say more than x hours, if yes then the probe fails and pod is restarted.
The service doesn't have http access to it.
we use enormous logging in the script that is being run here.
Is there a way to poll the openshift deployment log (latest one) and look for certain messages.
example:
#msg1
print("Refreshing materialized views")
.
.
.
#msg2
print("materialized view refreshed")
msg1 marks the start of potentially problematic step. My intent to write a command that polls the log and looks for msg2 (as it marks completion, exit status 0), if it doesn't find msg2 for more than 5 hours say, it must return non zero exit status causing the probe to fail.
How can I implement this?
Is this the best way to do it?
Related
I do chatbot on Dialogflow and connect Heroku with python in Github. How can I prevent Heroku from sleeping by python? Could you share the code with me?
If you are using the Free Plan you cannot prevent the Dyno to sleep (after 30 min inactivity).
There is a workaround if your Chatbot runs on a web site: in this case you could send a dummy request (just to start the Dyno) to your webhook hosted on Heroku when the user access the page, for example
<body onload="wakeUpCall();">
<script language="javascript">
function wakeUpCall() {
var xhr2 = new XMLHttpRequest();
xhr2.open("GET", "https://mywebhook.herokuapp.com/", true);
xhr2.send(null);
}
</script>
It is not obviously a perfect approach (it works only if you control the client and it relies on the Dyno starting before the chatbot sends data to the webhook), but it is an option if you want to keep working with the Free plan.
First some things to keep in mind before you try to use the free dyno for something it wasn't intended to be used for:
Heroku provides 1000 free hours a month. This is enough to only run single Heroku dyno at the free tier level. If you need to avoid the startup delay for two apps, then you'll need to pay for at least one of them.
Heroku still only allows for a single free dyno to run on your app. This you might lose traffic when you you are pushing new code (since the free web down has to shut down so you can built a new one).
There are undoubtably other issues as well, but those are the main ones I can think of offhand.
Now the solution:
Really, you just need something to ping your site at least once every 30 minutes. You could write a script for this, but there is an extremely useful tool that already does something like this that provides more benefit to you.
That would be the Availability (or Uptime) Monitoring tool. This is a tool that ensures your site is "still up and running" by pinging a URL every X minutes and ensuring that the response is a valid, expected response (IE: 200 status code and/or checking for certain text on the page). These often also provide the benefit of contacting you if it receives unexpected response (almost certainly an error) for too long.
Here is an example of an availability monitor:
https://uptimerobot.com/
my company has an arcgis server, and i've been trying to geocode some address using the python requests packages.
However, as long as the input format is correct, the reponse.status_code is always"200", meaning everything is OK, even if the server didn't process the request properly.
( for example, if the batch size limit is 1000 records, and I sent an json input with 2000 records, it would still return status_code 200, but half of the records will get ignored. )
just wondering if there is a way for me to know if the server process the request properly or not?
A great spot to check is the server logs to start with. They are located in your ArcGIS server manager (https://gisserver.domain.com:6443/arcgis/manager). I would assume it would log some type of warning/info there if records were ignored, but it is not technically an error so there would be no error messages would be returned anywhere.
I doubt you'd want to do this but if you want to up your limit you can follow this technical article on how to do thathttps://support.esri.com/en/technical-article/000012383
I have been working on a website for over a year now, using Django and Python3 primarily. A few of my buddies and I built a front end where a user enters some parameters and submits, this goes to the GAE to run the job and return the results.
In my local dev environment, everything works well. I have two separate dev environments. One builds the entire service up in a docker container. This produces the desired results in roughly 11 seconds. The other environment runs the source files locally on my computer and connects to the Postgres database hosted in Google Cloud. The Python application runs locally. It takes roughly 2 minutes for it to run locally, a lot of latency between the cloud and the post/gets from my local machine.
Once I perform the Gcloud app deploy and attempt to run in production, it never finishes. I have some print statements built into the code, I know it gets to the part where the submitted parameters go to the Python code. I monitor via this command on my local computer: gcloud app logs read.
I suspect that since my local computer is a beast (i7-7770 processor with 64 GB of RAM), it runs the whole thing no problem. But in the GAE, I don't think it's providing the proper machines to do the job efficiently (not enough compute, not enough RAM). That's my guess.
So, I need help in how to troubleshoot this. I tried changing my app.yaml file so that resources would scale to 16 GB of memory, but it would never deploy. I received an error 13.
One other note, after it spins around trying to run the job for 60 minutes, the website crashes and displays this message:
502 Server Error
Error: Server Error
The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
OK, so just in case anybody in the future is having a similar problem...the constant crashing of my Google App Engine workers was because of using Pandas dataframes in the production environment. I don't know exactly what Pandas was doing, but I kept getting Memory Errors, it would crash the site...and it didn't appear to be occurring in a single line of code. That is, it randomly happened somewhere in a Pandas Dataframe operation.
I am still using a Pandas Dataframe simply to read in a csv file. I then use
data_I_care_about = dict(zip(df.col1, df.col2))
#or
other_data = df.col3.values.tolist()
and then go to town with processing. As a note, on my local machine (my development environment basically) - it took 6 seconds to run from start to finish . That's a long time for a web request but I was in a hurry, thus why I used Pandas to begin with.
After refactoring, the same job completed in roughly 200ms using python lists and dicts (again, in my dev environment). The website is up and running very smoothly now. It takes a maximum of 7 seconds after pressing "Submit" for the back-end to return the data sets and render on the web page. Thanks for the help peeps!
Newbie on appengine and I really don't know how to phrase the question which sadly results in me not knowing what keywords to google and I hope that i really do get help other than the bashing that a lot of people do.
I'm confused between the behavior of appengine online and the appengine on the local server.
Background info:
Btw this is in Python
Initially i assumed that , when needed or as authored
an instance of the app or module will be created.
And that instance will be the one serving multiple requests from different clients.
In this behavior any initialization code will only be run once.
But in the local development server.
Every time i add something new, specially in the main.py,
the server is able to catch the new changes,
then on browser-refresh be able to run it.
This made me think, wait...
Does it run the entire script over and over again
on every request?
Question:
Does an instance/module run the entire code on every request or is this just an added behavior to the dev server to make development easier?
Both your assumptions - about behaviour in production and development - are wrong.
In production, GAE spins up instances as required. This may be in response to increased load, or the host may simply decide after a certain amount of time to recycle an instance by killing it and starting a new one. Initialization code will always be run whenever a new instance is started.
In development, you only get a single instance. However, the server watches your file system for changes. If it detects a change to the code itself, it will restart itself, and therefore re-run the initialization code. But if you don't make any code changes between requests, the existing process continues indefinitely, and init code will not be re-run.
From python, I am using knife to launch a server.
e.g.
knife ec2 server create -r "role[nginx_server]" --region ap-southeast-1 -Z ap-southeast-1a -I ami-ae1a5dfc --flavor t1.micro -G nginx -x ubuntu -S sg_development -i /home/ubuntu/.ec2/sg_development.pem -N webserver1
I will then use the chef-server api to check for when the bootstrap is complete so I can then use boto and other tools to configure the newly created server. Pseudo code will look like this:
cmd = """knife ec2 server create -r "role[nginx_server]...."""
os.system(cmd)
boot = False
while boot==False:
chefTrigger = getStatusFromChefApi()
if chefTrigger==True:
boot=True
continue with code for further proccessing
My question is: What is the trigger in the chef-server that will indicate when the node is fully processed by chef? Note, I used the -N to name the server and will query its properties, but what do I look for? Is there a bool? A status?
Thanks
TL;DR: Use a report/exception handler instead.
When the node has finished running chef-client successfully, it will save the node object to the Chef Server. One of the attributes automatically generated by ohai every time Chef runs is node['ohai_time'], which is the Unix epoch timestamp when ohai was executed (at the beginning of the Chef run). A node that has not successfully saved itself to the server will not have the ohai_time at all. However, this attribute merely tracks the time when ohai ran, not necessarily when chef-client saved to the server (since that can be a few seconds to minutes depending on what your recipes are doing). Note if the chef run exits due to an unhandled exception, it won't save to the server by default.
A more reliable way to be notified when a node has completed is to use a Report/Exception handler, which can send a message to a variety of places and APIs. See the documentation for more information.