I am developing an automation tool that is supposed to upgrade IP network devices.
I developed 2 totally separated script for the sake of simplicity - I am not an expert developer - one for the core and aggregation nodes, and one for the access nodes.
The tool executes software upgrade on the routers, and verifies the result by executing a set of post check commands. The device role implies the "size" of the router. Bigger routers take much more to finish the upgrade. Meanwhile the smaller ones are up much earlier than the bigger ones, the post check cannot be started until the bigger ones finish the upgrade, because they are connected to each other.
I want to implement a reliable signaling between the 2 scripts. That is, the slower script(core devices) flips a switch when the core devices are up, while the other script keeps checking this value, and start the checks for the access devices.
Both script run 200+ concurrent sessions moreover, each and every access device(session) needs individual signaling, so all the sessions keep checking the same value in the DB.
First I used the keyring library, but noticed that the keys do disappear sometimes. Now I am using a txt file to manipulate the signal values. It looks pretty much armature, so I would like to use MongoDB.
Would it cause any performance issues or unexpected exception?
The script will be running for 90+ minutes. Is it OK to connect to the DB once at the beginning of the script, set the signal to False, then 20~30 minutes later keep checking for an additional 20 minutes. Or is it advised to establish a new connection for reading the value for each and every parallel session?
The server runs on the same VM as the script. What exceptions shall I expect?
Thank you!
Related
I'm working on a project to learn Python, SQL, Javascript, running servers -- basically getting a grip of full-stack. Right now my basic goal is this:
I want to run a Python script infinitely, which is constantly making API calls to different services, which have different rate limits (e.g. 200/hr, 1000/hr, etc.) and storing the results (ints) in a database (PostgreSQL). I want to store these results over a period of time and then begin working with that data to display fun stuff on the front. I need this to run 24/7. I'm trying to understand the general architecture here, and searching around has proven surprisingly difficult. My basic idea in rough pseudocode is this:
database.connect()
def function1(serviceA):
while(True):
result = makeAPIcallA()
INSERT INTO tableA result;
if(hitRateLimitA):
sleep(limitTimeA)
def function2(serviceB):
//same thing, different limits, etc.
And I would ssh into my server, run python myScript.py &, shut my laptop down, and wait for the data to roll in. Here are my questions:
Does this approach make sense, or should I be doing something completely different?
Is it considered "bad" or dangerous to open a database connection indefinitely like this? If so, how else do I manage the DB?
I considered using a scheduler like cron, but the rate limits are variable. I can't run the script every hour when my limit is hit say, 5min into start time and has a wait time of 60min after that. Even running it on minute intervals seems messy: I need to sleep for persistent rate limit wait times which will keep varying. Am I correct in assuming a scheduler is not the way to go here?
How do I gracefully handle any unexpected potentially fatal errors (namely, logging and restarting)? What about manually killing the script, or editing it?
I'm interested in learning different approaches and best practices here -- any and all advice would be much appreciated!
I actually do exactly what you do for one of my personal applications and I can explain how I do it.
I use Celery instead of cron because it allows for finer adjustments in scheduling and it is Python and not bash, so it's easier to use. I have different tasks (basically a group of API calls and DB updates) to different sites running at different intervals to account for the various different rate limits.
I have the Celery app run as a service so that even if the system restarts it's trivial to restart the app.
I use the logging library in my application extensively because it is difficult to debug something when all you have is one difficult to read stack trace. I have INFO-level and DEBUG-level logs spread throughout my application, and any WARNING-level and above log gets printed to the console AND gets sent to my email.
For exception handling, the majority of what I prepare for are rate limit issues and random connectivity issues. Make sure to surround whatever HTTP request you send to your API endpoints in try-except statements and possibly just implement a retry mechanism.
As far as the DB connection, it shouldn't matter how long your connection is, but you need to make sure to surround your main application loop in a try-except statement and make sure it gracefully fails by closing the connection in the case of an exception. Otherwise you might end up with a lot of ghost connections and your application not being able to reconnect until those connections are gone.
I've created a python (and C, but the "controlling" part is Python) program for carrying out Bayesian inversion using Markov chain Monte Carlo methods. Unfortunately, McMC can take several days to run. Part of my research is in reducing the time, but we can only reduce so much.
I'm running it on a headless Centos 7 machine using nohup since maintaining a connection and receiving prints for several days is not practical. However, I would like to be able to check in on my program occasionally to see how many iterations it's done, how many proposals have been accepted, whether it's out of burn-in, etc.
Is there something I can use to interact with the python process to get this info?
I would recommend SAWs (Scientific Application Web server). It creates a thread in your process to handle HTTP request. The variables of interest are returned to the client in JSON format .
For the variables managed by the python runtime, write them into a (JSON) file on the harddisk (or any shared memory) and use SimpleHTTPServer to serve it. The SAWs web interface on the client side can still be used as long as you follow the JSON format of SAWs.
I have a python (or ruby, doesn't really matter) script on a server which has to be reliable and run all the time. And if something happens and it crashes or gets frozen I need to know about that immediately. Previously I was thinking about another "script" such as a cron job which would check it up every minute by the means of Linux -- whether or not it's in the list of the active processes. However, now I think that even if it's the list of the active processes, it still might be frozen (it hasn't crashed yet, but it's about to).
Isn't that right? If so, I'm thinking of having it save some "heart-beat" data into a file every minute, because it's more reliable way to know whether or not it's up AND whether or not it's frozen, because if it's frozen it can't write into a file but still can be in the memory.
Your suggestions, should I go with that? Or just checking if its process in the memory (in the list of active processes) is perfectly enough?
If there are bad consequences when the script is not running (If there weren't, you probably wouldn't care, would you?), it might be most reliable to check for distinct symptoms of those consequences.
For example, if the script is a web server, have a monitoring service make requests to it and notify you whenever that fails.
If the bad consequences can be observed remotely or even off-site, have the monitoring remote, or if possible, off-site from the machine running your script. Why? If the consequences occur because your script stopped running because the machine running it died ... you wouldn't get notified if it was that same machine's task to notify you. If it's a different machine, you'll be made aware of the situation. Unless the data center burnt down. Then your monitoring service needs to be in a different data center for you to get notified.
There are paid and free monitoring service offers for publicly accessible servers, e.g. Uptime Robot for web servers, in case you don't want to develop and host the monitoring yourself.
I have over 100 web servers instances running a php application using apc and we occasionally (order of once per week across the entire fleet) see a corruption to one of the caches which results in a distinctive error log message.
Once this occurs then the application is dead on that node any transactions routed to it will fail.
I've written a simple wrapper around tail -F which can spot the patter any time it appears in the log file and evaluate a shell command (using bash eval) to react. I have this using the salt-call command from salt-stack to trigger processing a custom module which shuts down the nginx server, warms (refreshes) the cache, and, of course, restarts the web server. (Actually I have two forms of this wrapper, bash and Python).
This is fine and the frequency of events is such that it's unlikely to be an issue. However my boss is, quite reasonably, concerned about a common mode failure pattern ... that the regular expression might appear in too many of these logs at once and take town the entire site.
My first thought would be to wrap my salt-call in a redis check (we already have a Redis infrastructure used for caching and certain other data structures). That would be implemented as an integer, with an expiration. The check would call INCR, check the result, and sleep if more than N returned (or if the Redis server were unreachable). If the result were below the threshold then salt-call would be dispatched and a decrement would be called after the server is back up and running. (Expiration of the Redis key would kill off any stale increments after perhaps a day or even a few hours ... our alerting system will already have notified us of down servers and our response time is more than adequate for such time frames).
However, I was reading about the Saltstack event handling features and wondering if it would be better to use that instead. (Advantage, the nodes don't have redis-cli command tool nor the Python Redis libraries, but, obviously, salt-call is already there with its requisite support). So using something in Salt would minimize the need to add additional packages and dependencies to these systems. (Alternatively I could just write all the Redis handling as a separate PHP command line utility and just have my shell script call that).
Is there a HOWTO for writing simple Saltstack modules? The docs seem to plunge deeply into reference details without any orientation. Even some suggestions about which terms to search on would be helpful (because their use of terms like pillars, grains, minions, and so on seems somewhat opaque).
The main doc for writing a Salt module is here: http://docs.saltstack.com/en/latest/ref/modules/index.html
There are many modules shipped with Salt that might be helpful for inspiration. You can find them here: https://github.com/saltstack/salt/tree/develop/salt/modules
One thing to keep in mind is that the Salt Minion doesn't do anything unless you tell it to do something. So you could create a module that checks for the error pattern you mention, but you'd need to add it to the Salt Scheduler or cron to make sure it gets run frequently.
If you need more help you'll find helpful people on IRC in #salt on freenode.
Background: I'm currently trying to develop a monitoring system at my job. All the nodes that need to be monitored are accessible via Telnet. Once a Telnet connection has been made, the system needs to execute a couple of commands on the node and process the output.
My problem is that both creating a new connection and running the commands needs time. It takes app. 10s to get a connection up (the TCP connection is established instantly, but some commands need to be run to prepare the connection for use), and an almost equal amount of time to run the command required.
So, I need to come up with a solution that allows me to execute 10-20 of these 10s long commands on the nodes, without collectively taking more than 1min. I was thinking of creating a sort of connection pooler, which I could send the commands to and then it could execute them in parallel, dividing them over available Telnet sessions. I tried to find something similar that I could use (or even just look at to gain some understanding), but I am unable to find anything.
I'm developing on Ubuntu with Python. Any help would be appreciated!
Edit (Update Info)*:
#Aya #Thomas: A bit more info. I already have a solution in Python that is working, however it is getting difficult to manage the code. Currently I'm using the same approach that you advised, using a per connection Thread. However, the problem is that there is a 10s delay each time a connection is made to a node, and I need to make atleast 10 connections per node per iteration. The time limit for each iteration is 60s, so making a new connection each time is not feasible. It needs to open 10 connections per node at startup and maintain those connections.
What I am looking for is someone who can point out examples of good architecture for something like this?