Run a Python script on Heroku on schedule (as a separate app) - python

OK so I'm working on an app that has 2 Heroku apps - one is the writer that writes to my DB after scraping a site, and one is the reader that consumes the said DB.
The former is just a Python script that has a kind of a while 1 loop - it's actually a Twitter stream. I want this to run every x minutes independent of what the reader is doing.
Now, running the script locally works fine, but I'm not sure how getting this to work on Heroku would work. I've tried looking it up, but could not find a solid answer. I read about background tasks, Redis queue, One-off dynos etc, but I'm not sure what to really use for my purpose. Some of my requirements are:
have the Python script keep logs of whatever I want.
in the future, I might want to add an admin panel for the writer, that will just show me stats of the script (and the logs). So hooking up this admin panel (flask) should be easy-ish and not break the script itself.
I would love any suggestions or pointers here.

I suggest writing the consumer as a server that waits around, then processes the stream on the timed interval. That is, you start it once and it runs forever, doing some processing every 10 minutes or so.
See: sched Python module, which handles scheduling events at certain times and running them.
Simpler: use Heroku's scheduler service.
This technique is simpler -- it's just straight-through code -- but can lead to problems if you have two of the same consumer running at the same time.

Related

How to do long running process in Django view?

I have a django web application and I have to create model for Machine Learning in a view.
It takes a long time so PythonAnyWhere does not allow it and it kills the process when it reaches 300 seconds. According to that, i want to ask two questions.
Without celery, django bg task or something else, my view which contains the long running process does not work in order. But when I use debugger, it works correctly. Probably, some lines of code try to work without waiting for each other without debugger. How can i fix this?
PythonAnyWhere does not support celery or other long running task packages. They suggest django-background-tasks but in its documentation, the usage is not explained clearly. So I could not integrate it. How can i integrate django-background-tasks?
Thank you.

Running simple python app with endless loop on openshift

I'm writing small & simple telegram bot on python. I never used this language in my work and decided that's a good way to learn by practice.
To get updates my app currently uses long polling called from an endless loop.
So I'm basically searching for the simplest way to run this app on openshift. I tried to use this example on flask but it didn't work. There are a lot of other options to implement background infinite processes with multiprocessing (from django and cerely to tornado) but it seems that all of them are way too advanced and complicated for my rather modest needs.
If the polling is not event driven, then you could use 'cron' (you can add cron cartridge to your gear) to periodically trigger your python script, that does the work and "dies".
However, keep in mind that Openshfit is not really intended to be your worker thread (unless you are on the bronze plan or higher). Unless you receive an external request to your gear within 24 hour period, your gear will be "idled" and your process will no longer run.
The way to get around this, "officially", is probably to get the bronze plan (you will not be charged unless you require the 4th gear instance),
"Unofficially", you can create a gear with python that will give you a default website. Then you create a new python script that does your job and trigger it using cron. To keep the gear from idling, use something like uptimerobot to ping your "website" every day.
How do I get SSL for my domains?
If you are still getting by on OpenShift Online's generous Free plan,
you'll see a warning message at the top of your application's SSL
configuration area. You can always take advantage of our *.rhcloud.com
wildcard certificate in order to securely connect to any application
via it's original, OpenShift-provided hostname URL.
Tornado is very simple, my first steps in telegram bot dev I did using this server on openshift platform.

Apache - Running Long Running Processes In Background

Before you go any further, I am currently working in a very restricted environment. Installing additional dll's/exe's, and other admin like activities are frustratingly difficult. I am fully aware that some of the methodology described in this post is far from best practice...
I would like to start a long running background process that start/stops with Apache. I have a cgi enabled python script that takes as input all of the parameters necessary to run a complex "job". It is not feasible to run this job in the cgi script itself - because a)cgi is already slow to begin with and b)multiple simultaneous requests would definitely cause trouble. The cgi script will do nothing more than enter the parameters into a "jobs" database.
Normally, I would set something up like MSMQ in conjunction with a Windows Service. I would have a web service add a job to the queue, and the windows service would be polling the queue at some standard interval - processing jobs in sequence...
How could I accomplish the same in Apache? I can easily enough create a python script to serve as the background job processor. My questions are:
how do I start it process up with, leave it running with, and stop with Apache?
how can i monitor the process - make sure stays alive with Apache?
Any tips or insight welcome.
Note. OS is Windows Server 2008
Heres a pretty hacky solution for anyone looking to do something similar.
Set up a windows scheduled task that does that background processing. set it to run once a day or whatever interval you want (it is irrelevant, as you'll see in next steps)
In the Settings tab of the Scheduled Task - make sure the "Allow task to be run on demand" option is checked. Also, under the "If the task is already running..." text, make sure the Do not start a new instance option in selected.
Then, from the cgi script - it is possible to invoke the scheduled task from the command line(subprocess module) see here. With the options set above - if the task is already running - any subsequent run on demands are ignored.

Monitor python scraper programs on multiple Amazon EC2 servers with a single web interface written in Django

I have a web-scraper (command-line scripts) written in Python that run on 4-5 Amazon-EC2 instances.
What i do is place the copy of these python scripts in these EC2 servers and run them.
So the next time when i change the program i have to do it for all the copies.
So, you can see the problem of redundancy, management and monitoring.
So, to reduce the redundancy and for easy management , I want to place the code in a separate server from which it can be executed on other EC2 servers and also monitor theses python programs, and logs created them through a Django/Web interface situated in this server.
There are at least two issues you're dealing with:
monitoring of execution of the scraping tasks
deployment of code to multiple servers
and each of them requires a different solution.
In general I would recommend using task queue for this kind of assignment (I have tried and was very pleased with Celery running on Amazon EC2).
One advantage of the task queue is that it abstracts the definition of the task from the worker which actually performs it. So you send the tasks to the queue, and then a variable number of workers (servers with multiple workers) process those tasks by asking for them one at a time. Each worker if it's idle will connect to the queue and ask for some work. If it receives it (a task) it will start processing it. Then it might send the results back and it will ask for another task and so on.
This means that a number of workers can change over time and they will process the tasks from the queue automatically until there are no more tasks to process. The use case for this is using Amazon's Spot instances which will greatly reduce the cost. Just send your tasks to the queue, create X spot requests and see the servers processing your tasks. You don't really need to care about the servers going up and down at any moment because the price went above your bid. That's nice, isn't it ?
Now, this implicitly takes care of monitoring - because celery has tools for monitoring the queue and processing, it can even be integrated with django using django-celery.
When it comes to deployment of code to multiple servers, Celery doesn't support that. The reasons behind this are of different nature, see e.g. this discussion. One of them might be that it's just difficult to implement.
I think it's possible to live without it, but if you really care, I think there's a relatively simple DIY solution. Put your code under VCS (I recommend Git) and check for updates on a regular basis. If there's an update, run a bash script which will kill your workers, make all the updates and start the workers again so that they can process more tasks. Given Celerys ability to handle failure this should work just fine.

Run a repeating task for a web app

This seems like a simple question, but I am having trouble finding the answer.
I am making a web app which would require the constant running of a task.
I'll use sites like Pingdom or Twitterfeed as an analogy. As you may know, Pingdom checks uptime, so is constantly checking websites to see if they are up and Twitterfeed checks RSS feeds to see if they;ve changed and then tweet that. I too need to run a simple script to cycle through URLs in a database and perform an action on them.
My question is: how should I implement this? I am familiar with cron, currently using it to do my server backups. Would this be the way to go?
I know how to make a Python script which runs indefinitely, starting back at the beginning with the next URL in the database when I'm done. Should I just run that on the server? How will I know it is always running and doesn't crash or something?
I hope this question makes sense and I hope I am not repeating someone else or anything.
Thank you,
Sam
Edit: To be clear, I need the task to run constantly. As in, check URL 1 in the database, check URl 2 in the database, check URL 3 and, when it reaches the last one, go right back to the beginning. Thanks!
If you need a repeatable running of the task which can be run from command line - that's what the cron is ideal for.
I don't see any demerits of this approach.
Update:
Okay, I saw the issue somewhat different. Now I see several solutions:
run the cron task at set intervals, let it process the data once per run, next time it will process the data on another run; use PIDs/Database/semaphores to avoid parallel processes;
update the processes that insert/update data in the database; let the information be processed when it is inserted/updated; c)
write a demon process which will reside in memory and check the data in real time.
cron would definitely be a way to go with this, as well as any other task scheduler you may prefer.
The main point is found in the title to your question:
Run a repeating task for a web app
The background task and the web application should be kept separate. They can share code, they can share access to a database, but they should be separate and discrete application contexts. (Consider them as separate UIs accessing the same back-end logic.)
The main reason for this is because web applications and background processes are architecturally very different and aren't meant to be mixed. Consider the structure of a web application being held within a web server (Apache, IIS, etc.). When is the application "running"? When it is "on"? It's not really a running task. It's a service waiting for input (requests) to handle and generate output (responses) and then go back to waiting.
Web applications are for responding to requests. Scheduled tasks or daemon jobs are for running repeated processes in the background. Keeping the two separate will make your management of the two a lot easier.

Categories