Heroku Scheduler vs Heroku Temporize Scheduler, what's the difference? - python

I'm wondering what is the difference between the Heroku Scheduler add-on and the Heroku Temporize Scheduler add-on. They both seem to be free and do scheduled jobs.
And how do they both compare to running a Python sched in Heroku?
I would like to run just one cron job to scrape some websites every minute with Heroku Python saving to Postgres. (I'm also trying to figure out what to write in a cron job to do so, but that's another question.)
Update with the solution:
Thanks to danneu's suggestions, the working solution was using the Heroku Scheduler. It was super simple to set up thanks to this tutorial.
(I tried using sched and twisted, but both times I got:
Application Error
An error occurred in the application and your page could not be served. Please try again in a few moments.
If you are the application owner, check your logs for details.
This was possibly due to my lack of experience of putting them in the correct place. It didn't work with a sync worker Heroku guricorn. I don't know the details.)

Temporize is a 3rd party service. You'd have to read what the limitations of their free plan is. It looks like the free plan only lets you run a task 20 times per day which is a far cry from your needs of a 1-minute interval.
Heroku's Scheduler is offered by Heroku. It spins up a server for each task and bills you for the runtime. The minimum task interval on Heroku scheduler is 10 minutes which also won't give you what you want.
For a 1-minute interval that scrapes a page, I'd just run a concurrent loop in my Heroku app process. setTimeout in Javascript is an example of a concurrent loop that can run alongside your application server (if you were using Node). It looks like the Twisted example in your Python sched link is the Python equivalent.
If you're using Heroku's free-tier (which I think you are after seeing you in IRC), you get 1,000 runtime hours each month once you verify your account (else you only get 550 hours), which I imagine means giving them your credit card number. 1,000 hours is enough for a single dyno to run all month for free.
However, the free-tier dyno will sleep (turn off) if it has gone X amount of minutes without receiving an HTTP request. Obviously the Twisted/concurrent loop approach will only work while the dyno is awake and running since the loop runs inside your application process.
So if your dyno falls asleep, your concurrent loop will stop until your dyno wakes back up and resumes.
If you want your dyno to stay awake all month, you can upgrade your dyno for $7/mo.

Related

Heroku free tier for a non-stop web and worker process in one application

I have a Python app that pings an API every minute and records data into a database. I also have a Django web app that is the client for displaying this data (it should also not be idling, as I will explain bellow).
Heroku recently made changes again to their free tier, allowing 1000hrs/month for verified accounts. I have verified my account to take advantage of this. What is not clear to me is how the usage hours will be counted in my situation. Will my Heroku application accumulate ~750 hours per month, or 2x750 hours after non-stop running? Are the two lines inside the Procfile considered separate dynos, thus each will be accumulating 750 hours per month?
Setup
Procfile:
worker: python api_crawler.py
web: gunicorn api_data_client.wsgi --log-file -
I found out that if the web process begins to idle after 30 minutes of inactivity, it will also bring down the worker process with it. This is not a desired outcome for me, as I need the worker process to be running non-stop. After some reading I found that the 'New Relic' monitoring addon can help keep the web process from idling, which is good, unless I will run out of the 1000 monthly hours.
Each line in the procfile will create a separate dyno, so if you can't let either process idle, 1000 hours is not enough.
However, if you are ok with your web dyno idling after 30 minutes of inactivity, and presuming you don't have too much web traffic, you should be able to keep your worker process up 24/7, and your web process might consume less than approx. 250 dyno hours per month, so you could fit everything into the 1000 hour free tier.
Heroku should not be idling your worker process when your web process idles. Not sure why you wrote that you think it does.

How to architech heroku app for workers with wildly varying runtimes?

I'm building a web application that has some long-running jobs to do, and some very short. The jobs are invoked by users on the website, and can run anywhere from a few seconds to several hours. The spawned job needs to provide status updates to the user via the website.
I'm new to heroku, and am thinking the proper approach is to spawn a new dyno for each background task, and to communicate status via database or a memcache record. Maybe?
My question is whether this is the technically feasibly, and advisable approach?
I ask it because the documentation has a different mindset: that the worker dyno pulls jobs off a queue, and if you want things to go faster you run more dynos. I'm not sure that will work—could a 10 second job get blocked waiting for a couple of 10 hour jobs to finish? There is a way of determining the size of the job but, again, there is a highly variable amount of work to do before it is known.
I've not found any examples suggesting it is even possible for the web dyno to run up workers ad-hoc. Is it? Is the solution to multi-thread the worker dyno? Is so, what about potential memory space issues?
What's my best approach?
thx.

Run a Python script on Heroku on schedule (as a separate app)

OK so I'm working on an app that has 2 Heroku apps - one is the writer that writes to my DB after scraping a site, and one is the reader that consumes the said DB.
The former is just a Python script that has a kind of a while 1 loop - it's actually a Twitter stream. I want this to run every x minutes independent of what the reader is doing.
Now, running the script locally works fine, but I'm not sure how getting this to work on Heroku would work. I've tried looking it up, but could not find a solid answer. I read about background tasks, Redis queue, One-off dynos etc, but I'm not sure what to really use for my purpose. Some of my requirements are:
have the Python script keep logs of whatever I want.
in the future, I might want to add an admin panel for the writer, that will just show me stats of the script (and the logs). So hooking up this admin panel (flask) should be easy-ish and not break the script itself.
I would love any suggestions or pointers here.
I suggest writing the consumer as a server that waits around, then processes the stream on the timed interval. That is, you start it once and it runs forever, doing some processing every 10 minutes or so.
See: sched Python module, which handles scheduling events at certain times and running them.
Simpler: use Heroku's scheduler service.
This technique is simpler -- it's just straight-through code -- but can lead to problems if you have two of the same consumer running at the same time.

How should I schedule my task in django

In my django project, I need to collect data from about 50 remote servers into the local database minutely or every 30-seconds. Though it works with crontab in the remote servers, I want to do this in the project. Firstly, I consider the django-celery. However it does well in asynchronous processing and the collect-data task could not be delayed. Therefore i think, it may be not fit. How if i do this use the timer for python and what need i to pay more attention. Excuse for my ignorance of python and django. I'll appreciate other advice or ideas. Many thanks
Basically you can use Celery's preiodic tasks with expire option, which makes you sure that your tasks will not be executed twice.
Also you could run your own script with infinite loop like which will run calculation. If your calculation will run more than minute you can spawn your tasks using eventlet or gevent. Other option you could creare celery-tasks from this script and be sure that your tasks executes every N seconds, as you prefer.

Apache - Running Long Running Processes In Background

Before you go any further, I am currently working in a very restricted environment. Installing additional dll's/exe's, and other admin like activities are frustratingly difficult. I am fully aware that some of the methodology described in this post is far from best practice...
I would like to start a long running background process that start/stops with Apache. I have a cgi enabled python script that takes as input all of the parameters necessary to run a complex "job". It is not feasible to run this job in the cgi script itself - because a)cgi is already slow to begin with and b)multiple simultaneous requests would definitely cause trouble. The cgi script will do nothing more than enter the parameters into a "jobs" database.
Normally, I would set something up like MSMQ in conjunction with a Windows Service. I would have a web service add a job to the queue, and the windows service would be polling the queue at some standard interval - processing jobs in sequence...
How could I accomplish the same in Apache? I can easily enough create a python script to serve as the background job processor. My questions are:
how do I start it process up with, leave it running with, and stop with Apache?
how can i monitor the process - make sure stays alive with Apache?
Any tips or insight welcome.
Note. OS is Windows Server 2008
Heres a pretty hacky solution for anyone looking to do something similar.
Set up a windows scheduled task that does that background processing. set it to run once a day or whatever interval you want (it is irrelevant, as you'll see in next steps)
In the Settings tab of the Scheduled Task - make sure the "Allow task to be run on demand" option is checked. Also, under the "If the task is already running..." text, make sure the Do not start a new instance option in selected.
Then, from the cgi script - it is possible to invoke the scheduled task from the command line(subprocess module) see here. With the options set above - if the task is already running - any subsequent run on demands are ignored.

Categories