I've a pretty small cron job running every 24h on the GAE using Python. Since yesterday I receive a DeadlineExceededErrors, due to the fact, that the job exceeds 60 Sec. Like I said my job is pretty small so it won't exceed 5 minutes ever but unfortunately the 60 sec deadline.
I already know, that this is a usual problem and found a lot of links and workarounds on Google but I can't solve the problem.
Does anybody know a good way to increase the deadline maximum or maybe to schedule a task asynchronous, to work around these 60 sec deadline?
Your cron job should simply start a task. This will take less than a second. A a task can run for up to 10 minutes.
You may also want to learn about different types of scaling methods on App Engine.
You might want to set up a backend service (module), which has no deadline. Then add the target: backend-module param to your cron job.
Related
I understand that you should essentially set your worker count to the number of cores your node has, and if you go beyond that, you'll probably overwhelm the node.
I have hundreds of web requests (each as their own task) that will need to every minute, currently routing all of these through apply_async().
If I set my concurrency -c 10, does that mean it can only execute up to 10 of those requests at a time? Or is the concurrency count not necessarily equal to the amount of tasks it can execute at once?
It would be a waste of resources and wildly inefficient to only handle 10 requests at a time when most of that time is spent just waiting on the request to finish. When I find articles on mixing asyncio and Celery, people seem to think it's not a great idea. So what would be the solution here? Was Celery the wrong move, or does 10 concurrency ≠ only 10 simultaneous tasks.
I was using the wrong approach. I should've gone with a thread-based approach using Celery: https://www.technoarchsoftwares.com/blog/optimization-using-celery-asynchronous-processing-in-django/
Switched to gevent and it's working exactly as I'd imagined now.
We have been using Airflow for a while, it is just great.
Now we are considering moving some of our very frequent tasks into our airflow server too.
Let's say I have a script running every second.
What's the best practice to schedule it with airflow:
Run this script in DAG that is scheduled every second. I highly doubt this will be the solution, there is significant overhead for a DAGRUN
Run this script in a while loop that stops after 6 hours, then schedule it on Airflow to be run every 6 hour?
Create a DAG with no schedule, put the task in a while True loop with proper sleep time, so the task will never terminates unless there is an error.
Any other suggestions?
Or this kind of task is just not suitable for Airflow? should do it with a lambda function and AWS scheduler?
Cheers!
What's the best practice to schedule it
... this kind of task is just not suitable for Airflow?
It is not suitable.
In particular, your airflow is probably configured to re-examine the set of DAGs every 5 seconds, which doesn't sound like a good fit for a 1-second task. Plus the ratio of scheduling overhead to work performed would not be attractive. I suppose you could schedule five simultaneous tasks, twelve times per minute, and have them sleep zero to four seconds, but that's just crazy. And likely you would need to "lock against yourself" to avoid having simultaneous sibling tasks step on each other's toes.
The six-hour suggestion (2.) is not crazy. I will view it as a sixty-minute #hourly task instead, since overheads are similar. Exiting after an hour and letting airflow respawn has several benefits. Log rolling happens at regular intervals. If your program crashes, it will be restarted before too long. If your host reboots, again your program is restarted before too long. Downside is that your business need may view "more than a minute" as "much too long". And coordinating overlapping tasks, or gap between tasks, at the hour boundary may pose some issues.
Your stated needs exactly match the problem that Supervisor addresses. Just use that. You will always have exactly one copy of your event loop running, even if the app crashes, even if the host crashes. Log rolling and other administrative details have already been addressed. The code base is mature and lots of folks have beat on it and incorporated their feature requests. It fits what you want.
I'm wondering what is the difference between the Heroku Scheduler add-on and the Heroku Temporize Scheduler add-on. They both seem to be free and do scheduled jobs.
And how do they both compare to running a Python sched in Heroku?
I would like to run just one cron job to scrape some websites every minute with Heroku Python saving to Postgres. (I'm also trying to figure out what to write in a cron job to do so, but that's another question.)
Update with the solution:
Thanks to danneu's suggestions, the working solution was using the Heroku Scheduler. It was super simple to set up thanks to this tutorial.
(I tried using sched and twisted, but both times I got:
Application Error
An error occurred in the application and your page could not be served. Please try again in a few moments.
If you are the application owner, check your logs for details.
This was possibly due to my lack of experience of putting them in the correct place. It didn't work with a sync worker Heroku guricorn. I don't know the details.)
Temporize is a 3rd party service. You'd have to read what the limitations of their free plan is. It looks like the free plan only lets you run a task 20 times per day which is a far cry from your needs of a 1-minute interval.
Heroku's Scheduler is offered by Heroku. It spins up a server for each task and bills you for the runtime. The minimum task interval on Heroku scheduler is 10 minutes which also won't give you what you want.
For a 1-minute interval that scrapes a page, I'd just run a concurrent loop in my Heroku app process. setTimeout in Javascript is an example of a concurrent loop that can run alongside your application server (if you were using Node). It looks like the Twisted example in your Python sched link is the Python equivalent.
If you're using Heroku's free-tier (which I think you are after seeing you in IRC), you get 1,000 runtime hours each month once you verify your account (else you only get 550 hours), which I imagine means giving them your credit card number. 1,000 hours is enough for a single dyno to run all month for free.
However, the free-tier dyno will sleep (turn off) if it has gone X amount of minutes without receiving an HTTP request. Obviously the Twisted/concurrent loop approach will only work while the dyno is awake and running since the loop runs inside your application process.
So if your dyno falls asleep, your concurrent loop will stop until your dyno wakes back up and resumes.
If you want your dyno to stay awake all month, you can upgrade your dyno for $7/mo.
I am new to app engine. I am using app engine python for my new project.
I want to schedule a task which will start and runs forever. Using that I want to check something like this
if smthng.expiry == time_now:
smthng.expired = True
I want to perform this check in every second.
Which one is suitable for this, cron or task api ? is there any other way to do this?
it will be very helpful if I can take a look at some examples or tutorials showing how its done.
Any help is appreciated. Thank you.
As mentioned in the comments above - the tasks cannot run forever (a task must finish its job in 10 minutes otherwise it will be killed).
You could use cron but you need it to run every second and crons can only run every minute.
You could also take a look at GAE's Modules and specifically at Modules with Manual Scaling (see more info here: https://cloud.google.com/appengine/docs/python/modules/#Python_Instance_scaling_and_class ) which would allow you to create an infinite loop and wouldn't be affected by the 10 minute limit.
In my django project, I need to collect data from about 50 remote servers into the local database minutely or every 30-seconds. Though it works with crontab in the remote servers, I want to do this in the project. Firstly, I consider the django-celery. However it does well in asynchronous processing and the collect-data task could not be delayed. Therefore i think, it may be not fit. How if i do this use the timer for python and what need i to pay more attention. Excuse for my ignorance of python and django. I'll appreciate other advice or ideas. Many thanks
Basically you can use Celery's preiodic tasks with expire option, which makes you sure that your tasks will not be executed twice.
Also you could run your own script with infinite loop like which will run calculation. If your calculation will run more than minute you can spawn your tasks using eventlet or gevent. Other option you could creare celery-tasks from this script and be sure that your tasks executes every N seconds, as you prefer.