I'm working on a Baseball Simulator app with Dash. It uses a SGD model to simulate gameplay between a lineup and a pitcher. The app (under construction) can be found here: https://capstone-baseball-simulator.herokuapp.com/ and the repo: https://github.com/c-fried/capstone_heroku
To summarize the question: I want to be able to run the lineup optimizer on the heroku server.
There are potentially two parts to this: 1. Running the actual function while avoiding timeout. & 2. Displaying the progress of the function as it's running.
There are several issues I'm having with solving this:
The function is expensive and cannot be completed before the 30-second timeout. (It takes several minutes to complete.)
For this, I attempted to follow these instructions (https://devcenter.heroku.com/articles/python-rq) by creating a worker.py (still in the repo), moving the function to the external .py...etc. The problem I believe was that the process still was taking too long and therefore terminating.
I'm (knowingly) using a global variable in the function which works when I run locally but does not work when deployed (for reasons I somewhat understand - workers don't share memory https://dash.plotly.com/sharing-data-between-callbacks)
I was using a global to be able to see live updates of what the function was doing as it was running. Again, worked as a hack locally, but doesn't work on the server. I don't know how else I can watch the progress of the function without some kind of global operation going on. I'd love a clever solution to this, but I can't think of it.
I'm not experienced with web apps, so thanks for the advice in advance.
A common approach to solve this problem is to,
Run the long calculation asynchronously, e.g. using a background service
On completion, put the result in a shared storage space, e.g. a redis cache or an S3 bucket
Check for updates using an Interval component or a Websocket component
I can recommend Celery for keeping track of the tasks.
Related
I have a django web application and I have to create model for Machine Learning in a view.
It takes a long time so PythonAnyWhere does not allow it and it kills the process when it reaches 300 seconds. According to that, i want to ask two questions.
Without celery, django bg task or something else, my view which contains the long running process does not work in order. But when I use debugger, it works correctly. Probably, some lines of code try to work without waiting for each other without debugger. How can i fix this?
PythonAnyWhere does not support celery or other long running task packages. They suggest django-background-tasks but in its documentation, the usage is not explained clearly. So I could not integrate it. How can i integrate django-background-tasks?
Thank you.
I'm working on a project to learn Python, SQL, Javascript, running servers -- basically getting a grip of full-stack. Right now my basic goal is this:
I want to run a Python script infinitely, which is constantly making API calls to different services, which have different rate limits (e.g. 200/hr, 1000/hr, etc.) and storing the results (ints) in a database (PostgreSQL). I want to store these results over a period of time and then begin working with that data to display fun stuff on the front. I need this to run 24/7. I'm trying to understand the general architecture here, and searching around has proven surprisingly difficult. My basic idea in rough pseudocode is this:
database.connect()
def function1(serviceA):
while(True):
result = makeAPIcallA()
INSERT INTO tableA result;
if(hitRateLimitA):
sleep(limitTimeA)
def function2(serviceB):
//same thing, different limits, etc.
And I would ssh into my server, run python myScript.py &, shut my laptop down, and wait for the data to roll in. Here are my questions:
Does this approach make sense, or should I be doing something completely different?
Is it considered "bad" or dangerous to open a database connection indefinitely like this? If so, how else do I manage the DB?
I considered using a scheduler like cron, but the rate limits are variable. I can't run the script every hour when my limit is hit say, 5min into start time and has a wait time of 60min after that. Even running it on minute intervals seems messy: I need to sleep for persistent rate limit wait times which will keep varying. Am I correct in assuming a scheduler is not the way to go here?
How do I gracefully handle any unexpected potentially fatal errors (namely, logging and restarting)? What about manually killing the script, or editing it?
I'm interested in learning different approaches and best practices here -- any and all advice would be much appreciated!
I actually do exactly what you do for one of my personal applications and I can explain how I do it.
I use Celery instead of cron because it allows for finer adjustments in scheduling and it is Python and not bash, so it's easier to use. I have different tasks (basically a group of API calls and DB updates) to different sites running at different intervals to account for the various different rate limits.
I have the Celery app run as a service so that even if the system restarts it's trivial to restart the app.
I use the logging library in my application extensively because it is difficult to debug something when all you have is one difficult to read stack trace. I have INFO-level and DEBUG-level logs spread throughout my application, and any WARNING-level and above log gets printed to the console AND gets sent to my email.
For exception handling, the majority of what I prepare for are rate limit issues and random connectivity issues. Make sure to surround whatever HTTP request you send to your API endpoints in try-except statements and possibly just implement a retry mechanism.
As far as the DB connection, it shouldn't matter how long your connection is, but you need to make sure to surround your main application loop in a try-except statement and make sure it gracefully fails by closing the connection in the case of an exception. Otherwise you might end up with a lot of ghost connections and your application not being able to reconnect until those connections are gone.
I'm writing small & simple telegram bot on python. I never used this language in my work and decided that's a good way to learn by practice.
To get updates my app currently uses long polling called from an endless loop.
So I'm basically searching for the simplest way to run this app on openshift. I tried to use this example on flask but it didn't work. There are a lot of other options to implement background infinite processes with multiprocessing (from django and cerely to tornado) but it seems that all of them are way too advanced and complicated for my rather modest needs.
If the polling is not event driven, then you could use 'cron' (you can add cron cartridge to your gear) to periodically trigger your python script, that does the work and "dies".
However, keep in mind that Openshfit is not really intended to be your worker thread (unless you are on the bronze plan or higher). Unless you receive an external request to your gear within 24 hour period, your gear will be "idled" and your process will no longer run.
The way to get around this, "officially", is probably to get the bronze plan (you will not be charged unless you require the 4th gear instance),
"Unofficially", you can create a gear with python that will give you a default website. Then you create a new python script that does your job and trigger it using cron. To keep the gear from idling, use something like uptimerobot to ping your "website" every day.
How do I get SSL for my domains?
If you are still getting by on OpenShift Online's generous Free plan,
you'll see a warning message at the top of your application's SSL
configuration area. You can always take advantage of our *.rhcloud.com
wildcard certificate in order to securely connect to any application
via it's original, OpenShift-provided hostname URL.
Tornado is very simple, my first steps in telegram bot dev I did using this server on openshift platform.
OK so I'm working on an app that has 2 Heroku apps - one is the writer that writes to my DB after scraping a site, and one is the reader that consumes the said DB.
The former is just a Python script that has a kind of a while 1 loop - it's actually a Twitter stream. I want this to run every x minutes independent of what the reader is doing.
Now, running the script locally works fine, but I'm not sure how getting this to work on Heroku would work. I've tried looking it up, but could not find a solid answer. I read about background tasks, Redis queue, One-off dynos etc, but I'm not sure what to really use for my purpose. Some of my requirements are:
have the Python script keep logs of whatever I want.
in the future, I might want to add an admin panel for the writer, that will just show me stats of the script (and the logs). So hooking up this admin panel (flask) should be easy-ish and not break the script itself.
I would love any suggestions or pointers here.
I suggest writing the consumer as a server that waits around, then processes the stream on the timed interval. That is, you start it once and it runs forever, doing some processing every 10 minutes or so.
See: sched Python module, which handles scheduling events at certain times and running them.
Simpler: use Heroku's scheduler service.
This technique is simpler -- it's just straight-through code -- but can lead to problems if you have two of the same consumer running at the same time.
This seems like a simple question, but I am having trouble finding the answer.
I am making a web app which would require the constant running of a task.
I'll use sites like Pingdom or Twitterfeed as an analogy. As you may know, Pingdom checks uptime, so is constantly checking websites to see if they are up and Twitterfeed checks RSS feeds to see if they;ve changed and then tweet that. I too need to run a simple script to cycle through URLs in a database and perform an action on them.
My question is: how should I implement this? I am familiar with cron, currently using it to do my server backups. Would this be the way to go?
I know how to make a Python script which runs indefinitely, starting back at the beginning with the next URL in the database when I'm done. Should I just run that on the server? How will I know it is always running and doesn't crash or something?
I hope this question makes sense and I hope I am not repeating someone else or anything.
Thank you,
Sam
Edit: To be clear, I need the task to run constantly. As in, check URL 1 in the database, check URl 2 in the database, check URL 3 and, when it reaches the last one, go right back to the beginning. Thanks!
If you need a repeatable running of the task which can be run from command line - that's what the cron is ideal for.
I don't see any demerits of this approach.
Update:
Okay, I saw the issue somewhat different. Now I see several solutions:
run the cron task at set intervals, let it process the data once per run, next time it will process the data on another run; use PIDs/Database/semaphores to avoid parallel processes;
update the processes that insert/update data in the database; let the information be processed when it is inserted/updated; c)
write a demon process which will reside in memory and check the data in real time.
cron would definitely be a way to go with this, as well as any other task scheduler you may prefer.
The main point is found in the title to your question:
Run a repeating task for a web app
The background task and the web application should be kept separate. They can share code, they can share access to a database, but they should be separate and discrete application contexts. (Consider them as separate UIs accessing the same back-end logic.)
The main reason for this is because web applications and background processes are architecturally very different and aren't meant to be mixed. Consider the structure of a web application being held within a web server (Apache, IIS, etc.). When is the application "running"? When it is "on"? It's not really a running task. It's a service waiting for input (requests) to handle and generate output (responses) and then go back to waiting.
Web applications are for responding to requests. Scheduled tasks or daemon jobs are for running repeated processes in the background. Keeping the two separate will make your management of the two a lot easier.