Scheduling in Heroku while keeping track of counter - python

Might be a noob question but I'm new to hosting code online. For my first project I've created a web scraper that logs into a website and updates a number every hour. Indexing up by one. Initially I was scheduling using the schedule library. Looked a little like this:
import schedule
int number = 4
def job():
number = number + 1
element.clear()
element.send_keys(number)
schedule.every().hour.at(":00").do(job)
while True:
schedule.run_pending()
time.sleep(1)
This worked fine for the first 24 hours until the dyno had to be cycled. After which, the value number reverted back to 4. I could use some built-in scheduler but I don't see how that would remedy the issue since it will run the int number = 4 line anyway. What am I missing and how can I keep track of the running count even when the dyno resets?

This isn't really about schedulers, it's about where that data lives. Right now it is in memory and (a) will be lost whenever your dyno restarts and (b) will not behave properly if you scale your application up or out.
You need to store it somewhere.
Heroku offers a number of addons that do data storage, many of which have free tiers.
Depending on your use case, you could also store it on an off-site blob storage container like Amazon S3 or Azure Blob Storage, but a data store is likely a better choice.

Related

Delaying 1 second per request, not enough for 3600 per hour

The Amazon API limit is apparently 1 req per second or 3600 per hour. So I implemented it like so:
while True:
#sql stuff
time.sleep(1)
result = api.item_lookup(row[0], ResponseGroup='Images,ItemAttributes,Offers,OfferSummary', IdType='EAN', SearchIndex='All')
#sql stuff
Error:
amazonproduct.errors.TooManyRequests: RequestThrottled: AWS Access Key ID: ACCESS_KEY_REDACTED. You are submitting requests too quickly. Please retry your requests at a slower rate.
Any ideas why?
This code looks correct, and it looks like 1 request/second limit is still actual:
http://docs.aws.amazon.com/AWSECommerceService/latest/DG/TroubleshootingApplications.html#efficiency-guidelines
You want to make sure that no other process is using the same associate account. Depending on where and how you run the code, there may be an old version of the VM, or another instance of your application running, or maybe there is a version on the cloud and other one on your laptop, or if you are using a threaded web server, there may be multiple threads all running the same code.
If you still hit the query limit, you just want to retry, possibly with the TCP-like "additive increase/multiplicative decrease" back-off. You start by setting extra_delay = 0. When request fails, you set extra_delay += 1 and sleep(1 + extra_delay), then retry. When it finally succeeds, set extra_delay = extra_delay * 0.9.
Computer time is funny
This post is correct in saying "it varies in a non-deterministic manner" (https://stackoverflow.com/a/1133888/5044893). Depending on a whole host of factors, the time measured by a processor can be quite unreliable.
This is compounded by the fact that Amazon's API has a different clock than your program does. They are certainly not in-sync, and there's likely some overlap between their "1 second" time measurement and your program's. It's likely that Amazon tries to average out this inconsistency, and they probably also allow a small bit of error, maybe +/- 5%. Even so, the discrepancy between your clock and theirs is probably triggering the ACCESS_KEY_REDACTED signal.
Give yourself some buffer
Here are some thoughts to consider.
Do you really need to hit the Amazon API every single second? Would your program work with a 5 second interval? Even a 2-second interval is 200% less likely to trigger a lockout. Also, Amazon may be charging you for every service call, so spacing them out could save you money.
This is really a question of "optimization" now. If you use a constant variable to control your API call rate (say, SLEEP = 2), then you can adjust that rate easily. Fiddle with it, increase and decrease it, and see how your program performs.
Push, not pull
Sometimes, hitting an API every second means that you're polling for new data. Polling is notoriously wasteful, which is why Amazon API has a rate-limit.
Instead, could you switch to a queue-based approach? Amazon SQS can fire off events to your programs. This is especially easy if you host them with Amazon Lambda.

Run python job every x minutes

I have a small python script that basically connects to a SQL Server (Micrsoft) database and gets users from there, and then syncs them to another mysql database, basically im just running queries to check if the user exists, if not, then add that user to the mysql database.
The script usually would take around 1 min to sync. I require the script to do its work every 5 mins (for example) exactly once (one sync per 5 mins).
How would be the best way to go about building this?
I have some test data for the users but on the real site, theres a lot more users so I can't guarantee the script takes 1 min to execute, it might even take 20 mins. However having an interval of say 15 mins everytime the script executes would be ideal for the problem...
Update:
I have the connection params for the sql server windows db, so I'm using a small ubuntu server to sync between the two databases located on different servers. So lets say db1 (windows) and db2 (linux) are the database servers, I'm using s1 (python server) and pymssql and mysql python modules to sync.
Regards
I am not sure cron is right for the job. It seems to me that if you have it run every 15 minutes but sometimes a synch takes 20 minutes you could have multiple processes running at once and possibly collide.
If the driving force is a constant wait time between the variable execution times then you might need a continuously running process with a wait.
def main():
loopInt = 0
while(loopInt < 10000):
synchDatabase()
loopInt += 1
print("call #" + str(loopInt))
time.sleep(300) #sleep 5 minutes
main()
(obviously not continuous, but long running) You can set the result of while to true and it will be continuous. (comment out loopInt += 1)
Edited to add: Please see note in comments about monitoring the process as you don't want the script to hang or crash and you not be aware of it.
You might want to use a system that handles queues, for example RabbitMQ, and use Celery as the python interface to implement it. With Celery, you can add tasks (like execution of a script) to a queue or run a schedule that'll perform a task after a given interval (just like cron).
Get started http://celery.readthedocs.org/en/latest/

How can I run a script constantly in background of App Engine website?

I'm trying to use Google App Engine (Python) to make a simple web app. I want to maintain one number x in the datastore that models a random walk. I need a script running 24 hours a day that, every second, randomly chooses to either increment or decrement x (saving the change to the datastore). Users should be able to go to a url to see the current value of x.
I've thought of two ways to accomplish the constant script issue:
1) I can have an admin-access page that runs a continuous loop in javascript which, each second, makes an AJAX request to the server to update x. If I leave this page open on my computer 24 hours a day, this should work. The problem with this approach is that if my computer crashes then the script dies with it.
2) I can use a CRON job. But the interval between jobs cannot be smaller than 1 minute, so this doesn't really work.
It seems like there should be a simple way to just run a script constantly (that exists only server side) with Google App Engine.
I appreciate any advice. Thanks for your time!
Start a backend instance using Modules (either programmatically or by hitting a special URL accessible to admins only). Run the script for as long as the instance lives.
Note that an instance can die, just like your computer can crash. For this reason, you are probably better off with a Google Compute Engine instance (choose the smallest) than with an App Engine instance. Note that the Compute Engine instance will be many times cheaper.
Compute Engine instances can also fail, though it is much less likely. There are ways to create a fail-over implementation (when one instance is creating your random numbers while the other instance - which can run on some other platform - waits for the first one to fail), but this will obviously cost more.

How to continuously read data from xively in a (python) heroku app?

I am trying to write a Heroku app in python which will read and store data from a xively feed in real time. I want the app to run independently as a sort of 'backend process' to simply store the data in a database. (It does not need to 'serve up' anything for users (for site visitors).)
Right now I am working on the 'continuous reading' part. I have included my code below. It simply reads the datastream once, each time I hit my app's Heroku URL. How do I get it to operate continuously so that it keeps on reading the data from xively?
import os
from flask import Flask
import xively
app = Flask(__name__)
#app.route('/')
def run_xively_script():
key = 'FEED_KEY'
feedid = 'FEED_ID'
client = xively.XivelyAPIClient(key)
feed = client.feeds.get(feedid)
datastream = feed.datastreams.get("level")
level = datastream.current_value
return "level is %s" %(level)
I am new to web development, heroku, and python... I would really appreciate any help(pointers)
{
PS:
I have read about Heroku Scheduler and from what I understand, it can be used to schedule a task at specific time intervals and when it does so, it starts a one-off dyno for the task. But as I mentioned, my app is really meant to perform just one function->continuously reading and storing data from xively. Is it necessary to schedule a separate task for that? And the one-off dyno that the scheduler will start will also consume dyno hours, which I think will exceed the free 750 dyno-hours limit (as my app's web dyno is already consuming 720 dyno-hours per month)...
}
Using the scheduler, as you and #Calumb have suggested, is one method to go about this.
Another method would be for you to setup a trigger on Xively. https://xively.com/dev/docs/api/metadata/triggers/
Have the trigger occur when your feed is updated. The trigger should POST to your Flask app, and the Flask app can then take the new data, manipulate it and store it as you wish. This would be the most near realtime, I'd think, because Xively is pushing the update to your system.
This question is more about high level architecture decisions and what you are trying to accomplish than a specific thing you should do.
Ultimately, Flask is probably not the best choice for an app to do what you are trying to do. You would be better off with just pure python or pure ruby. With that being said, using Heroku scheduler (which you alluded to) makes it possible to do something like what you are trying to do.
The simplest way to accomplish your goal (assuming that you want to change minimal amount of code and that constantly reading data is really what you want to do. Both of which you should consider) is to write a loop that runs when you call that task and grabs data for a few seconds. Just use a for loop and increment a counter for however many times you want to get the data.
Something like:
for i in range(0,5):
key = 'FEED_KEY'
feedid = 'FEED_ID'
client = xively.XivelyAPIClient(key)
feed = client.feeds.get(feedid)
datastream = feed.datastreams.get("level")
level = datastream.current_value
time.sleep(1)
However, Heroku has limits on how long something can run before it returns a value. Otherwise the router will return a 503 or 500. But you could use the scheduler to then schedule this to run every certain amount of time.
Again, I think that Flask and Heroku is not the best solution for what it sounds like you are trying to do. I would review your use case and go back to the drawing board on what the best method to accomplish it our.

Python chat server when can't guarantee page not reset?

I'm relatively new to Python, cgi, and passenger-wsgi, so please bear with me.
I set up a python script that's not much more than
import time
startTime = time.time()
def main():
return time.time()-startTime
.. just so I know how long the passenger server has been running. I did this a few days ago, but it's only at about 12 minutes now.
Keeping track of state isn't important for any of the scripts I currently have, but I'm planning on writing a simple chat page. Keeping track of the various users online and the chat groups will be very important, and I wouldn't want everything to be reset every 12 minutes.
What can I do about this?
My only thought is to store any necessary variables inside an object, then serialize that object and store it in a file every time I change it so that if the server restarts again I still have everything. Is this normally how it's done?

Categories