i'm new to web apps so i'm not so used to worrying about CPU limits, but i looks i am going to have a problem with this code. I read in google's quotas page that i can use 6.5 CPU hours per day an 15 CPU , minutes per minute.
Google Said:
CPU time is reported in "seconds," which is equivalent to the number of CPU cycles that
can be performed by a 1.2 GHz Intel x86 processor in that amount of time. The actual
number of CPU cycles spent varies greatly depending on conditions internal to App Engine,
so this number is adjusted for reporting purposes using this processor as a reference
measurement.
And
Per Day Max Rate
CPU Time 6.5 CPU-hours 15 CPU-minutes/minute
What i want to Know:
Is this script going over the limit?
(if yes)How can i make it not go over the limit?
I use the urllib library, should i use Google's URL Fetch API? Why?
Absolutely any other helpful comment.
What it does:
It scrapes (crawls) project free TV. I will only completely run it once then replace it with a shorter faster script.
from urllib import urlopen
import re
alphaUrl = 'http://www.free-tv-video-online.me/movies/'
alphaPage = urlopen(alphaUrl).read()
patFinderAlpha = re.compile('<td width="97%" nowrap="true" class="mnlcategorylist"><a href="(.*)">')
findPatAlpha = re.findall(patFinderAlpha,alphaPage)
listIteratorAlpha = []
listIteratorAlpha[:] = range(len(findPatAlpha))
for ai in listIteratorAlpha:
betaUrl = 'http://www.free-tv-video-online.me/movies/' + findPatAlpha[ai] + '/'
betaPage = urlopen(betaUrl).read()
patFinderBeta = re.compile('<td width="97%" class="mnlcategorylist"><a href="(.*)">')
findPatBeta = re.findall(patFinderBeta,betaPage)
listIteratorBeta = []
listIteratorBeta[:] = range(len(findPatBeta))
for bi in listIteratorBeta:
gammaUrl = betaUrl + findPatBeta[bi]
gammaPage = urlopen(gammaUrl).read()
patFinderGamma = re.compile('<a href="(.*)" target="_blank" class="mnllinklist">')
findPatGamma = re.findall(patFinderGamma,gammaPage)
patFinderGamma2 = re.compile('<meta name="keywords"content="(.*)">')
findPatGamma2 = re.findall(patFinderGamma2,gammaPage)
listIteratorGamma = []
listIteratorGamma[:] = range(len(findPatGamma))
for gi in listIteratorGamma:
deltaUrl = findPatGamma[gi]
deltaPage = urlopen(deltaUrl).read()
patFinderDelta = re.compile("<iframe id='hmovie' .* src='(.*)' .*></iframe>")
findPatDelta = re.findall(patFinderDelta,deltaPage)
PutData( findPatGamma2[gi], findPatAlpha[ai], findPatDelt)
If I forgot anything please let me know.
Update:
This is about how many times it will run and why in case this is helpfull in answering the question.
per cycle total
Alpha: 1 1
Beta: 16 16
Gamma: ~250 ~4000
Delta: ~6 ~24000
I don't like to optimize until I need to. First, just try it. It might just work. If you go over quota, shrug, come back tomorrow.
To split jobs into smaller parts, look at the Task Queue API. Maybe you can divide the workload into two queues, one that scrapes pages and one that processes them. You can put limits on the queues to control how aggressively they are run.
P.S. On Regex for HTML: Do what works. The academics will call you out on semantic correctness, but if it works for you, don't let that stop you.
I use the urllib library, should i use Google's URL Fetch API? Why?
urlib on AppEngine production servers is The URLFetch API
It's unlikely that this will go over the free limit, but it's impossible to say without seeing how big the list of URLs it needs to fetch is, and how big the resulting pages are. The only way to know for sure is to run it - and there's really no harm in doing that.
You're more likely to run into the limitations on individual request execution - 30 seconds for frontend requests, 10 minutes for backend requests like cron jobs - than run out of quota. To alleviate those issues, use the Task Queue API to split your job into many parts. As an additional benefit, they can run in parallel! You might also want to look into Asynchronous URLFetch - though it's probably not worth it if this is just a one-off script.
Related
I have a Flask application running on port 5000 that supports 7 different endpoints that support GET requests. So I can do a
curl http://localhost:5000/get_species_interactions?q=tiger
And it returns a page after some computation. There are 6 other such endpoints each with varying degrees of computation at the back end. It works fine with one user but I want to get metrics for how well it can perform under load. I am trying to stress test this by simulating a large number of requests and I was thinking of using a python script. The rough algorithm I had in mind is the following:
while (num_tests < 1000):
e = get_random_end_point_to_test() # pick one out of 7 end points
d = get_random_data_for_get(e) # pick relevant random data to send in curl command
resp = curl(e/q?d)
num_tests++
My question is - is this general approach on the right track? Does it simulate a large number of simultaneous users? I was planning to store the amount of time it took to execute each request and compute stats. Otherwise is there a free utility I can use to do this kind of stress test on Mac OS? I saw a tool called siege but its not available on mac easily.
I would suggest Apache jmeter. The tool has everything you need for Stresstests and is good documented online.
You'll need to install Java though
No, you need to parallelize your requests. The libcurl can do this using the multi interface.
Check this out. Pythonic interface to libcurl/pycurl.
The Amazon API limit is apparently 1 req per second or 3600 per hour. So I implemented it like so:
while True:
#sql stuff
time.sleep(1)
result = api.item_lookup(row[0], ResponseGroup='Images,ItemAttributes,Offers,OfferSummary', IdType='EAN', SearchIndex='All')
#sql stuff
Error:
amazonproduct.errors.TooManyRequests: RequestThrottled: AWS Access Key ID: ACCESS_KEY_REDACTED. You are submitting requests too quickly. Please retry your requests at a slower rate.
Any ideas why?
This code looks correct, and it looks like 1 request/second limit is still actual:
http://docs.aws.amazon.com/AWSECommerceService/latest/DG/TroubleshootingApplications.html#efficiency-guidelines
You want to make sure that no other process is using the same associate account. Depending on where and how you run the code, there may be an old version of the VM, or another instance of your application running, or maybe there is a version on the cloud and other one on your laptop, or if you are using a threaded web server, there may be multiple threads all running the same code.
If you still hit the query limit, you just want to retry, possibly with the TCP-like "additive increase/multiplicative decrease" back-off. You start by setting extra_delay = 0. When request fails, you set extra_delay += 1 and sleep(1 + extra_delay), then retry. When it finally succeeds, set extra_delay = extra_delay * 0.9.
Computer time is funny
This post is correct in saying "it varies in a non-deterministic manner" (https://stackoverflow.com/a/1133888/5044893). Depending on a whole host of factors, the time measured by a processor can be quite unreliable.
This is compounded by the fact that Amazon's API has a different clock than your program does. They are certainly not in-sync, and there's likely some overlap between their "1 second" time measurement and your program's. It's likely that Amazon tries to average out this inconsistency, and they probably also allow a small bit of error, maybe +/- 5%. Even so, the discrepancy between your clock and theirs is probably triggering the ACCESS_KEY_REDACTED signal.
Give yourself some buffer
Here are some thoughts to consider.
Do you really need to hit the Amazon API every single second? Would your program work with a 5 second interval? Even a 2-second interval is 200% less likely to trigger a lockout. Also, Amazon may be charging you for every service call, so spacing them out could save you money.
This is really a question of "optimization" now. If you use a constant variable to control your API call rate (say, SLEEP = 2), then you can adjust that rate easily. Fiddle with it, increase and decrease it, and see how your program performs.
Push, not pull
Sometimes, hitting an API every second means that you're polling for new data. Polling is notoriously wasteful, which is why Amazon API has a rate-limit.
Instead, could you switch to a queue-based approach? Amazon SQS can fire off events to your programs. This is especially easy if you host them with Amazon Lambda.
I have a requirement where I need to hit up to 2000 URLs per minute and save the response to a database. The URLS need to be hit within 5 seconds of the start of every minute (but the response can wait). Then, at the next minute, the same will happen and so on. So, it's time critical.
I've tried using Python multiprocessing and threading to solve the problem. However, some URLs may take up to 30 minutes to respond, which blocks all other URLs from being processed.
I'm also open to using something lower level such as C, but don't know where to start.
Any guidance in the right direction will help, thanks.
You need something lighter than a thread, since if each URL can block for a long time then you'll need to send them all simultaneously instead of via a thread pool.
gevent is a Python wrapper around the eventlib loop that's good at this sort of thing. From their docs:
>>> import gevent
>>> from gevent import socket
>>> urls = ['www.google.com', 'www.example.com', 'www.python.org']
>>> jobs = [gevent.spawn(socket.gethostbyname, url) for url in urls]
>>> gevent.joinall(jobs, timeout=2)
>>> [job.value for job in jobs]
['74.125.79.106', '208.77.188.166', '82.94.164.162']
I am not sure if I have understood the problem correctly, but if you are using 'n' processes and if all 'n' of them get stuck on a response, then changing the language will not solve your issue. Since the bottleneck is the server which you are requesting, and not your local driver code. You can eliminate this dependency by switching to an asynchronous mechanism. Do not wait for the response! Let a callback handle it for you!
EDIT: You might want to have a look at https://github.com/kennethreitz/grequests
For starters I'm new to python so my code below may not be the cleanest. For a program I need to download about 500 webpages. The url's are stored in an array which is populated by a previous function. The downloading part goes something like this:
def downloadpages(num):
import urllib
for i in range(0,numPlanets):
urllib.urlretrieve(downloadlist[i], 'webpages/'+names[i]'.htm')
each file is only around 20KB but it takes at least 10 mins to download all of them. Downloading a single file of the total combined size should only take a minute or two. Is there a way I can speed this up? Thanks
Edit: To anyone who is interested, following the example at http://code.google.com/p/workerpool/wiki/MassDownloader and using 50 threads, the download time has been reduced to about 20 seconds from the original 10 minutes plus. The download speed continues to decrease as the threads are increased up until around 60 threads, after which the download time begins to rise again.
But you're not downloading a single file, here. You're downloading 500 separate pages, each connection involves overhead (for the initial connection), plus whatever else the server is doing (is it serving other people?).
Either way, downloading 500 x 20kb is not the same as downloading a single file of that size.
You can speed up execution significantly by using threads (be careful though, to not overload the server).
Intro material/Code samples:
http://docs.python.org/library/threading.html
Python Package For Multi-Threaded Spider w/ Proxy Support?
http://code.google.com/p/workerpool/wiki/MassDownloader
You can use greenlet to do so.
E.G with the eventlet lib:
urls = [url1, url2, ...]
import eventlet
from eventlet.green import urllib2
def fetch(url):
return urllib2.urlopen(url).read()
pool = eventlet.GreenPool()
for body in pool.imap(fetch, urls):
print "got body", len(body)
All calls in the pools will be pseudo simulatneous.
Of course you must install eventlet with pip or easy_install before.
You have several implementations of greenlets in Python. You could do the same with gevent or another one.
In addition to using concurrency of some sort, make sure whatever method you're using to make the requests uses HTTP 1.1 connection persistence. That will allow each thread to open only a single connection and request all the pages over that, instead of having a TCP/IP setup/teardown for each request. Not sure if urllib2 does that by default; you might have to roll your own.
I'm pulling some RSS feeds into a datastore in App Engine to serve up to an iPhone app. I use cron to schedule updating the RSS every x minutes. Each task only parses one RSS feed (which has 15-20 items). I frequently get warnings about high CPU usage in the App Engine dashboard, so I'm looking for ways to optimise my code.
Currently, I use minidom (since it's already there on App Engine), but I suspect it's not very efficient!
Here's the code:
dom = minidom.parseString(urlfetch.fetch(url).content)
if dom:
items = []
for node in dom.getElementsByTagName('item'):
item = RssItem(
key_name = self.getText(node.getElementsByTagName('guid')[0].childNodes),
title = self.getText(node.getElementsByTagName('title')[0].childNodes),
description = self.getText(node.getElementsByTagName('description')[0].childNodes),
modified = datetime.now(),
link = self.getText(node.getElementsByTagName('link')[0].childNodes),
categories = [self.getText(category.childNodes) for category in node.getElementsByTagName('category')]
);
items.append(item);
db.put(items);
def getText(self, nodelist):
rc = ''
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc = rc + node.data
return rc
There isn't much going on, but the scripts often take 2-6 seconds CPU time, which seems a bit excessive for looping through 20ish items and reading a few attributes.
What can I do to make this faster? Is there anything particularly bad in the above code, or should I change to another way of parsing? Are there are any libraries (that work on App Engine) that would be better, or would I be better parsing the RSS myself?
Outsource feed parsing via for example superfeedr
You could also look into superfeedr.com. They have a reasonable free quota/paying plans. They will do the polling(within 15 minutes you get updates) for you/etc. If the feeds also support pubsubhubbub, then you will receive the feeds in realtime! This video will explain to you what pubsubhubbub is if you don't know yet.
Improved feed parser written by Brett Slatkin
I would also advice you to watch this awesome video from Brett Slatkin explaining pubsubhubbub. I also remember that somewhere in the presentation he says that he does not use Universal Feedparser because it's just does to much work for his problem. He wrote his own SAX(14:10 in video presentation he talks about it a little bit) parser which is lightning fast. I guess you should check out the pubsubhubbub code to find out how he accomplished this.
If you have a low amount of traffic coming to your site you might be experiencing spin up times for your app. If an app is idle for a as little as a few minutes app engine will spin down your app to save resources. When the next request comes in the app has to be spun up before it can handle the request and this all gets added to your cpu quota. If you search the appengine newsgroup you see that it is full of complaints about this.
I use superfeedr for my site www.newsfacet.com and I notice that when superfeedr notifies me most of the time I can handle a few rss articles in a few hundred milliseconds. If its been a while since the last input this time can jump to 10 or 11 seconds as it incurs the spin up cost.
In regards to using PubSubHubbub to let someone else do the work for you, you may find my blog post on using hubbub on App Engine to be useful.
I'd try ElementTree or the Universal Feed Parser and see if they're any better. ElementTree is in the stdlib as of Python 2.5, so it's available on App Engine.
You probably should run a profiler to pinpoint where the code is spinning its wheels. It could be waiting on the connections as some RSS feeds are REAL slow.
Also, some RDF/RSS/ATOM libraries build in a governor to keep from beating the cr*p out of the host when retrieving multiple feeds from the same site. I've written several aggregators and being considerate to the server is important.
Universal Feed Parser is full-featured, at least from what I've seen by looking through the docs. I didn't use it because I wrote my aggregators in Ruby and had different needs but I was aware of it and would consider it for a Python-based solution.