Optimising RSS parsing on App Engine to avoid high CPU warnings

Optimising RSS parsing on App Engine to avoid high CPU warnings - python

I'm pulling some RSS feeds into a datastore in App Engine to serve up to an iPhone app. I use cron to schedule updating the RSS every x minutes. Each task only parses one RSS feed (which has 15-20 items). I frequently get warnings about high CPU usage in the App Engine dashboard, so I'm looking for ways to optimise my code.
Currently, I use minidom (since it's already there on App Engine), but I suspect it's not very efficient!
Here's the code:
dom = minidom.parseString(urlfetch.fetch(url).content)
if dom:
items = []
for node in dom.getElementsByTagName('item'):
item = RssItem(
key_name = self.getText(node.getElementsByTagName('guid')[0].childNodes),
title = self.getText(node.getElementsByTagName('title')[0].childNodes),
description = self.getText(node.getElementsByTagName('description')[0].childNodes),
modified = datetime.now(),
link = self.getText(node.getElementsByTagName('link')[0].childNodes),
categories = [self.getText(category.childNodes) for category in node.getElementsByTagName('category')]
);
items.append(item);
db.put(items);
def getText(self, nodelist):
rc = ''
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc = rc + node.data
return rc
There isn't much going on, but the scripts often take 2-6 seconds CPU time, which seems a bit excessive for looping through 20ish items and reading a few attributes.
What can I do to make this faster? Is there anything particularly bad in the above code, or should I change to another way of parsing? Are there are any libraries (that work on App Engine) that would be better, or would I be better parsing the RSS myself?

Outsource feed parsing via for example superfeedr
You could also look into superfeedr.com. They have a reasonable free quota/paying plans. They will do the polling(within 15 minutes you get updates) for you/etc. If the feeds also support pubsubhubbub, then you will receive the feeds in realtime! This video will explain to you what pubsubhubbub is if you don't know yet.
Improved feed parser written by Brett Slatkin
I would also advice you to watch this awesome video from Brett Slatkin explaining pubsubhubbub. I also remember that somewhere in the presentation he says that he does not use Universal Feedparser because it's just does to much work for his problem. He wrote his own SAX(14:10 in video presentation he talks about it a little bit) parser which is lightning fast. I guess you should check out the pubsubhubbub code to find out how he accomplished this.

If you have a low amount of traffic coming to your site you might be experiencing spin up times for your app. If an app is idle for a as little as a few minutes app engine will spin down your app to save resources. When the next request comes in the app has to be spun up before it can handle the request and this all gets added to your cpu quota. If you search the appengine newsgroup you see that it is full of complaints about this.
I use superfeedr for my site www.newsfacet.com and I notice that when superfeedr notifies me most of the time I can handle a few rss articles in a few hundred milliseconds. If its been a while since the last input this time can jump to 10 or 11 seconds as it incurs the spin up cost.

In regards to using PubSubHubbub to let someone else do the work for you, you may find my blog post on using hubbub on App Engine to be useful.

I'd try ElementTree or the Universal Feed Parser and see if they're any better. ElementTree is in the stdlib as of Python 2.5, so it's available on App Engine.

You probably should run a profiler to pinpoint where the code is spinning its wheels. It could be waiting on the connections as some RSS feeds are REAL slow.
Also, some RDF/RSS/ATOM libraries build in a governor to keep from beating the cr*p out of the host when retrieving multiple feeds from the same site. I've written several aggregators and being considerate to the server is important.
Universal Feed Parser is full-featured, at least from what I've seen by looking through the docs. I didn't use it because I wrote my aggregators in Ruby and had different needs but I was aware of it and would consider it for a Python-based solution.

Related

How can I measure the coverage (in production system)?

I would like to measure the coverage of my Python code which gets executed in the production system.
I want an answer to this question:
Which lines get executed often (hot spots) and which lines are never used (dead code)?
Of course this must not slow down my production site.
I am not talking about measuring the coverage of tests.

I assume you are not talking about test suite code coverage which the other answer is referring to. That is a job for CI indeed.
If you want to know which code paths are hit often in your production system, then you're going to have to do some instrumentation / profiling. This will have a cost. You cannot add measurements for free. You can do it cheaply though and typically you would only run it for short amounts of time, long enough until you have your data.
Python has cProfile to do full profiling, measuring call counts per function etc. This will give you the most accurate data but will likely have relatively high impact on performance.
Alternatively, you can do statistical profiling which basically means you sample the stack on a timer instead of instrumenting everything. This can be much cheaper, even with high sampling rate! The downside of course is a loss of precision.
Even though it is surprisingly easy to do in Python, this stuff is still a bit much to put into an answer here. There is an excellent blog post by the Nylas team on this exact topic though.
The sampler below was lifted from the Nylas blog with some tweaks. After you start it, it fires an interrupt every millisecond and records the current call stack:
import collections
import signal
class Sampler(object):
def __init__(self, interval=0.001):
self.stack_counts = collections.defaultdict(int)
self.interval = interval
def start(self):
signal.signal(signal.VTALRM, self._sample)
signal.setitimer(signal.ITIMER_VIRTUAL, self.interval, 0)
def _sample(self, signum, frame):
stack = []
while frame is not None:
formatted_frame = '{}({})'.format(
frame.f_code.co_name,
frame.f_globals.get('__name__'))
stack.append(formatted_frame)
frame = frame.f_back
formatted_stack = ';'.join(reversed(stack))
self.stack_counts[formatted_stack] += 1
signal.setitimer(signal.ITIMER_VIRTUAL, self.interval, 0)
You inspect stack_counts to see what your program has been up to. This data can be plotted in a flame-graph which makes it really obvious to see in which code paths your program is spending the most time.

If i understand it right you want to learn which parts of your application is used most often by users.
TL;DR;
Use one of the metrics frameworks for python if you do not want to do it by hand. Some of them are above:
DataDog
Prometheus
Prometheus Python Client
Splunk
It is usually done by function level and it actually depends on application;
If it is a desktop app with internet access:
You can create a simple db and collect how many times your functions are called. For accomplish it you can write a simple function and call it inside every function that you want to track. After that you can define an asynchronous task to upload your data to internet.
If it is a web application:
You can track which functions are called from js (mostly preferred for user behaviour tracking) or from web api. It is a good practice to start from outer to go inner. First detect which end points are frequently called (If you are using a proxy like nginx you can analyze server logs to gather information. It is the easiest and cleanest way). After that insert a logger to every other function that you want to track and simply analyze your logs for every week or month.
But you want to analyze your production code line by line (it is a very bad idea) you can start your application with python profilers. Python has one already: cProfile.

Maybe make a text file and through your every program method just append some text referenced to it like "Method one executed". Run the web application like 10 times thoroughly as a viewer would and after this make a python program that reads the file and counts a specific parts of it or maybe even a pattern and adds it to a variable and outputs the variables.

Scraping Edgar with Python regular expressions

I am working on a personal project's initial stage of downloading 10-Q statements from EDGAR. Quick disclaimer, I am very new to programming and python so the code that I wrote is very basic, not even using custom functions and classes, just a very long script that I'm more comfortable editing. As a result, some solutions are quite rough (i.e. concatenating urls using CIKs and other search options instead of doing requests with "browser" headers)
I keep running into a problem that those who have scraped EDGAR might be familiar with. Every now and then my script just stops running. It doesn't raise any exceptions (I created some that append txt reports with links that can't be opened and so forth). I suspect that either SEC servers have a certain limit of requests from an IP per some unit of time (if I wait some time after CTRL-C'ing the script and run it again, it generates more output compared to rapid re-activation), alternatively it could be TWC that identifies me as a bot and limits such requests.
If it's SEC, what could potentially work? I tried learning how to work with TOR and potentially get a new IP every now and then but I can't really find some basic tutorial that would work for my level of expertise. Maybe someone can recommend something good on the topic?
Maybe the timers would work? Like force the script to sleep every hour or so (still trying to figure out how to make such timers and reset them if an event occurs). The main challenge with this particular problem is that I can't let it run at night.
Thank you in advance for any advice, I keep fighting with it for days and at this stage it could take me more than a month to get what I want (before I even start tackling 10-Ks)

It seems like delays are pretty useful - sitting at 3.5k downloads with no interruptions thanks to a simple:
import(time)
time.sleep(random.randint(0, 1) + abs(random.normalvariate(0, 0.2)))

GAE Backend fails to respond to start request

This is probably a truly basic thing that I'm simply having an odd time figuring out in a Python 2.5 app.
I have a process that will take roughly an hour to complete, so I made a backend. To that end, I have a backend.yaml that has something like the following:
-name: mybackend
options: dynamic
start: /path/to/script.py
(The script is just raw computation. There's no notion of an active web session anywhere.)
On toy data, this works just fine.
This used to be public, so I would navigate to the page, the script would start, and time out after about a minute (HTTP + 30s shutdown grace period I assume, ). I figured this was a browser issue. So I repeat the same thing with a cron job. No dice. Switch to a using a push queue and adding a targeted task, since on paper it looks like it would wait for 10 minutes. Same thing.
All 3 time out after that minute, which means I'm not decoupling the request from the backend like I believe I am.
I'm assuming that I need to write a proper Handler for the backend to do work, but I don't exactly know how to write the Handler/webapp2Route. Do I handle _ah/start/ or make a new endpoint for the backend? How do I handle the subdomain? It still seems like the wrong thing to do (I'm sticking a long-process directly into a request of sorts), but I'm at a loss otherwise.

So the root cause ended up being doing the following in the script itself:
models = MyModel.all()
for model in models:
# Magic happens
I was basically taking for granted that the query would automatically batch my Query.all() over many entities, but it was dying at the 1000th entry or so. I originally wrote it was computational only because I completely ignored the fact that the reads can fail.
The actual solution for solving the problem we wanted ended up being "Use the map-reduce library", since we were trying to look at each model for analysis.

How can i make this code run smoothly on google app engine?

i'm new to web apps so i'm not so used to worrying about CPU limits, but i looks i am going to have a problem with this code. I read in google's quotas page that i can use 6.5 CPU hours per day an 15 CPU , minutes per minute.
Google Said:
CPU time is reported in "seconds," which is equivalent to the number of CPU cycles that
can be performed by a 1.2 GHz Intel x86 processor in that amount of time. The actual
number of CPU cycles spent varies greatly depending on conditions internal to App Engine,
so this number is adjusted for reporting purposes using this processor as a reference
measurement.
And
Per Day Max Rate
CPU Time 6.5 CPU-hours 15 CPU-minutes/minute
What i want to Know:
Is this script going over the limit?
(if yes)How can i make it not go over the limit?
I use the urllib library, should i use Google's URL Fetch API? Why?
Absolutely any other helpful comment.
What it does:
It scrapes (crawls) project free TV. I will only completely run it once then replace it with a shorter faster script.
from urllib import urlopen
import re
alphaUrl = 'http://www.free-tv-video-online.me/movies/'
alphaPage = urlopen(alphaUrl).read()
patFinderAlpha = re.compile('<td width="97%" nowrap="true" class="mnlcategorylist"><a href="(.*)">')
findPatAlpha = re.findall(patFinderAlpha,alphaPage)
listIteratorAlpha = []
listIteratorAlpha[:] = range(len(findPatAlpha))
for ai in listIteratorAlpha:
betaUrl = 'http://www.free-tv-video-online.me/movies/' + findPatAlpha[ai] + '/'
betaPage = urlopen(betaUrl).read()
patFinderBeta = re.compile('<td width="97%" class="mnlcategorylist"><a href="(.*)">')
findPatBeta = re.findall(patFinderBeta,betaPage)
listIteratorBeta = []
listIteratorBeta[:] = range(len(findPatBeta))
for bi in listIteratorBeta:
gammaUrl = betaUrl + findPatBeta[bi]
gammaPage = urlopen(gammaUrl).read()
patFinderGamma = re.compile('<a href="(.*)" target="_blank" class="mnllinklist">')
findPatGamma = re.findall(patFinderGamma,gammaPage)
patFinderGamma2 = re.compile('<meta name="keywords"content="(.*)">')
findPatGamma2 = re.findall(patFinderGamma2,gammaPage)
listIteratorGamma = []
listIteratorGamma[:] = range(len(findPatGamma))
for gi in listIteratorGamma:
deltaUrl = findPatGamma[gi]
deltaPage = urlopen(deltaUrl).read()
patFinderDelta = re.compile("<iframe id='hmovie' .* src='(.*)' .*></iframe>")
findPatDelta = re.findall(patFinderDelta,deltaPage)
PutData( findPatGamma2[gi], findPatAlpha[ai], findPatDelt)
If I forgot anything please let me know.
Update:
This is about how many times it will run and why in case this is helpfull in answering the question.
per cycle total
Alpha: 1 1
Beta: 16 16
Gamma: ~250 ~4000
Delta: ~6 ~24000

I don't like to optimize until I need to. First, just try it. It might just work. If you go over quota, shrug, come back tomorrow.
To split jobs into smaller parts, look at the Task Queue API. Maybe you can divide the workload into two queues, one that scrapes pages and one that processes them. You can put limits on the queues to control how aggressively they are run.
P.S. On Regex for HTML: Do what works. The academics will call you out on semantic correctness, but if it works for you, don't let that stop you.

I use the urllib library, should i use Google's URL Fetch API? Why?
urlib on AppEngine production servers is The URLFetch API

It's unlikely that this will go over the free limit, but it's impossible to say without seeing how big the list of URLs it needs to fetch is, and how big the resulting pages are. The only way to know for sure is to run it - and there's really no harm in doing that.
You're more likely to run into the limitations on individual request execution - 30 seconds for frontend requests, 10 minutes for backend requests like cron jobs - than run out of quota. To alleviate those issues, use the Task Queue API to split your job into many parts. As an additional benefit, they can run in parallel! You might also want to look into Asynchronous URLFetch - though it's probably not worth it if this is just a one-off script.

Google App Engine for pseudo-cronjobs?

I look for a possibility to create pseudo-cronjobs as I cannot use the real jobs on UNIX.
Since Python scripts can run for an unlimited period, I thought Python would be a great solution.
On Google App Engine you can set up Python scripts and it's free. So I should use the App Engine.
The App Engine allows 160,000 external URL accesses (right?) so you should have 160000/31/24/60 = 3,6 accesses per minute.
So my script would be:
import time
import urllib
while time.clock() < 86400:
# execute pseudo-cronjob file and then wait 60 seconds
content = urllib.urlopen('http://www.example.org/cronjob_file.php').read()
time.sleep(60)
Unfortunately, I have no possibility to test the script, so my questions are:
1) Do you think this would work?
2) Is it allowed (Google TOS) to use the service for such an activity?
3) Is my calculation for the URL accesses per minute right?
Thanks in advance!

Maybe I'm misunderstanding you, but the cron config files will let you do this (without Python).
You can add something like this to you cron.yaml file:
cron:
- description: job that runs every minute
url: /cronjobs/job1
schedule: every minute
See Google's documentation for more info on scheduling.

Google has some limits on how long a task can run.
URLFetch calls made in the SDK now have a 5 second timeout, here
They allow you to schedule up to 20 cron tasks in any given day. Here

Duplicate, see cron jobs on google appengine
Cron jobs are now officaly supported on GAE:
http://code.google.com/appengine/docs/python/config/cron.html

You may want to clarify which way around you want to do it
Do you want to use appengine to RUN the job? Ie, the job runs on google's server?
or
Do you want to use your OWN code on your server, and trigger it by using google app engine?
If it's the former: google does cron now. Use that :)
If it's the latter: you could use google's cron to trigger your own, even if it's indirectly (ie, google-cron calls google-app-engine which calls your-app).
If you can, spin up a thread to do the job, so your page returns immediatly. Dont forgot: if you call http://whatever/mypage.php, and your browser dies (or in this case, google kills your process for running too long), the php script usually still runs to the end - the output just goes no where.
Failing that, try to spin up a thread (not sure if you can do that in PHP tho - I'm a C# guy new to PHP)
And if all else fails: get a better webhost! I pay $6/month or so for dreamhost.com, and I can run cron jobs on their servers - it's included. They do PHP, Rails et al. You could even ping me for a discount code :) (view profile for website etc)

Do what Nic Wise said or also outsource the cronjob using a service like www.guardiano.pm so you can actually call www.yoursite.com/myjob.php and every time you call that url something you want will be executed.
Ps is free
Pss is my pet project and is in beta

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.