I'm trying to "modularize" a section of an appengine website where a profile is requested as a small hunk of pre-rendered html
Sending a request to /userInfo?id=4992 sends down some html like:
<div>
(image of john) John
Information about this user
</div>
So, from my google appengine code, I need to be able to repeatedly fetch results from this URL when displaying a group of people.
The only way I can do it now is send down a collection of <iframes> like
<iframe src="/userInfo?id=4992"></iframe>
<iframe src="/userInfo?id=4993"></iframe>
<iframe src="/userInfo?id=4994"></iframe>
The iframes work to request the data.
I tried using urlfetch.fetch() but it keeps timing out on me.
Am I doing this right? I thought this would be handy-dandy (url that serves up a snippet of html) but it turns out its looking like a design error.
You're currently serializing urlfetch requests, which ends up summing their wait times and may easily push you beyond your latency deadline. I'm afraid that you'll need to switch to async urlfetch requests -- an advanced technique which may suit your architecture better!
Related
I am trying to retrieve query results on sites based on ajax like www.snapbird.org using Python. Since it doesn't show in the page source, I am not sure how to proceed.
I am a Python newbie and hence it would be great if I could get a pointer in the right direction.
I am also open to some other approach to the task if that is easier
This is going to be complex but as a start, ppen firebug and find the URL that gets called when the AJAX request is handled. You can call that directly in your Python program and parse the output.
You could use Selenium's Python client driver to parse the page source. I usually use this in conjunction with PyQuery to make web scraping easier.
Here's the basic tutorial for Selenium's Python driver. Be sure to follow the instructions for Selenium version 2 instead of version 1 (unless you're using version 1 for some reason).
You could also configure chrome/firefox to an HTTP proxy and then log/extract the necessary content with the proxy. I've tinkered with python proxies to save/log the requests/content based on content-type or URI globs.
For other projects I've written site-specific javascript bookmarklets which poll for new data and then POST it to my server (by dynamically creating both a form and iframe, and setting myform.target=myiframe;
Other javascript scripts/bookmarklets simulate a user interacting with sites, so instead of polling every few seconds the javascript automates clicking buttons and form submissions, etc. These scripts are always very site-specific of course but they've been hugely useful for me, especially when iterating over all the paginated results for a given search.
Here is a stripped down version of walking over a list of "paginated" results and preparing to send the data off to my server (which then further parses it with BeautifulSoup). In particular this was designed for Youtube's Sent/Inbox messages.
var tables = [];
function process_and_repeat(){
if(!(inbox && inbox.message_pane_ && inbox.message_pane_.innerHTML)){
alert("We've got no data!");
}
if(inbox.message_pane_.innerHTML.indexOf('<table') === 0)
{
tables.push(inbox.message_pane_.innerHTML);
inbox.next_page();
setTimeout("process_and_repeat()",3000);
}
else{
alert("Fininshed, [" + tables.length + " processed]");
document.write('<form action=http://curl.sente.cc method=POST><textarea name=sent.html>'+escape(tables.join('\n'))+'</textarea><input type=submit></form>')
}
}
process_and_repeat(); // now we wait and watch as all the paginated pages are viewed :)
This is a stripped down example without any fancy iframes/non-essentials which just add complexity.
Adding to what Liam said, Selenium is a great tool, too, which has aided in my various scraping needs. I'd be more than happy to help you out with this if you'd like.
One easy solution might be using a browser like Mechanize. So you can browse site, follow links, make searches and nearly everything that you can do with a browser with user interface.
But for a very sepcific job, you may not even need a such library, you can use urllib and urllib2 python libraries to make a connection and read response... You can use Firebug to see data structure of a search and response body. Then use urllib to make a request with relevant parameters...
With an example...
I made a search with joyvalencia and check the request url with firebug to see:
http://api.twitter.com/1/statuses/user_timeline.json?screen_name=joyvalencia&count=100&page=2&include_rts=true&callback=twitterlib1321017083330
So calling this url with urllib2.urlopen() will be the same with making the query on Snapbird. Response body is:
twitterlib1321017083330([{"id_str":"131548107799396357","place":null,"geo":null,"in_reply_to_user_id_str":null,"coordinates":.......
When you use urlopen() and read the response, the upper string is what you get... Then you can use json library of python to read the data and parse it to a pythonic data structure...
I'm trying to read in info that is constantly changing from a website.
For example, say I wanted to read in the artist name that is playing on an online radio site.
I can grab the current artist's name but when the song changes, the HTML updates itself and I've already opened the file via:
f = urllib.urlopen("SITE")
So I can't see the updated artist name for the new song.
Can I keep closing and opening the URL in a while(1) loop to get the updated HTML code or is there a better way to do this? Thanks!
You'll have to periodically re-download the website. Don't do it constantly because that will be too hard on the server.
This is because HTTP, by nature, is not a streaming protocol. Once you connect to the server, it expects you to throw an HTTP request at it, then it will throw an HTTP response back at you containing the page. If your initial request is keep-alive (default as of HTTP/1.1,) you can throw the same request again and get the page up to date.
What I'd recommend? Depending on your needs, get the page every n seconds, get the data you need. If the site provides an API, you can possibly capitalize on that. Also, if it's your own site, you might be able to implement comet-style Ajax over HTTP and get a true stream.
Also note if it's someone else's page, it's possible the site uses Ajax via Javascript to make it up to date; this means there's other requests causing the update and you may need to dissect the website to figure out what requests you need to make to get the data.
If you use urllib2 you can read the headers when you make the request. If the server sends back a "304 Not Modified" in the headers then the content hasn't changed.
Yes, this is correct approach. To get changes in web, you have to send new query each time. Live AJAX sites do exactly same internally.
Some sites provide additional API, including long polling. Look for documentation on the site or ask their developers whether there is some.
I am building a mobile web app with Django and jQuery Mobile. My problem is that jQuery Mobile likes for all links to be prepended with a # so it can accurately keep track of browsing history.
Example: http://www.fest.com/#/foo/1/
I would like know how to automatically redirect all urls that point From: /foo/1/ To: /#/foo/1/
If I don't do that and someone goes directly to /foo/1/, then clicks a link pointing to /bar/2/, they'll end up with a URL path like this:
/foo/1/#/bar/2/
I would very much like to prevent that from happening because its causes lots of problems. Whats the best way to do this?
You have misunderstood what the # does.
The # in a URL is the "fragment" separator. Nothing after that is sent to the server. So there is no such URL as "foo. com#/foo" - as far as the server is concerned, it's just "foo.com". So you can't do any server-side redirection.
If your JS library is using the fragments to simulate navigation, you'll need to handle this with Javascript.
This is jquery mobile, so the answer is a bit different. Jquery mobile uses #something for history when working with AJAX. The AJAX call is introduced for every <a href=...
So you just link to a page like this: <a href="some.html?var1=foo" and JQM calls an ajax on it without reloading the page AND stores the item in the DOM document to not load again. The url is updated to have #some.html at the end and it's how the history is managed.
<a href="#something" WILL NOT work as in a normal page, because jquery mobile takes over.
Read here to get all info on links in jquery mobile: http://jquerymobile.com/demos/1.0a2/#docs/pages/link-formats.html
As a newbie to app engine and python I can follow the examples given by Google and have created a python application with a template HTML page where I can enter data, submit it to the datastore and by reading back the data, just sent, recreate the sending page so I can continue adding data and store again. However what I would like to do is submit the data, have it stored in the datastore without the sending page being refreshed. It seems like a waste of traffic to have all the data sent back again.
Sounds like you want to look into AJAX. The simplest way to do this is probably to use the ajax functions in one of the popular Javascript libraries, like jQuery.
AJAX. If you want a specific resource concerning AppEngine -
http://code.google.com/appengine/articles/rpc.html (uses Python) is very good.
Here is a good link to understand Communication with Server on Google App Engine
Have a look at Pyjamas pyjs.org
It's a Python Compiler for web browsers. Write your client side in Python and Pyjamas will compile it into JavaScript.
I'm trying to scrap a page in youtube with python which has lot of ajax in it
I've to call the java script each time to get the info. But i'm not really sure how to go about it. I'm using the urllib2 module to open URLs. Any help would be appreciated.
Youtube (and everything else Google makes) have EXTENSIVE APIs already in place for giving you access to just about any and all data you could possibly want.
Take a look at The Youtube Data API for more information.
I use urllib to make the API requests and ElementTree to parse the returned XML.
Main problem is, you're violating the TOS (terms of service) for the youtube site. Youtube engineers and lawyers will do their professional best to track you down and make an example of you if you persist. If you're happy with that prospect, then, on you head be it -- technically, your best bet are python-spidermonkey and selenium. I wanted to put the technical hints on record in case anybody in the future has needs like the ones your question's title indicates, without the legal issues you clearly have if you continue in this particular endeavor.
Here is how I would do it: Install Firebug on Firefox, then turn the NET on in firebug and click on the desired link on YouTube. Now see what happens and what pages are requested. Find the one that are responsible for the AJAX part of page. Now you can use urllib or Mechanize to fetch the link. If you CAN pull the same content this way, then you have what you are looking for, then just parse the content. If you CAN'T pull the content this way, then that would suggest that the requested page might be looking at user login credentials, sessions info or other header fields such as HTTP_REFERER ... etc. Then you might want to look at something more extensive like the scrapy ... etc. I would suggest that you always follow the simple path first. Good luck and happy "responsibly" scraping! :)
As suggested, you should use the YouTube API to access the data made available legitimately.
Regarding the general question of scraping AJAX, you might want to consider the scrapy framework. It provides extensive support for crawling and scraping web sites and uses python-spidermonkey under the hood to access javascript links.
You could sniff the network traffic with something like Wireshark then replay the HTTP calls via a scraping framework that is robust enough to deal with AJAX, such as scraPY.