How can I create a python crawler for an ajax website that has same url all the time?
Is it possible?
Should I go step by step from the index page to the page that I want and wish for the best that everything works fine? (or there are other ways to heaven too?)
edit:
The url is http://192.168.1.1/
I actually want to access my router settings.
The short answer is "Yes, it's possible"
You can open your browsers console (network tab), and then look for api url which will be called by frontend scripts and contains needed data.
Responses are usually json/xml so you can parse them easily
Related
In my Django application I would like to know if the browser the client is using has AJAX or not. This is because I have, for example, profile editing. I have a version that edits the user's profile in-place and another one that redirects you to an edit page.
I know that most browsers have AJAX nowadays, but just to make sure, how can I check that in a Django application?
I believe that the correct thing would be to use some sort of graceful degradation and check for ajax in the request using Django's request.is_ajax() method
https://docs.djangoproject.com/en/dev/ref/request-response/#django.http.HttpRequest.is_ajax
In your view there would be something like
if form.is_valid():
if request.is_ajax():
return simplejson.dumps(something)
return redirect('/some-url/)
User agent sniffing and the like is not seen as the best solution... if you can afford that, rather use projects like hasjs on client side to check what the user's browser really is capable and send the information to the server somehow (like, serving the checking page when there is no session, let it do the checks and post the results to the server, which then creates a session and remember the capabilities for that session or the something similar).
If you want know if a browser support AJAX, you need know the capabilities of the browser, you need this project:
https://github.com/clement/django-wurfl/
I haven't foun a way to do this, therefore what I could do was prepare a JavaScript-free version and a JavaScript version of my template.
I load a .js file, and it replaces all the links to other pages with AJAX links. Therefore, if the user doesn't have JavaScript he will see all the original links and functionality, and if he has JavaScript he will see all AJAX functionality.
I am trying to retrieve query results on sites based on ajax like www.snapbird.org using Python. Since it doesn't show in the page source, I am not sure how to proceed.
I am a Python newbie and hence it would be great if I could get a pointer in the right direction.
I am also open to some other approach to the task if that is easier
This is going to be complex but as a start, ppen firebug and find the URL that gets called when the AJAX request is handled. You can call that directly in your Python program and parse the output.
You could use Selenium's Python client driver to parse the page source. I usually use this in conjunction with PyQuery to make web scraping easier.
Here's the basic tutorial for Selenium's Python driver. Be sure to follow the instructions for Selenium version 2 instead of version 1 (unless you're using version 1 for some reason).
You could also configure chrome/firefox to an HTTP proxy and then log/extract the necessary content with the proxy. I've tinkered with python proxies to save/log the requests/content based on content-type or URI globs.
For other projects I've written site-specific javascript bookmarklets which poll for new data and then POST it to my server (by dynamically creating both a form and iframe, and setting myform.target=myiframe;
Other javascript scripts/bookmarklets simulate a user interacting with sites, so instead of polling every few seconds the javascript automates clicking buttons and form submissions, etc. These scripts are always very site-specific of course but they've been hugely useful for me, especially when iterating over all the paginated results for a given search.
Here is a stripped down version of walking over a list of "paginated" results and preparing to send the data off to my server (which then further parses it with BeautifulSoup). In particular this was designed for Youtube's Sent/Inbox messages.
var tables = [];
function process_and_repeat(){
if(!(inbox && inbox.message_pane_ && inbox.message_pane_.innerHTML)){
alert("We've got no data!");
}
if(inbox.message_pane_.innerHTML.indexOf('<table') === 0)
{
tables.push(inbox.message_pane_.innerHTML);
inbox.next_page();
setTimeout("process_and_repeat()",3000);
}
else{
alert("Fininshed, [" + tables.length + " processed]");
document.write('<form action=http://curl.sente.cc method=POST><textarea name=sent.html>'+escape(tables.join('\n'))+'</textarea><input type=submit></form>')
}
}
process_and_repeat(); // now we wait and watch as all the paginated pages are viewed :)
This is a stripped down example without any fancy iframes/non-essentials which just add complexity.
Adding to what Liam said, Selenium is a great tool, too, which has aided in my various scraping needs. I'd be more than happy to help you out with this if you'd like.
One easy solution might be using a browser like Mechanize. So you can browse site, follow links, make searches and nearly everything that you can do with a browser with user interface.
But for a very sepcific job, you may not even need a such library, you can use urllib and urllib2 python libraries to make a connection and read response... You can use Firebug to see data structure of a search and response body. Then use urllib to make a request with relevant parameters...
With an example...
I made a search with joyvalencia and check the request url with firebug to see:
http://api.twitter.com/1/statuses/user_timeline.json?screen_name=joyvalencia&count=100&page=2&include_rts=true&callback=twitterlib1321017083330
So calling this url with urllib2.urlopen() will be the same with making the query on Snapbird. Response body is:
twitterlib1321017083330([{"id_str":"131548107799396357","place":null,"geo":null,"in_reply_to_user_id_str":null,"coordinates":.......
When you use urlopen() and read the response, the upper string is what you get... Then you can use json library of python to read the data and parse it to a pythonic data structure...
I'm on the middle of a scrapping project using Scrapy.
I realized that Scrapy strips the URL from a hash tag to the end.
Here's the output from the shell:
[s] request <GET http://www.domain.com/b?ie=UTF8&node=3006339011&ref_=pe_112320_20310580%5C#/ref=sr_nr_p_8_0?rh=n%3A165796011%2Cn%3A%212334086011%2Cn%3A%212334148011%2Cn%3A3006339011%2Cp_8%3A2229010011&bbn=3006339011&ie=UTF8&qid=1309631658&rnid=598357011>
[s] response <200 http://www.domain.com/b?ie=UTF8&node=3006339011&ref_=pe_112320_20310580%5C>
This really affects my scrapping because after a couple of hours trying to find out why some item was not being selected, I realized that the HTML provided by the long URL differs from the one provided by the short one. Besides, after some observation, the content changes in some critical parts.
Is there a way to modify this behavior so Scrapy keeps the whole URL?
Thanks for your feedback and suggestions.
This isn't something scrapy itself can change--the portion following the hash in the url is the fragment identifier which is used by the client (scrapy here, usually a browser) instead of the server.
What probably happens when you fetch the page in a browser is that the page includes some JavaScript that looks at the fragment identifier and loads some additional data via AJAX and updates the page. You'll need to look at what the browser does and see if you can emulate it--developer tools like Firebug or the Chrome or Safari inspector make this easy.
For example, if you navigate to http://twitter.com/also, you are redirected to http://twitter.com/#!/also. The actual URL loaded by the browser here is just http://twitter.com/, but that page then loads data (http://twitter.com/users/show_for_profile.json?screen_name=also) which is used to generate the page, and is, in this case, just JSON data you could parse yourself. You can see this happen using the Network Inspector in Chrome.
Looks like it's not possible. The problem is not the response, it's in the request, which chops the url.
It is retrievable from Javascript - as
window.location.hash. From there you
could send it to the server with Ajax
for example, or encode it and put it
into URLs which can then be passed
through to the server-side.
Can I read the hash portion of the URL on my server-side application (PHP, Ruby, Python, etc.)?
Why do you need this part which is stripped if the server doesn't receive it from browser?
If you are working with Amazon - i haven't seen any problems with such urls.
Actually, when entering that URL in a web browser, it will also only send the part before the hash tag to the web server. If the content is different, it's probably because there are some javascript on the page that - based on the content of the hash tag part - changes the content of the page after it has been loaded (most likely an XmlHttpRequest is made that loads additional content).
I'm trying to read in info that is constantly changing from a website.
For example, say I wanted to read in the artist name that is playing on an online radio site.
I can grab the current artist's name but when the song changes, the HTML updates itself and I've already opened the file via:
f = urllib.urlopen("SITE")
So I can't see the updated artist name for the new song.
Can I keep closing and opening the URL in a while(1) loop to get the updated HTML code or is there a better way to do this? Thanks!
You'll have to periodically re-download the website. Don't do it constantly because that will be too hard on the server.
This is because HTTP, by nature, is not a streaming protocol. Once you connect to the server, it expects you to throw an HTTP request at it, then it will throw an HTTP response back at you containing the page. If your initial request is keep-alive (default as of HTTP/1.1,) you can throw the same request again and get the page up to date.
What I'd recommend? Depending on your needs, get the page every n seconds, get the data you need. If the site provides an API, you can possibly capitalize on that. Also, if it's your own site, you might be able to implement comet-style Ajax over HTTP and get a true stream.
Also note if it's someone else's page, it's possible the site uses Ajax via Javascript to make it up to date; this means there's other requests causing the update and you may need to dissect the website to figure out what requests you need to make to get the data.
If you use urllib2 you can read the headers when you make the request. If the server sends back a "304 Not Modified" in the headers then the content hasn't changed.
Yes, this is correct approach. To get changes in web, you have to send new query each time. Live AJAX sites do exactly same internally.
Some sites provide additional API, including long polling. Look for documentation on the site or ask their developers whether there is some.
I'm trying to scrape some information from a web site, but am having trouble reading the relevant pages. The pages seem to first send a basic setup, then more detailed info. My download attempts only seem to capture the basic setup. I've tried urllib and mechanize so far.
Firefox and Chrome have no trouble displaying the pages, although I can't see the parts I want when I view page source.
A sample url is https://personal.vanguard.com/us/funds/snapshot?FundId=0542&FundIntExt=INT
I'd like, for example, average maturity and average duration from the lower right of the page. The problem isn't extracting that info from the page, it's downloading the page so that I can extract the info.
The page uses JavaScript to load the data. Firefox and Chrome are only working because you have JavaScript enabled - try disabling it and you'll get a mostly empty page.
Python isn't going to be able to do this by itself - your best compromise would be to control a real browser (Internet Explorer is easiest, if you're on Windows) from Python using something like Pamie.
The website loads the data via ajax. Firebug shows the ajax calls. For the given page, the data is loaded from https://personal.vanguard.com/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542
See the corresponding javascript code on the original page:
<script>populator = new Populator({parentId:
"profileForm:vanguardFundTabBox:tab0",execOnLoad:true,
populatorUrl:"/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542",
inline:fals e,type:"once"});
</script>
The reason why is because it's performing AJAX calls after it loads. You will need to account for searching out those URLs to scrape it's content as well.
As RichieHindle mentioned, your best bet on Windows is to use the WebBrowser class to create an instance of an IE rendering engine and then use that to browse the site.
The class gives you full access to the DOM tree, so you can do whatever you want with it.
http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser(loband).aspx