I'm trying to make an http request using httplib2:
import httplib2, time, re, urllib`
conn = httplib2.Http(".cache")
page = conn.request(u"http://www.mydomain.com/search?q=cars#p=100","GET")
The response is ok, but the "#p=100" does not get passed over. Does anyone know how to pass this over with httplib2?
thanks
The fragment in the URL is not passed to the server.
+1 to Ignacio because he answered correctly first.
The relevant documentation, from https://www.rfc-editor.org/rfc/rfc2396#section-4.1
When a URI reference is used to perform a retrieval action on the identified resource, the optional fragment identifier, separated from the URI by a crosshatch ("#") character, consists of additional reference information to be interpreted by the user agent after the retrieval action has been successfully completed. As such, it is not part of a URI, but is often used in conjunction with a URI.
In the case of the link above, the browser uses the information after the crosshatch as a bookmark for a particular spot in the HTML.
If anyone else stumbles onto this question and wants an answer, I found an answer from another Stack Overflow question:
The fragment of the url after the hash (#) symbol is for client-side handling and isn't actually sent to the webserver. My guess is there is some javascript on the page that requests the correct data from the server using AJAX, and you need to figure out what URL is used for that.
If you use chrome you can watch the Network tab of the developer tools and see what URLs are requested when you click the link to go to page two in your browser.
To get the developer tools In Chrome press F11(Windows) or Apple+Alt+i(Mac). If you click on the option's gear in the bottom right corner, make sure the Preserve log upon navigation is checked.
Related
I'm trying to get some data from a webpage, but I found a problem. Whenever I want to go to the next page (i.e. page 2) to keep retrieving the data on it, I keep receiving the data from page 1. Apparently something goes wrong trying to switch to the next page.
The thing is, I haven't had problems with urls like this:
'http://www.webpage.com/index.php?page=' + str(pageno)
I can just start a while statement and I'll just jump to page 2 by adding 1 to "pageno"
My problem comes in when I try to open an url with this format:
'http://www.webpage.com/search/?show_all=1#sort_order=ASC&page=' + str(pageno)
As
urllib2.urlopen('http://www.webpage.com/search/?show_all=1#sort_order=ASC&page=4').read()
will retrieve the source code from http://www.webpage.com/search/?show_all=1
There is no other way to retrieve other pages without using the hash, as far as I'm concerned.
I guess it's just urllib2 ignoring the hash, as it is normally used to specify a starting point for a browser.
The fragment of the url after the hash (#) symbol is for client-side handling and isn't actually sent to the webserver. My guess is there is some javascript on the page that requests the correct data from the server using AJAX, and you need to figure out what URL is used for that.
If you use chrome you can watch the Network tab of the developer tools and see what URLs are requested when you click the link to go to page two in your browser.
that's because hash are not part of the url that is sent to the server, it's a fragment identifier that is used to identify elements inside the page. Some websites misused the hash fragment for JavaScript hook for identifying pages though. You'll either need to be able to execute the JavaScript on the page or you'll need to reverse engineer the JavaScript and emulate the true search request that is being made, presumably through ajax. Firebug's Net tab will be really useful for this.
I am trying to retrieve query results on sites based on ajax like www.snapbird.org using Python. Since it doesn't show in the page source, I am not sure how to proceed.
I am a Python newbie and hence it would be great if I could get a pointer in the right direction.
I am also open to some other approach to the task if that is easier
This is going to be complex but as a start, ppen firebug and find the URL that gets called when the AJAX request is handled. You can call that directly in your Python program and parse the output.
You could use Selenium's Python client driver to parse the page source. I usually use this in conjunction with PyQuery to make web scraping easier.
Here's the basic tutorial for Selenium's Python driver. Be sure to follow the instructions for Selenium version 2 instead of version 1 (unless you're using version 1 for some reason).
You could also configure chrome/firefox to an HTTP proxy and then log/extract the necessary content with the proxy. I've tinkered with python proxies to save/log the requests/content based on content-type or URI globs.
For other projects I've written site-specific javascript bookmarklets which poll for new data and then POST it to my server (by dynamically creating both a form and iframe, and setting myform.target=myiframe;
Other javascript scripts/bookmarklets simulate a user interacting with sites, so instead of polling every few seconds the javascript automates clicking buttons and form submissions, etc. These scripts are always very site-specific of course but they've been hugely useful for me, especially when iterating over all the paginated results for a given search.
Here is a stripped down version of walking over a list of "paginated" results and preparing to send the data off to my server (which then further parses it with BeautifulSoup). In particular this was designed for Youtube's Sent/Inbox messages.
var tables = [];
function process_and_repeat(){
if(!(inbox && inbox.message_pane_ && inbox.message_pane_.innerHTML)){
alert("We've got no data!");
}
if(inbox.message_pane_.innerHTML.indexOf('<table') === 0)
{
tables.push(inbox.message_pane_.innerHTML);
inbox.next_page();
setTimeout("process_and_repeat()",3000);
}
else{
alert("Fininshed, [" + tables.length + " processed]");
document.write('<form action=http://curl.sente.cc method=POST><textarea name=sent.html>'+escape(tables.join('\n'))+'</textarea><input type=submit></form>')
}
}
process_and_repeat(); // now we wait and watch as all the paginated pages are viewed :)
This is a stripped down example without any fancy iframes/non-essentials which just add complexity.
Adding to what Liam said, Selenium is a great tool, too, which has aided in my various scraping needs. I'd be more than happy to help you out with this if you'd like.
One easy solution might be using a browser like Mechanize. So you can browse site, follow links, make searches and nearly everything that you can do with a browser with user interface.
But for a very sepcific job, you may not even need a such library, you can use urllib and urllib2 python libraries to make a connection and read response... You can use Firebug to see data structure of a search and response body. Then use urllib to make a request with relevant parameters...
With an example...
I made a search with joyvalencia and check the request url with firebug to see:
http://api.twitter.com/1/statuses/user_timeline.json?screen_name=joyvalencia&count=100&page=2&include_rts=true&callback=twitterlib1321017083330
So calling this url with urllib2.urlopen() will be the same with making the query on Snapbird. Response body is:
twitterlib1321017083330([{"id_str":"131548107799396357","place":null,"geo":null,"in_reply_to_user_id_str":null,"coordinates":.......
When you use urlopen() and read the response, the upper string is what you get... Then you can use json library of python to read the data and parse it to a pythonic data structure...
It's my first question here.
Today, I've done a little application using wxPython: a simple Megaupload Downloader, but yet, it doesn't support premium accounts.
Now I would like to know how to download from MU with a login (free or premium user).
I'm very new to Python, so please don't be specific and "professional".
I used to download files with urlretrieve but, but is there a way to pass "arguments" or something to be able to log in as a premium user ?
Thank you. :D
EDIT =
News: new help needed xD
After trying with PyCUrl, htmllib2 and mechanize, I've done the login with urllib2 and cookiejar (the requested html says the username).
But when I start download a file, surely the server doesn't keep my login, in fact the downloaded file seems corrupted (I changed wait time from 45 to 25 seconds).
How can I download a file from MegaUpload keeping my previously done login? Thanks for your patient :D
Questions like this are usually frowned upon, they are very broad, and there are already an abundance of answers if you just search on google.
You can use urllib, or mechanize, or any library you can make an http post request with.
megaupload looks to have the form values
login:1
redir:1
username:
password:
just post those values at http://megaupload.com/?c=login
all you should have to do is set your username and password to the correct values!
For logging in using Python follow the following steps.
Find the list of parameters to be sent in the POST request and the url where the request has to be made by viewing the source of the login form. You may use a browser with "Inspect Element" feature to find it easily. [parameter name examples - userid, password]. Just check the tags name attribute.
Most of the sites set a cookie on logging in and the cookie has to be sent along with subsequent requests. To handle this download httllib2 (http://code.google.com/p/httplib2/ ) and read the wiki page on the link given. It has shown how to login with examples.
Now you can make subsequent requests for files, the cookies etc. will be handled automatically by httplib2.
i do alot of web stuff with python, i perfer using pycurl you can get it here
it is very simple to post data and login with curl, i've used it accross many languages such as PHP, python, and C++, hope this helps
You can use urllib this is a good example
I'm trying to read in info that is constantly changing from a website.
For example, say I wanted to read in the artist name that is playing on an online radio site.
I can grab the current artist's name but when the song changes, the HTML updates itself and I've already opened the file via:
f = urllib.urlopen("SITE")
So I can't see the updated artist name for the new song.
Can I keep closing and opening the URL in a while(1) loop to get the updated HTML code or is there a better way to do this? Thanks!
You'll have to periodically re-download the website. Don't do it constantly because that will be too hard on the server.
This is because HTTP, by nature, is not a streaming protocol. Once you connect to the server, it expects you to throw an HTTP request at it, then it will throw an HTTP response back at you containing the page. If your initial request is keep-alive (default as of HTTP/1.1,) you can throw the same request again and get the page up to date.
What I'd recommend? Depending on your needs, get the page every n seconds, get the data you need. If the site provides an API, you can possibly capitalize on that. Also, if it's your own site, you might be able to implement comet-style Ajax over HTTP and get a true stream.
Also note if it's someone else's page, it's possible the site uses Ajax via Javascript to make it up to date; this means there's other requests causing the update and you may need to dissect the website to figure out what requests you need to make to get the data.
If you use urllib2 you can read the headers when you make the request. If the server sends back a "304 Not Modified" in the headers then the content hasn't changed.
Yes, this is correct approach. To get changes in web, you have to send new query each time. Live AJAX sites do exactly same internally.
Some sites provide additional API, including long polling. Look for documentation on the site or ask their developers whether there is some.
I wrote a web crawler in Python 2.6 using the Bing API that searches for certain documents and then downloads them for classification later. I've been using string methods and urllib.urlretrieve() to download results whose URL ends in .pdf, .ps etc., but I run into trouble when the document is 'hidden' behind a URL like:
http://www.oecd.org/officialdocuments/displaydocument/?cote=STD/CSTAT/WPNA(2008)25&docLanguage=En
So, two questions. Is there a way in general to tell if a URL has a pdf/doc etc. file that it's linking to if it's not doing so explicitly (e.g. www.domain.com/file.pdf)? Is there a way to get Python to snag that file?
Edit:
Thanks for replies, several of which suggest downloading the file to see if it's of the correct type. Only problem is... I don't know how to do that (see question #2, above). urlretrieve(<above url>) gives only an html file with an href containing that same url.
There's no way to tell from the URL what it's going to give you. Even if it ends in .pdf it could still give you HTML or anything it likes.
You could do a HEAD request and look at the content-type, which, if the server isn't lying to you, will tell you if it's a PDF.
Alternatively you can download it and then work out whether what you got is a PDF.
In this case, what you refer to as "a document that's not explicitly referenced in a URL" seems to be what is known as a "redirect". Basically, the server tells you that you have to get the document at another URL. Normally, python's urllib will automatically follow these redirects, so that you end up with the right file. (and - as others have already mentioned - you can check the response's mime-type header to see if it's a pdf).
However, the server in question is doing something strange here. You request the url, and it redirects you to another url. You request the other url, and it redirects you again... to the same url! And again... And again... At some point, urllib decides that this is enough already, and will stop following the redirect, to avoid getting caught in an endless loop.
So how come you are able to get the pdf when you use your browser? Because apparently, the server will only serve the pdf if you have cookies enabled. (why? you have to ask the people responsible for the server...) If you don't have the cookie, it will just keep redirecting you forever.
(check the urllib2 and cookielib modules to get support for cookies, this tutorial might help)
At least, that is what I think is causing the problem. I haven't actually tried doing it with cookies yet. It could also be that the server is does not "want" to serve the pdf, because it detects you are not using a "normal" browser (in which case you would probably need to fiddle with the User-Agent header), but it would be a strange way of doing that. So my guess is that it is somewhere using a "session cookie", and in the case you haven't got one yet, keeps on trying to redirect.
As has been said there is no way to tell content type from URL. But if you don't mind getting the headers for every URL you can do this:
obj = urllib.urlopen(URL)
headers = obj.info()
if headers['Content-Type'].find('pdf') != -1:
# we have pdf file, download whole
...
This way you won't have to download each URL just it's headers. It's still not exactly saving network traffic, but you won't get better than that.
Also you should use mime-types instead of my crude find('pdf').
No. It is impossible to tell what kind of resource is referenced by a URL just by looking at it. It is totally up to the server to decide what he gives you when you request a certain URL.
Check the mimetype with the urllib.info() function. This might not be 100% accurate, it really depends on what the site returns as a Content-Type header. If it's well behaved it'll return the proper mime type.
A PDF should return application/pdf, but that may not be the case.
Otherwise you might just have to download it and try it.
You can't see it from the url directly. You could try to only download the header of the HTTP response and look for the Content-Type header. However, you have to trust the server on this - it could respond with a wrong Content-Type header not matching the data provided in the body.
Detect the file type in Python 3.x and webapp with url to the file which couldn't have an extension or a fake extension. You should install python-magic, using
pip3 install python-magic
For Mac OS X, you should also install libmagic using
brew install libmagic
Code snippet
import urllib
import magic
from urllib.request import urlopen
url = "http://...url to the file ..."
request = urllib.request.Request(url)
response = urlopen(request)
mime_type = magic.from_buffer(response.read())
print(mime_type)