Given a url cnn.com when i feed it in a browser, it finds http://www.cnn.com as correct url.
However
r = requests.get('www.cnn.com')
gives
MissingSchema: Invalid URL u'www.cnn.com': No schema supplied
Error
Is it possible to detect the right url just like a browser does?
Obviously the module you are using does not want to guess the scheme, so you have to provide it. If you build an interface yourself and want your users to be able to omit the scheme, you need to implement some "intelligent" approach yourself. A way to do so would be to use http://docs.python.org/2/library/urlparse.html, check if a scheme was given within the URL. If no scheme was provided, add your desired default scheme (e.g. http) to the ParseResult object and get the modified URL via ParseResult.geturl().
Yes, it's possible, or at least it's possible to make good guesses and test them.
To make a good guess, you could start by looking for "http://" at the start of a URL, and add it if it's not there. To test that guess, you could try to hit the resulting domain and see if you get a successful response. .
Related
I am working on getting information from https://www.corporationwiki.com/search/results?term=jim%20smith(just a random name I pick, please don't mind), I want to filter the result by using the drop-down menu and select a State.
However, the web page doesn't implement 'States' as a parameter, which means the URL doesn't change after I select a state.
I tried passing params into requests.get(), the result didn't change.
Here's the code I used:
url = 'https://www.corporationwiki.com/search/results?term=jim%20smith'
r = requests.get(url, params= dict(
query="web scraping",
page=2, states = 'Maryland'),timeout = 5)
There's no error message, however, it also didn't show me the filtered result.
Can anyone help me passing the right parameters so I can filter the result by states?
Thanks :)
Actually, it looks like the website does implement state as parameter. The exact name is "stateFacet".
You can just send your get request to:
https://www.corporationwiki.com/search/withfacets?term=jim%20smith&stateFacet=state_code
Just replace state_code with the correct value. For example:
https://www.corporationwiki.com/search/withfacets?term=jim%20smith&stateFacet=de
This link will filter with the state Delaware.
If the endpoint doesn't support it, then you cannot get it via URL. You will need to look into more complicated methods of doing so, or figure out the correct URL parameter if there is one.
You won't be able to do it with requests. You will probably need to use something like Selenium to simulate clicking the dropdown and picking the filter(s) you want. This is because that dropdown's logic is all javascript which cannot be done through the URL request.
My application is listing some game servers IP addresses.
I want to add a simple search engine, taking a regular expression in it. I would type ^200. to list only the IP addresses beginning with 200.
The form would redirect me to the results page by sending a GET request like that :
/servers/search/^200./page/1/range/30/
This is the line I'm using in urls.py :
url(r'^servers/search/(?P<search>[a-zA-Z0-9.]+)/page/(?P<page>\d+)/range/(?P<count>\d+)/$', gameservers.views.index)
But it doesn't work the way I expected. No results are shown. I've intentionally made a syntax error to see the local variables. Then I realized that the search variable's value is the following :
^200./page/1/range/30/
How can I fix this ? I've thought about moving the search parameter in the url's ending, but it might be very interesting to see if there is a way to limit the value with the next /.
Your regex doesn't match at all: you are not accepting the ^ character. But even if it was, there's no way that the full URL could all be captured in the search variable, because then the rest of the URL wouldn't match.
However, I wouldn't try to fix this. Trying to capture complicated patterns in the URL itself is usually a mistake. For a search value, it's perfectly acceptable to move that to a GET query parameter, so that your URL would look something like this:
/servers/search/?search=^200.&page=1&range=30
or, if you like, you could still capture the page and range values in the URL, but leave the search value as a query param.
I have a rather strange question regarding urls which point to another url. So, for example, I have a url:
http://mywebpage/this/is/a/forward
which ultimately points to another url:
http://mynewpage/this/is/new
My question is, when I use for example urllib2 in python to fetch the first page, it ultimately fetches the second page. I would like to know if its possible to know what the original link is pointing to. Is there something like a "header" which tells me the second link when I request the first link?
Sorry if this is a really silly question!
When you issue a GET request for the first URL, the web server will return a 300-series reply code, with a Location header whose value is the second URL. You can find out what the second URL was from Python with the geturl method of the object returned by urlopen. If there is more than one redirection involved, it appears that urllib will tell you the last hop and there's no way to get the others.
This will not handle redirections via JavaScript or meta http-equiv="refresh", but you probably aren't in that situation or you wouldn't have asked the question the way you did.
It's most commonly done via a redirection response code (3xx) as defined in RFC2616 although a "pseudo redirect effect" cann be achieved with some javascript in the original page.
This SO question is about how to prevent urllib2 from following redirects, it looks like something you might be able to use.
You can do this using requests:
>>> url = 'http://ofa.bo/foagK7'
>>> r = requests.head(url)
>>> r.headers['location']
'https://my.barackobama.com/page/s/what-does-2000-mean-to-you'
I wrote a web crawler in Python 2.6 using the Bing API that searches for certain documents and then downloads them for classification later. I've been using string methods and urllib.urlretrieve() to download results whose URL ends in .pdf, .ps etc., but I run into trouble when the document is 'hidden' behind a URL like:
http://www.oecd.org/officialdocuments/displaydocument/?cote=STD/CSTAT/WPNA(2008)25&docLanguage=En
So, two questions. Is there a way in general to tell if a URL has a pdf/doc etc. file that it's linking to if it's not doing so explicitly (e.g. www.domain.com/file.pdf)? Is there a way to get Python to snag that file?
Edit:
Thanks for replies, several of which suggest downloading the file to see if it's of the correct type. Only problem is... I don't know how to do that (see question #2, above). urlretrieve(<above url>) gives only an html file with an href containing that same url.
There's no way to tell from the URL what it's going to give you. Even if it ends in .pdf it could still give you HTML or anything it likes.
You could do a HEAD request and look at the content-type, which, if the server isn't lying to you, will tell you if it's a PDF.
Alternatively you can download it and then work out whether what you got is a PDF.
In this case, what you refer to as "a document that's not explicitly referenced in a URL" seems to be what is known as a "redirect". Basically, the server tells you that you have to get the document at another URL. Normally, python's urllib will automatically follow these redirects, so that you end up with the right file. (and - as others have already mentioned - you can check the response's mime-type header to see if it's a pdf).
However, the server in question is doing something strange here. You request the url, and it redirects you to another url. You request the other url, and it redirects you again... to the same url! And again... And again... At some point, urllib decides that this is enough already, and will stop following the redirect, to avoid getting caught in an endless loop.
So how come you are able to get the pdf when you use your browser? Because apparently, the server will only serve the pdf if you have cookies enabled. (why? you have to ask the people responsible for the server...) If you don't have the cookie, it will just keep redirecting you forever.
(check the urllib2 and cookielib modules to get support for cookies, this tutorial might help)
At least, that is what I think is causing the problem. I haven't actually tried doing it with cookies yet. It could also be that the server is does not "want" to serve the pdf, because it detects you are not using a "normal" browser (in which case you would probably need to fiddle with the User-Agent header), but it would be a strange way of doing that. So my guess is that it is somewhere using a "session cookie", and in the case you haven't got one yet, keeps on trying to redirect.
As has been said there is no way to tell content type from URL. But if you don't mind getting the headers for every URL you can do this:
obj = urllib.urlopen(URL)
headers = obj.info()
if headers['Content-Type'].find('pdf') != -1:
# we have pdf file, download whole
...
This way you won't have to download each URL just it's headers. It's still not exactly saving network traffic, but you won't get better than that.
Also you should use mime-types instead of my crude find('pdf').
No. It is impossible to tell what kind of resource is referenced by a URL just by looking at it. It is totally up to the server to decide what he gives you when you request a certain URL.
Check the mimetype with the urllib.info() function. This might not be 100% accurate, it really depends on what the site returns as a Content-Type header. If it's well behaved it'll return the proper mime type.
A PDF should return application/pdf, but that may not be the case.
Otherwise you might just have to download it and try it.
You can't see it from the url directly. You could try to only download the header of the HTTP response and look for the Content-Type header. However, you have to trust the server on this - it could respond with a wrong Content-Type header not matching the data provided in the body.
Detect the file type in Python 3.x and webapp with url to the file which couldn't have an extension or a fake extension. You should install python-magic, using
pip3 install python-magic
For Mac OS X, you should also install libmagic using
brew install libmagic
Code snippet
import urllib
import magic
from urllib.request import urlopen
url = "http://...url to the file ..."
request = urllib.request.Request(url)
response = urlopen(request)
mime_type = magic.from_buffer(response.read())
print(mime_type)
I am writing client-side Python unit tests to verify whether the HTTP 302 redirects on my Google App Engine site are pointing to the right pages. So far, I have been calling urllib2.urlopen(my_url).geturl(). However, I have encountered 2 issues:
the URL returned by geturl() does not appear to include URL query strings like ?k1=v1&k2=v2; how can I see these? (I need to check whether I correctly passed along the visitor's original URL query string to the redirect page.)
geturl() shows the final URL after any additional redirects. I just care about the first redirect (the one from my site); I am agnostic to anything after that. For example, let's assume my site is example.com. If a user requests http://www.example.com/somepath/?q=foo, I might want to redirect them to http://www.anothersite.com?q=foo. That other site might do another redirect to http://subdomain.anothersite.com?q=foo, which I can't control or predict. How can I make sure my redirect is correct?
Supply follow_redirects=False to the fetch function, then retrieve the location of the first redirect from the 'location' header in the response, like so:
response = urlfetch.fetch(your_url, follow_redirects=False)
location = response.headers['Location']
Use httplib (and look at the return status and Location header of the response) to avoid the "auto-follow redirects" that's impeding your testing. There's a good example here.