BeautifulSoup 4 issue reading URL-encoded links and files downloaded using wget - python

I am having an issue with BS4/Python 2.7.12 reading links and files that have been URL encoded already when I downloaded them using wget to archive my Drupal website.
For example, a link that exists on the live website would be:
https://mywebsite.org/content/prime's-and-"doubleprimes"-in-it (I know this is incorrect grammar because the 's example is possessive not plural)
The downloaded file would be:
/content/prime%E2%80%99s-and-%E2%80%9Cdoubleprimes%E2%80%9D-in-it
(This is helpful in identifying different typography: http://www.w3schools.com/TAGS/ref_urlencode.asp)
My script loops through each file and flattens the site by adding ".html" to all links. However, in using BS4 to do this, it is actually changing the link path because it seems to try to re-interpret the already URL-encoded links. So as a result it would change the above link to:
/content/prime%2580%2599s-and-%2580%259Cdoubleprimes%2580%259D-in-it
And thus it wouldn't work. You can see the %25 it is trying to use to encode the % signs beginning %E2, for example.
There have been many questions regarding encoding with BS4, but most of them specifically with regard to utf-8 with BS4. I understand that BS4 will automatically read the "soup" into utf-8, but I'm unsure why it is trying to re-URL encode links that are already encoded. I have tried soup = BeautifulSoup(text.read().decode('utf-8','ignore')) as suggested here, which fixed an issue where BS4 was trying to interpret %E2 as a unicode character, however I haven't seen anything for re-encoding of already-URL encoded characters. I have also tried adding formatter="html" to my soup.prettify command, but this did not work either, as the files had already been read and interpreted at that point.

Related

How can I use a library such as requests-html to substitute image sources of local files on the web server in html source code to the url + local file

Remember when we were younger and we copied the source code of a website and did not understand why it broke so much?
Well as I know now it is because if you do not have webserverfile.png downloaded, it can not render.
How can I use python to replace
<img src='webserverfile.png'> to <img src='webserverurl/webserverfile.png'>
Note: I do not need help getting the source of the site, I know how to do that with requests.
Also Note: If you can just use native python syntax like replace or startswith than you do not need to use for example bs4 or requests-html to do it.
Also if the site already uses webserverurl/webserverfile I do not want to replace it to webserverurl/webserverurl/webserverfile.png

BeautifulSoup object not matching a website's html markup in chrome's DeveloperTools

I am tring to crawl this link using Python's BeautifulSoup and urllib2 libraries. One problem that I am running into is that the soup object does not match the webpage's html shown using GoogleChrome's DeveloperTool. I checked multiple times and I am certain that I am passing the correct address. The reason I know they are different is because I printed the entire soup object onto sublime2 and compared it against what is shown on chrome's DeveloperTools. I also searched for really specific tags in the soup object. After debugging for hours, I am out of ideas. Does anyone know why this is happening? Is there some sort of re-direction that is going on?
JavaScript will be run in the website which changes the website DOM. Any url library (such as urllib2) only downloads the HTML and does not execute included/linked JavaScript. That's why you see a difference.

Parse HTML, 'ValueError: stat: path too long for Windows'

I'm trying to scrape data from NYSE's website, from this URL:
nyse = http://www1.nyse.com/about/listed/IPO_Index.html
Using requests, my I've set my request up like this:
page = requests.get(nyse)
soup = BeautifulSoup(page.text)
tables = soup.findAll('table')
test = pandas.io.html.read_html(str(tables))
However, I keep getting this error
'ValueError: stat: path too long for Windows'
I don't understand how to interpret this error, and furthermore, solve the problem. I've seen one other posting on this area (Copy a file with a too long path to another directory in Python) but I don't fully understand the workaround, and am not sure which path is the problem in this case.
The error is getting thrown at the test = pandas.io.... line but there isn't a clear definition of path, where I'm storing the table locally. Do I need to use pywin32? Why does this error only show for some URLs and not others? How do I solve this problem?
For reference, I'm using python 3.4
Update:
The error only appears with the nyse website, and not for others that I'm also scraping. In all cases, I'm doing the str(tables) conversion.
The pandas read_html method accepts urls, files, or raw HTML strings as its first argument. It definitely looks like it's trying to interpret the str(tables) argument as a URL -- which would of course be quite long and overrun whatever limit Windows apparently has.
Are you certain that str(tables) produces raw, parseable HTML? Tables looks like it would be represented as a list of abstract node objects -- it seems likely that calling str() on this would not produce what you're looking for.

Matching contents of an html file with keyword python

I am making a download manager. And I want to make the download manager check the md5 hash of an url after downloading the file. The hash is found on the page. It needs to compute the md5 of the file ( this is done), search for a match on the html page and then compare the WHOLE contents of the html page for a match.
my question is how do i make python return the whole contents of the html and find a match for my "md5 string"?
Requests lib is what you want to use. Will save you lots of trouble
import urllib and use urllib.urlopen for getting the contents of an html. import re to search for the hash code using regex. You could also use find method on the string instead of regex.
If you encounter problems, then you can ask more specific questions. Your question is too general.

Using Python to download a document that's not explicitly referenced in a URL

I wrote a web crawler in Python 2.6 using the Bing API that searches for certain documents and then downloads them for classification later. I've been using string methods and urllib.urlretrieve() to download results whose URL ends in .pdf, .ps etc., but I run into trouble when the document is 'hidden' behind a URL like:
http://www.oecd.org/officialdocuments/displaydocument/?cote=STD/CSTAT/WPNA(2008)25&docLanguage=En
So, two questions. Is there a way in general to tell if a URL has a pdf/doc etc. file that it's linking to if it's not doing so explicitly (e.g. www.domain.com/file.pdf)? Is there a way to get Python to snag that file?
Edit:
Thanks for replies, several of which suggest downloading the file to see if it's of the correct type. Only problem is... I don't know how to do that (see question #2, above). urlretrieve(<above url>) gives only an html file with an href containing that same url.
There's no way to tell from the URL what it's going to give you. Even if it ends in .pdf it could still give you HTML or anything it likes.
You could do a HEAD request and look at the content-type, which, if the server isn't lying to you, will tell you if it's a PDF.
Alternatively you can download it and then work out whether what you got is a PDF.
In this case, what you refer to as "a document that's not explicitly referenced in a URL" seems to be what is known as a "redirect". Basically, the server tells you that you have to get the document at another URL. Normally, python's urllib will automatically follow these redirects, so that you end up with the right file. (and - as others have already mentioned - you can check the response's mime-type header to see if it's a pdf).
However, the server in question is doing something strange here. You request the url, and it redirects you to another url. You request the other url, and it redirects you again... to the same url! And again... And again... At some point, urllib decides that this is enough already, and will stop following the redirect, to avoid getting caught in an endless loop.
So how come you are able to get the pdf when you use your browser? Because apparently, the server will only serve the pdf if you have cookies enabled. (why? you have to ask the people responsible for the server...) If you don't have the cookie, it will just keep redirecting you forever.
(check the urllib2 and cookielib modules to get support for cookies, this tutorial might help)
At least, that is what I think is causing the problem. I haven't actually tried doing it with cookies yet. It could also be that the server is does not "want" to serve the pdf, because it detects you are not using a "normal" browser (in which case you would probably need to fiddle with the User-Agent header), but it would be a strange way of doing that. So my guess is that it is somewhere using a "session cookie", and in the case you haven't got one yet, keeps on trying to redirect.
As has been said there is no way to tell content type from URL. But if you don't mind getting the headers for every URL you can do this:
obj = urllib.urlopen(URL)
headers = obj.info()
if headers['Content-Type'].find('pdf') != -1:
# we have pdf file, download whole
...
This way you won't have to download each URL just it's headers. It's still not exactly saving network traffic, but you won't get better than that.
Also you should use mime-types instead of my crude find('pdf').
No. It is impossible to tell what kind of resource is referenced by a URL just by looking at it. It is totally up to the server to decide what he gives you when you request a certain URL.
Check the mimetype with the urllib.info() function. This might not be 100% accurate, it really depends on what the site returns as a Content-Type header. If it's well behaved it'll return the proper mime type.
A PDF should return application/pdf, but that may not be the case.
Otherwise you might just have to download it and try it.
You can't see it from the url directly. You could try to only download the header of the HTTP response and look for the Content-Type header. However, you have to trust the server on this - it could respond with a wrong Content-Type header not matching the data provided in the body.
Detect the file type in Python 3.x and webapp with url to the file which couldn't have an extension or a fake extension. You should install python-magic, using
pip3 install python-magic
For Mac OS X, you should also install libmagic using
brew install libmagic
Code snippet
import urllib
import magic
from urllib.request import urlopen
url = "http://...url to the file ..."
request = urllib.request.Request(url)
response = urlopen(request)
mime_type = magic.from_buffer(response.read())
print(mime_type)

Categories