Python urllib function reading wrong page - python

I'm using python 2.7 (in Windows 7 OS) I'm just trying to read a webpage using urllib function and writing it to a file. Below is my code.
import urllib
html=urllib.urlopen("http://www.sciencedirect.com/science/article/pii/S027252311730076X").readlines()
print len(html)
g=open("D:\path\to\output\output.html",'w')
for i in html:
g.write(i)
g.close()
But when I compared the page source of the above mentioned link in browser (by right click -> View page source) and my output html file they are different. Many information are missing in my output.html file. Why is that? and how can i get the original page source? Because I have to further write few more codes to extract some specific info from this page.
Thanks for your help in advance.

Related

Download data from online database in python script

I am trying to download data from UniProt using Python from within a script. If you follow the previous link, you will see a Download button, and then the option of choosing the format of the data. I would like to download the Excel format, compressed. Is there a way to do this within a script?
You can easily see the URL for that if you monitor it in the Firefox "netowork" tab or equivalent. For this page it seems to be https://www.uniprot.org/uniprot/?query=*&format=xlsx&force=true&columns=id,entry%20name,reviewed,protein%20names,genes,organism,length&fil=organism:%22Homo%20sapiens%20(Human)%20[9606]%22%20AND%20reviewed:yes&compress=yes. You should be able to download it using requests or any similar lib.
Example:
import requests
url = "https://www.uniprot.org/uniprot/?query=*&format=xlsx&force=true&columns=id,entry%20name,reviewed,protein%20names,genes,organism,length&fil=organism:%22Homo%20sapiens%20(Human)%20[9606]%22%20AND%20reviewed:yes&compress=yes"
with open("downloaded.xlsx.gz", "wb") as target:
target.write(requests.get(url).content)

How to download a webpage (mhtml format) using urllib in python

How can we save the webpage including the content in it, so that it is viewable offline, using urllib in python language? Currently I am using the following code:
import urllib.request
driver.webdriver.Chrome()
driver.get("http://www.yahoo.com")
urllib.request.urlretrieve("http://www.yahoo.com", C:\\Users\\karanjuneja\\Downloads\\kj\\yahoo.mhtml")
This works and strores an mhtml version of the webpage in the folder, but when you open the file, you will only find the codes written and not the page how it appears online. Do we need to make changes to the code?
Also, is there an alternate way of saving the webpage in MHTML format with all the content as it appears online, and not just the source.Any suggestions?
Thanks Karan
I guess this site might help you~
Create an MHTML archive

Unable to read HTML content

I'm building a webCrawler which needs to read links inside a webpage. For which I'm using urllib2 library of python to open and read the websites.
I found a website where I'm unable to fetch any data.
The URL is "http://www.biography.com/people/michael-jordan-9358066"
My code,
import urllib2
response = urllib2.urlopen("http://www.biography.com/people/michael-jordan-9358066")
print response.read()
By running the above code, the content I get from the website, if I open it in a browser and the content I get from the above code is very different. The content from the above code does not include any data.
I thought it could be because of delay in reading the web page, so I introduced a delay. Even after the delay, the response is the same.
response = urllib2.urlopen("http://www.biography.com/people/michael-jordan-9358066")
time.sleep(20)
print response.read()
The web page opens perfectly fine in a browser.
However, the above code works fine for reading Wikipedia or some other websites.
I'm unable to find the reason behind this odd behaviour. Please help, thanks in advance.
What you are experiencing is most likely to be the effect of dynamic web pages. These pages do not have static content for urllib or requests to get. The data is loaded on site. You can use Python's selenium to solve this.

Windows Python How to download dropbox popup from a website?

Alright so the issue is that I visit a site to download the file I want but the problem is the website that I try to download the file from doesn't host the actual file instead it uses dropbox to host it so as soon as you click download your redirected to a blank page that has dropbox pop up in a small window allowing you to download it. Things to note, there is no log in so I can direct python right to the link where dropbox pops up but it wont download the file.
import urllib
url = 'https://thewebsitedownload.com'
filename = 'filetobedownloaded.exe'
urllib.urlretrieve(url, filename)
Thats the code I use to use and it worked like a charm for direct downloads but now when I try to use it for the site that has the dropbox popup download it just ends up downloading the html code of the site (from what I can tell) and does not actually download the file.
I am still relatively new to python/ coding in general but I am loving it so far this is just the first brick wall that I have hit that I didn't find any similar resolutions to.
Thanks in advance! Sample codes help so much thats how I have been learning so far.
Use Beautifulsoup to parse the html you get. You can then get the href link to the file. There are a lot of Beautifulsoup tutorials on the web, so I think you'll find it fairly easy to figure out how to get the link in your specific situation.
First you download the html with the code you already have, but without the filename:
import urllib
from bs4 import BeautifulSoup
import re
url = 'https://thewebsitedownload.com'
text = urllib.urlopen(url).read()
soup = BeautifulSoup(text)
link = soup.find_all(href=re.compile("dropbox"))[0]['href']
print link
filename = 'filetobedownloaded.exe'
urllib.urlretrieve(link, filename)
I made this from the docs, but haven't tested it, but I think you get the idea.

:Save html as text

I have a javascript code which just show the source code of a html page
javascript:h=document.getElementsByTagName('html')[0].innerHTML;function%20disp(h){h=h.replace(/</g,%20'\n<');h=h.replace(/>/g,'>');document.getElementsByTagName('body')[0].innerHTML='<pre><html>'+h.replace(/(\n|\r)+/g,'\n')+'</html></pre>';}void(disp(h));
I save the code as a bookmark in firefox. So after loading a web page, when I select the code from the bookmark, and it shows the source code.
Now i try to save the html file using python.
from BeautifulSoup import BeautifulSoup
from BeautifulSoup import BeautifulStoneSoup
import BeautifulSoup
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://www.doctorisin.net/")
soup = BeautifulSoup(page)
print soup.prettify()
fp = open('file.txt','wb')
fp.write(soup.prettify())
But it does not have all the content that javascript code have. The saved file and the javascript shows source file is not same. Maybe python code does not get all the code(javascript/css tag code) from html page. What is the problem? Am i doing something wrong? Need help
thank you
EDITED
As an example of my problem, http://phpjunkyard.com/tutorials/cut-paste-code.php (random site ) Go to this site, right click and select view page source(firefox) copy the source and save in a text file. Now save the page (save page as). You can see that both are not same. Saved page(save as) has something more. Python give the output like source code(view page source). It is missing some scripts, forms etc.
If you want to save the exact HTML that the web server gives, don't use BeautifulSoup (which is an HTML parser and will likely modify the code when prettyprinting it back); this would be a better solution:
import urllib2
file("my_file.txt", "w").write(urllib2.urlopen("http://www.doctorisin.net/").read())
Firefox by default saves not only the HTML but also files that are needed to display the page (including css and scripts).
What you are seeing is the difference between static and dynamic webpages.
Unlike static webpages, dynamic webpages can modify the underlying html as they loading. Javascript can dump the full html of the loaded page, because it has access to the modified DOM created by the browser.
In contrast, if the same webpage is downloaded from the server and fed directly to BeautifulSoup, it will only be able to parse it as static html. To get the full, dynamic content, the page would need to be processed by a browser (or the equivalent) first.

Categories