How to download a webpage (mhtml format) using urllib in python

How to download a webpage (mhtml format) using urllib in python - python

How can we save the webpage including the content in it, so that it is viewable offline, using urllib in python language? Currently I am using the following code:
import urllib.request
driver.webdriver.Chrome()
driver.get("http://www.yahoo.com")
urllib.request.urlretrieve("http://www.yahoo.com", C:\\Users\\karanjuneja\\Downloads\\kj\\yahoo.mhtml")
This works and strores an mhtml version of the webpage in the folder, but when you open the file, you will only find the codes written and not the page how it appears online. Do we need to make changes to the code?
Also, is there an alternate way of saving the webpage in MHTML format with all the content as it appears online, and not just the source.Any suggestions?
Thanks Karan

I guess this site might help you~
Create an MHTML archive

Related

Download a .xls file from a aspx page using Python

I'm trying to download a .xls from this Site
I need to somehow click on the second button("Exporta informácion diária") on the grid and download the .xls file.
I tried with requests and beautifulsoup but didnt work.
After that, tried with selenium just for some tests and i managed to do what i needed.
Can someone please explain how can i download the .xls file without using a headless browser?
Thank You.

To do this, you first need to understand what the flow of network requests that performs the download.
The easiest way is to open the developer tools in the browser you are using. And follow the appropriate requests.
In your case, there is an POST Request, Which returns the exact address to the file.
Download it with a GET request.

Python urllib function reading wrong page

I'm using python 2.7 (in Windows 7 OS) I'm just trying to read a webpage using urllib function and writing it to a file. Below is my code.
import urllib
html=urllib.urlopen("http://www.sciencedirect.com/science/article/pii/S027252311730076X").readlines()
print len(html)
g=open("D:\path\to\output\output.html",'w')
for i in html:
g.write(i)
g.close()
But when I compared the page source of the above mentioned link in browser (by right click -> View page source) and my output html file they are different. Many information are missing in my output.html file. Why is that? and how can i get the original page source? Because I have to further write few more codes to extract some specific info from this page.
Thanks for your help in advance.

Using Python, upload image on https://cloud.google.com/vision/ and read JSON script

I want to write a python script to automate uploading of image on https://cloud.google.com/vision/ and collect information from JSON tab there. I need to know how to do it.
Till now, I'm only able to open the website on chrome using following code:-
import webbrowser
url = 'https://cloud.google.com/vision/'
webbrowser.open_new_tab(url + 'doc/')
I tried using urllib2 but couldn't get anything.
Help me out please

you have to use google-cloud-vision lib
there is a sample code in this docs
https://cloud.google.com/vision/docs/reference/libraries#client-libraries-install-python
you can start from here

Windows Python How to download dropbox popup from a website?

Alright so the issue is that I visit a site to download the file I want but the problem is the website that I try to download the file from doesn't host the actual file instead it uses dropbox to host it so as soon as you click download your redirected to a blank page that has dropbox pop up in a small window allowing you to download it. Things to note, there is no log in so I can direct python right to the link where dropbox pops up but it wont download the file.
import urllib
url = 'https://thewebsitedownload.com'
filename = 'filetobedownloaded.exe'
urllib.urlretrieve(url, filename)
Thats the code I use to use and it worked like a charm for direct downloads but now when I try to use it for the site that has the dropbox popup download it just ends up downloading the html code of the site (from what I can tell) and does not actually download the file.
I am still relatively new to python/ coding in general but I am loving it so far this is just the first brick wall that I have hit that I didn't find any similar resolutions to.
Thanks in advance! Sample codes help so much thats how I have been learning so far.

Use Beautifulsoup to parse the html you get. You can then get the href link to the file. There are a lot of Beautifulsoup tutorials on the web, so I think you'll find it fairly easy to figure out how to get the link in your specific situation.
First you download the html with the code you already have, but without the filename:
import urllib
from bs4 import BeautifulSoup
import re
url = 'https://thewebsitedownload.com'
text = urllib.urlopen(url).read()
soup = BeautifulSoup(text)
link = soup.find_all(href=re.compile("dropbox"))[0]['href']
print link
filename = 'filetobedownloaded.exe'
urllib.urlretrieve(link, filename)
I made this from the docs, but haven't tested it, but I think you get the idea.

downloading a file from a page

I would like to make a script (in any language, but preferably python or perl) download a specific type of file being streamed by a web page. However i do not know this files location so i will have to find it out by finding all the files being streamed by the page, and selecting the one i want based on file type.
a similar example would be to say i want to download a video off youtube, however there is no pattern or way to find the URL except finding the files being streamed to my computer.
The part i cannot figure out is how to find all the files being streamed by the page. The rest i can do myself. The file name is not mentioned anywhere in the source of the html page.
Example of the problem...
This works fine:
import urllib
urllib.urlretrieve ("http://example.com/anything.mp3", "a.mp3")
However this does not:
import urllib
urllib.urlretrieve ("http://example.com/page-where-the-mp3-file-is-being-streamed.html", "a.mp3")
If someone can help me figure out how to download all the files from a page or find the files being streamed i would really appreciate it. All i need is to know which language/library/method can accomplish this.Thanks

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to download a webpage (mhtml format) using urllib in python - python

I guess this site might help you~ Create an MHTML archive

Related

Download a .xls file from a aspx page using Python

Python urllib function reading wrong page

Using Python, upload image on https://cloud.google.com/vision/ and read JSON script

Windows Python How to download dropbox popup from a website?

downloading a file from a page

Categories

Resources