How can we save the webpage including the content in it, so that it is viewable offline, using urllib in python language? Currently I am using the following code:
import urllib.request
driver.webdriver.Chrome()
driver.get("http://www.yahoo.com")
urllib.request.urlretrieve("http://www.yahoo.com", C:\\Users\\karanjuneja\\Downloads\\kj\\yahoo.mhtml")
This works and strores an mhtml version of the webpage in the folder, but when you open the file, you will only find the codes written and not the page how it appears online. Do we need to make changes to the code?
Also, is there an alternate way of saving the webpage in MHTML format with all the content as it appears online, and not just the source.Any suggestions?
Thanks Karan
I guess this site might help you~
Create an MHTML archive
Related
I'm trying to download a .xls from this Site
I need to somehow click on the second button("Exporta informácion diária") on the grid and download the .xls file.
I tried with requests and beautifulsoup but didnt work.
After that, tried with selenium just for some tests and i managed to do what i needed.
Can someone please explain how can i download the .xls file without using a headless browser?
Thank You.
To do this, you first need to understand what the flow of network requests that performs the download.
The easiest way is to open the developer tools in the browser you are using. And follow the appropriate requests.
In your case, there is an POST Request, Which returns the exact address to the file.
Download it with a GET request.
I'm using python 2.7 (in Windows 7 OS) I'm just trying to read a webpage using urllib function and writing it to a file. Below is my code.
import urllib
html=urllib.urlopen("http://www.sciencedirect.com/science/article/pii/S027252311730076X").readlines()
print len(html)
g=open("D:\path\to\output\output.html",'w')
for i in html:
g.write(i)
g.close()
But when I compared the page source of the above mentioned link in browser (by right click -> View page source) and my output html file they are different. Many information are missing in my output.html file. Why is that? and how can i get the original page source? Because I have to further write few more codes to extract some specific info from this page.
Thanks for your help in advance.
I want to write a python script to automate uploading of image on https://cloud.google.com/vision/ and collect information from JSON tab there. I need to know how to do it.
Till now, I'm only able to open the website on chrome using following code:-
import webbrowser
url = 'https://cloud.google.com/vision/'
webbrowser.open_new_tab(url + 'doc/')
I tried using urllib2 but couldn't get anything.
Help me out please
you have to use google-cloud-vision lib
there is a sample code in this docs
https://cloud.google.com/vision/docs/reference/libraries#client-libraries-install-python
you can start from here
Alright so the issue is that I visit a site to download the file I want but the problem is the website that I try to download the file from doesn't host the actual file instead it uses dropbox to host it so as soon as you click download your redirected to a blank page that has dropbox pop up in a small window allowing you to download it. Things to note, there is no log in so I can direct python right to the link where dropbox pops up but it wont download the file.
import urllib
url = 'https://thewebsitedownload.com'
filename = 'filetobedownloaded.exe'
urllib.urlretrieve(url, filename)
Thats the code I use to use and it worked like a charm for direct downloads but now when I try to use it for the site that has the dropbox popup download it just ends up downloading the html code of the site (from what I can tell) and does not actually download the file.
I am still relatively new to python/ coding in general but I am loving it so far this is just the first brick wall that I have hit that I didn't find any similar resolutions to.
Thanks in advance! Sample codes help so much thats how I have been learning so far.
Use Beautifulsoup to parse the html you get. You can then get the href link to the file. There are a lot of Beautifulsoup tutorials on the web, so I think you'll find it fairly easy to figure out how to get the link in your specific situation.
First you download the html with the code you already have, but without the filename:
import urllib
from bs4 import BeautifulSoup
import re
url = 'https://thewebsitedownload.com'
text = urllib.urlopen(url).read()
soup = BeautifulSoup(text)
link = soup.find_all(href=re.compile("dropbox"))[0]['href']
print link
filename = 'filetobedownloaded.exe'
urllib.urlretrieve(link, filename)
I made this from the docs, but haven't tested it, but I think you get the idea.
I would like to make a script (in any language, but preferably python or perl) download a specific type of file being streamed by a web page. However i do not know this files location so i will have to find it out by finding all the files being streamed by the page, and selecting the one i want based on file type.
a similar example would be to say i want to download a video off youtube, however there is no pattern or way to find the URL except finding the files being streamed to my computer.
The part i cannot figure out is how to find all the files being streamed by the page. The rest i can do myself. The file name is not mentioned anywhere in the source of the html page.
Example of the problem...
This works fine:
import urllib
urllib.urlretrieve ("http://example.com/anything.mp3", "a.mp3")
However this does not:
import urllib
urllib.urlretrieve ("http://example.com/page-where-the-mp3-file-is-being-streamed.html", "a.mp3")
If someone can help me figure out how to download all the files from a page or find the files being streamed i would really appreciate it. All i need is to know which language/library/method can accomplish this.Thanks