Alright so the issue is that I visit a site to download the file I want but the problem is the website that I try to download the file from doesn't host the actual file instead it uses dropbox to host it so as soon as you click download your redirected to a blank page that has dropbox pop up in a small window allowing you to download it. Things to note, there is no log in so I can direct python right to the link where dropbox pops up but it wont download the file.
import urllib
url = 'https://thewebsitedownload.com'
filename = 'filetobedownloaded.exe'
urllib.urlretrieve(url, filename)
Thats the code I use to use and it worked like a charm for direct downloads but now when I try to use it for the site that has the dropbox popup download it just ends up downloading the html code of the site (from what I can tell) and does not actually download the file.
I am still relatively new to python/ coding in general but I am loving it so far this is just the first brick wall that I have hit that I didn't find any similar resolutions to.
Thanks in advance! Sample codes help so much thats how I have been learning so far.
Use Beautifulsoup to parse the html you get. You can then get the href link to the file. There are a lot of Beautifulsoup tutorials on the web, so I think you'll find it fairly easy to figure out how to get the link in your specific situation.
First you download the html with the code you already have, but without the filename:
import urllib
from bs4 import BeautifulSoup
import re
url = 'https://thewebsitedownload.com'
text = urllib.urlopen(url).read()
soup = BeautifulSoup(text)
link = soup.find_all(href=re.compile("dropbox"))[0]['href']
print link
filename = 'filetobedownloaded.exe'
urllib.urlretrieve(link, filename)
I made this from the docs, but haven't tested it, but I think you get the idea.
Related
I was hoping someone could help me figure out how to scrape data from this page. I don't know where to start, as I've never worked with scraping or automating downloads in Python, but I'm just trying to find a way to automate downloading all the files on the linked page (and others like it -- just using this one as an example).
There is no discernible pattern in the file names linked; they appear to be random numbers that reference an ID-file name lookup table elsewhere.
for above URL provided you could download zip files by following the below code:
import re
import requests
from bs4 import BeautifulSoup
hostname="http://mis.ercot.com"
r = requests.get(f'{hostname}/misapp/GetReports.do?reportTypeId=13060&reportTitle=Historical%20DAM%20Load%20Zone%20and%20Hub%20Prices&showHTMLView=&mimicKey')
soup = BeautifulSoup(r.text, 'html.parser')
regex = re.compile('.*misdownload/servlets/mirDownload.*')
atgs=soup.findAll("a",{"href":regex})
for link in atgs:
data=requests.get(f"{hostname}{link['href']}")
filename=link["href"].split("doclookupId=")[1][:-1]+".zip"
with open(filename,"wb") as savezip:
savezip.write(data.content)
print(filename,"Saved")
Let me know if you have any questions :)
Im a noob at programming I just started learning recently. I made a script to automate a process using selenium for python.The script logs in a webpage, scraps some info and downloads some pdf files some through wget others through requests because of the webpage structure and my limitations.
It works well but if I run it on headless mode it fails to download the file through wget. To get the url for this file I click on a href and then use the current_url method. Here's where the problem seems to be as I've printed the url and returns the previous url as if it hadn't clicked on the link therefore not loading the requiered page.
Might be important to point out that this href calls for a script(I suppose) to create the pdf and then redirects me to the actual url which is actually public. Heres the href:
"https://"WEBPAGE.COM"/admin.php?method=buildPDF&scode=JT9UM1FL5MP0P57UXFP6R5FT6LPE"
I did some research and tought it might be opening the pdf on another tab so I tried switching tabs to the same result.
Here's a sample of the code:
driver.get("https://webpage.com/")
time.sleep(5)
driver.find_element_by_xpath('//div[text()='+variable+']//ancestor::td[1]//following::a[1]').click()
time.sleep(2)
pr=driver.current_url
print(pr)
url= pr
filename= variable + ".pdf"
wget.download(url,'path')
All other "click()"s in the code work fine and I also tried increasing sleep up to 10 seconds but nothing works.
Any help/advise is really appreciated
Thanks
How can we save the webpage including the content in it, so that it is viewable offline, using urllib in python language? Currently I am using the following code:
import urllib.request
driver.webdriver.Chrome()
driver.get("http://www.yahoo.com")
urllib.request.urlretrieve("http://www.yahoo.com", C:\\Users\\karanjuneja\\Downloads\\kj\\yahoo.mhtml")
This works and strores an mhtml version of the webpage in the folder, but when you open the file, you will only find the codes written and not the page how it appears online. Do we need to make changes to the code?
Also, is there an alternate way of saving the webpage in MHTML format with all the content as it appears online, and not just the source.Any suggestions?
Thanks Karan
I guess this site might help you~
Create an MHTML archive
After i connect to a website and get the neccessary url's at the last one the downloading automatically triggers and chrome starts to download the file.
Howewer in mechanize this doesnt seems to work;
br.click_link(link)
br.retrieve(link.base_url, '~/Documents/test.mp3')
I only get a 7kb *.mp3 file on my document folder which holds the html data in it.
Here's the link i am working on: http://www.mrtzcmp3.net/Ok4PxQ0.mrtzcmp3
It may go bad after few minutes but basically when i click the url in chrome i get the mp3 fila automatically.
I woke up today and tried this;
link = [l for l in br.links()][-1]
br.click_link(link)
response = br.follow_link(link)
open('asd.mp3', 'w').write(response.read())
for anyone with the same problem, that works.
I would like to make a script (in any language, but preferably python or perl) download a specific type of file being streamed by a web page. However i do not know this files location so i will have to find it out by finding all the files being streamed by the page, and selecting the one i want based on file type.
a similar example would be to say i want to download a video off youtube, however there is no pattern or way to find the URL except finding the files being streamed to my computer.
The part i cannot figure out is how to find all the files being streamed by the page. The rest i can do myself. The file name is not mentioned anywhere in the source of the html page.
Example of the problem...
This works fine:
import urllib
urllib.urlretrieve ("http://example.com/anything.mp3", "a.mp3")
However this does not:
import urllib
urllib.urlretrieve ("http://example.com/page-where-the-mp3-file-is-being-streamed.html", "a.mp3")
If someone can help me figure out how to download all the files from a page or find the files being streamed i would really appreciate it. All i need is to know which language/library/method can accomplish this.Thanks