Python Downloading a file without having a direct link on mechanize - python

After i connect to a website and get the neccessary url's at the last one the downloading automatically triggers and chrome starts to download the file.
Howewer in mechanize this doesnt seems to work;
br.click_link(link)
br.retrieve(link.base_url, '~/Documents/test.mp3')
I only get a 7kb *.mp3 file on my document folder which holds the html data in it.
Here's the link i am working on: http://www.mrtzcmp3.net/Ok4PxQ0.mrtzcmp3
It may go bad after few minutes but basically when i click the url in chrome i get the mp3 fila automatically.

I woke up today and tried this;
link = [l for l in br.links()][-1]
br.click_link(link)
response = br.follow_link(link)
open('asd.mp3', 'w').write(response.read())
for anyone with the same problem, that works.

Related

Not able to scrape the response from clicking on a Button using Python Scrapy

Thank you all wonderful out there for reading this post and their help
For below URL, I have been trying to understand how to go about getting excel files which are downloaded after clicking on "Download Data" hyperlink. On inspecting this element, i get something like this "::before". Not sure what this is.
https://www.moneycontrol.com/mutual-funds/find-fund/returns?&amc=AXMF&EXCLUDE_FIXED_MATURITY_PLANS=Y
I have downloaded files, in somewhat similar cases in the past, where such buttons contain URL directing to the file. I had to then make use of request library to get a bytes response which downloads the file in my local.
However, in this case, i am not able to find the URL to send response to.
Cheers,
Aakash

Selenium Chromedriver Headless mode problem, does not load an url

Im a noob at programming I just started learning recently. I made a script to automate a process using selenium for python.The script logs in a webpage, scraps some info and downloads some pdf files some through wget others through requests because of the webpage structure and my limitations.
It works well but if I run it on headless mode it fails to download the file through wget. To get the url for this file I click on a href and then use the current_url method. Here's where the problem seems to be as I've printed the url and returns the previous url as if it hadn't clicked on the link therefore not loading the requiered page.
Might be important to point out that this href calls for a script(I suppose) to create the pdf and then redirects me to the actual url which is actually public. Heres the href:
"https://"WEBPAGE.COM"/admin.php?method=buildPDF&scode=JT9UM1FL5MP0P57UXFP6R5FT6LPE"
I did some research and tought it might be opening the pdf on another tab so I tried switching tabs to the same result.
Here's a sample of the code:
driver.get("https://webpage.com/")
time.sleep(5)
driver.find_element_by_xpath('//div[text()='+variable+']//ancestor::td[1]//following::a[1]').click()
time.sleep(2)
pr=driver.current_url
print(pr)
url= pr
filename= variable + ".pdf"
wget.download(url,'path')
All other "click()"s in the code work fine and I also tried increasing sleep up to 10 seconds but nothing works.
Any help/advise is really appreciated
Thanks

Problems automating getting webpages to .pdf

I am trying to automate the process of downloading webpages with technical documentation which I need to update every year or so.
Here is an example page: http://prod.adv-bio.com/ProductDetail.aspx?ProdNo=1197
From this page, the desired end result would be having all the html links saved as pdf's.
I am using wget to download the .pdf files
I can't use wget to download the html files, because the .html links on the page can only be accessed by clicking through from the previous page.
I tried using Selenium to open the links in Firefox and print them to pdf's, but the process is slow, frequently misses links, and my work proxy server forces me to re-authenticate every time I need to access a page for a different product.
I could open a chrome browser using chromedriver but could not handle the print dialog, even after trying pywinauto per an answer to a similar question here.
I tried taking screenshots of the html pages using Selenium, but could not find out how to get the whole webpage without capturing the entire screen.
I have been through a ton of links related to this topic but have yet to find a satisfying solution to this problem.
Is there a cleaner way to do this?

urllib.urlretrieve() cannot download the file

Hi there's a button in the web, if you click it, it'll download a file.
Say the corresponding url is like this
http://www.mydata.com/data/filedownload.aspx?e=MyArgu1&k=kfhk22wykq
If i put this url in the address bar in the browser, it can download the file as well properly.
Now i do this in the python,
urllib.urlretrieve(url, "myData.csv")
The csv file is empty. Any suggestions please ?
This may not be possible with every website. If a link has a token then python is unlikely to be able to use the link as it is tied to your browser.

Windows Python How to download dropbox popup from a website?

Alright so the issue is that I visit a site to download the file I want but the problem is the website that I try to download the file from doesn't host the actual file instead it uses dropbox to host it so as soon as you click download your redirected to a blank page that has dropbox pop up in a small window allowing you to download it. Things to note, there is no log in so I can direct python right to the link where dropbox pops up but it wont download the file.
import urllib
url = 'https://thewebsitedownload.com'
filename = 'filetobedownloaded.exe'
urllib.urlretrieve(url, filename)
Thats the code I use to use and it worked like a charm for direct downloads but now when I try to use it for the site that has the dropbox popup download it just ends up downloading the html code of the site (from what I can tell) and does not actually download the file.
I am still relatively new to python/ coding in general but I am loving it so far this is just the first brick wall that I have hit that I didn't find any similar resolutions to.
Thanks in advance! Sample codes help so much thats how I have been learning so far.
Use Beautifulsoup to parse the html you get. You can then get the href link to the file. There are a lot of Beautifulsoup tutorials on the web, so I think you'll find it fairly easy to figure out how to get the link in your specific situation.
First you download the html with the code you already have, but without the filename:
import urllib
from bs4 import BeautifulSoup
import re
url = 'https://thewebsitedownload.com'
text = urllib.urlopen(url).read()
soup = BeautifulSoup(text)
link = soup.find_all(href=re.compile("dropbox"))[0]['href']
print link
filename = 'filetobedownloaded.exe'
urllib.urlretrieve(link, filename)
I made this from the docs, but haven't tested it, but I think you get the idea.

Categories