Im a noob at programming I just started learning recently. I made a script to automate a process using selenium for python.The script logs in a webpage, scraps some info and downloads some pdf files some through wget others through requests because of the webpage structure and my limitations.
It works well but if I run it on headless mode it fails to download the file through wget. To get the url for this file I click on a href and then use the current_url method. Here's where the problem seems to be as I've printed the url and returns the previous url as if it hadn't clicked on the link therefore not loading the requiered page.
Might be important to point out that this href calls for a script(I suppose) to create the pdf and then redirects me to the actual url which is actually public. Heres the href:
"https://"WEBPAGE.COM"/admin.php?method=buildPDF&scode=JT9UM1FL5MP0P57UXFP6R5FT6LPE"
I did some research and tought it might be opening the pdf on another tab so I tried switching tabs to the same result.
Here's a sample of the code:
driver.get("https://webpage.com/")
time.sleep(5)
driver.find_element_by_xpath('//div[text()='+variable+']//ancestor::td[1]//following::a[1]').click()
time.sleep(2)
pr=driver.current_url
print(pr)
url= pr
filename= variable + ".pdf"
wget.download(url,'path')
All other "click()"s in the code work fine and I also tried increasing sleep up to 10 seconds but nothing works.
Any help/advise is really appreciated
Thanks
Related
I am trying to automate the process of downloading webpages with technical documentation which I need to update every year or so.
Here is an example page: http://prod.adv-bio.com/ProductDetail.aspx?ProdNo=1197
From this page, the desired end result would be having all the html links saved as pdf's.
I am using wget to download the .pdf files
I can't use wget to download the html files, because the .html links on the page can only be accessed by clicking through from the previous page.
I tried using Selenium to open the links in Firefox and print them to pdf's, but the process is slow, frequently misses links, and my work proxy server forces me to re-authenticate every time I need to access a page for a different product.
I could open a chrome browser using chromedriver but could not handle the print dialog, even after trying pywinauto per an answer to a similar question here.
I tried taking screenshots of the html pages using Selenium, but could not find out how to get the whole webpage without capturing the entire screen.
I have been through a ton of links related to this topic but have yet to find a satisfying solution to this problem.
Is there a cleaner way to do this?
I'm building a webCrawler which needs to read links inside a webpage. For which I'm using urllib2 library of python to open and read the websites.
I found a website where I'm unable to fetch any data.
The URL is "http://www.biography.com/people/michael-jordan-9358066"
My code,
import urllib2
response = urllib2.urlopen("http://www.biography.com/people/michael-jordan-9358066")
print response.read()
By running the above code, the content I get from the website, if I open it in a browser and the content I get from the above code is very different. The content from the above code does not include any data.
I thought it could be because of delay in reading the web page, so I introduced a delay. Even after the delay, the response is the same.
response = urllib2.urlopen("http://www.biography.com/people/michael-jordan-9358066")
time.sleep(20)
print response.read()
The web page opens perfectly fine in a browser.
However, the above code works fine for reading Wikipedia or some other websites.
I'm unable to find the reason behind this odd behaviour. Please help, thanks in advance.
What you are experiencing is most likely to be the effect of dynamic web pages. These pages do not have static content for urllib or requests to get. The data is loaded on site. You can use Python's selenium to solve this.
After i connect to a website and get the neccessary url's at the last one the downloading automatically triggers and chrome starts to download the file.
Howewer in mechanize this doesnt seems to work;
br.click_link(link)
br.retrieve(link.base_url, '~/Documents/test.mp3')
I only get a 7kb *.mp3 file on my document folder which holds the html data in it.
Here's the link i am working on: http://www.mrtzcmp3.net/Ok4PxQ0.mrtzcmp3
It may go bad after few minutes but basically when i click the url in chrome i get the mp3 fila automatically.
I woke up today and tried this;
link = [l for l in br.links()][-1]
br.click_link(link)
response = br.follow_link(link)
open('asd.mp3', 'w').write(response.read())
for anyone with the same problem, that works.
I've just started to learn coding this month and started with Python. I would like to automate a simple task (my first project) - visit a company's career website, retrieve all the jobs posted for the day and store them in a file. So this is what I would like to do, in sequence:
Go to http://www.nov.com/careers/jobsearch.aspx
Select the option - 25 Jobs per page
Select the date option - Today
Click on Search for Jobs
Store results in a file (just the job titles)
I looked around and found that Selenium is the best way to go about handling .aspx pages.
I have done steps 1-4 using Selenium. However, there are two issues:
I do not want the browser opening up. I just need the output saved to a file.
Even if I am ok with the browser popping up, using the Python code (exported from Selenium as Web Driver) on IDLE (i have windows OS) results in errors. When I run the Python code, the browser opens up and the link is loaded. But none of the form selections happen and I get the foll error message (link below), before the browser closes. So what does the error message mean?
http://i.stack.imgur.com/lmcDz.png
Any help/guidance will be appreciated...Thanks!
First about the error you've got, I should say that according to the expression NoSuchElementException and the message Unable to locate element, the selector you provided for the web-driver is wrong and web-driver can't find the element.
Well, since you did not post your code and I can't open the link of the website you entered, I can just give you a sample code and I will count as much details as I can.
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("url")
number_option = driver.find_element_by_id("id_for_25_option_indicator")
number_option.click()
date_option = driver.find_element_by_id("id_for_today_option_indicator")
date_option.click()
search_button = driver.find_element_by_id("id_for_search_button")
search_button.click()
all_results = driver.find_elements_by_xpath("some_xpath_that_is_common_between_all_job_results")
result_file = open("result_file.txt", "w")
for result in all_results:
result_file.write(result.text + "\n")
driver.close()
result_file.close()
Since you said you just started to learn coding recently, I think I have to give some explanations:
I recommend you to use driver.find_element_by_id in all cases that elements have ID property. It's more robust.
Instead of result.text, you can use result.get_attribute("value") or result.get_attribute("innerHTML").
That's all came into my mind by now; but it's better if you post your code and we see what is wrong with that. Additionally, it would be great if you gave me a new link to the website, so I can add more details to the code; your current link is broken.
Concerning the first issue, you can simply use a headless browser. This is possible with Chrome as well as Firefox.
Check Grey Li's answer here for example: Python - Firefox Headless
from selenium import webdriver
options = webdriver.FirefoxOptions()
options.add_argument('headless')
driver = webdriver.Firefox(options=options)
Alright so the issue is that I visit a site to download the file I want but the problem is the website that I try to download the file from doesn't host the actual file instead it uses dropbox to host it so as soon as you click download your redirected to a blank page that has dropbox pop up in a small window allowing you to download it. Things to note, there is no log in so I can direct python right to the link where dropbox pops up but it wont download the file.
import urllib
url = 'https://thewebsitedownload.com'
filename = 'filetobedownloaded.exe'
urllib.urlretrieve(url, filename)
Thats the code I use to use and it worked like a charm for direct downloads but now when I try to use it for the site that has the dropbox popup download it just ends up downloading the html code of the site (from what I can tell) and does not actually download the file.
I am still relatively new to python/ coding in general but I am loving it so far this is just the first brick wall that I have hit that I didn't find any similar resolutions to.
Thanks in advance! Sample codes help so much thats how I have been learning so far.
Use Beautifulsoup to parse the html you get. You can then get the href link to the file. There are a lot of Beautifulsoup tutorials on the web, so I think you'll find it fairly easy to figure out how to get the link in your specific situation.
First you download the html with the code you already have, but without the filename:
import urllib
from bs4 import BeautifulSoup
import re
url = 'https://thewebsitedownload.com'
text = urllib.urlopen(url).read()
soup = BeautifulSoup(text)
link = soup.find_all(href=re.compile("dropbox"))[0]['href']
print link
filename = 'filetobedownloaded.exe'
urllib.urlretrieve(link, filename)
I made this from the docs, but haven't tested it, but I think you get the idea.