Selenium Webdriver Download CSV - python

I am struggling using Selenium to download CSV file for couple of days. Please advise, much appreciated!!
I use Selenium Webdriver Language Bindings (Python) 2.4 + HTMLUnit browser.
Code:
browser.find_element_by_id("generate_csv").click()
csv_file = browser.page_source
In that webpage, if I use Firefox, after clicking "generate_csv" button, it will generate a CSV file, and usually download it. As I am using HTMLUnit, it is difficult to implement downloading files, so I use page_source attribute to get the CSV content.
Sometimes, it is successful!! But sometimes it will throw an error:
org.openqa.selenium.NoSuchElementException: Returned node was not an HTML element
Could someone help me analyze why this happens? I am so confused that running the script is like throwing a dice.
Thank you.
Update: (Part of traceback)
14:29:15.913 INFO - Executing: [find element: By.selector: .controlbuttons > a > img[alt='CSV']])
14:29:16.404 WARN - Exception thrown
org.openqa.selenium.NoSuchElementException: Returned node was not an HTML element
For documentation on this error, please visit: ...
Driver info: driver.version: EventFiringWebDriver
at org.openqa.selenium.htmlunit.HtmlUnitDriver.findElementByCssSelector(HtmlUnitDriver.java:952)
at org.openqa.selenium.By$ByCssSelector.findElement(By.java:426)
at org.openqa.selenium.htmlunit.HtmlUnitDriver$5.call(HtmlUnitDriver.java:1565)
at org.openqa.selenium.htmlunit.HtmlUnitDriver$5.call(HtmlUnitDriver.java:1)
at org.openqa.selenium.htmlunit.HtmlUnitDriver.implicitlyWaitFor(HtmlUnitDriver.java:1241)
at org.openqa.selenium.htmlunit.HtmlUnitDriver.findElement(HtmlUnitDriver.java:1562)
at org.openqa.selenium.htmlunit.HtmlUnitDriver.findElement(HtmlUnitDriver.java:530)
at sun.reflect.GeneratedMethodAccessor129.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.openqa.selenium.support.events.EventFiringWebDriver$2.invoke(EventFiringWebDriver.java:101)
at com.sun.proxy.$Proxy14.findElement(Unknown Source)
at org.openqa.selenium.support.events.EventFiringWebDriver.findElement(EventFiringWebDriver.java:184)
at org.openqa.selenium.remote.server.handler.FindElement.call(FindElement.java:47)
at org.openqa.selenium.remote.server.handler.FindElement.call(FindElement.java:1)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at org.openqa.selenium.remote.server.DefaultSession$1.run(DefaultSession.java:169)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)
14:29:16.405 WARN - Exception: Returned node was not an HTML element

Sounds like your html doesn't finish loading before you call the click on the generate csv button. This happens a bunch with selenium when loading html from javascript - at least for me.
Not sure if this is the greatest way to handle it but I would use a recursive method to click until you get it...
import time
def generateCsv(browser):
try:
browser.find_element_by_id("generate_csv").click()
csv_file = browser.page_source
Except NoSuchElementException,e:
time.sleep(3)
generateCsv(browser)
Hope that helps

Related

Get HTML code of PDF file opened with firefox's built-in PDF Reader in Python

You're probably aware of Firefox's built-in PDF viewer. The fact is that I'm trying to get the HTML code of the page displaying the file so that It would gave me more informations than the PyPDF library in python.
Obviously requests does not work because it's not a real link, so I thought about using selenium (maybe also in headless mode) with the webdriver.page_source attribute:
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
from selenium import webdriver
import os
service = Service(os.path.abspath('Files/geckodriver'))
options = Options()
# options.headless = True
driver = webdriver.Firefox(service = service, options = options)
driver.get(f'file://{os.path.abspath("Files/sample.pdf")}')
html = driver.page_source
print(html)
The fact is that this does not give me the complete source code, just the title and number of every page and the reference on the PDF file. For example, here's the page number 2:
<div class="thumbnail" data-page-number="2"><div class="thumbnailSelectionRing" style="width: 100px; height: 132px;"></div></div>.
Also for every page I got the same thing without knowing the content of every paragraph (notice that I've cut out all the code of the toolbar on the left side just for the explanation's sake) while in the Firefox Viewer I was able to see all the content.
So, I'm looking forward if some other method exists or if I have to fix something in my code.
Firefox is a completely open source company. Please browse the source code to get access to that part of it. If you do not want access to the source code for the browser, then please be clearer in your question.

How to get the .pdf download at the end of a redirect chain with Selenium?

I have tried every method I can think of for getting the pdf from the link: http://apps.colorado.gov/dora/licensing/Lookup/LicenseLookup.aspx?docExternal=926241&docGuid=8DC9BB72-A921-45E7-9BCD-358846FCE54D
I have tried:
Clicking the button for this link
Opening the href manually in the webdriver
Using WebDriverWait, and various commands to wait for url switches or the appearance of certain urls
Sleeping and re-getting the page_source
Using a try statement to override the TimeOut exception and trying to issue more commands from there
Every attempt at opening this link results in a timeout exception, even though it works just fine manually.
It looks like it runs through 2(?) redirects before landing on the pdf file I'd like to grab. Is there anyone out there with selenium experience that can point me in the right direction for getting this pdf? I'm running Selenium on ChromeDriver in a Python script.
ANSWER:
download_buttons = self.browser.find_elements_by_link_text("External Document")
for button in download_buttons:
new_file_path = f'{blah}.pdf'
link = button.get_attribute("href")
download_link = requests.get(link, allow_redirects=True)
try:
with open(new_file_path, 'wb') as new_file:
new_file.write(download_link.content)
except Exception as e:
self.print_error(f"Failed to write file: {e}")
I cannot comment yet as I'm new. When you find the url that should contain the document you can call the requests library similar to how this person answered. Get a file from an ASPX webpage using Python

Facing issues while scraping data from a table using python with selenium

I've written a script using python in combination with selenium to parse table from a target page which can be reached out following some steps I've tried to describe below for the clarity. It does reach the destination but at the time of scraping data from that table It throws an error showing in the console "Unable to locate element". I tried with online xpath tester to see if it is wrong but I found that the xpath I've used in my script for "td_data" is right. I suppose, what I'm missing here is beyond my knowledge. Hope there is somebody to take a look into it and provide me with a workaround.
Btw, the site link is given in my script.
Link to see the html contents for the table: "https://www.dropbox.com/s/kaom5qzk78xndqn/Partial%20Html%20content%20for%20the%20table.txt?dl=0"
Steps to reach the target page which my script is able to maintain:
Selecting "I've read and understand above"
Putting this keyword "pump" in the inputbox located right below "Select medical devices".
Selecting the checkbox "Devices found for "pump".
Finally, pressing the search button
Script I've tried with so far:
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('http://apps.tga.gov.au/Prod/devices/daen-entry.aspx')
driver.find_element_by_id('disclaimer-accept').click()
time.sleep(5)
driver.find_element_by_id('medicine-name').send_keys('pump')
time.sleep(8)
driver.find_element_by_id('medicines-header-text').click()
driver.find_element_by_id('submit-button').click()
time.sleep(7)
for item in driver.find_elements_by_xpath('//div[#class="table-responsive"]'):
for tr_data in item.find_elements_by_xpath('.//tr'):
td_data = tr_data.find_element_by_xpath('.//span[#class="hovertext"]//a')
print(td_data.text)
driver.close()
Why don't you just do this:
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('http://apps.tga.gov.au/Prod/devices/daen-entry.aspx')
driver.find_element_by_id('disclaimer-accept').click()
time.sleep(5)
driver.find_element_by_id('medicine-name').send_keys('pump')
time.sleep(8)
driver.find_element_by_id('medicines-header-text').click()
driver.find_element_by_id('submit-button').click()
time.sleep(7)
for item in driver.find_elements_by_xpath(
'//table[#id]/tbody/tr/td[#class]/span[#class]/a[#id]'
):
print(item.text)
driver.close()
Output:
27233
27283
27288
27289
27390
27413
27441
27520
25445
27816
27866
27970
28033
28238
26999
28264
28407
28448
28437
28509
28524
28553
28647
28677
28646
Maybe you want to think about saving the page with driver.page_source, pull out the table, save it as a html file. Then use pandas from html to open the table into a dataframe

selenium works on local and not on azure server

I am trying to get video url from links on this page. Video link could be seen on https://in.news.yahoo.com/video/jaguar-fighter-aircraft-crashes-near-084300217.html . (Open in Chrome)
For that I wrote chrome web driver related code as below :
from bs4 import BeautifulSoup
from selenium import webdriver
from pyvirtualdisplay import Display
chromedriver = '/usr/local/bin/chromedriver'
os.environ['webdriver.chrome.driver'] = chromedriver
display = Display(visible=0, size=(800,600))
display.start()
driver = webdriver.Chrome(chromedriver)
driver.get('https://in.news.yahoo.com/video/jaguar-fighter-aircraft-crashes-near-084300217.html')
try:
element = WebDriverWait(driver, 20).until(lambda driver: driver.find_elements_by_class_name('yvp-main'))
self.yahoo_video_trend = []
for s in driver.find_elements_by_class_name('yvp-main'):
print "Processing link - ", item['link']
trend = item
print item['description']
trend['video_link'] = s.find_element_by_tag_name('video').get_attribute('src')
print
print s.find_element_by_tag_name('video').get_attribute('src')
self.yahoo_video_trend.append(trend)
except:
return
This works fine on my local system but when I run on my azure server it does not give any result at s.find_element_by_tag_name('video').get_attribute('src')
I have installed chrome on my azureserver.
Update :
Please see, requests and Beautifulsoup I already tried, but as yahoo loads html content dynamically from json, I could not get it using them.
And yeah azure server is simple linux system with command line access. Not any application.
I tried to reproduce your issue using you code. However, I found there was no tag named video in that page('https://in.news.yahoo.com/video/jaguar-fighter-aircraft-crashes-near-084300217.html')(using IE and Chrome to test).
I used the developer Tool to check the HTML code, like this picture:
It seems that this page used the flash player to play video,not HTML5 video control.
For this reason, I suggest that you can check your code whether used the rightly tag name.
Any concerns, please feel free to let me know.
We tried to reproduce the error on our side. I was not able to get chrome driver to work, but I did try the firefox driver and it worked fine. It was able to load the page and get the link via the URL.
Can you change your code to print the exception and send it to us, to see where the script is failing?
Change your code:
except:
return
try
do
except Exception,e: print str(e)
Send us the exception, so we can take a look.

How to get data from inspect element of a webpage using Python

I'd like to get the data from inspect element using Python. I'm able to download the source code using BeautifulSoup but now I need the text from inspect element of a webpage. I'd truly appreciate if you could advise me how to do it.
Edit:
By inspect element I mean, in google chrome, right click gives us an option called inspect element which has code related to each element of that particular page. I'd like to extract that code/ just its text strings.
If you want to automatically fetch a web page from Python in a way that runs Javascript, you should look into Selenium. It can automatically drive a web browser (even a headless web browser such as PhantomJS, so you don't have to have a window open).
In order to get the HTML, you'll need to evaluate some javascript. Simple sample code, alter to suit:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get("http://google.com")
# This will get the initial html - before javascript
html1 = driver.page_source
# This will get the html after on-load javascript
html2 = driver.execute_script("return document.documentElement.innerHTML;")
Note 1: If you want a specific element or elements, you actually have a couple of options -- parse the HTML in Python, or write more specific JavaScript that returns what you want.
Note 2: if you actually need specific information from Chrome's tools that is not just dynamically generated HTML, you'll need a way to hook into Chrome itself. No way around that.
I would like to update answer from Jason S. I wasn't able to start phantomjs on OS X
driver = webdriver.PhantomJS()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium/webdriver/phantomjs/webdriver.py", line 50, in __init__
self.service.start()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium/webdriver/phantomjs/service.py", line 74, in start
raise WebDriverException("Unable to start phantomjs with ghostdriver.", e)
selenium.common.exceptions.WebDriverException: Message: Unable to start phantomjs with ghostdriver.
Resolved by answer here by downloading executables
driver = webdriver.PhantomJS("phantomjs-2.0.0-macosx/bin/phantomjs")
Inspect element shows all the HTML of the page which is the same as fetching the html using urllib
do something like this
import urllib
from bs4 import BeautifulSoup as BS
html = urllib.urlopen(URL).read()
soup = BS(html)
print soup.findAll(tag_name).get_text()
BeautifulSoup could be used to parse the html document, and extract anything you want. It's not designed for downloading. You could find the elements you want by it's class and id.

Categories