I am trying to get video url from links on this page. Video link could be seen on https://in.news.yahoo.com/video/jaguar-fighter-aircraft-crashes-near-084300217.html . (Open in Chrome)
For that I wrote chrome web driver related code as below :
from bs4 import BeautifulSoup
from selenium import webdriver
from pyvirtualdisplay import Display
chromedriver = '/usr/local/bin/chromedriver'
os.environ['webdriver.chrome.driver'] = chromedriver
display = Display(visible=0, size=(800,600))
display.start()
driver = webdriver.Chrome(chromedriver)
driver.get('https://in.news.yahoo.com/video/jaguar-fighter-aircraft-crashes-near-084300217.html')
try:
element = WebDriverWait(driver, 20).until(lambda driver: driver.find_elements_by_class_name('yvp-main'))
self.yahoo_video_trend = []
for s in driver.find_elements_by_class_name('yvp-main'):
print "Processing link - ", item['link']
trend = item
print item['description']
trend['video_link'] = s.find_element_by_tag_name('video').get_attribute('src')
print
print s.find_element_by_tag_name('video').get_attribute('src')
self.yahoo_video_trend.append(trend)
except:
return
This works fine on my local system but when I run on my azure server it does not give any result at s.find_element_by_tag_name('video').get_attribute('src')
I have installed chrome on my azureserver.
Update :
Please see, requests and Beautifulsoup I already tried, but as yahoo loads html content dynamically from json, I could not get it using them.
And yeah azure server is simple linux system with command line access. Not any application.
I tried to reproduce your issue using you code. However, I found there was no tag named video in that page('https://in.news.yahoo.com/video/jaguar-fighter-aircraft-crashes-near-084300217.html')(using IE and Chrome to test).
I used the developer Tool to check the HTML code, like this picture:
It seems that this page used the flash player to play video,not HTML5 video control.
For this reason, I suggest that you can check your code whether used the rightly tag name.
Any concerns, please feel free to let me know.
We tried to reproduce the error on our side. I was not able to get chrome driver to work, but I did try the firefox driver and it worked fine. It was able to load the page and get the link via the URL.
Can you change your code to print the exception and send it to us, to see where the script is failing?
Change your code:
except:
return
try
do
except Exception,e: print str(e)
Send us the exception, so we can take a look.
Related
I am using selenium webdriver to try scrape information from realestate.com.au, here is my code:
from selenium.webdriver import Chrome
from bs4 import BeautifulSoup
path = 'C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe'
url = 'https://www.realestate.com.au/buy'
url2 = 'https://www.realestate.com.au/property-house-nsw-castle+hill-134181706'
webdriver = Chrome(path)
webdriver.get(url)
soup = BeautifulSoup(webdriver.page_source, 'html.parser')
print(soup)
it works fine with URL but when I try to do the same to open url2, it opens up a blank page, and I checked the console get the following:
"Failed to load resource: the server responded with a status of 429 ()
about:blank:1 Failed to load resource: net::ERR_UNKNOWN_URL_SCHEME
149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint:1 Failed to load resource: the server responded with a status of 404 ()"
while opening up URL, I tried to search for anything, which also leads to a blank page like url2.
It looks like the www.realestate.com.au website is using an Akamai security tool.
A quick DNS lookup shows that www.realestate.com.au resolves to dualstack.realestate.com.au.edgekey.net.
They are most likely using the Bot Manager product (https://www.akamai.com/us/en/products/security/bot-manager.jsp). I have encountered this on another website recently.
Typically rotating user agents and IP addresses (ideally using residential
proxies) should do the trick. You want to load up the site with a "fresh" browser profile each time. You should also check out https://github.com/67-6f-64/akamai-sensor-data-bypass
I think you should try adding driver.implicitly_wait(10) before your get line, as this will add an implicit wait, in case the page loads too slowly for the driver to pull the site. Also you should consider trying out the Firefox webdriver, since this bug appears to be only affecting chromium browsers.
I'm trying to use Tor with selenium, which works through the use of tbselenium.
However, when loading an url or clicking a web element, the page immideately closes when finishing the action, instead of remaining open as would be the case when using selenium with chrome.
Any ideas to keep the page open?
import tbselenium.common as cm
from tbselenium.tbdriver import TorBrowserDriver
from tbselenium.utils import launch_tbb_tor_with_stem
tbb_dir = "C:\\pathto\\Tor Browser\\"
tor_process = launch_tbb_tor_with_stem(tbb_path=tbb_dir)
for i in range(1):
with TorBrowserDriver(tbb_dir, tor_cfg=cm.USE_STEM) as driver:
driver.load_url("http://hln.be",3,wait_for_page_body=True)
#driver.get('https://google.be')
try:
policypage=driver.find_element_by_xpath("//a[contains(#href,'members/join')]")
policypage.click()
usern=driver.find_element_by_xpath("//input[contains(#id,'user_member_username')]")
usern.send_keys('Tryout')
except:
print('different look')
As Furas said, use the standard driver declaration.
I want to do web scraping for Bing's search results. Basically, I am using selenium, the idea is to using selenium to click 'Next' automatedly and scrap the URLs of search results of each page. I made it run with chrome browser on my Ubuntu:
from selenium import web driver
import os
class bingURL(object):
def __init__(self):
self.driver=webdriver.Chrome(os.path.expanduser('./chromedriver'))
def get_urls(self,url):
driver=self.driver
driver.get(url)
elems = driver.find_elements_by_xpath("//a[#href]")
href=[]
for elem in elems:
link=elem.get_attribute("href")
try:
if 'bing.com' not in link and 'http' in link and 'microsoft.com' not in link and 'smashboards.com' not in link:
href.append(link)
except:
pass
return list(set(href))
def search_urls(self,keyword,pagenum):
driver=self.driver
searchurl=self.lookup(keyword) ### url of first page of google search
driver.get(searchurl)
results=self.get_urls(searchurl)
for i in range(pagenum):
driver.find_elements_by_class_name("sb_pagN")[0].click() # click 'Next' of bing search result
time.sleep(5) # wait to load page
current_url=driver.current_url
#print(current_url)
#print(self.get_urls(current_url))
results[0:0]=self.get_urls(current_url)
driver.quit()
return results
def lookup(self,query):
return "https://www.bing.com/search?q="+query
if __name__ == "__main__":
g=bingURL()
result=g.search_urls('Stackoverflow is good',10)
it works perfectly, when I run the code, it launches a Chrome browser, and I can saw it go to the next page automatically, and get URLs for 10 pages of searching results.
However, my goal is to run these codes on AWS successfully. The original codes failed with error 'Chrome failed to start'. After google, it seems I need to use a headless browser like PhantomJS on AWS. Thus I installed PhantomJS, and change the def __init__(self): to:
def __init__(self):
self.driver=webdriver.PhantomJS()
However, it cannot click 'next' anymore, and cannot scrap URLs using the old code. The error message is:
File ".../SEARCH_BING_MODULE.py", line 70, in search_urls
driver.find_elements_by_class_name("sb_pagN")[0].click()
IndexError: list index out of range
It looks like change the browser completely change the rules. How should I modify the more original code to make it work again? or how to scrap Bing search results' URLs using selenium+PhantomJS?
Thanks for your help!
Yes, You can perform all operations as per of your all 3 point using headless browser. Don't use HTMLUnit as it have many configuration issue.
PhamtomJS was another approach for headless browser but PhantomJs is having bug these days because of poorly maintenance of it.
You can use chromedriver itself for headless jobs.
You just need to pass one option in chromedriver as below:-
chromeOptions.addArguments("--headless");
Full code will appear like this :-
System.setProperty("webdriver.chrome.driver","D:\\Workspace\\JmeterWebdriverProject\\src\\lib\\chromedriver.exe");
ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.addArguments("--headless");
chromeOptions.addArguments("--start-maximized");
WebDriver driver = new ChromeDriver(chromeOptions);
driver.get("https://www.google.co.in/");
Hope it will help you :)
I've written a script using python in combination with selenium to parse table from a target page which can be reached out following some steps I've tried to describe below for the clarity. It does reach the destination but at the time of scraping data from that table It throws an error showing in the console "Unable to locate element". I tried with online xpath tester to see if it is wrong but I found that the xpath I've used in my script for "td_data" is right. I suppose, what I'm missing here is beyond my knowledge. Hope there is somebody to take a look into it and provide me with a workaround.
Btw, the site link is given in my script.
Link to see the html contents for the table: "https://www.dropbox.com/s/kaom5qzk78xndqn/Partial%20Html%20content%20for%20the%20table.txt?dl=0"
Steps to reach the target page which my script is able to maintain:
Selecting "I've read and understand above"
Putting this keyword "pump" in the inputbox located right below "Select medical devices".
Selecting the checkbox "Devices found for "pump".
Finally, pressing the search button
Script I've tried with so far:
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('http://apps.tga.gov.au/Prod/devices/daen-entry.aspx')
driver.find_element_by_id('disclaimer-accept').click()
time.sleep(5)
driver.find_element_by_id('medicine-name').send_keys('pump')
time.sleep(8)
driver.find_element_by_id('medicines-header-text').click()
driver.find_element_by_id('submit-button').click()
time.sleep(7)
for item in driver.find_elements_by_xpath('//div[#class="table-responsive"]'):
for tr_data in item.find_elements_by_xpath('.//tr'):
td_data = tr_data.find_element_by_xpath('.//span[#class="hovertext"]//a')
print(td_data.text)
driver.close()
Why don't you just do this:
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('http://apps.tga.gov.au/Prod/devices/daen-entry.aspx')
driver.find_element_by_id('disclaimer-accept').click()
time.sleep(5)
driver.find_element_by_id('medicine-name').send_keys('pump')
time.sleep(8)
driver.find_element_by_id('medicines-header-text').click()
driver.find_element_by_id('submit-button').click()
time.sleep(7)
for item in driver.find_elements_by_xpath(
'//table[#id]/tbody/tr/td[#class]/span[#class]/a[#id]'
):
print(item.text)
driver.close()
Output:
27233
27283
27288
27289
27390
27413
27441
27520
25445
27816
27866
27970
28033
28238
26999
28264
28407
28448
28437
28509
28524
28553
28647
28677
28646
Maybe you want to think about saving the page with driver.page_source, pull out the table, save it as a html file. Then use pandas from html to open the table into a dataframe
I am working on python and selenium. I want to download file from clicking event using selenium. I wrote following code.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
browser = webdriver.Firefox()
browser.get("http://www.drugcite.com/?q=ACTIMMUNE")
browser.close()
I want to download both files from links with name "Export Data" from given url. How can I achieve it as it works with click event only?
Find the link using find_element(s)_by_*, then call click method.
from selenium import webdriver
# To prevent download dialog
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2) # custom location
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', '/tmp')
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'text/csv')
browser = webdriver.Firefox(profile)
browser.get("http://www.drugcite.com/?q=ACTIMMUNE")
browser.find_element_by_id('exportpt').click()
browser.find_element_by_id('exporthlgt').click()
Added profile manipulation code to prevent download dialog.
I'll admit this solution is a little more "hacky" than the Firefox Profile saveToDisk alternative, but it works across both Chrome and Firefox, and doesn't rely on a browser-specific feature which could change at any time. And if nothing else, maybe this will give someone a little different perspective on how to solve future challenges.
Prerequisites: Ensure you have selenium and pyvirtualdisplay installed...
Python 2: sudo pip install selenium pyvirtualdisplay
Python 3: sudo pip3 install selenium pyvirtualdisplay
The Magic
import pyvirtualdisplay
import selenium
import selenium.webdriver
import time
import base64
import json
root_url = 'https://www.google.com'
download_url = 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png'
print('Opening virtual display')
display = pyvirtualdisplay.Display(visible=0, size=(1280, 1024,))
display.start()
print('\tDone')
print('Opening web browser')
driver = selenium.webdriver.Firefox()
#driver = selenium.webdriver.Chrome() # Alternately, give Chrome a try
print('\tDone')
print('Retrieving initial web page')
driver.get(root_url)
print('\tDone')
print('Injecting retrieval code into web page')
driver.execute_script("""
window.file_contents = null;
var xhr = new XMLHttpRequest();
xhr.responseType = 'blob';
xhr.onload = function() {
var reader = new FileReader();
reader.onloadend = function() {
window.file_contents = reader.result;
};
reader.readAsDataURL(xhr.response);
};
xhr.open('GET', %(download_url)s);
xhr.send();
""".replace('\r\n', ' ').replace('\r', ' ').replace('\n', ' ') % {
'download_url': json.dumps(download_url),
})
print('Looping until file is retrieved')
downloaded_file = None
while downloaded_file is None:
# Returns the file retrieved base64 encoded (perfect for downloading binary)
downloaded_file = driver.execute_script('return (window.file_contents !== null ? window.file_contents.split(\',\')[1] : null);')
print(downloaded_file)
if not downloaded_file:
print('\tNot downloaded, waiting...')
time.sleep(0.5)
print('\tDone')
print('Writing file to disk')
fp = open('google-logo.png', 'wb')
fp.write(base64.b64decode(downloaded_file))
fp.close()
print('\tDone')
driver.close() # close web browser, or it'll persist after python exits.
display.popen.kill() # close virtual display, or it'll persist after python exits.
Explaination
We first load a URL on the domain we're targeting a file download from. This allows us to perform an AJAX request on that domain, without running into cross site scripting issues.
Next, we're injecting some javascript into the DOM which fires off an AJAX request. Once the AJAX request returns a response, we take the response and load it into a FileReader object. From there we can extract the base64 encoded content of the file by calling readAsDataUrl(). We're then taking the base64 encoded content and appending it to window, a gobally accessible variable.
Finally, because the AJAX request is asynchronous, we enter a Python while loop waiting for the content to be appended to the window. Once it's appended, we decode the base64 content retrieved from the window and save it to a file.
This solution should work across all modern browsers supported by Selenium, and works whether text or binary, and across all mime types.
Alternate Approach
While I haven't tested this, Selenium does afford you the ability to wait until an element is present in the DOM. Rather than looping until a globally accessible variable is populated, you could create an element with a particular ID in the DOM and use the binding of that element as the trigger to retrieve the downloaded file.
In chrome what I do is downloading the files by clicking on the links, then I open chrome://downloads page and then retrieve the downloaded files list from shadow DOM like this:
docs = document
.querySelector('downloads-manager')
.shadowRoot.querySelector('#downloads-list')
.getElementsByTagName('downloads-item')
This solution is restrained to chrome, the data also contains information like file path and download date. (note this code is from JS, may not be the correct python syntax)
Here is the full working code. You can use web scraping to enter the username password and other field. For getting the field names appearing on the webpage, use inspect element. Element name(Username,Password or Click Button) can be entered through class or name.
from selenium import webdriver
# Using Chrome to access web
options = webdriver.ChromeOptions()
options.add_argument("download.default_directory=C:/Test") # Set the download Path
driver = webdriver.Chrome(options=options)
# Open the website
try:
driver.get('xxxx') # Your Website Address
password_box = driver.find_element_by_name('password')
password_box.send_keys('xxxx') #Password
download_button = driver.find_element_by_class_name('link_w_pass')
download_button.click()
driver.quit()
except:
driver.quit()
print("Faulty URL")