I've used BeautifulSoup to find a specific div class in the page's HTML. I want to check if this div has a span class inside it. If the div has the span class, I want to maintain it on the page's code, but if it doesn't, I want to delete it, maybe using Selenium.
For that I have two lists selecting the elements (div and span). I tried to check if one list is inside the other, and that kind of worked. But how can one delete that found element from the page's source code?
Edit
I've edited the code after a few conversations in the commentaries section. With help, I was able to implement code to remove elements executing javascript.
The code is running with no errors, but nothing is being deleted from the page.
# Import required module
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time
# Option to launch browser in incognito
options = Options()
options.add_argument("--incognito")
#options.add_argument("--headless")
# Using chrome driver
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)
# Web page url request
driver.get('https://www.facebook.com/ads/library/?active_status=all&ad_type=all&country=BR&q=frete%20gr%C3%A1tis%20aproveite&sort_data[direction]=desc&sort_data[mode]=relevancy_monthly_grouped&search_type=keyword_unordered&media_type=all')
driver.maximize_window()
time.sleep(10)
driver.execute_script("""
for(let div of document.querySelectorAll('div._99s5')){
let match = div.innerText.match(/(\d+) ads? use this creative and text/)
let numAds = match ? parseInt(match[1]) : 0
if(numAds < 10){
div.querySelector(".tp-logo")?.remove()
}
}
""")
Since you're deleting them in javascript anyway:
driver.execute_script("""
for(let div of document.querySelectorAll('div._99s5')){
let match = div.innerText.match(/(\d+) ads? use this creative and text/)
let numAds = match ? parseInt(match[1]) : 0
if(numAds < 10){
div.querySelector(".tp-logo")?.remove()
}
}
""")
Note: Question and comments reads a bit confusing so it would be great to improve it a bit. Assuming you like to decompose() some elements, the reason why or what to do after this action is not clear. So this answer will only point out an apporache.
To decompose() the elements that do not contains ads use this creative and text just negate your selection and iterate the ResultSet:
for e in soup.select('div._99s5:has(:not(:-soup-contains("ads use this creative and text")))'):
e.decompose()
Now these elements will no longer be included in your soup and you could process it for your needs.
I'm working in selenium with Chrome.
The webpage I'm accessing updates dynamically.
I need the html that shows the results, I can access it when I do 'inspect element'.
I don't get how I need to access that html from my code. I always get the original html.
I tried this: Get HTML Source of WebElement in Selenium WebDriver using Python
browser.get('http://bijsluiters.fagg-afmps.be/?localeValue=nl')
searchform = browser.find_element_by_class_name('iceInpTxt')
searchform.send_keys('cefuroxim')
button = browser.find_element_by_class_name('iceCmdBtn').click()
element = browser.find_element_by_class_name('contentContainer')
html = element.get_attribute('innerHTML')
browser.close()
print(html)
It seems that it's working after some delay. If I were you I should try to experiment with the delay time.
from selenium import webdriver
import time
browser = webdriver.Chrome()
browser.get('http://bijsluiters.fagg-afmps.be/?localeValue=nl')
searchform = browser.find_element_by_class_name('iceInpTxt')
searchform.send_keys('cefuroxim')
button = browser.find_element_by_class_name('iceCmdBtn').click()
time.sleep(10)
element = browser.find_element_by_class_name('contentContainer')
html = element.get_attribute('innerHTML')
browser.close()
print(html)
Addition: a nicer way is to let the script proceed when an element is available (because of time it takes with JS (for example) before a specific element has been added to the DOM). The element to look for in your example is table with id iceDatTbl (for what I could find after a quick look).
Note: I particularly deal with this website
How can I use selenium with Python to get the reviews on this page to sort by 'Most recent'?
What I tried was:
driver.find_element_by_id('sort-order-dropdown').send_keys('Most recent')
from this didn't cause any error but didn't work.
Then I tried
from selenium.webdriver.support.ui import Select
select = Select(driver.find_element_by_id('sort-order-dropdown'))
select.select_by_value('recent')
select.select_by_visible_text('Most recent')
select.select_by_index(1)
I've got: Message: Element <select id="sort-order-dropdown" class="a-native-dropdown" name=""> is not clickable at point (66.18333435058594,843.7999877929688) because another element <span class="a-dropdown-prompt"> obscures it
This one
element = driver.find_element_by_id('sort-order-dropdown')
element.click()
li = driver.find_elements_by_css_selector('#sort-order-dropdown > option:nth-child(2)')
li.click()
from this caused the same error msg
This one from this caused the same error also
Select(driver.find_element_by_id('sort-order-dropdown')).select_by_value('recent').click()
So, I'm curious to know if there is any way that I can select the reviews to sort from the most recent first.
Thank you
This worked for me using Java:
#Test
public void amazonTest() throws InterruptedException {
String URL = "https://www.amazon.com/Harry-Potter-Slytherin-Wall-Banner/product-reviews/B01GVT5KR6/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews";
String menuSelector = ".a-dropdown-prompt";
String menuItemSelector = ".a-dropdown-common .a-dropdown-item";
driver.get(URL);
Thread.sleep(2000);
WebElement menu = driver.findElement(By.cssSelector(menuSelector));
menu.click();
List<WebElement> menuItem = driver.findElements(By.cssSelector(menuItemSelector));
menuItem.get(1).click();
}
You can reuse the element names and follow a similar path using Python.
The key points here are:
Click on the menu itself
Click on the second menu item
It is a better practice not to hard-code the item number but actually read the item names and select the correct one so it works even if the menu changes. This is just a note for future improvement.
EDIT
This is how the same can be done in Python.
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
URL = "https://www.amazon.com/Harry-Potter-Slytherin-Wall-Banner/product-reviews/B01GVT5KR6/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews";
menuSelector = ".a-dropdown-prompt";
menuItemSelector = ".a-dropdown-common .a-dropdown-item";
driver = webdriver.Chrome()
driver.get(URL)
elem = driver.find_element_by_css_selector(menuSelector)
elem.click()
time.sleep(1)
elemItems = []
elemItems = driver.find_elements_by_css_selector(menuItemSelector)
elemItems[1].click()
time.sleep(5)
driver.close()
Just to keep in mind, css selectors are a better alternative to xpath as they are much faster, more robust and easier to read and change.
This is the simplified version of what I did to get the reviews sorted from the most recent ones. As "Eugene S" said above, the key point is to click on the button itself and select/click the desired item from the list. However, my Python code use XPath instead of selector.
# click on "Top rated" button
driver.find_element_by_xpath('//*[#id="a-autoid-4-announce"]').click()
# this one select the "Most recent"
driver.find_element_by_xpath('//*[#id="sort-order-dropdown_1"]').click()
I'm trying to scrape all the file paths in a Github repo from the File Finder page (example).
Beautiful Soup 4 fails to scrape the <tbody class="js-tree-finder-results js-navigation-container js-active-navigation-container"> element that wraps the list of file paths. I figured this is b/c bs4 can't scrape dynamic content, so I tried waiting for all the elements to load with Selenium:
driver = webdriver.PhantomJS()
driver.get("https://github.com/chrisspen/weka/find/master")
# Explicitly wait for the element to become present
wait = WebDriverWait(driver, 10)
element = wait.until(
EC.presence_of_element_located((
By.CSS_SELECTOR, "js-tree-finder-results.js-navigation-container.js-active-navigation-container"
))
)
# Pass the found element into the script
items = driver.execute_script('return arguments[0].innerHTML;', element)
print(items)
But it's still failing to find the element.
What am I doing wrong?
P.S.
The file paths can be grabbed easily via the JS console with the following script:
window.onload = function() {
var fileLinks = document.getElementsByClassName("css-truncate-target js-navigation-open js-tree-finder-path");
var files = [];
for (var i = 0; i < fileLinks.length; i++) {
files.push(fileLinks[i].innerHTML);
}
return files;
}
Edit:
My program requires the use of a headless browser, like PhantomJS.
The current version of PhantomJS can't capture Github content because of Github's Content-Security-Policy setting. This is a known bug and documented here. According to that issue page, downgrading to PhantomJS version 1.9.8 is a way around the problem.
Another solution would be to use chromedriver:
driver = webdriver.Chrome('chromedriver')
driver.get("https://github.com/chrisspen/weka/find/master")
wait = WebDriverWait(driver, 10)
element = wait.until(
EC.presence_of_element_located((
By.CLASS_NAME, "js-tree-finder-path"
))
)
I want to automate file download completion checking in chromedriver.
HTML of each entry in downloads list looks like
<a is="action-link" id="file-link" tabindex="0" role="link" href="http://fileSource" class="">DownloadedFile#1</a>
So I use following code to find target elements:
driver.get('chrome://downloads/') # This page should be available for everyone who use Chrome browser
driver.find_elements_by_tag_name('a')
This returns empty list while there are 3 new downloads.
As I found out, only parent elements of #shadow-root (open) tag can be handled.
So How can I find elements inside this #shadow-root element?
Sometimes the shadow root elements are nested and the second shadow root is not visible in document root, but is available in its parent accessed shadow root. I think is better to use the selenium selectors and inject the script just to take the shadow root:
def expand_shadow_element(element):
shadow_root = driver.execute_script('return arguments[0].shadowRoot', element)
return shadow_root
outer = expand_shadow_element(driver.find_element_by_css_selector("#test_button"))
inner = outer.find_element_by_id("inner_button")
inner.click()
To put this into perspective I just added a testable example with Chrome's download page, clicking the search button needs open 3 nested shadow root elements:
import selenium
from selenium import webdriver
driver = webdriver.Chrome()
def expand_shadow_element(element):
shadow_root = driver.execute_script('return arguments[0].shadowRoot', element)
return shadow_root
driver.get("chrome://downloads")
root1 = driver.find_element_by_tag_name('downloads-manager')
shadow_root1 = expand_shadow_element(root1)
root2 = shadow_root1.find_element_by_css_selector('downloads-toolbar')
shadow_root2 = expand_shadow_element(root2)
root3 = shadow_root2.find_element_by_css_selector('cr-search-field')
shadow_root3 = expand_shadow_element(root3)
search_button = shadow_root3.find_element_by_css_selector("#search-button")
search_button.click()
Doing the same approach suggested in the other answers has the drawback that it hard-codes the queries, is less readable and you cannot use the intermediary selections for other actions:
search_button = driver.execute_script('return document.querySelector("downloads-manager").shadowRoot.querySelector("downloads-toolbar").shadowRoot.querySelector("cr-search-field").shadowRoot.querySelector("#search-button")')
search_button.click()
later edit:
I recently try to access the content settings(see code below) and it has more than one shadow root elements imbricated now you cannot access one without first expanding the other, when you usually have also dynamic content and more than 3 shadow elements one into another it makes impossible automation. The answer above use to work a few time ago but is enough for just one element to change position and you need to always go with inspect element an ho up the tree an see if it is in a shadow root, automation nightmare.
Not only was hard to find just the content settings due to the shadowroots and dynamic change when you find the button is not clickable at this point.
driver = webdriver.Chrome()
def expand_shadow_element(element):
shadow_root = driver.execute_script('return arguments[0].shadowRoot', element)
return shadow_root
driver.get("chrome://settings")
root1 = driver.find_element_by_tag_name('settings-ui')
shadow_root1 = expand_shadow_element(root1)
root2 = shadow_root1.find_element_by_css_selector('[page-name="Settings"]')
shadow_root2 = expand_shadow_element(root2)
root3 = shadow_root2.find_element_by_id('search')
shadow_root3 = expand_shadow_element(root3)
search_button = shadow_root3.find_element_by_id("searchTerm")
search_button.click()
text_area = shadow_root3.find_element_by_id('searchInput')
text_area.send_keys("content settings")
root0 = shadow_root1.find_element_by_id('main')
shadow_root0_s = expand_shadow_element(root0)
root1_p = shadow_root0_s.find_element_by_css_selector('settings-basic-page')
shadow_root1_p = expand_shadow_element(root1_p)
root1_s = shadow_root1_p.find_element_by_css_selector('settings-privacy-page')
shadow_root1_s = expand_shadow_element(root1_s)
content_settings_div = shadow_root1_s.find_element_by_css_selector('#site-settings-subpage-trigger')
content_settings = content_settings_div.find_element_by_css_selector("button")
content_settings.click()
There is also ready to use pyshadow pip module, which worked in my case, below example:
from pyshadow.main import Shadow
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
shadow = Shadow(driver)
element = shadow.find_element("#Selector_level1")
element1 = shadow.find_element("#Selector_level2")
element2 = shadow.find_element("#Selector_level3")
element3 = shadow.find_element("#Selector_level4")
element4 = shadow.find_element("#Selector_level5")
element5 = shadow.find_element('#control-button') #target selector
element5.click()
You can use the driver.executeScript() method to access the HTML elements and JavaScript objects in your web page.
In the exemple below, executeScript will return in a Promise the Node List of all <a> elements present in the Shadow tree of element which id is host. Then you can perform you assertion test:
it( 'check shadow root content', function ()
{
return driver.executeScript( function ()
{
return host.shadowRoot.querySelectorAll( 'a' ).then( function ( n )
{
return expect( n ).to.have.length( 3 )
}
} )
} )
Note: I don't know Python so I've used the JavaScript syntax but it should work the same way.
I would add this as a comment but I don't have enough reputation points--
The answers by Eduard Florinescu works well with the caveat that once you're inside a shadowRoot, you only have the selenium methods available that correspond to the available JS methods--mainly select by id.
To get around this I wrote a longer JS function in a python string and used native JS methods and attributes (find by id, children + indexing etc.) to get the element I ultimately needed.
You can use this method to also access shadowRoots of child elements and so on when the JS string is run using driver.execute_script()
With selenium 4.1 there's a new attribute shadow_root for the WebElement class.
From the docs:
Returns a shadow root of the element if there is one or an error. Only works from Chromium 96 onwards. Previous versions of Chromium based browsers will throw an assertion exception.
Returns:
ShadowRoot object or
NoSuchShadowRoot - if no shadow root was attached to element
A ShadowRoot object has the methods find_element and find_elements but they're currently limited to:
By.ID
By.CSS_SELECTOR
By.NAME
By.CLASS_NAME
Shadow roots and explicit waits
You can also combine that with WebdriverWait and expected_conditions to obtain a decent behaviour. The only caveat is that you must use EC that accept WebElement objects. At the moment it's just one of the following ones:
element_selection_state_to_be
element_to_be_clickable
element_to_be_selected
invisibility_of_element
staleness_of
visibility_of
Example
e.g. borrowing the example from eduard-florinescu
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Chrome()
timeout = 10
driver.get("chrome://settings")
root1 = driver.find_element_by_tag_name('settings-ui')
shadow_root1 = root1.shadow_root
root2 = WebDriverWait(driver, timeout).until(EC.visibility_of(shadow_root1.find_element(by=By.CSS_SELECTOR, value='[page-name="Settings"]')))
shadow_root2 = root2.shadow_root
root3 = WebDriverWait(driver, timeout).until(EC.visibility_of(shadow_root2.find_element(by=By.ID, value='search')))
shadow_root3 = root3.shadow_root
search_button = WebDriverWait(driver, timeout).until(EC.visibility_of(shadow_root3.find_element(by=By.ID, value="searchTerm")))
search_button.click()
text_area = WebDriverWait(driver, timeout).until(EC.visibility_of(shadow_root3.find_element(by=By.ID, value='searchInput')))
text_area.send_keys("content settings")
root0 = WebDriverWait(driver, timeout).until(EC.visibility_of(shadow_root1.find_element(by=By.ID, value='main')))
shadow_root0_s = root0.shadow_root
root1_p = WebDriverWait(driver, timeout).until(EC.visibility_of(shadow_root0_s.find_element(by=By.CSS_SELECTOR, value='settings-basic-page')))
shadow_root1_p = root1_p.shadow_root
root1_s = WebDriverWait(driver, timeout).until(EC.visibility_of(shadow_root1_p.find_element(by=By.CSS_SELECTOR, value='settings-privacy-page')))
shadow_root1_s = root1_s.shadow_root
content_settings_div = WebDriverWait(driver, timeout).until(EC.visibility_of(shadow_root1_s.find_element(by=By.CSS_SELECTOR, value='#site-settings-subpage-trigger')))
content_settings = WebDriverWait(driver, timeout).until(EC.visibility_of(content_settings_div.find_element(by=By.CSS_SELECTOR, value="button")))
content_settings.click()
I originally implemented Eduard's solution just slightly modified as a loop for simplicity. But when Chrome updated to 96.0.4664.45 selenium started returning a dict instead of a WebElement when calling 'return arguments[0].shadowRoot'.
I did a little hacking around and found out I could get Selenium to return a WebElement by calling return arguments[0].shadowRoot.querySelector("tag").
Here's what my final solution ended up looking like:
def get_balance_element(self):
# Loop through nested shadow root tags
tags = [
"tag2",
"tag3",
"tag4",
"tag5",
]
root = self.driver.find_element_by_tag_name("tag1")
for tag in tags:
root = self.expand_shadow_element(root, tag)
# Finally there. GOLD!
return [root]
def expand_shadow_element(self, element, tag):
shadow_root = self.driver.execute_script(
f'return arguments[0].shadowRoot.querySelector("{tag}")', element)
return shadow_root
Clean and simple, works for me.
Also, I could only get this working Selenium 3.141.0. 4.1 has a half baked shadow DOM implementation that just manages to break everything.
The downloaded items by google-chrome are within multiple #shadow-root (open).
Solution
To extract the contents of the table you have to use shadowRoot.querySelector() and you can use the following locator strategy:
Code Block:
driver = webdriver.Chrome(service=s, options=options)
driver.execute("get", {'url': 'chrome://downloads/'})
time.sleep(5)
download = driver.execute_script("""return document.querySelector('downloads-manager').shadowRoot.querySelector('downloads-item').shadowRoot.querySelector('a#file-link')""")
print(download.text)