How to scrape dynamic content with BS4 or Selenium (Python)? - python

I'm trying to scrape all the file paths in a Github repo from the File Finder page (example).
Beautiful Soup 4 fails to scrape the <tbody class="js-tree-finder-results js-navigation-container js-active-navigation-container"> element that wraps the list of file paths. I figured this is b/c bs4 can't scrape dynamic content, so I tried waiting for all the elements to load with Selenium:
driver = webdriver.PhantomJS()
driver.get("https://github.com/chrisspen/weka/find/master")
# Explicitly wait for the element to become present
wait = WebDriverWait(driver, 10)
element = wait.until(
EC.presence_of_element_located((
By.CSS_SELECTOR, "js-tree-finder-results.js-navigation-container.js-active-navigation-container"
))
)
# Pass the found element into the script
items = driver.execute_script('return arguments[0].innerHTML;', element)
print(items)
But it's still failing to find the element.
What am I doing wrong?
P.S.
The file paths can be grabbed easily via the JS console with the following script:
window.onload = function() {
var fileLinks = document.getElementsByClassName("css-truncate-target js-navigation-open js-tree-finder-path");
var files = [];
for (var i = 0; i < fileLinks.length; i++) {
files.push(fileLinks[i].innerHTML);
}
return files;
}
Edit:
My program requires the use of a headless browser, like PhantomJS.

The current version of PhantomJS can't capture Github content because of Github's Content-Security-Policy setting. This is a known bug and documented here. According to that issue page, downgrading to PhantomJS version 1.9.8 is a way around the problem.
Another solution would be to use chromedriver:
driver = webdriver.Chrome('chromedriver')
driver.get("https://github.com/chrisspen/weka/find/master")
wait = WebDriverWait(driver, 10)
element = wait.until(
EC.presence_of_element_located((
By.CLASS_NAME, "js-tree-finder-path"
))
)

Related

Web scraping when scrolling down is needed

I want to scrape, e.g., the title of the first 200 questions under the web page https://www.quora.com/topic/Stack-Overflow-4/all_questions. And I tried the following code:
import requests
from bs4 import BeautifulSoup
url = "https://www.quora.com/topic/Stack-Overflow-4/all_questions"
print("url")
print(url)
r = requests.get(url) # HTTP request
print("r")
print(r)
html_doc = r.text # Extracts the html
print("html_doc")
print(html_doc)
soup = BeautifulSoup(html_doc, 'lxml') # Create a BeautifulSoup object
print("soup")
print(soup)
It gave me a text https://pastebin.com/9dSPzAyX. If we search href='/, we can see that the html does contain title of some questions. However, the problem is that the number is not enough; actually on the web page, a user needs to manually scroll down to trigger extra load.
Does anyone know how I could mimic "scrolling down" by the program to load more content of the page?
Infinite scrolls on a webpage is based on the Javascript functionality. Therefore, to find out what URL we need to access and what parameters to use, we need to either thoroughly study the JS code working inside the page or, and preferably, examine the requests that the browser does when you scroll down the page. We can study requests using the Developer Tools.
See example for quora
the more you scroll down, the more requests generated. so now your requests will be done to that url instead of normal url but keep in mind to send correct headers and playload.
other easier solution will be by using selenium
Couldn't find a response using request. But you can use Selenium. First printed out the number of questions at first load, then send the End key to mimic scrolling down. You can see number of questions went from 20 to 40 after sending the End key.
I used driver.implicitly wait for 5 seconds before loading the DOM again in case the script load to fast before the DOM was loaded. You can improve by using EC with selenium.
The page loads 20 questions per scroll. So if you are looking to scrape 100 questions, then you need to send the End key 5 times.
To use the code below you need to install chromedriver.
http://chromedriver.chromium.org/downloads
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
CHROMEDRIVER_PATH = ""
CHROME_PATH = ""
WINDOW_SIZE = "1920,1080"
chrome_options = Options()
# chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=%s" % WINDOW_SIZE)
chrome_options.binary_location = CHROME_PATH
prefs = {'profile.managed_default_content_settings.images':2}
chrome_options.add_experimental_option("prefs", prefs)
url = "https://www.quora.com/topic/Stack-Overflow-4/all_questions"
def scrape(url, times):
if not url.startswith('http'):
raise Exception('URLs need to start with "http"')
driver = webdriver.Chrome(
executable_path=CHROMEDRIVER_PATH,
chrome_options=chrome_options
)
driver.get(url)
counter = 1
while counter <= times:
q_list = driver.find_element_by_class_name('TopicAllQuestionsList')
questions = [x for x in q_list.find_elements_by_xpath('//div[#class="pagedlist_item"]')]
q_len = len(questions)
print(q_len)
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
wait = WebDriverWait(driver, 5)
time.sleep(5)
questions2 = [x for x in q_list.find_elements_by_xpath('//div[#class="pagedlist_item"]')]
print(len(questions2))
counter += 1
driver.close()
if __name__ == '__main__':
scrape(url, 5)
I recommend using selenium rather than bs.
selenium can control browser and parsing. like scroll down, click button, etc…
this example is for scroll down for get all liker user in instagram.
https://stackoverflow.com/a/54882356/5611675
If the content only loads on "scrolling down", this probably means that the page is using Javascript to dynamically load the content.
You can try using a web client such as PhantomJS to load the page and execute the javascript in it, and simulate the scroll by injecting some JS such as document.body.scrollTop = sY; (Simulate scroll event using Javascript).

Scrapy/Splash Click on a button then get content from new page in new window

I'm facing a problem that when I click on a button, then Javascript handle the action then it redirect to a new page with new window (It's similar to when you click on <a> with target _Blank). In the scrapy/splash I don't know how to get content from the new page (I means I don't know how to control that new page).
Anyone can help!
script = """
function main(splash)
assert(splash:go(splash.args.url))
splash:wait(0.5)
local element = splash:select('div.result-content-columns div.result-title')
local bounds = element:bounds()
element:mouse_click{x=bounds.width/2, y=bounds.height/2}
return splash:html()
end
"""
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, endpoint='execute', args={'lua_source': self.script})
Issue:
The problem that you can't scrape html which is out of your selection scope. When a new link is clicked, if there is an iframe involved, it rarely brings it into scope for scraping.
Solution:
Choose a method of selecting the new iframe, and then proceed to parse the new html.
The Scrapy-Splash method
(This is an adaptation of Mikhail Korobov's solution from this answer)
If you are able to get the src link of the new page that pops up, it may be the most reliable, however, you can also try selecting iframe this way:
# ...
yield SplashRequest(url, self.parse_result, endpoint='render.json',
args={'html': 1, 'iframes': 1})
def parse_result(self, response):
iframe_html = response.data['childFrames'][0]['html']
sel = parsel.Selector(iframe_html)
item = {
'my_field': sel.xpath(...),
# ...
}
The Selenium method
(requires pip install selenium,bs4, and possibly a chrome driver download from here for your os: Selenium Chromedrivers) Supports Javascript parsing! Woohoo!
With the following code, this will switch scopes to the new frame:
# Goes at the top
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
import time
# Your path depends on where you downloaded/located your chromedriver.exe
CHROME_PATH = 'C:\Program Files (x86)\Google\Chrome\Application\chrome.exe'
CHROMEDRIVER_PATH = 'chromedriver.exe'
WINDOW_SIZE = "1920,1080"
chrome_options = Options()
chrome_options.add_argument("--log-level=3")
chrome_options.add_argument("--headless") # Speeds things up if you don't need gui
chrome_options.add_argument("--window-size=%s" % WINDOW_SIZE)
chrome_options.binary_location = CHROME_PATH
browser = webdriver.Chrome(executable_path=CHROMEDRIVER_PATH, chrome_options=chrome_options)
url = "example_js_site.com" # Your site goes here
browser.get(url)
time.sleep(3) # An unsophisticated way to wait for the new page to load.
browser.switch_to.frame(0)
soup = BeautifulSoup(browser.page_source.encode('utf-8').strip(), 'lxml')
# This will return any content found in tags called '<table>'
table = soup.find_all('table')
My favorite of the two options is Selenium, but try the first solution if you are more comfortable with it!

How to get URLs from website

I tried to get all URLs from this website:
https://www.bbvavivienda.com/es/buscador/venta/vivienda/todos/la-coruna/
There are a lot of links like https://www.bbvavivienda.com/es/unidades/UV_n_UV00121705 inside but I'm not able to recover them with Selenium. Any idea how to do it?
I add more info about how I tried it. obviously... i'm starting with python, selenium, etc... thanks in advance:
**from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome("D:\Python27\selenium\webdriver\chrome\chromedriver.exe")
driver.implicitly_wait(30)
driver.maximize_window()
driver.get("https://www.bbvavivienda.com/es/buscador/venta/vivienda/todos/la-coruna/")
urls=driver.find_element_by_css_selector('a').get_attribute('href')
print urls
links = driver.find_elements_by_partial_link_text('_self')
for link in links:
print link.get_attribute("href")
driver.quit()**
following code shall work. You are using the wrong identifier for the link.
driver = webdriver.Chrome()
driver.implicitly_wait(30)
driver.maximize_window()
driver.get("https://www.bbvavivienda.com/es/buscador/venta/vivienda/todos/la-coruna/")
urls=driver.find_element_by_css_selector('a').get_attribute('href')
print urls
for link in driver.find_elements_by_xpath("//a[#target='_self']"):
try:
print link.get_attribute("href")
except Exception:
pass
driver.quit()
I don't know python but normally in Java we can find all the elements in the webpage having tag as "a" for finding the links in the webpage. You can find the below code snippet useful.
List<WebElement> links = driver.findElements(By.tagName("a"));
System.out.println(links.size());
for (int i = 1; i<=links.size(); i=i+1)
{
System.out.println(links.get(i).getText());
}

How to handle elements inside Shadow DOM from Selenium

I want to automate file download completion checking in chromedriver.
HTML of each entry in downloads list looks like
<a is="action-link" id="file-link" tabindex="0" role="link" href="http://fileSource" class="">DownloadedFile#1</a>
So I use following code to find target elements:
driver.get('chrome://downloads/') # This page should be available for everyone who use Chrome browser
driver.find_elements_by_tag_name('a')
This returns empty list while there are 3 new downloads.
As I found out, only parent elements of #shadow-root (open) tag can be handled.
So How can I find elements inside this #shadow-root element?
Sometimes the shadow root elements are nested and the second shadow root is not visible in document root, but is available in its parent accessed shadow root. I think is better to use the selenium selectors and inject the script just to take the shadow root:
def expand_shadow_element(element):
shadow_root = driver.execute_script('return arguments[0].shadowRoot', element)
return shadow_root
outer = expand_shadow_element(driver.find_element_by_css_selector("#test_button"))
inner = outer.find_element_by_id("inner_button")
inner.click()
To put this into perspective I just added a testable example with Chrome's download page, clicking the search button needs open 3 nested shadow root elements:
import selenium
from selenium import webdriver
driver = webdriver.Chrome()
def expand_shadow_element(element):
shadow_root = driver.execute_script('return arguments[0].shadowRoot', element)
return shadow_root
driver.get("chrome://downloads")
root1 = driver.find_element_by_tag_name('downloads-manager')
shadow_root1 = expand_shadow_element(root1)
root2 = shadow_root1.find_element_by_css_selector('downloads-toolbar')
shadow_root2 = expand_shadow_element(root2)
root3 = shadow_root2.find_element_by_css_selector('cr-search-field')
shadow_root3 = expand_shadow_element(root3)
search_button = shadow_root3.find_element_by_css_selector("#search-button")
search_button.click()
Doing the same approach suggested in the other answers has the drawback that it hard-codes the queries, is less readable and you cannot use the intermediary selections for other actions:
search_button = driver.execute_script('return document.querySelector("downloads-manager").shadowRoot.querySelector("downloads-toolbar").shadowRoot.querySelector("cr-search-field").shadowRoot.querySelector("#search-button")')
search_button.click()
later edit:
I recently try to access the content settings(see code below) and it has more than one shadow root elements imbricated now you cannot access one without first expanding the other, when you usually have also dynamic content and more than 3 shadow elements one into another it makes impossible automation. The answer above use to work a few time ago but is enough for just one element to change position and you need to always go with inspect element an ho up the tree an see if it is in a shadow root, automation nightmare.
Not only was hard to find just the content settings due to the shadowroots and dynamic change when you find the button is not clickable at this point.
driver = webdriver.Chrome()
def expand_shadow_element(element):
shadow_root = driver.execute_script('return arguments[0].shadowRoot', element)
return shadow_root
driver.get("chrome://settings")
root1 = driver.find_element_by_tag_name('settings-ui')
shadow_root1 = expand_shadow_element(root1)
root2 = shadow_root1.find_element_by_css_selector('[page-name="Settings"]')
shadow_root2 = expand_shadow_element(root2)
root3 = shadow_root2.find_element_by_id('search')
shadow_root3 = expand_shadow_element(root3)
search_button = shadow_root3.find_element_by_id("searchTerm")
search_button.click()
text_area = shadow_root3.find_element_by_id('searchInput')
text_area.send_keys("content settings")
root0 = shadow_root1.find_element_by_id('main')
shadow_root0_s = expand_shadow_element(root0)
root1_p = shadow_root0_s.find_element_by_css_selector('settings-basic-page')
shadow_root1_p = expand_shadow_element(root1_p)
root1_s = shadow_root1_p.find_element_by_css_selector('settings-privacy-page')
shadow_root1_s = expand_shadow_element(root1_s)
content_settings_div = shadow_root1_s.find_element_by_css_selector('#site-settings-subpage-trigger')
content_settings = content_settings_div.find_element_by_css_selector("button")
content_settings.click()
There is also ready to use pyshadow pip module, which worked in my case, below example:
from pyshadow.main import Shadow
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
shadow = Shadow(driver)
element = shadow.find_element("#Selector_level1")
element1 = shadow.find_element("#Selector_level2")
element2 = shadow.find_element("#Selector_level3")
element3 = shadow.find_element("#Selector_level4")
element4 = shadow.find_element("#Selector_level5")
element5 = shadow.find_element('#control-button') #target selector
element5.click()
You can use the driver.executeScript() method to access the HTML elements and JavaScript objects in your web page.
In the exemple below, executeScript will return in a Promise the Node List of all <a> elements present in the Shadow tree of element which id is host. Then you can perform you assertion test:
it( 'check shadow root content', function ()
{
return driver.executeScript( function ()
{
return host.shadowRoot.querySelectorAll( 'a' ).then( function ( n )
{
return expect( n ).to.have.length( 3 )
}
} )
} )
Note: I don't know Python so I've used the JavaScript syntax but it should work the same way.
I would add this as a comment but I don't have enough reputation points--
The answers by Eduard Florinescu works well with the caveat that once you're inside a shadowRoot, you only have the selenium methods available that correspond to the available JS methods--mainly select by id.
To get around this I wrote a longer JS function in a python string and used native JS methods and attributes (find by id, children + indexing etc.) to get the element I ultimately needed.
You can use this method to also access shadowRoots of child elements and so on when the JS string is run using driver.execute_script()
With selenium 4.1 there's a new attribute shadow_root for the WebElement class.
From the docs:
Returns a shadow root of the element if there is one or an error. Only works from Chromium 96 onwards. Previous versions of Chromium based browsers will throw an assertion exception.
Returns:
ShadowRoot object or
NoSuchShadowRoot - if no shadow root was attached to element
A ShadowRoot object has the methods find_element and find_elements but they're currently limited to:
By.ID
By.CSS_SELECTOR
By.NAME
By.CLASS_NAME
Shadow roots and explicit waits
You can also combine that with WebdriverWait and expected_conditions to obtain a decent behaviour. The only caveat is that you must use EC that accept WebElement objects. At the moment it's just one of the following ones:
element_selection_state_to_be
element_to_be_clickable
element_to_be_selected
invisibility_of_element
staleness_of
visibility_of
Example
e.g. borrowing the example from eduard-florinescu
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Chrome()
timeout = 10
driver.get("chrome://settings")
root1 = driver.find_element_by_tag_name('settings-ui')
shadow_root1 = root1.shadow_root
root2 = WebDriverWait(driver, timeout).until(EC.visibility_of(shadow_root1.find_element(by=By.CSS_SELECTOR, value='[page-name="Settings"]')))
shadow_root2 = root2.shadow_root
root3 = WebDriverWait(driver, timeout).until(EC.visibility_of(shadow_root2.find_element(by=By.ID, value='search')))
shadow_root3 = root3.shadow_root
search_button = WebDriverWait(driver, timeout).until(EC.visibility_of(shadow_root3.find_element(by=By.ID, value="searchTerm")))
search_button.click()
text_area = WebDriverWait(driver, timeout).until(EC.visibility_of(shadow_root3.find_element(by=By.ID, value='searchInput')))
text_area.send_keys("content settings")
root0 = WebDriverWait(driver, timeout).until(EC.visibility_of(shadow_root1.find_element(by=By.ID, value='main')))
shadow_root0_s = root0.shadow_root
root1_p = WebDriverWait(driver, timeout).until(EC.visibility_of(shadow_root0_s.find_element(by=By.CSS_SELECTOR, value='settings-basic-page')))
shadow_root1_p = root1_p.shadow_root
root1_s = WebDriverWait(driver, timeout).until(EC.visibility_of(shadow_root1_p.find_element(by=By.CSS_SELECTOR, value='settings-privacy-page')))
shadow_root1_s = root1_s.shadow_root
content_settings_div = WebDriverWait(driver, timeout).until(EC.visibility_of(shadow_root1_s.find_element(by=By.CSS_SELECTOR, value='#site-settings-subpage-trigger')))
content_settings = WebDriverWait(driver, timeout).until(EC.visibility_of(content_settings_div.find_element(by=By.CSS_SELECTOR, value="button")))
content_settings.click()
I originally implemented Eduard's solution just slightly modified as a loop for simplicity. But when Chrome updated to 96.0.4664.45 selenium started returning a dict instead of a WebElement when calling 'return arguments[0].shadowRoot'.
I did a little hacking around and found out I could get Selenium to return a WebElement by calling return arguments[0].shadowRoot.querySelector("tag").
Here's what my final solution ended up looking like:
def get_balance_element(self):
# Loop through nested shadow root tags
tags = [
"tag2",
"tag3",
"tag4",
"tag5",
]
root = self.driver.find_element_by_tag_name("tag1")
for tag in tags:
root = self.expand_shadow_element(root, tag)
# Finally there. GOLD!
return [root]
def expand_shadow_element(self, element, tag):
shadow_root = self.driver.execute_script(
f'return arguments[0].shadowRoot.querySelector("{tag}")', element)
return shadow_root
Clean and simple, works for me.
Also, I could only get this working Selenium 3.141.0. 4.1 has a half baked shadow DOM implementation that just manages to break everything.
The downloaded items by google-chrome are within multiple #shadow-root (open).
Solution
To extract the contents of the table you have to use shadowRoot.querySelector() and you can use the following locator strategy:
Code Block:
driver = webdriver.Chrome(service=s, options=options)
driver.execute("get", {'url': 'chrome://downloads/'})
time.sleep(5)
download = driver.execute_script("""return document.querySelector('downloads-manager').shadowRoot.querySelector('downloads-item').shadowRoot.querySelector('a#file-link')""")
print(download.text)

Selenium Python Finding actual xPath of current node within loop

browser = webdriver.Firefox()
browser.get("http://www.example.com/admin/orders")
total_images = len(browser.find_elements_by_xpath('/html/body/div[5]/div[3]/div[1]/div[3]/div[4]/form[1]//img'))
for i in range(1,total_images):
compiledData['item_url'] = browser.find_elements_by_xpath(browser.xpath('/html/body/div[5]/div[3]/div[1]/div[3]/div[4]/form[1]//img[')+str(']'))
Now what I need is the actual path of the element. As you can see that I am looping through the range of images within a dynamic xpath code. Now if I am in a loop and there I need the exact xpath of that current dynamically identified image xpath
For example, if it is running for the first image then it should give:
/html/body/div[5]/div[3]/div[1]/div[3]/div[4]/form[1]/img[1]
if it's running for second time then:
/html/body/div[5]/div[3]/div[1]/div[3]/div[4]/form[1]/img[2]
I want to extract this actual xpath of that current node within that loop
Now something like the PHP one:
$breakup = $xpath->query('/html/body/div[5]/div[3]/div[1]/div[3]//*[text()="Package total"]');
if($breakup->length != 0)
{
$breakupPath = substr($breakup->item(0)->getNodePath(), 0, -2)."2]";
$orderData['total'] = str_replace("Rs.", "", trim($xpath->query($breakupPath)->item(0)->nodeValue));
}
Where, $breakup->item(0)->getNodePath() gives the real xPath of the dynamically identified xPath
One option here would be to use one of the XPath generating functions suggested here:
Javascript get XPath of a node
I'm storing click coordinates in my db and then reloading them later and showing them on the site where the click happened, how do I make sure it loads in the same place?
And execute them using execute_script(). Example code:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.w3schools.com/")
driver.execute_script("""
window.getPathTo = function (element) {
if (element.id!=='')
return 'id("'+element.id+'")';
if (element===document.body)
return element.tagName;
var ix= 0;
var siblings= element.parentNode.childNodes;
for (var i= 0; i<siblings.length; i++) {
var sibling= siblings[i];
if (sibling===element)
return window.getPathTo(element.parentNode)+'/'+element.tagName+'['+(ix+1)+']';
if (sibling.nodeType===1 && sibling.tagName===element.tagName)
ix++;
}
}
""")
element = driver.find_element_by_class_name("w3-fluid")
generated_xpath = driver.execute_script("return window.getPathTo(arguments[0]);", element)
print(generated_xpath)
Another option would be to get the source of a page and feed it to lxml.html HTML parser which has getPath() method:
from selenium import webdriver
import lxml.html
driver = webdriver.Firefox()
driver.get("http://www.w3schools.com/")
tree = lxml.html.fromstring(driver.page_source)
root = tree.getroottree()
driver.close()
element = tree.xpath("//*[#class='w3-fluid']")[0]
print(root.getpath(element))

Categories