I want to scrape the product listings from any website. Some websites for example: Amazon, Alibaba have max 10 products on the page while some have 20. I don't want to put for loops in XPATH of each and every website.
Is there a way to get all the XPATH's related to a special attribute of any website? For example, if we have a XPATH of table, then it will show all the XPATH's of the table. Any help would be appreciated...
Here is the example HTML I will use
XPath of < ul> tag:
/html/body/div[4]/div/aside[1]/div[2]/div[2]/div/ul
XPaths of < li> tags:
/html/body/div[4]/div/aside[1]/div[2]/div[2]/div/ul/li[1]
/html/body/div[4]/div/aside[1]/div[2]/div[2]/div/ul/li[2]
/html/body/div[4]/div/aside[1]/div[2]/div[2]/div/ul/li[3]
/html/body/div[4]/div/aside[1]/div[2]/div[2]/div/ul/li[4]
/html/body/div[4]/div/aside[1]/div[2]/div[2]/div/ul/li[5]
What you can do is make a more general XPath that will grab all of the XPaths you want.
So say you want to find all the elements in the list:
/html/body/div[4]/div/aside[1]/div[2]/div[2]/div/ul/li
Notice there is no [#] at the end of that XPath, so it will find all elements containing that xpath
An example:
from selenium import webdriver
url = 'https://www.livesoccertv.com/'
driver = webdriver.Firefox()
driver.get(url)
test = driver.find_elements_by_xpath('/html/body/div/div[5]/div[3]/div/table[2]/tbody/tr')
print(len(test))
driver.close()
This returns a result of 35
Related
I was attempting to solve this issue for a bit of time and attempted multiple solution posted on here prior to opening this question.
I am currently attempting to a run a scraper with the following code
website = 'https://www.abitareco.it/nuove-costruzioni-milano.html'
path = Path().joinpath('util', 'chromedriver')
driver = webdriver.Chrome(path)
driver.get(website)
main = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.NAME, "p1")))
My goal hyperlink has word scheda in it:
i = driver.find_element_by_xpath('.//a[contains(#href, "scheda")]')
i.text
My first issue is that find_element_by_xpath only outputs a single hyperlink and second issue is that it is not extracting anything so far.
I'd appreciate any help and/or guidance.
You need to use find_elements instead :
for name in driver.find_elements(By.XPATH, ".//a[contains(#href, 'scheda')]"):
print(name.text)
Note that find_elements will return a list of web elements, where as find_element return a single web element.
if you specifically looking for href attribute then you can try the below code :
for name in driver.find_elements(By.XPATH, ".//a[contains(#href, 'scheda')]"):
print(name.get_attribute('href'))
There's 2 issues, looking at the website.
You want to find all elements, not just one, so you need to use find_elements, not find_element
The anchors actually don't have any text in them, so .text won't return anything.
Assuming what you want is to scrape the URLs of all these links, you can use .get_attribute('href') instead of .text, like so:
url_list = driver.find_elements(By.XPATH, './/a[contains(#href, "scheda")]')
for i in url_list:
print(i.get_attribute('href'))
It will detect all webelements that match you criteria and store them in a list. I just used print as an example, but obviously you may want to do more than just print the links.
I am trying to get the author and content of the comment from a website but I found out that its Page Source and Inspect Elements are different. I tried to use BeautifulSoup but I cannot get anything back from it. Therefore, I tried to use Selenium but still, I couldn't get anything. I inspect elements from the website and put in the class name by using Selenium but still couldn't scrape anything back. Here is the code that I wrote.
web = "https://www.regulations.gov/document?D=WHD-2020-0007-0609"
#Selenium
driver = webdriver.Chrome()
driver.get(web)
name = driver.find_elements_by_name("GIY1LSJBID")
#Beautifulsoup
page = requests.get(web)
soup = BeautifulSoup(page.text, 'html.parser')
quotes = soup.find_all('div')
I am wondering did I do something wrong and how can I fix it?
You've already given the answer yourself. You are searching for an element by it's classname, but you use find_elements_by_name. This doesn't search for class name, but the name attribute in an element. Also find_elements with an 's' on the end means the function returns a list of elements not a single element.
In your case you need find_element_by_class_name("GIY1LSJBID")
I have this html tree in a web app:
I have scraped all the text for all the league names.
But I also need a XPATH or any indicator so that I can tell selenium: if I choose for example EFL League Two (ENG 4) in my GUI from e. g. a drop down menu, then use the corresponding xpath to choose the right league in the web app.
I have no idea how I could either extract a XPATCH from that tree nor any other solution that could be used for my scenario.
Any idea how I could fix this?
If I try to get a 'href' extracted, it prints just "None"
This is my code so far:
def scrape_test():
leagues = []
#click the dropdown menue to open the folder with all the leagues
league_dropdown_menu = driver.find_element_by_xpath('/html/body/main/section/section/div[2]/div/div[2]/div/div[1]/div[1]/div[7]/div')
league_dropdown_menu.click()
time.sleep(1)
#get all league names as text
scrape_leagues = driver.find_elements_by_xpath("//li[#class='with-icon' and contains(text(), '')]")
for league in scrape_leagues:
leagues.append(league.text)
print('\n')
# HERE I NEED HELP! - I try to get a link/xpath for each corresponding league to use later with selenium
scrape_leagues_xpath = driver.find_elements_by_xpath("//li[#class='with-icon']")
for xpath in scrape_leagues_xpath:
leagues.append(xpath.get_attribute('xpath')) #neither xpath, text, href is working here
print(leagues)
li node doesn't have text, href or xpath (I don't think its a valid HTML attribute). You can scrape and parse #style.
Try to use this approach to extract background-image URL
leagues.append(xpath.get_attribute('style').strip('background-image:url("').rstrip('");'))
I am attempting to scrape Kickstarter based on the project names alone. Using the project name and the base URL I can get to the search page. In order to scrape the project page, I need to use Selenium to click on the URL. However, I cannot point Selenium to the correct element to click on. I would also like this to be dynamic so I do not need to put the project name each time.
<div class="type-18 clamp-5 navy-500 mb3">
<a href="https://www.kickstarter.com/projects/1980119549/knife-block-
designed-by-if-and-red-dot-winner-jle?
ref=discovery&term=Knife%20block%20-
%20Designed%20by%20IF%20and%20Red%20dot%20winner%20JLE%20Design"
class="soft-black hover-text-underline">Knife block -
Designed by IF and
Red dot winner JLE Design
</a>
</div>`
driver = webdriver.Chrome(chrome_path)
url = 'https://www.kickstarter.com/discover/advanced?ref=nav_search&term=Knife
block - Designed by IF and Red dot winner JLE Design'
driver.get(url)
elem = driver.find_elements_by_link_text('Knife block - Designed by IF and Red
dot winner JLE Design')
elem.click()
How can I get the elem to point to the correct link?
In regards to your attempt, your code had a typo: using find_elements.... returns a list of elements so the method .click() would not work. You mean to use find_element.
To dynamically click links, use an XPath instead. The resulting code would be:
elem = driver.find_element_by_xpath('//div[contains(#class, "type-18")]/a')
elem.click()
This would grab the first match. You could do find_elements and iterate over the elements but this would be a bad approach because since you're clicking the links, each time that renders the previous page stale. If there's more than one, you could use the same XPath but indexed:
first_elem = driver.find_element_by_xpath('(//div[contains(#class, "type-18")]/a)[1]')
first_elem.click()
# ...
second_elem = driver.find_element_by_xpath('(//div[contains(#class, "type-18")]/a)[2]')
second_elem.click()
# And so forth...
I am trying to count the number of items in a list box on a webpage and then select multiple items from this list box. I can select the items fine I am just struggling to find out how to count the items in the list box.
see code:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
... ...
accountListBox = Select(driver.find_element_by_id("ctl00_MainContent_accountItemsListBox"))
accountListBox.select_by_index(0)
print(len(accountListBox))
I have tried using len() which results in the error "TypeError: object of type 'Select' has no len()".
I have also tried accountListBox.size() and also removed the 'Select' from line 3 which also doesn't work.
Pretty new to this so would appreciate your feedback.
Thanks!
According to the docs a list of Select element's options can be obtained by saying select.options. In your particular case this would be accountListBox.options, and you need to call len() on that, and not on the Select instance itself:
print(len(accountListBox.options))
Or, if you only want to print a list of currently selected options:
print(len(accountListBox.all_selected_options))
You should use find_elements by using some common selector for each listbox's items to find all of them, store the found elements into a variable, and use a native python's library to count them.
I usually use Selenium along with Beautiful Soup. Beautiful Soup is a Python package for parsing HTML and XML documents.
With Beautiful Soup you can get the count of items in a list box in the following way:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.PhantomJS() # or webdriver.Firefox()
driver.get('http://some-website.com/some-page/')
html = driver.page_source.encode('utf-8')
b = BeautifulSoup(html, 'lxml')
items = b.find_all('p', attrs={'id':'ctl00_MainContent_accountItemsListBox'})
print(len(items))
I assumed that the DOM element you want to find is a paragraph (p tag), but you can replace this with whatever element you need to find.