I have this html tree in a web app:
I have scraped all the text for all the league names.
But I also need a XPATH or any indicator so that I can tell selenium: if I choose for example EFL League Two (ENG 4) in my GUI from e. g. a drop down menu, then use the corresponding xpath to choose the right league in the web app.
I have no idea how I could either extract a XPATCH from that tree nor any other solution that could be used for my scenario.
Any idea how I could fix this?
If I try to get a 'href' extracted, it prints just "None"
This is my code so far:
def scrape_test():
leagues = []
#click the dropdown menue to open the folder with all the leagues
league_dropdown_menu = driver.find_element_by_xpath('/html/body/main/section/section/div[2]/div/div[2]/div/div[1]/div[1]/div[7]/div')
league_dropdown_menu.click()
time.sleep(1)
#get all league names as text
scrape_leagues = driver.find_elements_by_xpath("//li[#class='with-icon' and contains(text(), '')]")
for league in scrape_leagues:
leagues.append(league.text)
print('\n')
# HERE I NEED HELP! - I try to get a link/xpath for each corresponding league to use later with selenium
scrape_leagues_xpath = driver.find_elements_by_xpath("//li[#class='with-icon']")
for xpath in scrape_leagues_xpath:
leagues.append(xpath.get_attribute('xpath')) #neither xpath, text, href is working here
print(leagues)
li node doesn't have text, href or xpath (I don't think its a valid HTML attribute). You can scrape and parse #style.
Try to use this approach to extract background-image URL
leagues.append(xpath.get_attribute('style').strip('background-image:url("').rstrip('");'))
Related
Scrapping links should be a simple feat, usually just grabbing the src value of the a tag.
I recently came across this website (https://sunteccity.com.sg/promotions) where the href value of a tags of each item cannot be found, but the redirection still works. I'm trying to figure out a way to grab the items and their corresponding links. My typical python selenium code looks something as such
all_items = bot.find_elements_by_class_name('thumb-img')
for promo in all_items:
a = promo.find_elements_by_tag_name("a")
print("a[0]: ", a[0].get_attribute("href"))
However, I can't seem to retrieve any href, onclick attributes, and I'm wondering if this is even possible. I noticed that I couldn't do a right-click, open link in new tab as well.
Are there any ways around getting the links of all these items?
Edit: Are there any ways to retrieve all the links of the items on the pages?
i.e.
https://sunteccity.com.sg/promotions/724
https://sunteccity.com.sg/promotions/731
https://sunteccity.com.sg/promotions/751
https://sunteccity.com.sg/promotions/752
https://sunteccity.com.sg/promotions/754
https://sunteccity.com.sg/promotions/280
...
Edit:
Adding an image of one such anchor tag for better clarity:
By reverse-engineering the Javascript that takes you to the promotions pages (seen in https://sunteccity.com.sg/_nuxt/d4b648f.js) that gives you a way to get all the links, which are based on the HappeningID. You can verify by running this in the JS console, which gives you the first promotion:
window.__NUXT__.state.Promotion.promotions[0].HappeningID
Based on that, you can create a Python loop to get all the promotions:
items = driver.execute_script("return window.__NUXT__.state.Promotion;")
for item in items["promotions"]:
base = "https://sunteccity.com.sg/promotions/"
happening_id = str(item["HappeningID"])
print(base + happening_id)
That generated the following output:
https://sunteccity.com.sg/promotions/724
https://sunteccity.com.sg/promotions/731
https://sunteccity.com.sg/promotions/751
https://sunteccity.com.sg/promotions/752
https://sunteccity.com.sg/promotions/754
https://sunteccity.com.sg/promotions/280
https://sunteccity.com.sg/promotions/764
https://sunteccity.com.sg/promotions/766
https://sunteccity.com.sg/promotions/762
https://sunteccity.com.sg/promotions/767
https://sunteccity.com.sg/promotions/732
https://sunteccity.com.sg/promotions/733
https://sunteccity.com.sg/promotions/735
https://sunteccity.com.sg/promotions/736
https://sunteccity.com.sg/promotions/737
https://sunteccity.com.sg/promotions/738
https://sunteccity.com.sg/promotions/739
https://sunteccity.com.sg/promotions/740
https://sunteccity.com.sg/promotions/741
https://sunteccity.com.sg/promotions/742
https://sunteccity.com.sg/promotions/743
https://sunteccity.com.sg/promotions/744
https://sunteccity.com.sg/promotions/745
https://sunteccity.com.sg/promotions/746
https://sunteccity.com.sg/promotions/747
https://sunteccity.com.sg/promotions/748
https://sunteccity.com.sg/promotions/749
https://sunteccity.com.sg/promotions/750
https://sunteccity.com.sg/promotions/753
https://sunteccity.com.sg/promotions/755
https://sunteccity.com.sg/promotions/756
https://sunteccity.com.sg/promotions/757
https://sunteccity.com.sg/promotions/758
https://sunteccity.com.sg/promotions/759
https://sunteccity.com.sg/promotions/760
https://sunteccity.com.sg/promotions/761
https://sunteccity.com.sg/promotions/763
https://sunteccity.com.sg/promotions/765
https://sunteccity.com.sg/promotions/730
https://sunteccity.com.sg/promotions/734
https://sunteccity.com.sg/promotions/623
You are using a wrong locator. It brings you a lot of irrelevant elements.
Instead of find_elements_by_class_name('thumb-img') please try find_elements_by_css_selector('.collections-page .thumb-img') so your code will be
all_items = bot.find_elements_by_css_selector('.collections-page .thumb-img')
for promo in all_items:
a = promo.find_elements_by_tag_name("a")
print("a[0]: ", a[0].get_attribute("href"))
You can also get the desired links directly by .collections-page .thumb-img a locator so that your code could be:
links = bot.find_elements_by_css_selector('.collections-page .thumb-img a')
for link in links:
print(link.get_attribute("href"))
I am practicing about web-scraping with Python.
And I have a problem because of the structure of webpage.
There are two <a> in same list class.
I want to extract just link of the post. And I want to know how can I extract this link without touching another one.
Now, I write the code like this:
def extract_job(post):
title = post.find("span", {"class": "title"})
company = post.find("span", {"class": "company"})
location = post.find("span", {"class": "region"})
link = post.find
What do I have to put after find function?
Instead of using the find method you can do it using css selectors with the select method. In the inspector if you right click on the element you want to find the selector for and click copy selector you can copy the selector and use that in your code.
link = post.select(put the copied selector here)
I have a list of job roles on this website that I want to scrape. The code I am using is below:
driver.get('https://jobs.ubs.com/TGnewUI/Search/home/HomeWithPreLoad?partnerid=25008&siteid=5012&PageType=searchResults&SearchType=linkquery&LinkID=6017#keyWordSearch=&locationSearch=')
job_roles = driver.find_elements(By.XPATH, '/html/body/div[2]/div[2]/div[1]/div[6]/div[3]/div/div/div[5]/div[2]/div/div[1]/ul/li[1]/div[2]/div[1]/span/a')
for job_roles in job_roles:
text = job_roles.text
print(text)
With this code, I am able to retrieve the first role which is: Business Analyst - IB Credit Risk Change
I am unable to retrieve the other roles, can someone kindly assist
Thanks
In this case all the job names have the two CSS classes jobProperty and jobtitle.
So, since you want all the jobs, I recommend selecting by CSS selector.
The following example should work:
driver.get('https://jobs.ubs.com/TGnewUI/Search/home/HomeWithPreLoad?partnerid=25008&siteid=5012&PageType=searchResults&SearchType=linkquery&LinkID=6017#keyWordSearch=&locationSearch=')
job_roles = driver.find_elements_by_css_selector('.jobProperty.jobtitle')
for job_roles in job_roles:
text = job_roles.text
print(text)
If you want to use the xPath, you were very close. Your xPath specifically only selects the first li element (li[1]). By changing it to just li, it will find all matching xPaths:
driver.get('https://jobs.ubs.com/TGnewUI/Search/home/HomeWithPreLoad?partnerid=25008&siteid=5012&PageType=searchResults&SearchType=linkquery&LinkID=6017#keyWordSearch=&locationSearch=')
job_roles = driver.find_elements(By.XPATH, '/html/body/div[2]/div[2]/div[1]/div[6]/div[3]/div/div/div[5]/div[2]/div/div[1]/ul/li/div[2]/div[1]/span/a')
for job_roles in job_roles:
text = job_roles.text
print(text)
I want to scrape the product listings from any website. Some websites for example: Amazon, Alibaba have max 10 products on the page while some have 20. I don't want to put for loops in XPATH of each and every website.
Is there a way to get all the XPATH's related to a special attribute of any website? For example, if we have a XPATH of table, then it will show all the XPATH's of the table. Any help would be appreciated...
Here is the example HTML I will use
XPath of < ul> tag:
/html/body/div[4]/div/aside[1]/div[2]/div[2]/div/ul
XPaths of < li> tags:
/html/body/div[4]/div/aside[1]/div[2]/div[2]/div/ul/li[1]
/html/body/div[4]/div/aside[1]/div[2]/div[2]/div/ul/li[2]
/html/body/div[4]/div/aside[1]/div[2]/div[2]/div/ul/li[3]
/html/body/div[4]/div/aside[1]/div[2]/div[2]/div/ul/li[4]
/html/body/div[4]/div/aside[1]/div[2]/div[2]/div/ul/li[5]
What you can do is make a more general XPath that will grab all of the XPaths you want.
So say you want to find all the elements in the list:
/html/body/div[4]/div/aside[1]/div[2]/div[2]/div/ul/li
Notice there is no [#] at the end of that XPath, so it will find all elements containing that xpath
An example:
from selenium import webdriver
url = 'https://www.livesoccertv.com/'
driver = webdriver.Firefox()
driver.get(url)
test = driver.find_elements_by_xpath('/html/body/div/div[5]/div[3]/div/table[2]/tbody/tr')
print(len(test))
driver.close()
This returns a result of 35
I am attempting to scrape Kickstarter based on the project names alone. Using the project name and the base URL I can get to the search page. In order to scrape the project page, I need to use Selenium to click on the URL. However, I cannot point Selenium to the correct element to click on. I would also like this to be dynamic so I do not need to put the project name each time.
<div class="type-18 clamp-5 navy-500 mb3">
<a href="https://www.kickstarter.com/projects/1980119549/knife-block-
designed-by-if-and-red-dot-winner-jle?
ref=discovery&term=Knife%20block%20-
%20Designed%20by%20IF%20and%20Red%20dot%20winner%20JLE%20Design"
class="soft-black hover-text-underline">Knife block -
Designed by IF and
Red dot winner JLE Design
</a>
</div>`
driver = webdriver.Chrome(chrome_path)
url = 'https://www.kickstarter.com/discover/advanced?ref=nav_search&term=Knife
block - Designed by IF and Red dot winner JLE Design'
driver.get(url)
elem = driver.find_elements_by_link_text('Knife block - Designed by IF and Red
dot winner JLE Design')
elem.click()
How can I get the elem to point to the correct link?
In regards to your attempt, your code had a typo: using find_elements.... returns a list of elements so the method .click() would not work. You mean to use find_element.
To dynamically click links, use an XPath instead. The resulting code would be:
elem = driver.find_element_by_xpath('//div[contains(#class, "type-18")]/a')
elem.click()
This would grab the first match. You could do find_elements and iterate over the elements but this would be a bad approach because since you're clicking the links, each time that renders the previous page stale. If there's more than one, you could use the same XPath but indexed:
first_elem = driver.find_element_by_xpath('(//div[contains(#class, "type-18")]/a)[1]')
first_elem.click()
# ...
second_elem = driver.find_element_by_xpath('(//div[contains(#class, "type-18")]/a)[2]')
second_elem.click()
# And so forth...