Python selenium how to scrape list of values from website - python

I have a list of job roles on this website that I want to scrape. The code I am using is below:
driver.get('https://jobs.ubs.com/TGnewUI/Search/home/HomeWithPreLoad?partnerid=25008&siteid=5012&PageType=searchResults&SearchType=linkquery&LinkID=6017#keyWordSearch=&locationSearch=')
job_roles = driver.find_elements(By.XPATH, '/html/body/div[2]/div[2]/div[1]/div[6]/div[3]/div/div/div[5]/div[2]/div/div[1]/ul/li[1]/div[2]/div[1]/span/a')
for job_roles in job_roles:
text = job_roles.text
print(text)
With this code, I am able to retrieve the first role which is: Business Analyst - IB Credit Risk Change
I am unable to retrieve the other roles, can someone kindly assist
Thanks

In this case all the job names have the two CSS classes jobProperty and jobtitle.
So, since you want all the jobs, I recommend selecting by CSS selector.
The following example should work:
driver.get('https://jobs.ubs.com/TGnewUI/Search/home/HomeWithPreLoad?partnerid=25008&siteid=5012&PageType=searchResults&SearchType=linkquery&LinkID=6017#keyWordSearch=&locationSearch=')
job_roles = driver.find_elements_by_css_selector('.jobProperty.jobtitle')
for job_roles in job_roles:
text = job_roles.text
print(text)

If you want to use the xPath, you were very close. Your xPath specifically only selects the first li element (li[1]). By changing it to just li, it will find all matching xPaths:
driver.get('https://jobs.ubs.com/TGnewUI/Search/home/HomeWithPreLoad?partnerid=25008&siteid=5012&PageType=searchResults&SearchType=linkquery&LinkID=6017#keyWordSearch=&locationSearch=')
job_roles = driver.find_elements(By.XPATH, '/html/body/div[2]/div[2]/div[1]/div[6]/div[3]/div/div/div[5]/div[2]/div/div[1]/ul/li/div[2]/div[1]/span/a')
for job_roles in job_roles:
text = job_roles.text
print(text)

Related

Extract a hyperlink from a website - Selenium

I was attempting to solve this issue for a bit of time and attempted multiple solution posted on here prior to opening this question.
I am currently attempting to a run a scraper with the following code
website = 'https://www.abitareco.it/nuove-costruzioni-milano.html'
path = Path().joinpath('util', 'chromedriver')
driver = webdriver.Chrome(path)
driver.get(website)
main = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.NAME, "p1")))
My goal hyperlink has word scheda in it:
i = driver.find_element_by_xpath('.//a[contains(#href, "scheda")]')
i.text
My first issue is that find_element_by_xpath only outputs a single hyperlink and second issue is that it is not extracting anything so far.
I'd appreciate any help and/or guidance.
You need to use find_elements instead :
for name in driver.find_elements(By.XPATH, ".//a[contains(#href, 'scheda')]"):
print(name.text)
Note that find_elements will return a list of web elements, where as find_element return a single web element.
if you specifically looking for href attribute then you can try the below code :
for name in driver.find_elements(By.XPATH, ".//a[contains(#href, 'scheda')]"):
print(name.get_attribute('href'))
There's 2 issues, looking at the website.
You want to find all elements, not just one, so you need to use find_elements, not find_element
The anchors actually don't have any text in them, so .text won't return anything.
Assuming what you want is to scrape the URLs of all these links, you can use .get_attribute('href') instead of .text, like so:
url_list = driver.find_elements(By.XPATH, './/a[contains(#href, "scheda")]')
for i in url_list:
print(i.get_attribute('href'))
It will detect all webelements that match you criteria and store them in a list. I just used print as an example, but obviously you may want to do more than just print the links.

Scrape clickable link or xpath

I have this html tree in a web app:
I have scraped all the text for all the league names.
But I also need a XPATH or any indicator so that I can tell selenium: if I choose for example EFL League Two (ENG 4) in my GUI from e. g. a drop down menu, then use the corresponding xpath to choose the right league in the web app.
I have no idea how I could either extract a XPATCH from that tree nor any other solution that could be used for my scenario.
Any idea how I could fix this?
If I try to get a 'href' extracted, it prints just "None"
This is my code so far:
def scrape_test():
leagues = []
#click the dropdown menue to open the folder with all the leagues
league_dropdown_menu = driver.find_element_by_xpath('/html/body/main/section/section/div[2]/div/div[2]/div/div[1]/div[1]/div[7]/div')
league_dropdown_menu.click()
time.sleep(1)
#get all league names as text
scrape_leagues = driver.find_elements_by_xpath("//li[#class='with-icon' and contains(text(), '')]")
for league in scrape_leagues:
leagues.append(league.text)
print('\n')
# HERE I NEED HELP! - I try to get a link/xpath for each corresponding league to use later with selenium
scrape_leagues_xpath = driver.find_elements_by_xpath("//li[#class='with-icon']")
for xpath in scrape_leagues_xpath:
leagues.append(xpath.get_attribute('xpath')) #neither xpath, text, href is working here
print(leagues)
li node doesn't have text, href or xpath (I don't think its a valid HTML attribute). You can scrape and parse #style.
Try to use this approach to extract background-image URL
leagues.append(xpath.get_attribute('style').strip('background-image:url("').rstrip('");'))

How to access text element in selenium if it is splitted by body tags

I have a problem while trying to access some values on the website during the process of web scraping the data. The problem is that the text I want to extract is in the class which contains several texts separated by tags (these body tags also have texts which are also important for me).
So firstly, I tried to look for the tag with the text I needed ('Category' in this case) and then extract the exact category from the text below this body tag assignment. I could use precise XPath but here it is not the case because other pages I need to web scrape contain a different amount of rows in this sidebar so the locations, as well as XPaths, are different.
The expected output is 'utility' - the category in the sidebar.
The website and the text I need to extract look like that (look right at the sidebar containing 'Category':
The element looks like that:
And the code I tried:
driver = webdriver.Safari()
driver.get('https://www.statsforsharks.com/entry/MC_Squares')
element = driver.find_elements_by_xpath("//b[contains(text(), 'Category')]/following-sibling")
for value in element:
print(value.text)
driver.close()
the link to the page with the data is https://www.statsforsharks.com/entry/MC_Squares.
Thank you!
You might be better off using regex here, as the whole text comes under the 'company-sidebar-body' class, where only some text is between b tags and some are not.
So, you can the text of the class first:
sidebartext = driver.find_element_by_class_name("company-sidebar-body").text
That will give you the following:
"EOY Proj Sales: $1,000,000\r\nSales Prev Year: $200,000\r\nCategory: Utility\r\nAsking Deal\r\nEquity: 10%\r\nAmount: $300,000\r\nValue: $3,000,000\r\nEquity Deal\r\nSharks: Kevin O'Leary\r\nEquity: 25%\r\nAmount: $300,000\r\nValue: $1,200,000\r\nBite: -$1,800,000"
You can then use regex to target the category:
import re
c = re.search("Category:\s\w+", sidebartext).group()
print(c)
c will result in 'Category: Utility' which you can then work with. This will also work if the value of the category ('Utility') is different on other pages.
There are easier ways when it's a MediaWiki website. You could, for instance, access the page data through the API with a JSON request and parse it with a much more limited DOM.
Any particular reason you want to scrape my website?

looping over followers name in instagram

I am trying to generate a list where to put the followers name for a particular person using selenium into a list
The XPath for the first user in the list is :
/html/body/div[3]/div/div/div[2]/ul/div/li[1]/div/div[2]/div[1]/div/div/a
I been trying loop over li but could reach nothing good.
Or maybe i can take the tittle for each one of this class, but i cannot perform it
What you are trying to accomplish can easily be done in plain python:
Copy the page HTML into a text file
Extract all anchor tags using regex
For each anchor tag, if href value contains anchor text value, then append the anchor text to list of followers
Try to post your code, so it would be easy for us to view and try to help you.
-HTML code
-Your code (java/python)
After followers link clicked you need to wait until follower dialog appear, use WebDriverWait
# after the link clicked
followers = WebDriverWait(driver, 5).until(
lambda d: d.find_elements_by_xpath('//div[#role="dialog"]//a[#title]')
)
for follower in followers:
print(follower.get_attribute('textContent'))
note: your xpath only return first follower

Want to extract decimal number from a page with xpath, selenium wedriver in python

I have a page having item price as shown in attached image. i want to extract this price as 64.99. I want to ask what would be the xpath to get this number as Im using selenium webdriver to find this price
I have tried a lot of permutations of xpaths but the problem is that this page have a lot such products so its being difficult to find unique xpath of that price. e.g -
//li[#class = 'price-current'] (gives 13 result on the page)
//*[#id = 'landingpage-price' and #class = 'price-current'] (give no result)
Any help will be appreciated. Thanks
Since you mentioned there are lot of such products, then the problem you are asking is wrong. You need to find out how to get to the product that you are interested in and then find its price. You are trying to find the price directly.
Now the issue in below xpath
//*[#id = 'landingpage-price' and #class = 'price-current'] (give no result)
is that, you are trying to search inside landingpage-price and specifying the class condition also on the container element. First I would suggest do this using css, but I will show both xpath and css as well.
XPath
elem = driver.find_element_by_xpath("//div[#id = 'landingpage-price']//li[#class = 'price-current']")
print (elem.text.replace("$",""))
CSS
elem = driver.find_element_by_css_selector("#landingpage-price .price-current")
print (elem.text.replace("$",""))
You xpath would break if developers adds more classes to the price. So using a css is better and it does work also. As you can see in below image it uniquely identified the element

Categories