Extract a hyperlink from a website - Selenium

Extract a hyperlink from a website - Selenium - python

I was attempting to solve this issue for a bit of time and attempted multiple solution posted on here prior to opening this question.
I am currently attempting to a run a scraper with the following code
website = 'https://www.abitareco.it/nuove-costruzioni-milano.html'
path = Path().joinpath('util', 'chromedriver')
driver = webdriver.Chrome(path)
driver.get(website)
main = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.NAME, "p1")))
My goal hyperlink has word scheda in it:
i = driver.find_element_by_xpath('.//a[contains(#href, "scheda")]')
i.text
My first issue is that find_element_by_xpath only outputs a single hyperlink and second issue is that it is not extracting anything so far.
I'd appreciate any help and/or guidance.

You need to use find_elements instead :
for name in driver.find_elements(By.XPATH, ".//a[contains(#href, 'scheda')]"):
print(name.text)
Note that find_elements will return a list of web elements, where as find_element return a single web element.
if you specifically looking for href attribute then you can try the below code :
for name in driver.find_elements(By.XPATH, ".//a[contains(#href, 'scheda')]"):
print(name.get_attribute('href'))

There's 2 issues, looking at the website.
You want to find all elements, not just one, so you need to use find_elements, not find_element
The anchors actually don't have any text in them, so .text won't return anything.
Assuming what you want is to scrape the URLs of all these links, you can use .get_attribute('href') instead of .text, like so:
url_list = driver.find_elements(By.XPATH, './/a[contains(#href, "scheda")]')
for i in url_list:
print(i.get_attribute('href'))
It will detect all webelements that match you criteria and store them in a list. I just used print as an example, but obviously you may want to do more than just print the links.

Related

Selenium - to make find_elements. readable

Basic concept I know:
find_element = find single elements. We can use .text or get.attribute('href') to make the element can be readable. Since find_elements is a list, we can't use .textor get.attribute('href') otherwise it shows no attribute.
To scrape information to be readable from find_elements, we can use for loop function:
vegetables_search = driver.find_elements(By.CLASS_NAME, "product-brief-wrapper")
for i in vegetables_search:
print(i.text)
Here is my problem, when I use find_element, it shows the same result. I searched the problem on the internet and the answer said that it's because using find_element would just show a single result only. Here is my code which hopes to grab different urls.
links.append(driver.find_element(By.XPATH, ".//a[#rel='noopener']").get_attribute('href'))
But I don't know how to combine the results into pandas. If I print these codes, links variable prints the same url on the csv file...
vegetables_search = driver.find_elements(By.CLASS_NAME, "product-brief-wrapper")
Product_name =[]
links = []
for search in vegetables_search:
Product_name.append(search.find_element(By.TAG_NAME, "h4").text)
links.append(driver.find_element(By.XPATH, ".//a[#rel='noopener']").get_attribute('href'))
#use panda modules to export the information
df = pd.DataFrame({'Product': Product_name,'Link': links})
df.to_csv('name.csv', index=False)
print(df)
Certainly, if I use loop function particularly, it shows different links.(That's mean my Xpath is correct(!?))
product_link = (driver.find_elements(By.XPATH, "//a[#rel='noopener']"))
for i in product_link:
print(i.get_attribute('href'))
My questions:
Besides using for loop function, how to make find_elements becomes readable? Just like find_element(By.attribute, 'content').text
How to go further step for my code? I cannot print out different urls.
Thanks so much. ORZ
This is the html code which's inspected from the website:

This line:
links.append(driver.find_element(By.XPATH, ".//a[#rel='noopener']").get_attribute('href'))
should be changed to be
links.append(search.find_element(By.XPATH, ".//a[#rel='noopener']").get_attribute('href'))
driver.find_element(By.XPATH, ".//a[#rel='noopener']").get_attribute('href') will always search for the first element on the DOM matching .//a[#rel='noopener'] XPath locator while you want to find the match inside another element.
To do so you need to change WebDriver driver object with WebElement search object you want to search inside, as shown above.

Python/Selenium web scrap how to find hidden src value from a links?

Scrapping links should be a simple feat, usually just grabbing the src value of the a tag.
I recently came across this website (https://sunteccity.com.sg/promotions) where the href value of a tags of each item cannot be found, but the redirection still works. I'm trying to figure out a way to grab the items and their corresponding links. My typical python selenium code looks something as such
all_items = bot.find_elements_by_class_name('thumb-img')
for promo in all_items:
a = promo.find_elements_by_tag_name("a")
print("a[0]: ", a[0].get_attribute("href"))
However, I can't seem to retrieve any href, onclick attributes, and I'm wondering if this is even possible. I noticed that I couldn't do a right-click, open link in new tab as well.
Are there any ways around getting the links of all these items?
Edit: Are there any ways to retrieve all the links of the items on the pages?
i.e.
https://sunteccity.com.sg/promotions/724
https://sunteccity.com.sg/promotions/731
https://sunteccity.com.sg/promotions/751
https://sunteccity.com.sg/promotions/752
https://sunteccity.com.sg/promotions/754
https://sunteccity.com.sg/promotions/280
...
Edit:
Adding an image of one such anchor tag for better clarity:

By reverse-engineering the Javascript that takes you to the promotions pages (seen in https://sunteccity.com.sg/_nuxt/d4b648f.js) that gives you a way to get all the links, which are based on the HappeningID. You can verify by running this in the JS console, which gives you the first promotion:
window.__NUXT__.state.Promotion.promotions[0].HappeningID
Based on that, you can create a Python loop to get all the promotions:
items = driver.execute_script("return window.__NUXT__.state.Promotion;")
for item in items["promotions"]:
base = "https://sunteccity.com.sg/promotions/"
happening_id = str(item["HappeningID"])
print(base + happening_id)
That generated the following output:
https://sunteccity.com.sg/promotions/724
https://sunteccity.com.sg/promotions/731
https://sunteccity.com.sg/promotions/751
https://sunteccity.com.sg/promotions/752
https://sunteccity.com.sg/promotions/754
https://sunteccity.com.sg/promotions/280
https://sunteccity.com.sg/promotions/764
https://sunteccity.com.sg/promotions/766
https://sunteccity.com.sg/promotions/762
https://sunteccity.com.sg/promotions/767
https://sunteccity.com.sg/promotions/732
https://sunteccity.com.sg/promotions/733
https://sunteccity.com.sg/promotions/735
https://sunteccity.com.sg/promotions/736
https://sunteccity.com.sg/promotions/737
https://sunteccity.com.sg/promotions/738
https://sunteccity.com.sg/promotions/739
https://sunteccity.com.sg/promotions/740
https://sunteccity.com.sg/promotions/741
https://sunteccity.com.sg/promotions/742
https://sunteccity.com.sg/promotions/743
https://sunteccity.com.sg/promotions/744
https://sunteccity.com.sg/promotions/745
https://sunteccity.com.sg/promotions/746
https://sunteccity.com.sg/promotions/747
https://sunteccity.com.sg/promotions/748
https://sunteccity.com.sg/promotions/749
https://sunteccity.com.sg/promotions/750
https://sunteccity.com.sg/promotions/753
https://sunteccity.com.sg/promotions/755
https://sunteccity.com.sg/promotions/756
https://sunteccity.com.sg/promotions/757
https://sunteccity.com.sg/promotions/758
https://sunteccity.com.sg/promotions/759
https://sunteccity.com.sg/promotions/760
https://sunteccity.com.sg/promotions/761
https://sunteccity.com.sg/promotions/763
https://sunteccity.com.sg/promotions/765
https://sunteccity.com.sg/promotions/730
https://sunteccity.com.sg/promotions/734
https://sunteccity.com.sg/promotions/623

You are using a wrong locator. It brings you a lot of irrelevant elements.
Instead of find_elements_by_class_name('thumb-img') please try find_elements_by_css_selector('.collections-page .thumb-img') so your code will be
all_items = bot.find_elements_by_css_selector('.collections-page .thumb-img')
for promo in all_items:
a = promo.find_elements_by_tag_name("a")
print("a[0]: ", a[0].get_attribute("href"))
You can also get the desired links directly by .collections-page .thumb-img a locator so that your code could be:
links = bot.find_elements_by_css_selector('.collections-page .thumb-img a')
for link in links:
print(link.get_attribute("href"))

Getting href with Selenium and Python

I am trying to get the href with selenium and python.
This is my page:
Some class information are changing depending on which elements. So I am trying basically to get all href for <a id="job____ .....
links.append(job.find_element_by_xpath('//a[#aria-live="polite"]//span').get_attribute(name="href"))
I tried couple of things but can't figure out how. How can i get all my href from the screenshot above?

Try this, but take care your xpath
"//a[#aria-live="polite"]//span"
will get a span, and i dont see any span with href on your html. Maybe this xpath solve it
//a[./span[#aria-live="polite"]]
links.append(job.find_element_by_xpath('//a[./span[#aria-live="polite"]]').get_attribute("href"))
But it wont get all urls, this with find_elements (return a list), extend your url list with list comprehension
links.extend([x.get_attribute("href") for x in job.find_elements_by_xpath('//a[./span[#aria-live="polite"]]')])
edit 1, other xpath solution
links.extend(["website_base_url"+x.get_attribute("href") for x in job.find_elements_by_xpath('//a[contains(#id, "job_")]')])

list_of_elements_with_href = wd.find_elements_by_xpath("//a[contains(#href,'')]")
for el_with_href in list_of_elements_with_href :
links.append(el.with_href.get_attribute("href"))
or if you need more specify:
list_of_elements_with_href = wd.find_elements_by_xpath("//a[contains(#href,'') and contains(#id,'job_')]")

Based on your description and attached image, I think you have got the wrong xpath. Try the following code.
find_links = driver.find_elements_by_xpath("//a[starts-with(#id,'job_')]")
links = []
for link in find_links:
links.append(link.get_attribute("href"))
Please note elements in find_elements_by_xpath instead of element.
I am unable to test this solution as you have not provided the website.

Python selenium xpath using contains and not contains

I try to get links whose title contains some word in the mean time not contains some words, I use the following code but it says is not a valid XPath expression.
Please find my code here:
Any help will be highly appreciated!
driver.get("http://www.csisc.cn/zbscbzw/isinbm/index_list_code.shtml")
while True:
links = [link.get_attribute('href') for link in driver.find_elements_by_xpath("//a[(contains(#title,'公司债券')and not(contains(#title,'短期'))]")]
for link in links:
driver.get(link)
#dosth

There is an extra bracket in you xpath, use
links = [link.get_attribute('href') for link in driver.find_elements_by_xpath("//a[contains(#title,'公司债券')and not(contains(#title,'短期'))]")]
instead
You can use chrome developer tools first to validate your xpaths
PS: I changed the xpath here a bit to be able to find some elements in my page

There should be space before and. Also there is extra leading bracket in your XPath. Try:
"//a[contains(#title,'公司债券') and not(contains(#title,'短期'))]"

Selenium Python - Store XPath in var and extract depther hirachy XPath from var

I sadly couldn't find any resources online for my problem. I'm trying to store elements found by XPath in a list and then loop over the XPath elements in a list to search in that object. But instead of searching in that given object, it seems that selenium is always again looking in the whole site.
Anyone with good knowledge about this? I've seen that:
// Selects nodes in the document from the current node that matches the selection no matter where they are
But I've also tried "/" and it didn't work either.
Instead of giving me the text for each div, it gives me the text from all divs.
My Code:
from selenium import webdriver
driver = webdriver.Chrome()
result_text = []
# I'm looking for all divs with a specific class and store them in a list
divs_found = driver.find_elements_by_xpath("//div[#class='a-fixed-right-grid-col a-col-left']")
# Here seems to be the problem as it seems like instead of "divs_found[1]" it behaves like "driver" an looking on the whole site
hrefs_matching_in_div = divs_found[1].find_elements_by_xpath("//a[contains(#href, '/gp/product/')]")
# Now I'm looking in the found href matches to store the text from it
for href in hrefs_matching_in_div:
result_text.append(href.text)
print(result_text)

You need to add . for immediate child.Try now.
hrefs_matching_in_div = divs_found[1].find_elements_by_xpath(".//a[contains(#href, '/gp/product/')]")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract a hyperlink from a website - Selenium - python

Related

Selenium - to make find_elements. readable

Python/Selenium web scrap how to find hidden src value from a links?

Getting href with Selenium and Python

Python selenium xpath using contains and not contains

Selenium Python - Store XPath in var and extract depther hirachy XPath from var

Categories

Resources