How to click every link and extract content inside - Python Selenium - python

I wanna get content inside from all links with id = "LinkNoticia"
Actually my code join in first link and extract content, but i cant access to other.
How can i do it?
this is my code (its works for 1 link)
from selenium import webdriver
driver= webdriver.Chrome("/selenium/webdriver/chromedriver")
driver.get('http://www.emol.com/noticias/economia/todas.aspx')
driver.find_element_by_id("LinkNoticia").click()
title = driver.find_element_by_id("cuDetalle_cuTitular_tituloNoticia")
print(title.text)

First of all, the fact that page has multiple elements with the same ID is a bug on its own. The whole point of ID is to be unique for each element on the page. According to HTML specs:
id = name
This attribute assigns a name to an element. This name must be unique in a document.
A lengthy discussion is here.
Since ID is supposed to be unique, most (all?) implementations of Selenium will only have function to look for one element with given ID (e.g. find_element_by_id). I have never seen a function for finding multiple elements by ID. So you cannot use ID as your locator directly, you need to use one of the existing functions that allows location of multiple elements, and use ID as just some attribute which allows you to select a group of elements. Your choices are:
find_elements_by_xpath
find_elements_by_css_selector
For example, you could change your search like this:
links = driver.find_elements_by_xpath("//a[#id='LinkNoticia']");
That would give you the whole set of links, and you'd need to loop through them to retrieve the actual link (href). Note that if you just click on each link, you navigate away from the page and references in links will no longer be valid. So instead you can do this:
Build list of hrefs from the links:
hrefs=[]
for link in links:
hrefs.append(link.get_attribute("href"))
Navigate to eachhref to check its title:
for href in hrefs:
driver.get(href);
title = driver.find_element_by_id("cuDetalle_cuTitular_tituloNoticia")
# etc

Related

Python/Selenium web scrap how to find hidden src value from a links?

Scrapping links should be a simple feat, usually just grabbing the src value of the a tag.
I recently came across this website (https://sunteccity.com.sg/promotions) where the href value of a tags of each item cannot be found, but the redirection still works. I'm trying to figure out a way to grab the items and their corresponding links. My typical python selenium code looks something as such
all_items = bot.find_elements_by_class_name('thumb-img')
for promo in all_items:
a = promo.find_elements_by_tag_name("a")
print("a[0]: ", a[0].get_attribute("href"))
However, I can't seem to retrieve any href, onclick attributes, and I'm wondering if this is even possible. I noticed that I couldn't do a right-click, open link in new tab as well.
Are there any ways around getting the links of all these items?
Edit: Are there any ways to retrieve all the links of the items on the pages?
i.e.
https://sunteccity.com.sg/promotions/724
https://sunteccity.com.sg/promotions/731
https://sunteccity.com.sg/promotions/751
https://sunteccity.com.sg/promotions/752
https://sunteccity.com.sg/promotions/754
https://sunteccity.com.sg/promotions/280
...
Edit:
Adding an image of one such anchor tag for better clarity:
By reverse-engineering the Javascript that takes you to the promotions pages (seen in https://sunteccity.com.sg/_nuxt/d4b648f.js) that gives you a way to get all the links, which are based on the HappeningID. You can verify by running this in the JS console, which gives you the first promotion:
window.__NUXT__.state.Promotion.promotions[0].HappeningID
Based on that, you can create a Python loop to get all the promotions:
items = driver.execute_script("return window.__NUXT__.state.Promotion;")
for item in items["promotions"]:
base = "https://sunteccity.com.sg/promotions/"
happening_id = str(item["HappeningID"])
print(base + happening_id)
That generated the following output:
https://sunteccity.com.sg/promotions/724
https://sunteccity.com.sg/promotions/731
https://sunteccity.com.sg/promotions/751
https://sunteccity.com.sg/promotions/752
https://sunteccity.com.sg/promotions/754
https://sunteccity.com.sg/promotions/280
https://sunteccity.com.sg/promotions/764
https://sunteccity.com.sg/promotions/766
https://sunteccity.com.sg/promotions/762
https://sunteccity.com.sg/promotions/767
https://sunteccity.com.sg/promotions/732
https://sunteccity.com.sg/promotions/733
https://sunteccity.com.sg/promotions/735
https://sunteccity.com.sg/promotions/736
https://sunteccity.com.sg/promotions/737
https://sunteccity.com.sg/promotions/738
https://sunteccity.com.sg/promotions/739
https://sunteccity.com.sg/promotions/740
https://sunteccity.com.sg/promotions/741
https://sunteccity.com.sg/promotions/742
https://sunteccity.com.sg/promotions/743
https://sunteccity.com.sg/promotions/744
https://sunteccity.com.sg/promotions/745
https://sunteccity.com.sg/promotions/746
https://sunteccity.com.sg/promotions/747
https://sunteccity.com.sg/promotions/748
https://sunteccity.com.sg/promotions/749
https://sunteccity.com.sg/promotions/750
https://sunteccity.com.sg/promotions/753
https://sunteccity.com.sg/promotions/755
https://sunteccity.com.sg/promotions/756
https://sunteccity.com.sg/promotions/757
https://sunteccity.com.sg/promotions/758
https://sunteccity.com.sg/promotions/759
https://sunteccity.com.sg/promotions/760
https://sunteccity.com.sg/promotions/761
https://sunteccity.com.sg/promotions/763
https://sunteccity.com.sg/promotions/765
https://sunteccity.com.sg/promotions/730
https://sunteccity.com.sg/promotions/734
https://sunteccity.com.sg/promotions/623
You are using a wrong locator. It brings you a lot of irrelevant elements.
Instead of find_elements_by_class_name('thumb-img') please try find_elements_by_css_selector('.collections-page .thumb-img') so your code will be
all_items = bot.find_elements_by_css_selector('.collections-page .thumb-img')
for promo in all_items:
a = promo.find_elements_by_tag_name("a")
print("a[0]: ", a[0].get_attribute("href"))
You can also get the desired links directly by .collections-page .thumb-img a locator so that your code could be:
links = bot.find_elements_by_css_selector('.collections-page .thumb-img a')
for link in links:
print(link.get_attribute("href"))

xpath result from scrapy don't show the same result from a html page

I'm having some issues in crawling this website search:
https://www.simplyhired.com/search?q=data+engineer&l=United+States&pn=1&job=ZMzeXt6JW0jMuZc6H-3Af3sqOGzeQMLj7X5mnXXv9ZteeAoGm6oDdg
I'm trying to extract these elements from de SimplyHired search jobs for Data Engineer in US:
But when I try using xpath locator to any of them using selector module I'm getting different results and in different order.
Also the output for all of them isn't matching (The index corresponding to xpath job name is not the same index for ther location in xpath location for example).
Here is my code:
from scrapy import Selector
import requests
response = requests.get('https://www.simplyhired.com/search?q=data+engineer&l=united+states&mi=exact&sb=dd&pn=1&job=X1yGOt2Y8QTJm0tYqyptbgV9Pu19ge0GkVZK7Im5WbXm-zUr-QMM-A').content
sel=Selector(text=response)
#job name
sel.xpath('//main[#id="job-list"]/div/article[contains(#class,"SerpJob")]/div/div[#class="jobposting-title-container"]/h2/a/text()').extract()
#company
sel.xpath('//main[#id="job-list"]/div/article/div/h3[#class="jobposting-subtitle"]/span[#class="JobPosting-labelWithIcon jobposting-company"]/text()').extract()
#location
sel.xpath('//main[#id="job-list"]//div/article/div/h3[#class="jobposting-subtitle"]/span[#class="JobPosting-labelWithIcon jobposting-location"]/span/span/text()').extract()
#salary estimates
sel.xpath('//main[#id="job-list"]//div/article/div/div[#class="SerpJob-metaInfo"]//div[#class="SerpJob-metaInfoLeft"]/span/text()[2]').extract()
I'm not quite sure whether you're trying to use Scrapy or requests. Looks like you're wanting to use requests but with xpath selectors.
For websites like this, it's best to look at each individual job advert as a 'card'. You want to loop over each card with the XPATH selectors that you need to get the data you want.
Code Example
card = sel.xpath('//div[#class="SerpJob-jobCard card"]')
for a in card:
title = a.xpath('.//a[#class="card-link"]/text()').get()
company = a.xpath('.//span[#class="JobPosting-labelWithIcon jobposting-company"]/text()').get()
salary = a.xpath('.//span[#class="jobposting-salary"]/text()').get()
location = a.xpath('.//span[#class="jobposting-location"]/text()').get()
Explanation
You want to search each card with relative XPATH selectors. The .// searches within the chunk of HTML downstream of the card variable.
Always use get() instead of extract(). get() is used to get one value and returns a string always, here that's what we want when we're looping over each card. extract() extracts all values if there are multiple and if there's only one value for the XPATH selector it puts it into a list which is often not what you want. The ambiguity of extract() is not ideal, if you want multiple values to use getall(), this is explicit and will only give you multiple values.
Additional Information
If you're finding you're not getting the correct data in the right format, always look to see if javascript content is being added to the website. Turn off your browsers javascript to refresh the page. On this particular site, none of the data you require is loaded by javascript, this makes it much easier to scrape.

Selenium scraping

I am trying to come up with a way to scrape information on houses on Zillow and I am currently using xpath to look at data such as rent price, principal and mortgage costs, insurance costs.
I was able to find the information using xpath but I wanted to make it automatic and put it inside a for loop but I realized as I was using xpath, not all the data for each listing has the same xpath information. for some it would be off by 1 of a list or div. See code below for what I mean. How do I get it more specific? Is there a way to look up for a string like "principal and interest" and select the next value which would be the numerical value that I am looking for?
works for one listing:
driver.find_element_by_xpath("/html/body/div[1]/div[6]/div/div[1]/div[1]/div[1]/ul/li[1]/article/div[1]/div[2]/div")
a different listing would contain this:
driver.find_element_by_xpath("/html/body/div[1]/div[6]/div/div[1]/div[1]/div[2]/ul/li[1]/article/div[1]/div[2]/div")
The xpaths that you are using are specific to the elements of the first listing. To be able to access elements for each listing, you will need to use xpaths in a way that can help you access elements for each listing:
import pandas as pd
from selenium import webdriver
I searched for listing for sale in Manhattan and got the below URL
url = "https://www.zillow.com/homes/Manhattan,-New-York,-NY_rb/"
Asking selenium to open the above link in Chrome
driver = webdriver.Chrome()
driver.get(url)
I hovered my mouse on one of the house listings and clicked "inspect". This opened the HTML code and highlighted the item I am inspecting. I noticed that the elements having class "list-card-info" contain all the info of the house that we need. So, our strategy would be for each house access the element that has class "list-card-info". So, using the following code, I saved all such HTML blocks in house_cards variable
house_cards = driver.find_elements_by_class_name("list-card-info")
There are 40 elements in house_cards i.e. one for each house (each page has 40 houses listed)
I loop over each of these 40 houses and extract the information I need. Notice that I am now using xpaths which are specific to elements within the "list-card-info" element. I save this info in a pandas datagram.
address = []
price = []
bedrooms = []
baths = []
sq_ft = []
for house in house_cards:
address.append(house.find_element_by_class_name("list-card-addr").text)
price.append(house.find_element_by_class_name("list-card-price").text)
bedrooms.append(house.find_element_by_xpath('.//div[#class="list-card-heading"]/ul[#class="list-card-details"]/li[1]').text)
baths.append(house.find_element_by_xpath('.//div[#class="list-card-heading"]/ul[#class="list-card-details"]/li[2]').text)
sq_ft.append(house.find_element_by_xpath('.//div[#class="list-card-heading"]/ul[#class="list-card-details"]/li[3]').text)
driver.quit()
# print(address, price,bedrooms,baths, sq_ft)
Manahattan_listings = pd.DataFrame({"address":address,
"bedrooms": bedrooms,
"baths":baths,
"sq_ft":sq_ft,
"price":price},)
pandas dataframe output
Now, to extract info from more pages i.e. page2, page 3, etc, you can loop over website pages i.e. keep modifying your URL and keep extracting info
Happy Scraping!
selecting multiple elements using xpath is not a good idea. You can look into "css selector". Using this you can get similar elements.

Selenium Python - Store XPath in var and extract depther hirachy XPath from var

I sadly couldn't find any resources online for my problem. I'm trying to store elements found by XPath in a list and then loop over the XPath elements in a list to search in that object. But instead of searching in that given object, it seems that selenium is always again looking in the whole site.
Anyone with good knowledge about this? I've seen that:
// Selects nodes in the document from the current node that matches the selection no matter where they are
But I've also tried "/" and it didn't work either.
Instead of giving me the text for each div, it gives me the text from all divs.
My Code:
from selenium import webdriver
driver = webdriver.Chrome()
result_text = []
# I'm looking for all divs with a specific class and store them in a list
divs_found = driver.find_elements_by_xpath("//div[#class='a-fixed-right-grid-col a-col-left']")
# Here seems to be the problem as it seems like instead of "divs_found[1]" it behaves like "driver" an looking on the whole site
hrefs_matching_in_div = divs_found[1].find_elements_by_xpath("//a[contains(#href, '/gp/product/')]")
# Now I'm looking in the found href matches to store the text from it
for href in hrefs_matching_in_div:
result_text.append(href.text)
print(result_text)
You need to add . for immediate child.Try now.
hrefs_matching_in_div = divs_found[1].find_elements_by_xpath(".//a[contains(#href, '/gp/product/')]")

Using Selenium to select an anchor with specific content

I have a HTML element as follows:
<a class="country" href="/es-co">
Columbia
</a>
How do I select that anchor element based on the content 'Columbia'? I can't use find_element_by_class_css_selector because a.country represents half a dozen elements. How do I select that element and click it using Silenium with Python (through IE, if that has any bearing)?
As an aside, I could have any number of links with the same text and CSS selectors. How would Silenium differentiate?
There's no find_element_by_class_css_selector. But you are right, you can't use class names.
The best way is to use href="/es-co", if it's unique.
find_element_by_css_selector("a[href='/es-co']")
Otherwise you can find by text using XPath
find_element_by_xpath(".//a[contains(text(), 'Columbia')])
If you have many links with same locator, then you can index them, either by XPath directly or the list returned by Selenium.
For example, if you have ten Columbia
find_element_by_xpath(".//a[contains(text(), 'Columbia')][10]") # one-based index, one element only
find_elements_by_xpath(".//a[contains(text(), 'Columbia')]")[9] # find_elements_* gives you zero-base index list
In case of <a> with clickable text, Selenium provides API like find_with_link_text or find_with_partial_link_text (API name many be different but you got the idea).
If there are many <a> with same text/css-class, best bet to locate them is using XPath that is accepted by selenium APIs.

Categories