I'm new in Selenium with Python. I'm trying to scrape some data but I can't figure out how to parse outputs from commands like this:
driver.find_elements_by_css_selector("div.flightbox")
I was trying to google some tutorial but I've found nothing for Python.
Could you give me a hint?
find_elements_by_css_selector() would return you a list of WebElement instances. Each web element has a number of methods and attributes available. For example, to get an inner text of the element, use .text:
for element in driver.find_elements_by_css_selector("div.flightbox"):
print(element.text)
You can also make a context-specific search to find other elements inside the current element. Taking into account, that I know what site you are working with, here is an example code to get the departure and arrival times for the first-way flight in a result box:
for result in driver.find_elements_by_css_selector("div.flightbox"):
departure_time = result.find_element_by_css_selector("div.departure p.p05 strong").text
arrival_time = result.find_element_by_css_selector("div.arrival p.p05 strong").text
print [departure_time, arrival_time]
Make sure you study Getting Started, Navigating and Locating Elements documentation pages.
Related
Basic concept I know:
find_element = find single elements. We can use .text or get.attribute('href') to make the element can be readable. Since find_elements is a list, we can't use .textor get.attribute('href') otherwise it shows no attribute.
To scrape information to be readable from find_elements, we can use for loop function:
vegetables_search = driver.find_elements(By.CLASS_NAME, "product-brief-wrapper")
for i in vegetables_search:
print(i.text)
Here is my problem, when I use find_element, it shows the same result. I searched the problem on the internet and the answer said that it's because using find_element would just show a single result only. Here is my code which hopes to grab different urls.
links.append(driver.find_element(By.XPATH, ".//a[#rel='noopener']").get_attribute('href'))
But I don't know how to combine the results into pandas. If I print these codes, links variable prints the same url on the csv file...
vegetables_search = driver.find_elements(By.CLASS_NAME, "product-brief-wrapper")
Product_name =[]
links = []
for search in vegetables_search:
Product_name.append(search.find_element(By.TAG_NAME, "h4").text)
links.append(driver.find_element(By.XPATH, ".//a[#rel='noopener']").get_attribute('href'))
#use panda modules to export the information
df = pd.DataFrame({'Product': Product_name,'Link': links})
df.to_csv('name.csv', index=False)
print(df)
Certainly, if I use loop function particularly, it shows different links.(That's mean my Xpath is correct(!?))
product_link = (driver.find_elements(By.XPATH, "//a[#rel='noopener']"))
for i in product_link:
print(i.get_attribute('href'))
My questions:
Besides using for loop function, how to make find_elements becomes readable? Just like find_element(By.attribute, 'content').text
How to go further step for my code? I cannot print out different urls.
Thanks so much. ORZ
This is the html code which's inspected from the website:
This line:
links.append(driver.find_element(By.XPATH, ".//a[#rel='noopener']").get_attribute('href'))
should be changed to be
links.append(search.find_element(By.XPATH, ".//a[#rel='noopener']").get_attribute('href'))
driver.find_element(By.XPATH, ".//a[#rel='noopener']").get_attribute('href') will always search for the first element on the DOM matching .//a[#rel='noopener'] XPath locator while you want to find the match inside another element.
To do so you need to change WebDriver driver object with WebElement search object you want to search inside, as shown above.
My goal is to be able to scrape definitions of words in python.
To begin with, I am trying to get just the first definition of the word "assist" which should be "to help". I am using dictionary.cambridge.org
//web driver goes to page
driver.get("https://dictionary.cambridge.org/dictionary/english/assist")
//to give time for the page to load
time.sleep(4)
//click "accept cookies"
driver.find_element_by_xpath("/html[#class='i-amphtml-singledoc i-amphtml-standalone']/body[#class='break default_layout amp-mode-mouse']/div[#id='onetrust-consent-sdk']/div[#id='onetrust-banner-sdk']/div[#class='ot-sdk-container']/div[#class='ot-sdk-row']/div[#id='onetrust-button-group-parent']/div[#id='onetrust-button-group']/div[#class='banner-actions-container']/button[#id='onetrust-accept-btn-handler']").click()
Up this point, everything is working correctly. However, when I try to print the first definition using "find element by xpath", I get a NoSuchElementException. I'm pretty familiar with selenium and have scraped web stuff hundreds of times before but on this webpage, I don't know what I'm doing wrong. Here's the code I am using:
print(driver.find_element_by_xpath("/html[#class='i-amphtml-singledoc i-amphtml-standalone']/body[#class='break default_layout amp-mode-mouse']/div[#class='cc fon']/div[#class='pr cc_pgwn']/div[#class='x lpl-10 lpr-10 lpt-10 lpb-25 lmax lp-m_l-20 lp-m_r-20']/div[#class='hfr-m ltab lp-m_l-15']/article[#id='page-content']/div[#class='page']/div[#class='pr dictionary'][1]/div[#class='link']/div[#class='pr di superentry']/div[#class='di-body']/div[#class='entry']/div[#class='entry-body']/div[#class='pr entry-body__el'][1]/div[#class='pos-body']/div[#class='pr dsense dsense-noh']/div[#class='sense-body dsense_b']/div[#class='def-block ddef_block ']/div[#class='ddef_h']/div[#class='def ddef_d db']").text())
To print the scrape definitions of words you can use either of the following Locator Strategies:
Using xpath and text attribute:
print(driver.find_element_by_xpath("//span[contains(#class, 'epp-xref dxref')]//following::div[1]").text)
Using xpath and innerText:
print(driver.find_element_by_xpath("//span[contains(#class, 'epp-xref dxref')]//following::div[1]").get_attribute("innerText"))
Console Output:
to help:
Instead of Absolute xpath, opt for Relative xpaths. You can refer this link
Tried with below code and it retrieved the data.
driver.get("https://dictionary.cambridge.org/dictionary/english/assist")
print(driver.find_element_by_xpath("(//div[#class='ddef_h'])[1]/div").get_attribute("innerText"))
to help:
I'm having some issues in crawling this website search:
https://www.simplyhired.com/search?q=data+engineer&l=United+States&pn=1&job=ZMzeXt6JW0jMuZc6H-3Af3sqOGzeQMLj7X5mnXXv9ZteeAoGm6oDdg
I'm trying to extract these elements from de SimplyHired search jobs for Data Engineer in US:
But when I try using xpath locator to any of them using selector module I'm getting different results and in different order.
Also the output for all of them isn't matching (The index corresponding to xpath job name is not the same index for ther location in xpath location for example).
Here is my code:
from scrapy import Selector
import requests
response = requests.get('https://www.simplyhired.com/search?q=data+engineer&l=united+states&mi=exact&sb=dd&pn=1&job=X1yGOt2Y8QTJm0tYqyptbgV9Pu19ge0GkVZK7Im5WbXm-zUr-QMM-A').content
sel=Selector(text=response)
#job name
sel.xpath('//main[#id="job-list"]/div/article[contains(#class,"SerpJob")]/div/div[#class="jobposting-title-container"]/h2/a/text()').extract()
#company
sel.xpath('//main[#id="job-list"]/div/article/div/h3[#class="jobposting-subtitle"]/span[#class="JobPosting-labelWithIcon jobposting-company"]/text()').extract()
#location
sel.xpath('//main[#id="job-list"]//div/article/div/h3[#class="jobposting-subtitle"]/span[#class="JobPosting-labelWithIcon jobposting-location"]/span/span/text()').extract()
#salary estimates
sel.xpath('//main[#id="job-list"]//div/article/div/div[#class="SerpJob-metaInfo"]//div[#class="SerpJob-metaInfoLeft"]/span/text()[2]').extract()
I'm not quite sure whether you're trying to use Scrapy or requests. Looks like you're wanting to use requests but with xpath selectors.
For websites like this, it's best to look at each individual job advert as a 'card'. You want to loop over each card with the XPATH selectors that you need to get the data you want.
Code Example
card = sel.xpath('//div[#class="SerpJob-jobCard card"]')
for a in card:
title = a.xpath('.//a[#class="card-link"]/text()').get()
company = a.xpath('.//span[#class="JobPosting-labelWithIcon jobposting-company"]/text()').get()
salary = a.xpath('.//span[#class="jobposting-salary"]/text()').get()
location = a.xpath('.//span[#class="jobposting-location"]/text()').get()
Explanation
You want to search each card with relative XPATH selectors. The .// searches within the chunk of HTML downstream of the card variable.
Always use get() instead of extract(). get() is used to get one value and returns a string always, here that's what we want when we're looping over each card. extract() extracts all values if there are multiple and if there's only one value for the XPATH selector it puts it into a list which is often not what you want. The ambiguity of extract() is not ideal, if you want multiple values to use getall(), this is explicit and will only give you multiple values.
Additional Information
If you're finding you're not getting the correct data in the right format, always look to see if javascript content is being added to the website. Turn off your browsers javascript to refresh the page. On this particular site, none of the data you require is loaded by javascript, this makes it much easier to scrape.
I'm trying to extract some odds from a page using Selenium ChromeDriver, since the data is dynamic. The "find elements by XPath expression" usually works with these kind of websites for me, but this time, it can't seem to find the element in question, nor any element that belong to the section of the page that shows the relevant odds.
I'm probably making a simple error - if anyone has time to check the page out I'd be very grateful! Sample page: Nordic Bet NHL Odds
driver.get("https://www.nordicbet.com/en/odds#?cat=®=&sc=50&bgi=36")
time.sleep(5)
dayElems = driver.find_elements_by_xpath("//div[#class='ng-scope']")
print(len(dayElems))
Output:
0
It was a problem I used to face...
It is in another frame whose id is SportsbookIFrame. You need to navigate into the frame:
driver.switch_to_frame("SportsbookIFrame")
dayElems = driver.find_elements_by_xpath("//div[#class='ng-scope']")
len(dayElems)
Output:
26
For searching iframes, they are usual elements:
iframes = driver.find_elements_by_xpath("//iframe")
I am using the following code using Python 3.6 and selenium:
element = driver.find_element_by_class_name("first_result_price")
print(element)
on the website it is like this
`website: span class="first_result_price">712
however if I print element I get a completely different number?
Any suggestions?
many thanks!!
"element" is a type of object called WebElement that Selenium adds. If you want to find the text inside that element, you have to say
element.text
Which should return what you're looking for, '712', albeit in string form.