I need to scrape some pages. The exact structure of the part that I want is as follows:
<div class="someclasses">
<h3>...</h3> # Not needed
<ul class="ul-class1 ul-class2">
<li id="li1-id" class="li-class1 li-class2">
<div id ="div1-id" class="div-class1 div-class2 ... div-class6">
<div class="div2-class">
<div class="div3-class">...</div> #Not needed
<div class="div4-class1 div4-class2 div4-class3">
<a href="href1" data-control-id="id1" data-control-name="name" id ="a1-id" class="a-class1 a-class2">
<h3 class="h3-class1 h3-class2 h3-class3">Text1</h3>
</a></div>
<div>...</div> # Not needed
</div>
</li>
<li id="li2-id" class="li-class1 li-class2">
<div id ="div2-id" class="div-class1 div-class2 ... div-class6">
<div class="div2-class">
<div class="div3-class">...</div> #Not needed
<div class="div4-class1 div4-class2 div4-class3">
<a href="href2" data-control-id="id2" data-control-name="name" id ="a2-id" class="a-class1 a-class2">
<h3 class="h3-class1 h3-class2 h3-class3">Text2</h3>
</a></div>
<div>...</div> # Not needed
</div>
</li>
# More <li> elements
</ul>
</div>
Now what I want is to get the Texts as well as the hrefs.I have used the naming in above example exactly realistic i.e the same names are also the same in the real webpage. The code that I am currently using is:
elems = driver.find_elements_by_xpath("//div[#class='someclasses']/ul[#class='ul-class1']/li[#class='li-class1']")
print(len(elems))
for elem in elems:
elem1 = driver.find_element_by_xpath("./a[#data-control-name='name']")
names2.append(elem1.text)
print(elem1.text)
hrefs.append(elem.get_attribute("href"))
The result of the print statement above is 0 so basically the elements are not found. Can anyone please tell me what am I doing wrong.
You are using only part of the class name... in XPATH you need the full class name...
FYI: With CSS you can use part of the class name...
If you want to use XPATH try:
elems = driver.find_elements_by_xpath("//div[#class='someclasses']//li//a")
print(len(elems))
for elem in elems:
names2.append(elem.text)
print(elem.text)
new_href = elem.get_attribute("href")
print(new_href)
hrefs.append(new_href)
For CSS use: div.someclasses ul.ul-class1
elems = driver.find_elements_by_css_selector("div.someclasses ul.ul-class1 li a")
for elem in elems:
names2.append(elem.text)
print(elem.text)
new_href = elem.get_attribute("href")
print(new_href)
hrefs.append(new_href)
Related
I have a table of search results in Selenium browser and each search result is defined in Html like this:
<div class="item
itemWrapper
ItemPosition1
ItemMonitor
" data-position="1" data-it-name="NAME OF THE ITEM" data-it-category="Category" role="article">
<div class="item-image">
<a href="/some/link/" target="_blank" rel="noopener" class="itemRec">
<img src="https://some.jpg" alt="some name" class="img-responsive">
</a>
</div>
<h2 class="small-text item-title">
Link Text
</h2>
<div class="item-bottom">
<div class="pull-left item-price">
<span>999</span>
</div>
<div class="pull-right detail-link">
<a href="/link/to/detail" title="link title" class="detail"
Detail
</a>
</div>
</div>
</div>
I am able to find all webelements by classname = item.
elements = driver.find_elements_by_class_name("item")
I would need to iterate over elements and get their position, name and price to be able to click to one of them:
for e in elements:
position=e.get_attribute("data-position").value,
name=e.get_attribute("data-it-name").value,
price=e.find_element(By.CLASS_NAME,'item-price').value
but this does not work - get_attribute returns None and find_element does not find any child element
Can you please advise me how to get the "data-" atributes and child elements values correctly?
Whole code using Webbot:
import webbot
from selenium.webdriver.common.by import By
web = webbot.Browser()
web.go_to('www.***.cz')
web.type('bed', classname='header-search-form')
web.press(web.Key.ENTER)
elements = web.find_elements(classname="product-item")
for e in elements:
name = e.get_attribute("data-it-name").value
price = e.find_element(By.CLASS_NAME, 'item-price').value
print(name,price)
break
classname acts weirdly in webbot. You definitely are not getting a product item there:
In [56]: elements[0].get_attribute('outerHTML')
Out[56]: '\n\n\t\t\t\t\t\t<img src="https://s.favi.cz/static/frontend/_global/images/favi-logo/favi-logo.60d511aff13247dd52f15acf6bdf2af9.svg" role="banner">\n\n\t\t\t\t\t'
Works well with a CSS selector:
In [58]: elements = web.find_elements(css_selector=".product-item")
In [59]: elements[0].get_attribute('outerHTML')
Out[59]: '<div class="\n\t\t\tproduct-item\n\t\t\titemWrapper\n\t\t\tproductItemPosition1\n\t\t\tproductItemMonitor\n\t\t\tproductItemWrapper\n\t\t\tsendProductTransactionWrapper\n\t\t\t\t\t" data-position="1" data-pr-name="Moderní box spring postel Alvares 160x200, bílá" data-tr-id="04d62b60-9d00-4d1b-b03c-2258c50bfdb9" data-pr-category="Postele" data-tr-ob-id="2144583" data-m-ob-id="2345478" role="article">\n\n\t\t<div class="product-image">\n\n\t\t\t\n\t\t\t\t\t\t\t\t\t<img src="https://s.favi.cz/static/images/t/product/300/6f/92/6f922779-bc84-483e-b1cd-ad8522ef0c92.jpg" alt="Moderní box spring postel Alvares 160x200, bílá" class="img-responsive">\n\t\t\t\t\t\t\t\n\n\t\t\t\n\t\t\t\t\t\t\t\t\t<span class="count">485</span>\n\t\t\t\t\t\t\t\n\n\t\t\t\n\t\t\t\n\t\t</div>\n\n\t\t<div class="product-labels stickers-holder">\n\n\t\t\t\t\t\t\t<span class="sticker storage white">\n\t\t\t\t\t<span class="text">Skladem</span>\n\t\t\t\t</span>\n\t\t\t\n\t\t\t\n\t\t\t\n\t\t\t\n\t\t</div>\n\n\t\t<h2 class="small-text product-item-title">\n\t\t\tModerní box spring postel Alvares 160x200, bílá\n\t\t</h2>\n\n\t\t<div class="product-bottom">\n\n\t\t\t<div class="pull-left product-item-price">\n\t\t\t\t<span>15 599 Kč</span>\n\t\t\t\t\t\t\t</div>\n\n\t\t\t<div class="pull-right product-shop-link">\n\t\t\t\t\n\t\t\t\t\tDetail\n\t\t\t\t\n\n\t\t\t\t\n\t\t\t\t\t<strong>Do obchodu</strong>\n\t\t\t\t\n\t\t\t</div>\n\n\t\t</div>\n\n\t\t\n\t</div>'
In [60]: elements[0].get_attribute('data-position')
Out[60]: '1'
In [61]: elements[0].get_attribute('data-pr-name')
Out[61]: 'Moderní box spring postel Alvares 160x200, bílá'
Question:
Can I group found elements by a div class they're in and store them in lists in a list.
Is that possible?
*So I did some further testing and as said. It seems like that even if you store one div in a variable and when trying to search in that stored div it searches the whole site content.
from selenium import webdriver
driver = webdriver.Chrome()
result_text = []
# Let's say this is the class of the different divs, I want to group it by
#class='a-fixed-right-grid a-spacing-top-medium'
# These are the texts from all divs around the page that I'm looking for but I can't say which one belongs in witch div
elements = driver.find_elements_by_xpath("//a[contains(#href, '/gp/product/')]")
for element in elements:
result_text.append(element.text)
print(result_text )
Current Result:
I'm already getting all the information I'm looking for from different divs around the page but I want it to be "grouped" by the topmost div.
['Text11', 'Text12', 'Text2', 'Text31', 'Text32']
Result I want to achieve:
The
text is grouped by the #class='a-fixed-right-grid a-spacing-top-medium'
[['Text11', 'Text12'], ['Text2'], ['Text31', 'Text32']]
HTML: (looks something like this)
class="a-text-center a-fixed-left-grid-col a-col-left" is the first one that wraps the group from there on we can use any div to group it. At least I think that.
</div>
</div>
</div>
</div>
<div class="a-fixed-right-grid a-spacing-top-medium"><div class="a-fixed-right-grid-inner a-grid-vertical-align a-grid-top">
<div class="a-fixed-right-grid-col a-col-left" style="padding-right:3.2%;float:left;">
<div class="a-row">
<div class="a-fixed-left-grid a-spacing-base"><div class="a-fixed-left-grid-inner" style="padding-left:100px">
<div class="a-text-center a-fixed-left-grid-col a-col-left" style="width:100px;margin-left:-100px;float:left;">
<div class="item-view-left-col-inner">
<a class="a-link-normal" href="/gp/product/B07YCW79/ref=ppx_yo_dt_b_asin_image_o0_s00?ie=UTF8&psc=1">
<img alt="" src="https://images-eu.ssl-images-amazon.com/images/I/41rcskoL._SY90_.jpg" aria-hidden="true" onload="if (typeof uet == 'function') { uet('cf'); uet('af'); }" class="yo-critical-feature" height="90" width="90" title="Same as the text I'm looking for" data-a-hires="https://images-eu.ssl-images-amazon.com/images/I/41rsxooL._SY180_.jpg">
</a>
</div>
</div>
<div class="a-fixed-left-grid-col a-col-right" style="padding-left:1.5%;float:left;">
<div class="a-row">
<a class="a-link-normal" href="/gp/product/B07YCR79/ref=ppx_yo_dt_b_asin_title_o00_s0?ie=UTF8&psc=1">
Text I'm looking for
</a>
</div>
<div class="a-row">
I don't have the link to test it on but this might work for you:
from selenium import webdriver
driver = webdriver.Chrome()
result_text = [[a.text for a in div.find_elements_by_xpath("//a[contains(#href, '/gp/product/')]")]
for div in driver.find_elements_by_class_name('a-fixed-right-grid')]
print(result_text)
EDIT: added alternative function:
# if that doesn't work try:
def get_results(selenium_driver, div_class, a_xpath):
div_list = []
for div in selenium_driver.find_elements_by_class_name(div_class):
a_list = []
for a in div.find_elements_by_xpath(a_xpath):
a_list.append(a.text)
div_list.append(a_list)
return div_list
get_results(driver,
div_class='a-fixed-right-grid'
a_xpath="//a[contains(#href, '/gp/product/')]")
If that doesn't work then maybe the xpath is returning EVERY matching element every time despite being called from the div, or another element has that same class name farther up the document
I'm using Python and Selenium to Scrape a website. Used find_by_element to find all the values that I need but I've run into something more challenging. The website html show the exactly structure to two different values and I cannot use a simple find_element_by_class because they have the same classes and ids. I don't want to use xpath or selector because I am iterating this through many "flight-row" divs and it would make thinks more hardcoded.
<div class="flight-row">
<div class="row row-eq-heights">
<div class="col-xs-4 col-md-4 no-padding"><span class="airline-name">gol</span><span class="flight-number">AM-477</span></div>
<div class="col-xs-4 col-md-4">
<div class="flight-timming"><span class="flight-time">06:15</span><span class="flight-destination">IAH</span></div><span class="flight-data">01/10/19</span></div>
<div class="col-xs-4 col-md-4 no-padding">
<div class="duration"><span class="flight-duration">21:25</span><span class="flight-stops" aria-label="Paradas do voo">2 paradas</span></div>
</div>
<div class="col-xs-4 col-md-4">
<div class="flight-timming"><span class="flight-destination">GIG</span><span class="flight-time">05:40</span></div><span class="flight-data">02/10/19</span></div>
</div>
</div>
I wanna get the values from flight-time, flight-destination and flight-data from the both "col-xs-4 col-md-4" divs.
This is a little of my code:
outbound_flights = driver.find_elements_by_css_selector("div[class^='flight-item ']")
for outbound_flight in outbound_flights:
airline = outbound_flight.find_element_by_css_selector("span[class='airline-name']")
Thank you!
Try the following css selector to get flight-time, flight-destination and flight-data
outbound_flights = driver.find_elements_by_css_selector("div.col-xs-4.col-md-4:not(.no-padding)")
for outbound_flight in outbound_flights:
flight_time = outbound_flight.find_element_by_css_selector("div.flight-timming span.flight-time").text
print(flight_time)
flight_destination = outbound_flight.find_element_by_css_selector("div.flight-timming span.flight-destination").text
print(flight_destination)
flight_data = outbound_flight.find_element_by_css_selector("span.flight-data").text
print(flight_data)
Output on console:
06:15
IAH
01/10/19
05:40
GIG
02/10/19
EDITED Answer:
outbound_flights = driver.find_elements_by_css_selector("div.col-xs-4.col-md-4:not(.no-padding)")
flighttime=[]
for outbound_flight in outbound_flights:
flight_time = outbound_flight.find_element_by_css_selector("div.flight-timming span.flight-time").text
print(flight_time)
flighttime.append(flight_time)
flight_destination = outbound_flight.find_element_by_css_selector("div.flight-timming span.flight-destination").text
print(flight_destination)
flight_data = outbound_flight.find_element_by_css_selector("span.flight-data").text
print(flight_data)
departure_time=flighttime[0]
arrival_time=flighttime[1]
print("Departure time :" + departure_time)
print("Arrival time :" + arrival_time)
You can get values by index.
(//*[#class='flight-time'])[1] and (//*[#class='flight-time'])[2]
I've used a selector within my python code to get Soccer: Next To Play out of some html elements. It works fine when I use a for loop and .extract() the unwanted portion. However, is there any better way to get the aforesaid text out of the elements other than what I have done below or at least to do the same with a one-liner expression.
from bs4 import BeautifulSoup
content='''
<div class="page-title-new">
<h1>
Soccer: Next To Play
<span aria-hidden="true" class="race-large ng-hide" ng-show="vm.hasRaceNumber()">
RACE
</span>
<span aria-hidden="true" class="race-small ng-hide" ng-show="vm.hasRaceNumber()">
R
</span>
<span aria-hidden="true" class="ng-hide" ng-show="vm.hasRaceNumber()">
</span>
</h1>
<div aria-hidden="true" class="page-info-new ng-hide" ng-show="vm.hasEventDetailItems()">
<!-- -->
</div>
</div>
'''
soup = BeautifulSoup(content,"lxml")
for item in soup.select(".page-title-new h1"):
for elem in item.select("span"):elem.extract()
print(item.text.strip())
# items = [item.text for item in soup.select(".page-title-new h1")] #what to do to finish it as a one-liner
# print(items)
With the loop what I get (this is what I wish to get without a loop or a one-liner code):
Soccer: Next To Play
Without the loop what I get:
Soccer: Next To Play RACE R
With soup.select_one() method (to find only the 1st tag that matches the CSS selector):
...
soup = BeautifulSoup(content,"lxml")
result = soup.select_one(".page-title-new > h1").contents[0].strip()
print(result)
The output:
Soccer: Next To Play
I'm trying to scrape data from a site with this structure below. I want to extract information in each of the <li id="entry">, but both of the entries should also extract the category information from <li id="category"> / <h2>
<ul class="html-winners">
<li id="category">
<h2>Redaktionell Print - Dagstidning</h2>
<ul>
<li id="entry">
<div class="entry-info">
<div class="block">
<img src="bilder/tumme10/4.jpg" width="110" height="147">
<span class="gold">Guld: Svenska Dagbladet</span><br>
<strong><strong>Designer:</strong></strong> Anna W Thurfjell och SvD:s medarbetare<br>
<strong><strong>Motivering:</strong></strong> "Konsekvent design som är lätt igenkänningsbar. Små förändringar förnyar ständigt och blldmotiven utnyttjas föredömligt."
</div>
</div>
</li>
<li id="entry">
<div class="entry-info">
<div class="block"><img src="bilder/tumme10/3.jpg" width="110" height="147">
<span class="silver">Silver: K2 - Kristianstadsbladet</span>
</div>
</div>
</li>
</ul>
</li>
I use a scrapy with the following code:
start_urls = [
"http://www.designpriset.se/vinnare.php?year=2010"
]
rules = (
Rule(LinkExtractor(allow = "http://www.designpriset.se/", restrict_xpaths=('//*[#class="html-winners"]')), callback='parse_item'),
)
def parse(self, response):
for sel in response.xpath('//*[#class="entry-info"]'):
item = ByrauItem()
annons_list = sel.xpath('//span[#class="gold"]/text()|//span[#class="silver"]/text()').extract()
byrau_list = sel.xpath('//div/text()').extract()
kategori_list = sel.xpath('/preceding::h2/text()').extract()
for x in range(0,len(annons_list)):
item['Annonsrubrik'] = annons_list[x]
item['Byrau'] = byrau_list[x]
item['Kategori'] = kategori_list[x]
yield item
annons_list and byrau_list works perfect, they use xpath to go down the heirarchy from the starting point //*[#class="entry-info"]. But kategori_list gives me "IndexError: list index out of range". Am I writing the xpath preceding axe the wrong way?
As mentioned by #kjhughes in comment, you need to add . just before / or // to make your XPath expression relative to current context element. Otherwise the expression will be considered relative to the root document. And that's why the expression /preceding::h2/text() returned nothing.
In the case of /, you can also remove it from the beginning of your XPath expression as alternative way to make to it relative to current context element :
kategori_list = sel.xpath('preceding::h2/text()').extract()
Just a note, preceding::h2 will return all h2 elements located before the <div class="entry-info">. According to the HTML posted, I think the following XPath expression is safer from returning unwanted h2 elements (false positive) :
query = 'parent::li/parent::ul/preceding-sibling::h2/text()'
kategori_list = sel.xpath(query).extract()