I am trying to get all xpaths that start with something, but when I test it out, it says my xpath is invalid.
This is my xpath:
xpath = "//*[contains(#id,'ClubDetails_"+ str(team) +"']"
Which outputs when I print:
//*[contains(#id,'ClubDetails_1']
Here is HTML:
<div id="ClubDetails_1_386" class="fadeout">
<div class="MapListDetail">
<div>Bev Garman</div>
armadaswim#aol.com
</div>
<div class="MapListDetail">
<div>Rolling Hills Country Day School</div>
<div>26444 Crenshaw Blvd</div>
<div>Rolling Hills Estates, CA 90274</div>
</div>
<div class="MapListDetailOtherLocs_1">
<div>This club also swims at other locations</div>
<span class="show_them_link">show them...</span>
</div>
</div>
What am I missing ?
An alternative can be use pattern to match an ID starting with some text and get all those element in a list and then iterate list one by one and check the condition whether id attribute of that element contains the team is you wants.
You can match the pattern in CSS selector as below :
div[id^='ClubDetails_']
And this is how in xpath :
//div[starts-with(#id, 'ClubDetails_')]
Code sample:
expectedid = "1_386"
clubdetails = driver.find_elements_by_css_selector("div[id^='ClubDetails_']")
for item in clubdetails:
elementid = item.get_attribute('id')
expectedid in elementid
// further code
Related
Question:
Can I group found elements by a div class they're in and store them in lists in a list.
Is that possible?
*So I did some further testing and as said. It seems like that even if you store one div in a variable and when trying to search in that stored div it searches the whole site content.
from selenium import webdriver
driver = webdriver.Chrome()
result_text = []
# Let's say this is the class of the different divs, I want to group it by
#class='a-fixed-right-grid a-spacing-top-medium'
# These are the texts from all divs around the page that I'm looking for but I can't say which one belongs in witch div
elements = driver.find_elements_by_xpath("//a[contains(#href, '/gp/product/')]")
for element in elements:
result_text.append(element.text)
print(result_text )
Current Result:
I'm already getting all the information I'm looking for from different divs around the page but I want it to be "grouped" by the topmost div.
['Text11', 'Text12', 'Text2', 'Text31', 'Text32']
Result I want to achieve:
The
text is grouped by the #class='a-fixed-right-grid a-spacing-top-medium'
[['Text11', 'Text12'], ['Text2'], ['Text31', 'Text32']]
HTML: (looks something like this)
class="a-text-center a-fixed-left-grid-col a-col-left" is the first one that wraps the group from there on we can use any div to group it. At least I think that.
</div>
</div>
</div>
</div>
<div class="a-fixed-right-grid a-spacing-top-medium"><div class="a-fixed-right-grid-inner a-grid-vertical-align a-grid-top">
<div class="a-fixed-right-grid-col a-col-left" style="padding-right:3.2%;float:left;">
<div class="a-row">
<div class="a-fixed-left-grid a-spacing-base"><div class="a-fixed-left-grid-inner" style="padding-left:100px">
<div class="a-text-center a-fixed-left-grid-col a-col-left" style="width:100px;margin-left:-100px;float:left;">
<div class="item-view-left-col-inner">
<a class="a-link-normal" href="/gp/product/B07YCW79/ref=ppx_yo_dt_b_asin_image_o0_s00?ie=UTF8&psc=1">
<img alt="" src="https://images-eu.ssl-images-amazon.com/images/I/41rcskoL._SY90_.jpg" aria-hidden="true" onload="if (typeof uet == 'function') { uet('cf'); uet('af'); }" class="yo-critical-feature" height="90" width="90" title="Same as the text I'm looking for" data-a-hires="https://images-eu.ssl-images-amazon.com/images/I/41rsxooL._SY180_.jpg">
</a>
</div>
</div>
<div class="a-fixed-left-grid-col a-col-right" style="padding-left:1.5%;float:left;">
<div class="a-row">
<a class="a-link-normal" href="/gp/product/B07YCR79/ref=ppx_yo_dt_b_asin_title_o00_s0?ie=UTF8&psc=1">
Text I'm looking for
</a>
</div>
<div class="a-row">
I don't have the link to test it on but this might work for you:
from selenium import webdriver
driver = webdriver.Chrome()
result_text = [[a.text for a in div.find_elements_by_xpath("//a[contains(#href, '/gp/product/')]")]
for div in driver.find_elements_by_class_name('a-fixed-right-grid')]
print(result_text)
EDIT: added alternative function:
# if that doesn't work try:
def get_results(selenium_driver, div_class, a_xpath):
div_list = []
for div in selenium_driver.find_elements_by_class_name(div_class):
a_list = []
for a in div.find_elements_by_xpath(a_xpath):
a_list.append(a.text)
div_list.append(a_list)
return div_list
get_results(driver,
div_class='a-fixed-right-grid'
a_xpath="//a[contains(#href, '/gp/product/')]")
If that doesn't work then maybe the xpath is returning EVERY matching element every time despite being called from the div, or another element has that same class name farther up the document
I have some HTML that has the following structure:
<div class="article">
<h1 class="header">Birth Date between 1919-01-01 and 2019-01-01, Oscar-Winning, Oscar-Nominated, Males (Sorted by Popularity Ascending) </h1>
<br class="clear"/>
<div class="desc">
<span>1-100 of 716 names.</span> // I WANT THIS ELEMENT
<span class="ghost">|</span> <a class="lister-page-next next-page" href="/search/name?birth_date=1919-01-01,2019-01-01&groups=oscar_winner,oscar_nominee&gender=male&count=100&start=101&ref_=rlm">Next »</a>
</div>
<br class="clear"/>
</div>
Now I am trying to get a specific element out of this html with bs4. I tried to do:
webSoup = BeautifulSoup(html, 'html.parser')
nextUrl = webSoup.findChildren()[2][0]
but this gives me the following error:
return self.attrs[key]
KeyError: 0
So, to summarize my question:
How do I get a specific child at a certain index from an html document with bs4?
If you want the first match for span following class desc then you can use a css child combinator to pair the parent class with child element tag:
webSoup.select_one('.desc > span')
You can also choose to specify the parent must be a div
div.desc > span
If there is more than one match then use webSoup.select and then index into the list returned.
You can use:
nextUrl = webSoup.findChildren()[3].findChildren()[0]
print(nextUrl)
I've the following HTML content :
content = """
<div>
<div> <div>div A</div> </div>
<p>P A</p>
<div> <div>div B</div> </div>
<p> P B1</p>
<p> P B2</p>
<div> <div>div C</div> </div>
<p> P C1 <div>NODE</div> </p>
</div>
"""
Which can be seen like that (Not sure if it helps but I like diagram) :
If I use the following code :
soup = bs4.BeautifulSoup(content, "lxml")
firstDiv = soup.div
allElem = firstDiv.findAll( recursive = False)
for i, el in enumerate(allElem):
print "element ", i , " : ", el
I get this :
element 0 : <div> <div>div A</div> </div>
element 1 : <p>P A</p>
element 2 : <div> <div>div B</div> </div>
element 3 : <p> P B1</p>
element 4 : <p> P B2</p>
element 5 : <div> <div>div C</div> </div>
element 6 : <p> P C1 </p>
element 7 : <div>NODE</div>
As you can see unlike elements 0, 2 or 5, the element 6 doesn't contains its children. If I change its <p> to <b> or <div> then it acts as excepted. Why this little difference with <p> ? I'm still having that problem (if this is one?) upgrading from 4.3.2 to 4.4.6.
p elements can only contain phrasing content so what you have is actually invalid HTML. Here's an example of how it's parsed:
For example, a form element isn't allowed inside phrasing content,
because when parsed as HTML, a form element's start tag will imply a
p element's end tag. Thus, the following markup results in two
paragraphs, not one:
<p>Welcome. <form><label>Name:</label> <input></form>
It is parsed exactly like the following:
<p>Welcome. </p><form><label>Name:</label> <input></form>
You can confirm that this is how browsers parse your HTML (pictured is Chrome 64):
lxml is handling this correctly, as is html5lib. html.parser doesn't implement much of the HTML5 spec and doesn't care about these quirks.
I suggest you stick to lxml and html5lib if you don't want to be frustrated in the future by these parsing differences. It's annoying when what you see in your browser's DOM inspector differs from how your code parses it.
I'm trying to scrape data from a site with this structure below. I want to extract information in each of the <li id="entry">, but both of the entries should also extract the category information from <li id="category"> / <h2>
<ul class="html-winners">
<li id="category">
<h2>Redaktionell Print - Dagstidning</h2>
<ul>
<li id="entry">
<div class="entry-info">
<div class="block">
<img src="bilder/tumme10/4.jpg" width="110" height="147">
<span class="gold">Guld: Svenska Dagbladet</span><br>
<strong><strong>Designer:</strong></strong> Anna W Thurfjell och SvD:s medarbetare<br>
<strong><strong>Motivering:</strong></strong> "Konsekvent design som är lätt igenkänningsbar. Små förändringar förnyar ständigt och blldmotiven utnyttjas föredömligt."
</div>
</div>
</li>
<li id="entry">
<div class="entry-info">
<div class="block"><img src="bilder/tumme10/3.jpg" width="110" height="147">
<span class="silver">Silver: K2 - Kristianstadsbladet</span>
</div>
</div>
</li>
</ul>
</li>
I use a scrapy with the following code:
start_urls = [
"http://www.designpriset.se/vinnare.php?year=2010"
]
rules = (
Rule(LinkExtractor(allow = "http://www.designpriset.se/", restrict_xpaths=('//*[#class="html-winners"]')), callback='parse_item'),
)
def parse(self, response):
for sel in response.xpath('//*[#class="entry-info"]'):
item = ByrauItem()
annons_list = sel.xpath('//span[#class="gold"]/text()|//span[#class="silver"]/text()').extract()
byrau_list = sel.xpath('//div/text()').extract()
kategori_list = sel.xpath('/preceding::h2/text()').extract()
for x in range(0,len(annons_list)):
item['Annonsrubrik'] = annons_list[x]
item['Byrau'] = byrau_list[x]
item['Kategori'] = kategori_list[x]
yield item
annons_list and byrau_list works perfect, they use xpath to go down the heirarchy from the starting point //*[#class="entry-info"]. But kategori_list gives me "IndexError: list index out of range". Am I writing the xpath preceding axe the wrong way?
As mentioned by #kjhughes in comment, you need to add . just before / or // to make your XPath expression relative to current context element. Otherwise the expression will be considered relative to the root document. And that's why the expression /preceding::h2/text() returned nothing.
In the case of /, you can also remove it from the beginning of your XPath expression as alternative way to make to it relative to current context element :
kategori_list = sel.xpath('preceding::h2/text()').extract()
Just a note, preceding::h2 will return all h2 elements located before the <div class="entry-info">. According to the HTML posted, I think the following XPath expression is safer from returning unwanted h2 elements (false positive) :
query = 'parent::li/parent::ul/preceding-sibling::h2/text()'
kategori_list = sel.xpath(query).extract()
I am trying to select the "(6)" in the tag below:
<a class="itemRating" href="http://www.newegg.com/Product/ProductReview.aspx?Item=N82E16834200347" title="Rating + 4">
<span class="eggs r4"> </span>
(6)
</a>
The xpath, which I will call review, is in the () below:
review = site.xpath('/html/body/div[3]/div[2]/table/tr/td[2]/div/div[8]/div/div/div/a[3]
When I try printing review[0].text, it prints 'None' instead of the (6).
Any ideas?
(6) is in the tail of <span> element:
>>> a[0].tail
'\n(6)\n'
You can use:
review[0].text_content().strip()
or
review[0].xpath('string()').strip()
And I'd write your xpath as:
review = site.xpath('//a[#class="itemRating"]')