Python scraping xpath get <a> with specific <span> - python

I'm using Scrapy to get some data from a website.
I have the following list of links:
<li class="m-pagination__item">
10
</li>
<li class="m-pagination__item">
<a href="?isin=IT0000072618&lang=it&page=1">
<span class="m-icon -pagination-right"></span>
</a>
I want to extract the href attribute only of the 'a' element that contains the span class="m-icon -pagination-right".
I've been looking for some examples of xpath but I'm not an expert of xpath and I couldn't find a solution.
Thanks.

//a[span/#class = 'm-icon -pagination-right']/#href

With a Scrapy response:
response.css('span.m-icon').xpath('../#href')

Related

scrapy xpath how to use?

guys,
I have a question, scrapy, selector, XPath
I would like to choose the link in the "a" tag in the last "li" tag in HTML, and how to write the query for XPath
I did that, but I believe there are simpler ways to do that, such as using XPath queries, not using list fragmentation, but I don't know how to write
from scrapy import Selector
sel = Selector(text=html)
print sel.xpath('(//ul/li)').xpath('a/#href').extract()[-1]
'''
html
'''
</ul>
<li>
<a href="/info/page/" rel="follow">
<span class="page-numbers">
35
</span>
</a>
</li>
<li>
<a href="/info/page/" rel="follow">
<span class="next">
next page.
</span>
</a>
</li>
</ul>
I am assuming you want specifically the link to the "next" page. If this is the case, you can locate an a element checking the child span to the "next" class:
//a[span/#class = "next"]/#href

Python splinter select by tag attribute

I am messing around with some web scraping using Splinter but have this issue. The html basically has loads of li only some of which I am interested in. The ones I am interested in have a bid value. Now, I know for Beautiful Soup I can do
tab = browser.find_by_css('li', {'bid': '18663145091'})
but this doesn't seem to work for splinter. I get an error saying:
find_by_css() takes exactly 2 arguments (3 given)
This is a sample of my html:
<li class="rugby" bid="18663145091">
<span class="info">
<div class="points">
12
</div>
<img alt="Leinster" height="19" src="..Leinster" width="26"/>
</span>
</li>
It looks like you are using find_by_css() method as if it was a BeautifulSoup method. Instead, provide a valid CSS selector checking the value of the bid attribute:
tab = browser.find_by_css('li[bid=18663145091]')

Python Selenium Webdriver - Grab div after specified one

I am trying to use Python Selenium Firefox Webdriver to grab the h2 content 'My Data Title' from this HTML
<div class="box">
<ul class="navigation">
<li class="live">
<span>
Section Details
</span>
</li>
</ul>
</div>
<div class="box">
<h2>
My Data Title
</h2>
</div>
<div class="box">
<ul class="navigation">
<li class="live">
<span>
Another Section
</span>
</li>
</ul>
</div>
<div class="box">
<h2>
Another Title
</h2>
</div>
Each div has a class of box so I can't easily identify the one I want. Is there a way to tell Selenium to grab the h2 in the box class that comes after the one that has the span called 'Section Details'?
If you want grab the h2 in the box class that comes after the one that has the span with text Section Details try below xpath using preceding :-
(//h2[preceding::span[normalize-space(text()) = 'Section Details']])[1]
or using following :
(//span[normalize-space(text()) = 'Section Details']/following::h2)[1]
and for Another Section just change the span text in xpath as:-
(//h2[preceding::span[normalize-space(text()) = 'Another Section']])[1]
or
(//span[normalize-space(text()) = 'Another Section']/following::h2)[1]
Here is an XPath to select the title following the text "Section Details":
//div[#class='box'][normalize-space(.)='Section Details']/following::h2
yeah, you need to do some complicated xpath searching:
referenceElementList = driver.find_elements_by_xpath("//span")
for eachElement in referenceElementList:
if eachElement.get_attribute("innerHTML") == 'Section Details':
elementYouWant = eachElement.find_element_by_xpath("../../../following-sibling::div/h2")
elementYouWant.get_attribute("innerHTML") should give you "My Data Title"
My code reads:
find all span elements regardless of where they are in HTML and store them in a list called referenceElementList;
iterate all span elements in referenceElementList one by one, looking for a span whose innerHTML attribute is 'Section Details'.
if there is a match, we have found the span, and we navigate backwards three levels to locate the enclosing div[#class='box'], and find this div element next sibling, which is the second div element,
Lastly, we locate the h2 element from its parent.
Can you please tell me if my code works? I might have gone wrong somewhere navigating backwards.
There is potential difficulty you may encounter, the innerHTML attribute may contain tab, new line and space characters, in that case, you need regex to do some filtering first.

Python Selenium save list items as array

I am using Python Selenium to look through some HTML and find elements. I have the following HTML saved into Python...
<section id="categories">
<ul id="category_list">
<li id="category84">
Sample Category
</li>
<li id="category984">
Another Category
</li>
<li id="category22">
My Sample Category
</li>
</ul>
</section>
I can find the categories section easy enough but now I would like to loop through each list item and save it's name and href link into an array.
Anyone got a similar example I can see?
Sure, let's use a CSS selector locator and a list comprehension calling .get_attribute("href") to get the link and .text to get the link text:
categories = driver.find_elements_by_css_selector("#categories #category_list li[id^=category] a")
result = [{"link": category.get_attribute("href"), "text": category.text}
for category in categories]
print(result)

Python BeautifulSoup findAll by "class" attribute

I want to do the following code, which is what BS documentation says to do, the only problem is that the word "class" isn't just a word. It can be found inside HTML, but it's also a python keyword which causes this code to throw an error.
So how do I do the following?
soup.findAll('ul', class="score")
Your problem seems to be that you expect find_all in the soup to find an exact match for your string. In fact:
When you search for a tag that matches a certain CSS class, you’re
matching against any of its CSS classes:
You can properly search for a class tag as #alKid said. You can also search with the class_ keyword arg.
soup.find_all('ul', class_="score")
Here is how to do it:
soup.find_all('ul', {'class':"score"})
If OP is interested in getting the finalScore by going through ul you could solve this with a couple of lines of gazpacho:
from gazpacho import Soup
html = """\
<div>
<ul class="score header" id="400488971-linescoreHeader" style="display: block">
<li>1</li>
<li>2</li>
<li>3</li>
<li>4</li>
<li id="400488971-lshot"> </li>
<li class="finalScore">T</li>
</ul>
<div>
"""
soup = Soup(html)
soup.find("ul", {"class": "score"}).find("li", {"class": "finalScore"}).text
Which would output:
'T'

Categories