guys,
I have a question, scrapy, selector, XPath
I would like to choose the link in the "a" tag in the last "li" tag in HTML, and how to write the query for XPath
I did that, but I believe there are simpler ways to do that, such as using XPath queries, not using list fragmentation, but I don't know how to write
from scrapy import Selector
sel = Selector(text=html)
print sel.xpath('(//ul/li)').xpath('a/#href').extract()[-1]
'''
html
'''
</ul>
<li>
<a href="/info/page/" rel="follow">
<span class="page-numbers">
35
</span>
</a>
</li>
<li>
<a href="/info/page/" rel="follow">
<span class="next">
next page.
</span>
</a>
</li>
</ul>
I am assuming you want specifically the link to the "next" page. If this is the case, you can locate an a element checking the child span to the "next" class:
//a[span/#class = "next"]/#href
Related
I was learning web scraping, and the 'li' tag is not showing when I run soup.findAll
Here's the html:
<label>
<input type="checkbox">
<ul class="dropdown-content">
<li>
<a href=stuff</a>
</li>
</ul>
</label>
I tried:
soup = BeautifulSoup(r.content,'html5lib')
dropdown = soup.findAll('ul', {'class':'dropdown-content'})
print(dropdown)
And it only shows:
[<ul class="dropdown-content"></ul>]
Any help will do. Thanks!
in this command: dropdown = soup.findAll('ul', {'class':'dropdown-content'}), yo search for ul and dropdown-content class.
dropdown = soup.find('ul').findAll('li')
Your selection per se is okay to find the <ul> it may do not contain any <li> cause I assume these elements are generated dynamically by javascript. To validate this, question should be improved and url of website should be provided.
If content is provided dynamically one approach could be to work with selenium that will render the website like a browser and could return the "full" dom.
Note: In new code use find_all() instead of old syntax findAll()
Example
Html in your example is broken, but your code works if any lis are in the ul in your soup.
import requests
from bs4 import BeautifulSoup
html = '''
<label>
<input type="checkbox">
<ul class="dropdown-content">
<li>
</li>
</ul>
</label>
'''
soup = BeautifulSoup(html,'html5lib')
dropdown = soup.find_all('ul', {'class':'dropdown-content'})
print(dropdown)
Output
[<ul class="dropdown-content">
<li>
</li>
</ul>]
I have a python code that retrieve some data from a web page (web scrape).
Some point of the code it returns the follow list:
<ul class="nav nav--stacked" id="designer-list">
<li>
<h2>
<a class="text-uppercase bold router-link-active" href="/en-ca/cars_all">
All Cars
</a>
</h2>
</li>
<li>
<a href="/en-ca/cars/c1">
<span>
The car c1
</span>
</a>
</li>
<li>
<a href="/en-ca/cars/c2">
<span>
The car c2
</span>
</a>
</li>
</ul>
I am using BeautifulSoup and I just want to retrieve the references (href) for each car and its names.
In this example I want to retrieve (/en-ca/cars/c1)=>(The car c1) AND (/en-ca/cars/c2)=>(The car c2). I want to skip the first element (All cars).
I could use .find_all('li') and skip the first element inside the loop.
I was wondering if is there a way to reject the element trough BeautifulSoup methods
You can do it like this, though its not trough BeautifulSoup methods
soup = BeautifulSoup(html, "html.parser")
content = soup.find_all('li')[1:]
My intention is to scrape the names of the top-selling products on Ali-Express.
I'm using the Requests library alongside Beautiful Soup to accomplish this.
# Remember to import BeautifulSoup, requests and pprint
url = "https://bestselling.aliexpress.com/en?spm=2114.11010108.21.3.qyEJ5m"
soup = bs(req.get(url).text, 'html.parser')
#pp.pprint(soup) Verify that the page has been found
all_items = soup.find_all('li',class_= 'top10-item')
pp.pprint(all_items)
# []
However this returns an empty list, indicating that soup_find_all() did not find any tags fitting that criteria.
Inspect Element in Chrome displays the list items as such
.
However in source code (ul class = "top10-items") contains a script, which seems to iterate through each list item (I'm not familiar with HTML).
<div class="container">
<div class="top10-header"><span class="title">TOP SELLING</span> <span class="sub-title">This week's most popular products</span></div>
<ul class="top10-items loading" id="bestselling-top10">
</ul>
<script class="X-template-top10" type="text/mustache-template">
{{#topList}}
<li class="top10-item">
<div class="rank-orders">
<span class="rank">{{rank}}</span>
<span class="orders">{{productOrderNum}}</span>
</div>
<div class="img-wrap">
<a href="{{productDetailUrl}}" target="_blank">
<img src="{{productImgUrl}}" alt="{{productName}}">
</a>
</div>
<a class="item-desc" href="{{productDetailUrl}}" target="_blank">{{productName}}</a>
<p class="item-price">
<span class="price">US ${{productMinPrice}}</span>
<span class="uint">/ {{productUnitType}}</span>
</p>
</li>
{{/topList}}</script>
</div>
</div>
So this probably explains why soup.find_all() doesn't find the "li" tag.
My question is: How can I extract the item names from the script using Beautiful soup?
I'm using Scrapy to get some data from a website.
I have the following list of links:
<li class="m-pagination__item">
10
</li>
<li class="m-pagination__item">
<a href="?isin=IT0000072618&lang=it&page=1">
<span class="m-icon -pagination-right"></span>
</a>
I want to extract the href attribute only of the 'a' element that contains the span class="m-icon -pagination-right".
I've been looking for some examples of xpath but I'm not an expert of xpath and I couldn't find a solution.
Thanks.
//a[span/#class = 'm-icon -pagination-right']/#href
With a Scrapy response:
response.css('span.m-icon').xpath('../#href')
I am trying to use Python Selenium Firefox Webdriver to grab the h2 content 'My Data Title' from this HTML
<div class="box">
<ul class="navigation">
<li class="live">
<span>
Section Details
</span>
</li>
</ul>
</div>
<div class="box">
<h2>
My Data Title
</h2>
</div>
<div class="box">
<ul class="navigation">
<li class="live">
<span>
Another Section
</span>
</li>
</ul>
</div>
<div class="box">
<h2>
Another Title
</h2>
</div>
Each div has a class of box so I can't easily identify the one I want. Is there a way to tell Selenium to grab the h2 in the box class that comes after the one that has the span called 'Section Details'?
If you want grab the h2 in the box class that comes after the one that has the span with text Section Details try below xpath using preceding :-
(//h2[preceding::span[normalize-space(text()) = 'Section Details']])[1]
or using following :
(//span[normalize-space(text()) = 'Section Details']/following::h2)[1]
and for Another Section just change the span text in xpath as:-
(//h2[preceding::span[normalize-space(text()) = 'Another Section']])[1]
or
(//span[normalize-space(text()) = 'Another Section']/following::h2)[1]
Here is an XPath to select the title following the text "Section Details":
//div[#class='box'][normalize-space(.)='Section Details']/following::h2
yeah, you need to do some complicated xpath searching:
referenceElementList = driver.find_elements_by_xpath("//span")
for eachElement in referenceElementList:
if eachElement.get_attribute("innerHTML") == 'Section Details':
elementYouWant = eachElement.find_element_by_xpath("../../../following-sibling::div/h2")
elementYouWant.get_attribute("innerHTML") should give you "My Data Title"
My code reads:
find all span elements regardless of where they are in HTML and store them in a list called referenceElementList;
iterate all span elements in referenceElementList one by one, looking for a span whose innerHTML attribute is 'Section Details'.
if there is a match, we have found the span, and we navigate backwards three levels to locate the enclosing div[#class='box'], and find this div element next sibling, which is the second div element,
Lastly, we locate the h2 element from its parent.
Can you please tell me if my code works? I might have gone wrong somewhere navigating backwards.
There is potential difficulty you may encounter, the innerHTML attribute may contain tab, new line and space characters, in that case, you need regex to do some filtering first.