How to skip the first element of a <ul> using BeautifulSoup (python)? - python

I have a python code that retrieve some data from a web page (web scrape).
Some point of the code it returns the follow list:
<ul class="nav nav--stacked" id="designer-list">
<li>
<h2>
<a class="text-uppercase bold router-link-active" href="/en-ca/cars_all">
All Cars
</a>
</h2>
</li>
<li>
<a href="/en-ca/cars/c1">
<span>
The car c1
</span>
</a>
</li>
<li>
<a href="/en-ca/cars/c2">
<span>
The car c2
</span>
</a>
</li>
</ul>
I am using BeautifulSoup and I just want to retrieve the references (href) for each car and its names.
In this example I want to retrieve (/en-ca/cars/c1)=>(The car c1) AND (/en-ca/cars/c2)=>(The car c2). I want to skip the first element (All cars).
I could use .find_all('li') and skip the first element inside the loop.
I was wondering if is there a way to reject the element trough BeautifulSoup methods

You can do it like this, though its not trough BeautifulSoup methods
soup = BeautifulSoup(html, "html.parser")
content = soup.find_all('li')[1:]

Related

Python 3 Beautifulsoup: Get span tag value with specific text which is also randomly placed within the html tree

I tried searching for this here but couldn't find an answer to be honest as this should be fairly easy to do with Selenium but since performance is an important factor, I was thinking on doing this with Beautifulsoup instead.
Scenario: I need to scrape the prices of different items which are generated in a random fashion depending on user input, see code below:
<div class="sk-expander-content" style="display: block;">
<ul>
<li>
<span>Third Party Liability</span>
<span>€756.62</span>
</li>
<li>
<span>Fire & Theft</span>
<span>€15.59</span>
</li>
</ul>
</div>
If these options were static and would always be displayed in the same position within the html, it would be easy to scrape the prices but since these could be placed anywhere within the div sk-expander-content, I'm not sure how to find these in a dynamic way.
The best approach would be to write a method to pass in the text of the span we are looking for and return the value in Euro. The structure of the span tags is always the same, the first span is always the name of the item and the second one is always the price.
The first thing that came to mind is the following code, but I'm not sure if this is even robust enough or if it makes sense:
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
div_i_need = soup.find_all("div", class_="sk-expander-content")[1]
def price_scraper(text_to_find):
for el in div_i_need.find_all(['ul', 'li', 'span']):
if el.name == 'span':
if el[0].text == text_to_find:
return(el[1].text)
Your help will be much appreciated.
Use regular expression.
import re
html='''<div class="sk-expander-content" style="display: block;">
<ul>
<li>
<span>Third Party Liability</span>
<span>€756.62</span>
</li>
<li>
<span>Fire & Theft</span>
<span>€15.59</span>
</li>
</ul>
</div>
<div class="sk-expander-content" style="display: block;">
<ul>
<li>
<span>Fire & Theft</span>
<span>€756.62</span>
</li>
<li>
<span>Third Party Liability</span>
<span>€15.59</span>
</li>
</ul>
</div>'''
soup = BeautifulSoup(html, "html.parser")
for item in soup.find_all(class_="sk-expander-content"):
for span in item.find_all('span',text=re.compile("€(\d+).(\d+)")):
print(span.find_previous_sibling('span').text)
print(span.text)
Output:
Third Party Liability
€756.62
Fire & Theft
€15.59
Fire & Theft
€756.62
Third Party Liability
€15.59
UPDATE:
If you want to get first node value.Then use find() instead of find_all().
import re
html='''<div class="sk-expander-content" style="display: block;">
<ul>
<li>
<span>Third Party Liability</span>
<span>€756.62</span>
</li>
<li>
<span>Fire & Theft</span>
<span>€15.59</span>
</li>
</ul>
</div>
<div class="sk-expander-content" style="display: block;">
<ul>
<li>
<span>Fire & Theft</span>
<span>€756.62</span>
</li>
<li>
<span>Third Party Liability</span>
<span>€15.59</span>
</li>
</ul>
</div>'''
soup = BeautifulSoup(html, "html.parser")
for span in soup.find(class_="sk-expander-content").find_all('span',text=re.compile("€(\d+).(\d+)")):
print(span.find_previous_sibling('span').text)
print(span.text)
from bs4 import BeautifulSoup
import re
html = """
<div class="sk-expander-content" style="display: block;">
<ul>
<li>
<span>Third Party Liability</span>
<span>€756.62</span>
</li>
<li>
<span>Fire & Theft</span>
<span>€15.59</span>
</li>
</ul>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
target = soup.select("div.sk-expander-content")
for tar in target:
data = [item.text for item in tar.findAll("span", text=re.compile("€"))]
print(data)
Output:
['€756.62', '€15.59']
Note: I used select which return ResultSet in order to find all div.

Getting repeats in beautifulsoup nested tags

I'm trying to parse through html using beautifulsoup (being called with lxml).
On nested tags I'm getting repeated text
I've tried going through and only counting tags that have no children, but then I'm losing out on data
given:
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments</span>
</li>
</ul>
</div>
and running:
soup = BeautifulSoup(file_info, features = "lxml")
soup.prettify().encode("utf-8")
for tag in soup.find_all(True):
if check_text(tag.text): #false on empty string/ all numbers
print (tag.text)
I get "to post comments" 4 times.
Is there a beautifulsoup way of just getting the result once?
Given an input like
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments1</span>
</li>
</ul>
</div>
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments2</span>
</li>
</ul>
</div>
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments3</span>
</li>
</ul>
</div>
You could do something like
[x.span.string for x in soup.find_all("li", class_="comment_forbidden first last")]
which would give
[' to post comments1', ' to post comments2', ' to post comments3']
find_all() is used to find all the <li> tags of class comment_forbidden first last and the <span> child tag of each of these <li> tag's content is obtained using their string attribute.
For anyone struggling with this, try swapping out the parser. I switched to html5lib and I no longer have repetitions. It is a costlier parser though, so may cause performance issues.
soup = BeautifulSoup(html, "html5lib")
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
You can use find() instead of find_all() to get your desired result only once

Python: XPATH search within node

I have a html code that looks kind of like this (shortened);
<div id="activities" class="ListItems">
<h2>Standards</h2>
<ul>
<li>
<a class="Title" href="http://www.google.com" >Guidelines on management</a>
<div class="Info">
<p>
text
</p>
<p class="Date">Status: Under development</p>
</div>
</li>
</ul>
</div>
<div class="DocList">
<h3>Reports</h3>
<p class="SupLink">+ <a href="http://www.google.com/test" >View More</a></p>
<ul>
<li class="pdf">
<a class="Title" href="document.pdf" target="_blank" >Document</a>
<span class="Size">
[1,542.3KB]
</span>
<div class="Info">
<p>
text <a href="http://www.google.com" >Read more</a>
</p>
<p class="Date">
14/03/2018
</p>
</div>
</li>
</ul>
</div>
I am trying to select the value in 'href=' under 'a class="Title"' by using this code:
def sub_path02(url):
page = requests.get(url)
tree = html.fromstring(page.content)
url2 = []
for node in tree.xpath('//a[#class="Title"]'):
url2.append(node.get("href"))
return url2
But I get two returns, the one under 'div class="DocList"' is also returned.
I am trying to change my xpath expressions so that I would only look within the node but I cannot get it to work.
Could someone please help me understand how to "search" within a specific node. I have gone through multiple xpath documentations but I cannot seem to figure it out.
Using // you are already selecting all the a elements in the document.
To search in a specific div try specifying the parent with // and then use //a again to look anywhere in the div
//div[#class="ListItems"]//a[#class="Title"]
for node in tree.xpath('//div[#class="ListItems"]//a[#class="Title"]'):url2.append(node.get("href"))
Try this xpath expression to select the div with a specific id recursively :
'//div[#id="activities"]//a[#class="Title"]'
so :
def sub_path02(url):
page = requests.get(url)
tree = html.fromstring(page.content)
url2 = []
for node in tree.xpath('//div[#id="activities"]//a[#class="Title"]'):
url2.append(node.get("href"))
return url2
Note :
It's ever better to select an id than a class because an id should be unique (in real life, there's sometimes bad code with multiple same id in the same page, but a class can be repeated N times)

scrapy xpath how to use?

guys,
I have a question, scrapy, selector, XPath
I would like to choose the link in the "a" tag in the last "li" tag in HTML, and how to write the query for XPath
I did that, but I believe there are simpler ways to do that, such as using XPath queries, not using list fragmentation, but I don't know how to write
from scrapy import Selector
sel = Selector(text=html)
print sel.xpath('(//ul/li)').xpath('a/#href').extract()[-1]
'''
html
'''
</ul>
<li>
<a href="/info/page/" rel="follow">
<span class="page-numbers">
35
</span>
</a>
</li>
<li>
<a href="/info/page/" rel="follow">
<span class="next">
next page.
</span>
</a>
</li>
</ul>
I am assuming you want specifically the link to the "next" page. If this is the case, you can locate an a element checking the child span to the "next" class:
//a[span/#class = "next"]/#href

How to scrape tags that appear within a script

My intention is to scrape the names of the top-selling products on Ali-Express.
I'm using the Requests library alongside Beautiful Soup to accomplish this.
# Remember to import BeautifulSoup, requests and pprint
url = "https://bestselling.aliexpress.com/en?spm=2114.11010108.21.3.qyEJ5m"
soup = bs(req.get(url).text, 'html.parser')
#pp.pprint(soup) Verify that the page has been found
all_items = soup.find_all('li',class_= 'top10-item')
pp.pprint(all_items)
# []
However this returns an empty list, indicating that soup_find_all() did not find any tags fitting that criteria.
Inspect Element in Chrome displays the list items as such
.
However in source code (ul class = "top10-items") contains a script, which seems to iterate through each list item (I'm not familiar with HTML).
<div class="container">
<div class="top10-header"><span class="title">TOP SELLING</span> <span class="sub-title">This week's most popular products</span></div>
<ul class="top10-items loading" id="bestselling-top10">
</ul>
<script class="X-template-top10" type="text/mustache-template">
{{#topList}}
<li class="top10-item">
<div class="rank-orders">
<span class="rank">{{rank}}</span>
<span class="orders">{{productOrderNum}}</span>
</div>
<div class="img-wrap">
<a href="{{productDetailUrl}}" target="_blank">
<img src="{{productImgUrl}}" alt="{{productName}}">
</a>
</div>
<a class="item-desc" href="{{productDetailUrl}}" target="_blank">{{productName}}</a>
<p class="item-price">
<span class="price">US ${{productMinPrice}}</span>
<span class="uint">/ {{productUnitType}}</span>
</p>
</li>
{{/topList}}</script>
</div>
</div>
So this probably explains why soup.find_all() doesn't find the "li" tag.
My question is: How can I extract the item names from the script using Beautiful soup?

Categories