HTML content rewrite in python - python

I'm having some issues rewriting HTML text. I want to change my Text.
For Example:
<p>
Some Text
</P>
I cant output like this: <p>This this some text</p>
or
<ol>
<li>My Elm<li>
</ol>
<ul>
<li>My Elm<li>
</ul>
Output should be like this:
<ol>
<li>This is my elm<li>
</ol>
<ul>
<li>This is my elm<li>
</ul>
I want to change only the text object and then put the same HTML.

Cause question is not that clear and needs some improvement - I would point with BeautifulSoups replace_with() in a direction.
Example
from bs4 import BeautifulSoup
html = '''
<ol>
<li>My Elm</li>
</ol>
'''
soup = BeautifulSoup(html, 'html.parser')
soup.ol.li.string.replace_with('Some other text ...')
soup
Output
<ol>
<li>Some other text ...</li>
</ol>

Related

Beautiful Soup remove first appearance after selector

I am trying to use Beautiful Soup to remove some HTML from an HTML text.
This could be an example of my HTML:
<p>whatever</p><h2 class="myclass"><strong>fruit</strong></h2><ul><li>something</li></ul><div>whatever</div><h2 class="myclass"><strong>television</strong></h2><div>whatever</div><ul><li>test</li></ul>
Focus on those two elements:
<h2 class="myclass"><strong>television</strong></h2>
<ul>
I am trying to remove the first <ul> after <h2 class="myclass"><strong>television</strong></h2>, also if it's possible i would like to remove this <ul> only if it appears 1 or 2 element after that <h2>
Is that possible?
You can search for the second <h2> tag using a CSS Selector: h2:nth-of-type(2), and if the next_sibling or the next_sibling after that is an <ul> tag, than remove it from the HTML using the .decompose() method:
from bs4 import BeautifulSoup
html = """<p>whatever</p><h2 class="myclass"><strong>fruit</strong></h2><ul><li>something</li></ul><div>whatever</div><h2 class="myclass"><strong>television</strong></h2><div>whatever</div><ul><li>test</li></ul>"""
soup = BeautifulSoup(html, "html.parser")
looking_for = soup.select_one("h2:nth-of-type(2)")
if (
looking_for.next_sibling.name == "ul"
or looking_for.next_sibling.next_sibling.name == "ul"
):
soup.select_one("ul:nth-of-type(2)").decompose()
print(soup.prettify())
Output:
<p>
whatever
</p>
<h2 class="myclass">
<strong>
fruit
</strong>
</h2>
<ul>
<li>
something
</li>
</ul>
<div>
whatever
</div>
<h2 class="myclass">
<strong>
television
</strong>
</h2>
<div>
whatever
</div>
You can use a CSS selector (adjacent sibling selector +) and then .extract():
for tag in soup.select('h2.myclass+ul'):
tag.extract()
If you want to extract all adjacent uls then use ~ selector:
for tag in soup.select('h2.myclass~ul'):
tag.extract()

Targeting the third list item with beautiful soup

I'm scraping a website with Beautiful Soup and am having trouble trying to target an item in a span tag nested within an li tag. The website I'm trying to scrape is using the same classes for each list items which is making it harder. The HTML looks something like this:
<div class="bigger-container">
<div class="smaller-container">
<ul class="ulclass">
<li>
<span class="description"></span>
<span class="item"></span>
</li>
<li>
<span class="description"></span>
<span class="item"></span>
</li>
<li>
<span class="description"></span>
<span class="item">**This is the only tag I want to scrape**</span>
</li>
<li>
<span class="description"></span>
<span class="item"></span>
</li>
</ul>
My first thought was to try and target it using "nth-of-type() - I found a similar questions here but it hasn't helped. I've tried playing with it for a while now but my code basically looks like this:
import requests
from bs4 import BeautifulSoup
url = 'url of website I'm scraping'
headers = {User-Agent Header}
for page in range(1):
r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.content, features="lxml")
scrape = soup.find_all('div', class_ = 'even_bigger_container_not_included_in_html_above')
for item in scrape:
condition = soup.find('li:nth-of-type(2)', 'span:nth-of-type(1)').text
print(condition)
Any help is greatly appreciated!
To use a CSS Selector, use the select() method, not find().
So to get the third <li>, use li:nth-of-type(3) as a CSS Selector:
from bs4 import BeautifulSoup
html = """<div class="bigger-container">
<div class="smaller-container">
<ul class="ulclass">
<li>
<span class="description"></span>
<span class="item"></span>
</li>
<li>
<span class="description"></span>
<span class="item"></span>
</li>
<li>
<span class="description"></span>
<span class="item">**This is the only tag I want to scrape**</span>
</li>
<li>
<span class="description"></span>
<span class="item"></span>
</li>
</ul>"""
soup = BeautifulSoup(html, "html.parser")
print(soup.select_one("li:nth-of-type(3)").get_text(strip=True))
Output:
**This is the only tag I want to scrape**

Getting repeats in beautifulsoup nested tags

I'm trying to parse through html using beautifulsoup (being called with lxml).
On nested tags I'm getting repeated text
I've tried going through and only counting tags that have no children, but then I'm losing out on data
given:
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments</span>
</li>
</ul>
</div>
and running:
soup = BeautifulSoup(file_info, features = "lxml")
soup.prettify().encode("utf-8")
for tag in soup.find_all(True):
if check_text(tag.text): #false on empty string/ all numbers
print (tag.text)
I get "to post comments" 4 times.
Is there a beautifulsoup way of just getting the result once?
Given an input like
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments1</span>
</li>
</ul>
</div>
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments2</span>
</li>
</ul>
</div>
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments3</span>
</li>
</ul>
</div>
You could do something like
[x.span.string for x in soup.find_all("li", class_="comment_forbidden first last")]
which would give
[' to post comments1', ' to post comments2', ' to post comments3']
find_all() is used to find all the <li> tags of class comment_forbidden first last and the <span> child tag of each of these <li> tag's content is obtained using their string attribute.
For anyone struggling with this, try swapping out the parser. I switched to html5lib and I no longer have repetitions. It is a costlier parser though, so may cause performance issues.
soup = BeautifulSoup(html, "html5lib")
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
You can use find() instead of find_all() to get your desired result only once

How to scrape tags that appear within a script

My intention is to scrape the names of the top-selling products on Ali-Express.
I'm using the Requests library alongside Beautiful Soup to accomplish this.
# Remember to import BeautifulSoup, requests and pprint
url = "https://bestselling.aliexpress.com/en?spm=2114.11010108.21.3.qyEJ5m"
soup = bs(req.get(url).text, 'html.parser')
#pp.pprint(soup) Verify that the page has been found
all_items = soup.find_all('li',class_= 'top10-item')
pp.pprint(all_items)
# []
However this returns an empty list, indicating that soup_find_all() did not find any tags fitting that criteria.
Inspect Element in Chrome displays the list items as such
.
However in source code (ul class = "top10-items") contains a script, which seems to iterate through each list item (I'm not familiar with HTML).
<div class="container">
<div class="top10-header"><span class="title">TOP SELLING</span> <span class="sub-title">This week's most popular products</span></div>
<ul class="top10-items loading" id="bestselling-top10">
</ul>
<script class="X-template-top10" type="text/mustache-template">
{{#topList}}
<li class="top10-item">
<div class="rank-orders">
<span class="rank">{{rank}}</span>
<span class="orders">{{productOrderNum}}</span>
</div>
<div class="img-wrap">
<a href="{{productDetailUrl}}" target="_blank">
<img src="{{productImgUrl}}" alt="{{productName}}">
</a>
</div>
<a class="item-desc" href="{{productDetailUrl}}" target="_blank">{{productName}}</a>
<p class="item-price">
<span class="price">US ${{productMinPrice}}</span>
<span class="uint">/ {{productUnitType}}</span>
</p>
</li>
{{/topList}}</script>
</div>
</div>
So this probably explains why soup.find_all() doesn't find the "li" tag.
My question is: How can I extract the item names from the script using Beautiful soup?

Python : Extract HTML content

Is there any way to get "Data to be extracted" content by extracting the following html, using BeautifulSoup or any library
<div>
<ul class="main class">
<li>
<p class="class_label">User Name</p>
<p>"Data to be extracted"</p>
</li>
</ul>
</div>
Thanks in advance for any help !! :)
There are certainly multiple options. For starters, you can find the p element with class="class_label" and get the next p sibling:
from bs4 import BeautifulSoup
data = """
<div>
<ul class="main class">
<li>
<p class="class_label">User Name</p>
<p>"Data to be extracted"</p>
</li>
</ul>
</div>
"""
soup = BeautifulSoup(data)
print soup.find('p', class_='class_label').find_next_sibling('p').text
Or, using a CSS selector:
soup.select('div ul.main li p.class_label + p')[0].text
Or, relying on the User Name text:
soup.find(text='User Name').parent.find_next_sibling('p').text
Or, relying on the p element's position inside the li tag:
soup.select('div ul.main li p')[1].text

Categories