So I'm attempting to clean some HTML. I've got the following function:
def clean_html(self, html):
replaced_html = html.decode('utf-8').replace('<', ' <')
tree = etree.HTML(replaced_html)
etree.strip_elements(tree, 'script', 'style', 'img', 'noscript', 'svg')
for el in tree.xpath('//*[#style]'):
el.attrib.pop('style')
for el in tree.xpath('//*[#class]'):
el.attrib.pop('class')
for el in tree.xpath('//*[#id]'):
el.attrib.pop('id')
etree.strip_tags(tree, etree.Comment)
return etree.tostring(tree, encoding='unicode', method='html')
I'm looking to also remove all data-attributes e.g
<li data-direction="ltr" '
'data-listposition="center" data-data-id="dataItem-ifz7cqbs" '
'data-state="menu idle link notMobile">sky</li>
But the attributes are unknown to me (above is just an example).
So I'm looking to transform the above into just <li>sky</li> and would run on every element on the page.
In my code above I'm able to remove simple things like id, class but I'm not sure how to handle the dynamic attributes data-*. Possibly regex?
EDIT
I should clarify a bit about the input. My example above shows the use of <li> tags. But the actual input is the entire html of a page so it would be something like:
<html>
<ul>
<li data-i="sdfdsf">something</li>
<li data-i="dsfd">something</li>
</ul>
<p data-para="cvcv">content</p>
<div data-image-info='{"imageData":{"type":"Image","id":"dataItem-ifp35za1","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"2.0","isHidden":false},"title":"Black LinkedIn Icon","uri":"6ea5b4a88f0b4f91945b40499aa0af00.png","width":200,"height":200,"alt":"Black LinkedIn Icon","link":{"type":"ExternalLink","id":"dataItem-ig84dp5v","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"1.0","isHidden":false},"url":"https://www.linkedin.com/in/beth-liu-aba2b487?trk=hp-identity-name","target":"_blank"}},"displayMode":"fill"}' > </div> </a> </li> <li> <a href="https://www.pinterest.com/agencyb/" target="_blank" > <div data-image-info='{"imageData":{"type":"Image","id":"dataItem-ijxtrrjj","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"2.0","isHidden":false},"title":"Black Pinterest Icon","uri":"8f6f59264a094af0b46e9f6c77dff83e.png","width":200,"height":200,"alt":"Black Pinterest Icon","link":{"type":"ExternalLink","id":"dataItem-ikg674xm","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"1.0","isHidden":false},"url":"https://www.pinterest.com/agencyb/","target":"_blank"}},"displayMode":"fill"}' > </div> </a> </li> <li> <a href="http://www.twitter.com/lubecka" target="_blank" > <div data-image-info='{"imageData":{"type":"Image","id":"dataItem-ifp3554u","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"2.0","isHidden":false},"title":"Black Twitter Icon","uri":"c7d035ba85f6486680c2facedecdcf4d.png","description":"","width":200,"height":200,"alt":"Black Twitter Icon","link":{"type":"ExternalLink","id":"dataItem-ifp3554u1","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"1.0","isHidden":false},"url":"http://www.twitter.com/lubecka","target":"_blank"}},"displayMode":"fill"}' > </div> </a> </li> <li> <a href="https://www.instagram.com/" target="_blank">
<html>
Assuming that the names of the "data attributes" always start with "data-", you can remove them like this:
for el in tree.xpath("//*"):
for attr in el.attrib:
if attr.startswith("data-"):
el.attrib.pop(attr)
you can clear the attributes like this
import re
def strip_attribute(data):
p = re.compile('data-[^=]*="[^"]*"')
print(p)
return p.sub('', data)
print(strip_attribute('with attribute'))
Maybe this is what you're looking for:
from lxml import etree
code = """
<html>
<ul>
<li data-i="sdfdsf">something</li>
<li data-i="dsfd">something</li>
</ul>
<p data-para="cvcv">content</p>
</html>
"""
xml = etree.XML(code)
elements = list(xml.iter())
for element in elements:
if len(element.text.strip())>0:
print('<'+element.tag+'>'+element.text+'</'+element.tag+'>')
Output:
<li>something</li>
<li>something</li>
<p>content</p>
Related
I'm trying to parse through html using beautifulsoup (being called with lxml).
On nested tags I'm getting repeated text
I've tried going through and only counting tags that have no children, but then I'm losing out on data
given:
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments</span>
</li>
</ul>
</div>
and running:
soup = BeautifulSoup(file_info, features = "lxml")
soup.prettify().encode("utf-8")
for tag in soup.find_all(True):
if check_text(tag.text): #false on empty string/ all numbers
print (tag.text)
I get "to post comments" 4 times.
Is there a beautifulsoup way of just getting the result once?
Given an input like
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments1</span>
</li>
</ul>
</div>
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments2</span>
</li>
</ul>
</div>
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments3</span>
</li>
</ul>
</div>
You could do something like
[x.span.string for x in soup.find_all("li", class_="comment_forbidden first last")]
which would give
[' to post comments1', ' to post comments2', ' to post comments3']
find_all() is used to find all the <li> tags of class comment_forbidden first last and the <span> child tag of each of these <li> tag's content is obtained using their string attribute.
For anyone struggling with this, try swapping out the parser. I switched to html5lib and I no longer have repetitions. It is a costlier parser though, so may cause performance issues.
soup = BeautifulSoup(html, "html5lib")
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
You can use find() instead of find_all() to get your desired result only once
I have a html code that looks kind of like this (shortened);
<div id="activities" class="ListItems">
<h2>Standards</h2>
<ul>
<li>
<a class="Title" href="http://www.google.com" >Guidelines on management</a>
<div class="Info">
<p>
text
</p>
<p class="Date">Status: Under development</p>
</div>
</li>
</ul>
</div>
<div class="DocList">
<h3>Reports</h3>
<p class="SupLink">+ <a href="http://www.google.com/test" >View More</a></p>
<ul>
<li class="pdf">
<a class="Title" href="document.pdf" target="_blank" >Document</a>
<span class="Size">
[1,542.3KB]
</span>
<div class="Info">
<p>
text <a href="http://www.google.com" >Read more</a>
</p>
<p class="Date">
14/03/2018
</p>
</div>
</li>
</ul>
</div>
I am trying to select the value in 'href=' under 'a class="Title"' by using this code:
def sub_path02(url):
page = requests.get(url)
tree = html.fromstring(page.content)
url2 = []
for node in tree.xpath('//a[#class="Title"]'):
url2.append(node.get("href"))
return url2
But I get two returns, the one under 'div class="DocList"' is also returned.
I am trying to change my xpath expressions so that I would only look within the node but I cannot get it to work.
Could someone please help me understand how to "search" within a specific node. I have gone through multiple xpath documentations but I cannot seem to figure it out.
Using // you are already selecting all the a elements in the document.
To search in a specific div try specifying the parent with // and then use //a again to look anywhere in the div
//div[#class="ListItems"]//a[#class="Title"]
for node in tree.xpath('//div[#class="ListItems"]//a[#class="Title"]'):url2.append(node.get("href"))
Try this xpath expression to select the div with a specific id recursively :
'//div[#id="activities"]//a[#class="Title"]'
so :
def sub_path02(url):
page = requests.get(url)
tree = html.fromstring(page.content)
url2 = []
for node in tree.xpath('//div[#id="activities"]//a[#class="Title"]'):
url2.append(node.get("href"))
return url2
Note :
It's ever better to select an id than a class because an id should be unique (in real life, there's sometimes bad code with multiple same id in the same page, but a class can be repeated N times)
My intention is to scrape the names of the top-selling products on Ali-Express.
I'm using the Requests library alongside Beautiful Soup to accomplish this.
# Remember to import BeautifulSoup, requests and pprint
url = "https://bestselling.aliexpress.com/en?spm=2114.11010108.21.3.qyEJ5m"
soup = bs(req.get(url).text, 'html.parser')
#pp.pprint(soup) Verify that the page has been found
all_items = soup.find_all('li',class_= 'top10-item')
pp.pprint(all_items)
# []
However this returns an empty list, indicating that soup_find_all() did not find any tags fitting that criteria.
Inspect Element in Chrome displays the list items as such
.
However in source code (ul class = "top10-items") contains a script, which seems to iterate through each list item (I'm not familiar with HTML).
<div class="container">
<div class="top10-header"><span class="title">TOP SELLING</span> <span class="sub-title">This week's most popular products</span></div>
<ul class="top10-items loading" id="bestselling-top10">
</ul>
<script class="X-template-top10" type="text/mustache-template">
{{#topList}}
<li class="top10-item">
<div class="rank-orders">
<span class="rank">{{rank}}</span>
<span class="orders">{{productOrderNum}}</span>
</div>
<div class="img-wrap">
<a href="{{productDetailUrl}}" target="_blank">
<img src="{{productImgUrl}}" alt="{{productName}}">
</a>
</div>
<a class="item-desc" href="{{productDetailUrl}}" target="_blank">{{productName}}</a>
<p class="item-price">
<span class="price">US ${{productMinPrice}}</span>
<span class="uint">/ {{productUnitType}}</span>
</p>
</li>
{{/topList}}</script>
</div>
</div>
So this probably explains why soup.find_all() doesn't find the "li" tag.
My question is: How can I extract the item names from the script using Beautiful soup?
I try to extract all content (tags and text) from one main tag on html page. For example:
`my_html_page = '''
<html>
<body>
<div class="post_body">
<span class="polor">
<a class="p-color">Some text</a>
<a class="p-color">another text</a>
</span>
<a class="p-color">hello world</a>
<p id="bold">
some text inside p
<ul>
<li class="list">one li</li>
<li>second li</li>
</ul>
</p>
some text 2
<div>
text inside div
</div>
some text 3
</div>
<div class="post_body">
<a>text inside second main div</a>
</div>
<div class="post_body">
<span>third div</span>
</div>
<div class="post_body">
<p>four div</p>
</div>
<div class="post">
other text
</div>
</body>
<html>'''`
And I need to get using xpath("(//div[#class="post_body"])[1]"):
`
<div class="post_body">
<span class="polor">
<a class="p-color">Some text</a>
<a class="p-color">another text</a>
</span>
<a class="p-color">hello world</a>
<p id="bold">
some text inside p
<ul>
<li class="list">one li</li>
<li>second li</li>
</ul>
</p>
some text 2
<div>
text inside div
</div>
some text 3
</div>
`
All inside tag <div class="post_body">
I read this topic, but it did not help.
I need to create DOM by beautifulsoup parser in lxml.
import lxml.html.soupparser
import lxml.html
text_inside_tag = lxml.html.soupparser.fromstring(my_html_page)
text = text_inside_tag.xpath('(//div[#class="post_body"])[1]/text()')
And i can extract only text inside tag, but I need extract text with tags.
If i tried use this:
for elem in text.xpath("(//div[#class="post_body"])[1]/text()"):
print lxml.html.tostring(elem, pretty_print=True)
I have error: TypeError: Type '_ElementStringResult' cannot be serialized.
Help, please.
You can try this way :
import lxml.html.soupparser
import lxml.html
my_html_page = '''...some html markup here...'''
root = lxml.html.soupparser.fromstring(my_html_page)
for elem in root.xpath("//div[#class='post_body']"):
result = elem.text + ''.join(lxml.html.tostring(e, pretty_print=True) for e in elem)
print result
result variable constructed by combining text nodes within parent <div> with markup of all of the child nodes.
Is there any way to get "Data to be extracted" content by extracting the following html, using BeautifulSoup or any library
<div>
<ul class="main class">
<li>
<p class="class_label">User Name</p>
<p>"Data to be extracted"</p>
</li>
</ul>
</div>
Thanks in advance for any help !! :)
There are certainly multiple options. For starters, you can find the p element with class="class_label" and get the next p sibling:
from bs4 import BeautifulSoup
data = """
<div>
<ul class="main class">
<li>
<p class="class_label">User Name</p>
<p>"Data to be extracted"</p>
</li>
</ul>
</div>
"""
soup = BeautifulSoup(data)
print soup.find('p', class_='class_label').find_next_sibling('p').text
Or, using a CSS selector:
soup.select('div ul.main li p.class_label + p')[0].text
Or, relying on the User Name text:
soup.find(text='User Name').parent.find_next_sibling('p').text
Or, relying on the p element's position inside the li tag:
soup.select('div ul.main li p')[1].text