Python 2.7 using lxml
I have some annoyingly formed html that looks like this:
<td>
<b>"John"
</b>
<br>
"123 Main st.
"
<br>
"New York
"
<b>
"Sally"
</b>
<br>
"101 California St.
"
<br>
"San Francisco
"
</td>
So basically it's a single td with a ton of stuff in it. I'm trying to compile a list or dict of the names and their addresses.
So far what I've done is gotten a list of nodes with names using tree.xpath('//td/b'). So let's assume I'm currently on the b node for John.
I'm trying to get whatever.xpath('string()') for everything following the current node but preceding the next b node (Sally). I've tried a bunch of different xpath queries but can't seem to get this right. In particular, any time I use an and operator in an expression that has no [] brackets, it returns a bool rather than a list of all nodes meeting the conditions. Can anyone help out?
This should work:
from lxml import etree
p = etree.HTMLParser()
html = open(r'./test.html','r')
data = html.read()
tree = etree.fromstring(data, p)
my_dict = {}
for b in tree.iter('b'):
br = b.getnext().tail.replace('\n', '')
my_dict[b.text.replace('\n', '')] = br
print my_dict
This code prints:
{'"John"': '"123 Main st."', '"Sally"': '"101 California St."'}
(You may want to strip the quotation marks out!)
Rather than using xpath, you could use one of lxml's parsers in order to easily navigate the HTML. The parser will turn the HTML document into an "etree", which you can navigate with provided methods. The lxml module provides a method called iter() which allows you to pass in a tag name and receive all elements in the tree with that name. In your case, if you use this to obtain all of the <b> elements, you can then manually navigate to the <br> element and retrieve its tail text, which contains the information you need. You can find information about this in the "Elements contain text" header of the lxml.etree tutorial.
What not use getchildren function from view of each td. For example:
from lxml import html
s = """
<td>
<b>"John"
</b>
<br>
"123 Main st.
"
<br>
"New York
"
<b>
"Sally"
</b>
<br>
"101 California St.
"
<br>
"San Francisco
"
</td>
"""
records = []
cur_record = -1
cur_field = 1
FIELD_NAME = 0
FIELD_STREET = 1
FIELD_CITY = 2
doc = html.fromstring(s)
td = doc.xpath('//td')[0]
for child in td.getchildren():
if child.tag == 'b':
cur_record += 1
record = dict()
record['name'] = child.text.strip()
records.append(record)
cur_field = 1
elif child.tag == 'br':
if cur_field == FIELD_STREET:
records[cur_record]['street'] = child.tail.strip()
cur_field += 1
elif cur_field == FIELD_CITY:
records[cur_record]['city'] = child.tail.strip()
And the results are:
records = [
{'city': '"New York\n"', 'name': '"John"\n', 'street': '"123 Main st.\n"'},
{'city': '"San Francisco\n"', 'name': '\n"Sally"\n', 'street': '"101 California St.\n"'}
]
Note you should use tag.tail if you want to get text of some non-close html tag, e.g., <br>.
Hope this would be helpful.
Related
So, I am trying to scrape data from a journal.
While I can successfully scrape titles of papers, keywords, and so on, and save them in dataframe properly when it comes to collecting authors' names, each mentioned author of the paper after the first one is stored in a new row. The same problem applies to affiliations.
It’s making stored data useless and unrelated, obviously. Thus, instead of having the same number of rows, I get stuck with a useless dataframe.
It is my understanding that the problem arises because the program doesn’t “know” to store all the data associated with each paper in separate rows. Additionally, some papers only have one author, while others have 3-4. For example, authors need to be stored in a "NameSurname, NameSurname, NameSurname..." format within separate rows containing information about each research paper: authors, affiliations, etc.
But when it comes to specifying classes that I intend to scrape, I am uncertain how to set up the Python (BS4) code properly.
Here's a snippet of the relevant code from the simple scraper:
title = []
authors = []
afiliations = []
for i in urls:
page = requests.get(link)
content = page.text
soup = BeautifulSoup(content, "html.parser")
for t in soup.select(".obj_article_details .page_title"):
title.append(t.get_text(strip=True))
for au in soup.select(".obj_article_details .authors .name"):
authors.append(au.get_text(strip=True))
for af in soup.select(".obj_article_details .item.authors .affiliation"):
affiliations.append(af.get_text(strip=True))
time.sleep(3)
Also, here is a structure of section which i am intending to scrape
...
<article class="obj_article_details">
<h1 class="page_title">
Lorem ipsum dolor sit amet
</h1>
<div class="row">
<div class="main_entry">
<section class="item authors">
<ul class="authors">
<li>
<span class="name">Brandon Scott </span>
<span class="affiliation"> Villanova University, Pennsylvania </span>
</li>
<li>
<span class="name">Alvaro Cote </span>
<span class="affiliation">Carleton College, Minnesota</span>
</li>
</ul>
</section>
...
What I am getting now:
|Authors | Affiliation |
+--------------+------------------------------------+
|Brandon Scott | Villanova University, Pennsylvania |
+--------------+------------------------------------+
|Alvaro Cote | Carleton College, Minnesota |
+--------------+------------------------------------+
|... | ... |
What i want:
|Authors | Affiliation |
+--------------+------------------------------------+
|Brandon Scott, Alvaro Cote | Villanova University..|
+--------------+------------------------------------+
|... |... |
+--------------+------------------------------------+
|... |... |
For cases like this, you should use nested loops - an outer loop for the containers ResultSet (soup.select('article.obj_article_details') here), and the inner loop/s for the details you want - title/author/affiliation/etc. And it's also better to build a dictionary of the details for each container and add it to a list of dictionaries than to try to bind together separate lists (you've already faced some of the issues that are caused by that approach).
Since you're doing the same thing for each detail (select followed by get_text), it would be more convenient to move those operations to a function like
def getText_bySelector(tagSoup, selector, sep=None):
selTags = tagSoup.select(selector) if selector else [tagSoup]
if type(sep) == str:
return sep.join([s.get_text(' ').strip() for s in selTags])
return selTags[0].get_text(' ').strip() if selTags else None
(This is a variation of this function, which I use in most of my bs4 projects.)
If you pass a string (like , /; /etc) as sep, it will join all the results with it (or return an empty string [""] if there are no results); otherwise, it will return the first result (or None if there are no results).
Another reason I like using functions like this is that it allows me to use list comprehension instead the innermost for loop.
Then, you just need to define a reference dictionary with the arguments you'll need to pass to getText_bySelector
refDict = {
'title': ('.page_title', None),
'authors': ('.authors .name', ', '),
'affiliations': ('.item.authors .affiliation', '; ')
}
Now you can built a list of dictionaries with
dictList = []
for i in urls:
page = requests.get(link)
content = page.text
soup = BeautifulSoup(content, "html.parser")
dictList += [{
k: getText_bySelector(a, vsel, vsep)
for k, (vsel, vsep) in refDict.items()
} for a in soup.select('article.obj_article_details')]
The items in dictList will look like
{
'title': 'Lorem ipsum dolor sit amet',
'authors': 'Brandon Scott, Alvaro Cote',
'affiliations': 'Villanova University, Pennsylvania; Carleton College, Minnesota'
}
and you can easily use pandas to view dictList as a table
EDIT [PART 1]: Without a function, you'd just have to do the same operations in an inner for loop:
dictList = []
for i in urls:
page = requests.get(link)
content = page.text
soup = BeautifulSoup(content, "html.parser")
for a in soup.select('article.obj_article_details'):
dets = {}
for k, (sel, sep) in refDict.items():
selTags = a.select(sel) if sel else [a]
if type(sep) == str:
dets[k] = sep.join([s.get_text(' ').strip() for s in selTags])
else:
dets[k] = selTags[0].get_text(' ').strip() if selTags else None
dictList.append(dets)
EDIT [PART 2]: If you must have separate lists:
title = []
authors = []
afiliations = []
for i in urls:
page = requests.get(link)
content = page.text
soup = BeautifulSoup(content, "html.parser")
for a in soup.select('article.obj_article_details'):
titleA = a.select_one('.page_title')
if titleA: titleA = titleA.get_text(' ').strip()
title.append(titleA)
authorsA = a.select('.authors .name')
# authors.append(', '.join([aa.get_text(' ').strip() for aa in authorsA]))
listAuth = []
for aa in authorsA: listAuth.append(aa.get_text(' ').strip())
authors.append(', '.join(listAuth))
# authors.append(listAuth) # if you want a list instead of a string
affA = a.select('.item.authors .affiliation')
# afiliations.append('; '.join([aa.get_text(' ').strip() for aa in affA]))
listAff = []
for aa in affA: listAff.append(aa.get_text(' ').strip())
afiliations.append(', '.join(listAff))
# afiliations.append(listAff) # if you want a list instead of a string
the DataFrame arguments would be a little different this time:
[I copied the html twice to create multiple rows and added a1/a2 to differentiate the copies.]
Even that can be shortened with an inner for loop and list-comprehension
refDict = {
'title': '.page_title', 'authors': '.authors .name',
'affiliations': '.item.authors .affiliation'
}
listsDict = {k: [] for k in refDict}
for i in urls:
page = requests.get(link)
content = page.text
soup = BeautifulSoup(content, "html.parser")
for a in soup.select('article.obj_article_details'):
for k in refDict:
kvals = [t.get_text(' ').strip() for t in (
a.select(refDict[k]) if refDict[k] else [a]
)]
listsDict[k].append('; '.join(kvals))
# listsDict[k].append(kvals[0] if len(kvals) == 1 else kvals)
refDict was simplified, so you can't have different separators for different columns.
By the ways, if you want multiple authors/affiliations as lists rather than joining them as strings, you can remove the listsDict[k].append('; '.join(kvals)) line and uncomment the next line
listsDict[k].append(kvals[0] if len(kvals) == 1 else kvals)
Btw, with this last method, if there is more than one .page_title in a container, all of them will be includes, but with all my other methods, one the first title from each container would have been included. (I assumed that there would always only be one title per container.)
The important thing is that the title/authors/afiliations lists are appended to the same number of times for a container - that's why you need to separate the containers and append from there a fixed number of times.
I want to scrape separate content like- text in 'a' tag (ie. only the name- "42mm Architecture") and 'scope of services, types of built projects, Locations of Built Projects, Style of work, Website' as CSV file headers and its content for a whole webpage.
The elements have no Class or ID associated with it. So I am kind of stuck on how to extract those details properly, also there are those 'br' and 'b' tags in between.
There are multiple 'p' tags before and after the block of code provided. Here is the website.
<h2>
<a href="http://www.dezeen.com/tag/design-by-42mm-architecture" rel="noopener noreferrer" target="_blank">
42mm Architecture
</a>
|
<span style="color: #808080;">
Delhi | Top Architecture Firms/ Architects in India
</span>
</h2>
<!-- /wp:paragraph -->
<p>
<b>
Scope of services:
</b>
Architecture, Interiors, Urban Design.
<br/>
<b>
Types of Built Projects:
</b>
Residential, commercial, hospitality, offices, retail, healthcare, housing, Institutional
<br/>
<b>
Locations of Built Projects:
</b>
New Delhi and nearby states
<b>
<br/>
</b>
<b>
Style of work
</b>
<span style="font-weight: 400;">
: Contemporary
</span>
<br/>
<b>
Website
</b>
<span style="font-weight: 400;">
:
<a href="https://www.42mm.co.in/">
42mm.co.in
</a>
</span>
</p>
So how is it done using BeautifulSoup4?
This one was a bit of a time consuming one! The webpage is not complete and it has less tags and identifiers. To add more on that they haven't even spell checked the content Eg. One place has a heading Scope of Services and another place has Scope of services and there are many more like that! So what I have done is a crude extraction and I'm sure it would help you if you have the idea of paginating also.
import requests
from bs4 import BeautifulSoup
import csv
page = requests.get('https://www.re-thinkingthefuture.com/top-architects/top-architecture-firms-in-india-part-1/')
soup = BeautifulSoup(page.text, 'lxml')
# there are many h2 tags but we want the one without any class name
h2 = soup.find_all('h2', class_= '')
headers = []
contents = []
header_len = []
a_tags = []
for i in h2:
if i.find_next().name == 'a': # to make sure we do not grab the wrong tag
a_tags.append(i.find_next().text)
p = i.find_next_sibling()
contents.append(p.text)
h =[j.text for j in p.find_all('strong')] # some headings were bold in the website
headers.append(h)
header_len.append(len(h))
# since only some headings were in bold the max number of bold would give all headers
headers = headers[header_len.index(max(header_len))]
# removing the : from headings
headers = [i[:len(i)-1] for i in headers]
# inserted a new heading
headers.insert(0, 'Firm')
# n for traversing through headers list
# k for traversing through a_tags list
n =1
k =0
# this is the difficult part where the content will have all the details in one value including the heading like this
"""
Scope of services: Architecture, Interiors, Urban Design.Types of Built Projects: Residential, commercial, hospitality, offices, retail, healthcare, housing, InstitutionalLocations of Built Projects: New Delhi and nearby statesStyle of work: ContemporaryWebsite: 42mm.co.in
"""
# thus I am splitting it using the ':' and then splicing it from the start of the each heading
contents = [i.split(':') for i in contents]
for i in contents:
for j in i:
h = headers[n][:5]
if i.index(j) == 0:
i[i.index(j)] = a_tags[k]
n+=1
k+=1
elif h in j:
i[i.index(j)] = j[:j.index(h)]
j = j[:j.index(h)]
if n < len(headers)-1:
n+=1
n =1
# merging those extra values in the list if any
if len(i) == 7:
i[3] = i[3] + ' ' + i[4]
i.remove(i[4])
# writing into csv file
# if you don't want a line space between each row then add newline = '' argument in the open function below
with open('output.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(contents)
This was the output:
If you want to paginate then just add the page number to the end of the url and you'll be good!
page_num = 1
while page_num <13:
page = requests.get(f'https://www.re-thinkingthefuture.com/top-architects/top-architecture-firms-in-india-part-1/{page_num}/')
# paste the above code starting from soup = BeautifulSoup(page.text, 'lxml')
page_num +=1
Hope this helps, let me know if there's any error.
EDIT 1:
I forgot to say the most important part sorry, if there is a tag with no class name then you can still get the tag with what I have used in the code above
h2 = soup.find_all('h2', class_= '')
This just says that give me all the h2 tags which does not have a class name. This itself can sometimes be a unique identifier as we are using this no class value to identify it.
You can use this example as a basis how to scrape the informations from that page:
import requests
import pandas as pd
url = "https://www.gov.uk/government/publications/endorsing-bodies-start-up/start-up"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
parent = soup.select_one("div.govspeak")
mapping = {"sector": "sectors", "endorses businesses": "endorses businesses in"}
all_data = []
for h3 in parent.select("h3"):
name = h3.text
link = h3.a["href"] if h3.a else "-"
ul = h3.find_next("ul")
if ul and ul.find_previous("h3") == h3 and ul.parent == parent:
li = [
list(map(lambda x: mapping.get((i := x.strip()), i), v))
for li in ul.select("li")
if len(v := li.get_text(strip=True).split(":")) == 2
]
else:
li = []
all_data.append({"name": name, "link": link, **dict(li)})
df = pd.DataFrame(all_data)
print(df)
df.to_csv("data.csv", index=False)
Creates data.csv (screenshot from LibreOffice):
I am using xpath to parse an xml file
from lxml import etree
example='''<div n="0001" type="car" xml:id="_3a327f0002">
<p xml:id="_3a327f0003">
1. A car is
<p xml:id="_3a327f0004"> - big, yellow and red;</p>
<p xml:id="_3a327f0005"> - has a big motor;</p>
<p xml:id="_3a327f0006"> - and also has <lb/>
big seats.
</p>
</p>
</div>'''
I want to serialize the above XML file in the following way:
{"_3a327f0003": "1. A car is",
"_3a327f0004":"- big, yellow and red;"
"_3a327f0005":"- has a big motor;"
"_3a327f0006":"- and also has big seats"
Basically extracting the text and building a dictionary where every text belongs to his xml:id. My code is as follows:
parser = etree.XMLParser(resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True)
XML_tree = etree.fromstring(example.encode() , parser=parser)
all_paras = XML_tree.xpath('.//p[#xml:id]')
list_of_paragraphs = []
for para in all_paras:
mydict = {}
mydict['text'] = para.text
for att in para.attrib:
mykey=att
if 'id' in mykey:
mykey='xmlid'
mydict[mykey] = para.attrib[att]
list_of_paragraphs.append(mydict)
PDM_XML_serializer(example)
It works except for the fact that if I have a node like:
<p xml:id="_3a327f0006"> - and also has <lb/>
big seats.
</p>
It will not extract the part from
How should I modify:
XML_tree.xpath('.//p[#xml:id]')
in order to get all the text from <p to /p> ?
EDIT:
para.itertext() could be used but then the first node will give back all the text of the other nodes as well.
Using xml.etree.ElementTree
import xml.etree.ElementTree as ET
xml = '''<div n="0001" type="car" xml:id="_3a327f0002">
<p xml:id="_3a327f0003">
1. A car is
<p xml:id="_3a327f0004"> - big, yellow and red;</p>
<p xml:id="_3a327f0005"> - has a big motor;</p>
<p xml:id="_3a327f0006"> - and also has <lb/>
big seats.
</p>
</p>
</div>'''
def _get_element_txt(element):
txt = element.text
children = list(element)
if children:
txt += children[0].tail.strip()
return txt
root = ET.fromstring(xml)
data = {p.attrib['{http://www.w3.org/XML/1998/namespace}id']: _get_element_txt(p)
for p in root.findall('.//p/p')}
for k, v in data.items():
print(f'{k} --> {v}')
output
_3a327f0004 --> - big, yellow and red;
_3a327f0005 --> - has a big motor;
_3a327f0006 --> - and also has big seats.
Using lxml.etree parse all elements in all_paras in a list/dict comprehension. Since your XML uses the special xml prefix and lxml does not yet support parsing namespace prefix in attributes (see #mzjn's answer here), below uses workaround with next + iter to retrieve attribute value.
Additionally, to retrieve all text values between nodes, xpath("text()") is used with str.strip and .join to clean up whitespace and line breaks and concatenate together.
from lxml import etree
example='''<div n="0001" type="car" xml:id="_3a327f0002">
<p xml:id="_3a327f0003">
1. A car is
<p xml:id="_3a327f0004"> - big, yellow and red;</p>
<p xml:id="_3a327f0005"> - has a big motor;</p>
<p xml:id="_3a327f0006"> - and also has <lb/>
big seats.
</p>
</p>
</div>'''
XML_tree = etree.fromstring(example)
all_paras = XML_tree.xpath('.//p[#xml:id]')
output = {
next(iter(t.attrib.values())):" ".join(i.strip()
for i in t.xpath("text()")).strip()
for t in all_paras
}
output
# {
# '_3a327f0003': '1. A car is',
# '_3a327f0004': '- big, yellow and red;',
# '_3a327f0005': '- has a big motor;',
# '_3a327f0006': '- and also has big seats.'
# }
You could use lxml itertext() to get text content of the p element:
mydict['text'] = ''.join(para.itertext())
See this question as well for more generic solution.
This modifies the xpath to exclude the "A car is" text as per your example. It also uses the xpath functions string and normalize-space to evaluate the para node as a string and join its text nodes, as well as clean up the text to match your example.
from lxml import etree
example='''<div n="0001" type="car" xml:id="_3a327f0002">
<p xml:id="_3a327f0003">
1. A car is
<p xml:id="_3a327f0004"> - big, yellow and red;</p>
<p xml:id="_3a327f0005"> - has a big motor;</p>
<p xml:id="_3a327f0006"> - and also has <lb/>
big seats.
</p>
</p>
</div>'''
parser = etree.XMLParser(resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True)
XML_tree = etree.fromstring(example.encode() , parser=parser)
all_paras = XML_tree.xpath('./p/p[#xml:id]')
list_of_paragraphs = []
for para in all_paras:
mydict = {}
mydict['text'] = para.xpath('normalize-space(string(.))')
for att in para.attrib:
mykey=att
if 'id' in mykey:
mykey='xmlid'
mydict[mykey] = para.attrib[att]
list_of_paragraphs.append(mydict)
PDM_XML_serializer(example)
If these tags are just noise for you, you can simply remove them before reading the xml
XML_tree = etree.fromstring(example.replace('<lb/>', '').encode() , parser=parser)
I have a scraper that looks for pricing on particular product pages. I'm only interested in the current price - whether the product is on sale or not.
I store the identifying tags like this in a JSON file:
{
"some_ecommerce_site" : {
"product_name" : ["span", "data-test", "product-name"],
"breadcrumb" : ["div", "class", "breadcrumbs"],
"sale_price" : ["span", "data-test", "sale-price"],
"regular_price" : ["span", "data-test", "product-price"]
},
}
And have these functions to select current price and clean up the price text:
def get_pricing(rpi, spi):
sale_price = self.soup_object.find(spi[0], {spi[1] : spi[2]})
regular_price = self.soup_object.find(rpi[0], {rpi[1] : rpi[2]})
return sale_price if sale_price else regular_price
def get_text(obj):
return re.sub(r'\s\s+', '', obj.text.strip()).encode('utf-8')
Which are called by:
def get_ids(name_of_ecommerce_site):
with open('site_identifiers.json') as j:
return json.load(j)[name_of_ecommerce_site]
def get_data():
rpi = self.site_ids['regular_price']
spi = self.site_ids['sale_price']
product_price = self.get_text( self.get_pricing(rpi, spi) )
This works for all but one site so far because their pricing is formatted like so:
<div class="product-price">
<h3>
£15.00
<span class="price-standard">
£35.00
</span>
</h3>
</div>
So what product_price returns is "£15£35" instead of the desired "£15".
Is there a simple way to exclude the nested <span> which won't break for the working sites?
I thought a solution would be to get a list and select index 0, but checking the tag's contents, that won't work as it's a single item in the list:
>> print(type(regular_price))
>> <class 'bs4.element.Tag'>
>> print(regular_price.contents)
>> [u'\n', <h3>\n\n\xa325.00\n\n<span class="price-standard">\n\n\xa341.00\n</span>\n</h3>, u'\n']
I've tried creating a list out of the result's NavigableString elements then filtering out the empty strings:
filter(None, [self.get_text(unicode(x)) for x in sale_price.find_all(text=True)])
This fixes that one case, but breaks a few of the others (since they often have the currency in a different tag than the value amount) - I get back "£".
If you want to get the text without child element one.You can do like this
from bs4 import BeautifulSoup,NavigableString
html = """
<div class="product-price">
<h3>
£15.00
<span class="price-standard">
£35.00
</span>
</h3>
</div>
"""
bs = BeautifulSoup(html,"xml")
result = bs.find("div",{"class":"product-price"})
fr = [element for element in result.h3 if isinstance(element, NavigableString)]
print(fr[0])
question may be duplicate of this
I have an html file which looks like:
...
<p>
<strong>This is </strong>
<strong>a lin</strong>
<strong>e which I want to </strong>
<strong>join.</strong>
</p>
<p>
2.
<strong>But do not </strong>
<strong>touch this</strong>
<em>Maybe some other tags as well.</em>
bla bla blah...
</p>
...
What I need is, if all the tags in a 'p' block are 'strong', then combine them into one line, i.e.
<p>
<strong>This is a line which I want to join.</strong>
</p>
Without touching the other block since it contains something else.
Any suggestions? I am using lxml.
UPDATE:
So far I tried:
for p in self.tree.xpath('//body/p'):
if p.tail is None: #no text before first element
children = p.getchildren()
for child in children:
if len(children)==1 or child.tag!='strong' or child.tail is not None:
break
else:
etree.strip_tags(p,'strong')
With these code I was able to strip off the strong tag in the part desired, giving:
<p>
This is a line which I want to join.
</p>
So now I just need a way to put the tag back in...
I was able to do this with bs4 (BeautifulSoup):
from bs4 import BeautifulSoup as bs
html = """<p>
<strong>This is </strong>
<strong>a lin</strong>
<strong>e which I want to </strong>
<strong>join.</strong>
</p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p>"""
soup = bs(html)
s = ''
# note that I use the 0th <p> block ...[0],
# so make the appropriate change in your code
for t in soup.find_all('p')[0].text:
s = s+t.strip('\n')
s = '<p><strong>'+s+'</strong></p>'
print s # prints: <p><strong>This is a line which I want to join.</strong></p>
Then use replace_with():
p_tag = soup.p
p_tag.replace_with(bs(s, 'html.parser'))
print soup
prints:
<html><body><p><strong>This is a line which I want to join.</strong></p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p></body></html>
I have managed to solve my own problem.
for p in self.tree.xpath('//body/p'):
if p.tail is None: # some conditions specifically for my doc
children = p.getchildren()
if len(children)>1:
for child in children:
#if other stuffs present, break
if child.tag!='strong' or child.tail is not None:
break
else:
# If not break, we find a p block to fix
# Get rid of stuffs inside p, and put a SubElement in
etree.strip_tags(p,'strong')
tmp_text = p.text_content()
p.clear()
subtext = etree.SubElement(p, "strong")
subtext.text = tmp_text
Special thanks to #Scott who helps me come down to this solution. Although I cannot mark his answer correct, I have no less appreciation to his guidance.
Alternatively, you can use more specific xpath to get the targeted p elements directly :
p_target = """
//p[strong]
[not(*[not(self::strong)])]
[not(text()[normalize-space()])]
"""
for p in self.tree.xpath(p_target):
#logic inside the loop can also be the same as your `else` block
content = p.xpath("normalize-space()")
p.clear()
strong = etree.SubElement(p, "strong")
strong.text = content
brief explanation about xpath being used :
//p[strong] : find p element, anywhere in the XML/HTML document, having child element strong...
[not(*[not(self::strong)])] : ..and not having child element other than strong...
[not(text()[normalize-space()])] : ..and not having non-empty text node child.
normalize-space() : get all text nodes from current context element, concatenated with consecutive whitespaces normalized to single space