lxml - how to remove element but not it's content? - python

Let's assume I have following code:
<div id="first">
<div id="second">
<a></a>
<ul>...</ul>
</div>
</div>
Here's my code:
div_parents = root_element.xpath('//div[div]')
for div in reversed(div_parents):
if len(div.getchildren()) == 1:
# remove second div and replace it with it's content
I'm reaching div's with div children and then I want to remove the child div if that's the only child it's parent has. The result should be:
<div id="first">
<a></a>
<ul>...</ul>
</div>
I wanted to do it with:
div.replace(div.getchildren()[0], div.getchildren()[0].getchildren())
but unfortunatelly, both arguments on replace should consist of only one element. Is there something easier than reassigning all properties of first div to second div and then replacing both? - because I could do it easily with:
div.getparent().replace(div, div.getchildren()[0])

Consider using copy.deepcopy as suggested in the docs:
For example:
div_parents = root_element.xpath('//div[div]')
for outer_div in div_parents:
if len(outer_div.getchildren()) == 1:
inner_div = outer_div[0]
# Copy the children of innder_div to outer_div
for e in inner_div: outer_div.append( copy.deepcopy(e) )
# Remove inner_div from outer_div
outer_div.remove(inner_div)
Full code used to test:
import copy
import lxml.etree
def pprint(e): print(lxml.etree.tostring(e, pretty_print=True))
xml = '''
<body>
<div id="first">
<div id="second">
<a>...</a>
<ul>...</ul>
</div>
</div>
</body>
'''
root_element = lxml.etree.fromstring(xml)
div_parents = root_element.xpath('//div[div]')
for outer_div in div_parents:
if len(outer_div.getchildren()) == 1:
inner_div = outer_div[0]
# Copy the children of innder_div to outer_div
for e in inner_div: outer_div.append( copy.deepcopy(e) )
# Remove inner_div from outer_div
outer_div.remove(inner_div)
pprint(root_element)
Output:
<body>
<div id="first">
<a>...</a>
<ul>...</ul>
</div>
</body>
Note: The enclosing <body> tag in the test code is unnecessary, I was just using it for testing multiple cases. The test code operates without issue on your input.

I'd just use list-replacement:
from lxml.etree import fromstring, tostring
xml = """<div id="first">
<div id="second">
<a></a>
<ul>...</ul>
</div>
</div>"""
doc = fromstring(xml)
outer_divs = doc.xpath("//div[div]")
for outer in outer_divs:
outer[:] = list(outer[0])
print tostring(doc)

Related

extract tags from soup with BeautifulSoup

'''
<div class="kt-post-card__body>
<div class="kt-post-card__title">Example_1</div>
<div class="kt-post-card__description">Example_2</div>
<div class="kt-post-card__bottom">
<span class="kt-post-card__bottom-description kt-text-truncate" title="Example_3">Example_4</span>
</div>
</div>
'''
according to picture I attached, I want to extract all "kt-post-card__body" attrs and then from each one of them, extract:
("kt-post-card__title", "kt-post-card__description")
like a list.
I tried this:
ads = soup.find_all('div',{'class':'kt-post-card__body'})
but with ads[0].div I only access to "kt-post-card__title" while "kt-post-card__body" has three other sub tags like: "kt-post-card__description" and "kt-post-card__bottom" ... , why is that?
Cause your question is not that clear - To extract the classes:
for e in soup.select('.kt-post-card__body'):
print([c for t in e.find_all() for c in t.get('class')])
Output:
['kt-post-card__title', 'kt-post-card__description', 'kt-post-card__bottom', 'kt-post-card__bottom-description', 'kt-text-truncate']
To get the texts you also have to iterate your ResultSet and could access each elements text to fill your list or use stripped_strings.
Example
from bs4 import BeautifulSoup
html_doc='''
<div class="kt-post-card__body">
<div class="kt-post-card__title">Example_1</div>
<div class="kt-post-card__description">Example_2</div>
<div class="kt-post-card__bottom">
<span class="kt-post-card__bottom-description kt-text-truncate" title="Example_3">Example_4</span>
</div>
</div>
'''
soup = BeautifulSoup(html_doc)
for e in soup.select('.kt-post-card__body'):
data = [
e.select_one('.kt-post-card__title').text,
e.select_one('.kt-post-card__description').text
]
print(data)
Output:
['Example_1', 'Example_2']
or
print(list(e.stripped_strings))
Output:
['Example_1', 'Example_2', 'Example_4']
Try this:
ads = soup.find_all('div',{'class':'kt-post-card__body'})
ads[0]
I think you're getting only the first div because you called ads[0].div

Parsing invalid HTML and retrieving tag´s text to replace it

I need to iterate invalid HTML and obtain a text value from all tags to change it.
from bs4 import BeautifulSoup
html_doc = """
<div class="oxy-toggle toggle-7042 toggle-7042-expanded" data-oxy-toggle-active-class="toggle-7042-expanded" data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
<div class="oxy-expand-collapse-icon" href="#"></div>
<div class="oxy-toggle-content">
<h3 class="ct-headline" id="headline-213-142"><span class="ct-span" id="span-225-142">Sklizeň jahod 2019</span></h3> </div>
</div><div class="ct-text-block" id="text_block-214-142"><span class="ct-span" id="span-230-142"><p>Začátek sklizně: <strong>Zahájeno</strong><br>
Otevřeno: <strong>6 h – do otrhání</strong>, denně</p>
</span></div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
for tag in soup.find_all():
print(tag.name)
if tag.string:
tag.string.replace_with("1")
print(soup)
The result is
<div class="oxy-toggle toggle-7042 toggle-7042-expanded" data-oxy-toggle-active-class="toggle-7042-expanded" data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
<div class="oxy-expand-collapse-icon" href="#"></div>
<div class="oxy-toggle-content">
<h3 class="ct-headline" id="headline-213-142"><span class="ct-span" id="span-225-142">1</span></h3> </div>
</div><div class="ct-text-block" id="text_block-214-142"><span class="ct-span" id="span-230-142"><p>Začátek sklizně: <strong>1</strong><br/>
Otevřeno: <strong>1</strong>, denně</p>
</span></div>
I know how to replace the text but bs won´t find the text of the paragraph tag. So the texts "Začátek sklizně:" and "Otevřeno:" and ", denně" are not found so I cannot replace them.
I tried using different parsers such as lxml and html5lib won´t make a difference.
I tried python´s HTML library but that doesn´t support changing HTML only iterating it.
.string returns on a tag type object a NavigableString type object -> Your tag has a single string child then returned value is that string, if
it has no children or more than one child it will return None.
Scenario is not quiet clear to me, but here is one last approach based on your comment:
I need generic code to iterate any html and find all texts so I can work with them.
for tag in soup.find_all(text=True):
tag.replace_with('1')
Example
from bs4 import BeautifulSoup
html_doc = """<div class="oxy-toggle toggle-7042 toggle-7042-expanded" data-oxy-toggle-active-class="toggle-7042-expanded" data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
<div class="oxy-expand-collapse-icon" href="#"></div>
<div class="oxy-toggle-content">
<h3 class="ct-headline" id="headline-213-142"><span class="ct-span" id="span-225-142">Sklizeň jahod 2019</span></h3> </div>
</div><div class="ct-text-block" id="text_block-214-142"><span class="ct-span" id="span-230-142"><p>Začátek sklizně: <strong>Zahájeno</strong><br>
Otevřeno: <strong>6 h – do otrhání</strong>, denně</p>
</span></div>"""
soup = BeautifulSoup(html_doc, 'html.parser')
for tag in soup.find_all(text=True):
tag.replace_with('1')
Output
<div class="oxy-toggle toggle-7042 toggle-7042-expanded" data-oxy-toggle-active-class="toggle-7042-expanded" data-oxy-toggle-initial-state="closed" id="_toggle-212-142">1<div class="oxy-expand-collapse-icon" href="#"></div>1<div class="oxy-toggle-content">1<h3 class="ct-headline" id="headline-213-142"><span class="ct-span" id="span-225-142">1</span></h3>1</div>1</div><div class="ct-text-block" id="text_block-214-142"><span class="ct-span" id="span-230-142"><p>1<strong>1</strong><br/>1<strong>1</strong>1</p>1</span></div>

How to keep all html elements with selector but drop all others?

I would like to get a HTML string without certain elements. However, upfront I just know which elements to keep but don't know which ones to drop.
Let's say I just want to keep all p and a tags inside the div with class="A".
Input:
<div class="A">
<p>Text1</p>
<img src="A.jpg">
<div class="sub1">
<p>Subtext1</p>
</div>
<p>Text2</p>
link text
</div>
<div class="B">
ContentDiv2
</div>
Expected output:
<div class="A">
<p>Text1</p>
<p>Text2</p>
link text
</div>
If I'd know all the selectors of all other elements I could just use lxml's drop_tree(). But the problem is that I don't know ['img', 'div.sub1', 'div.B'] upfront.
Example with drop_tree():
import lxml.cssselect
import lxml.html
tree = lxml.html.fromstring(html_str)
elements_drop = ['img', 'div.sub1', 'div.B']
for j in elements_drop:
selector = lxml.cssselect.CSSSelector(j)
for e in selector(tree):
e.drop_tree()
output = lxml.html.tostring(tree)
I'm still not entirely sure I understand correctly, but it seems like you may be looking for something resembling this:
target = tree.xpath('//div[#class="A"]')[0]
to_keep = target.xpath('//p | //a')
for t in target.xpath('.//*'):
if t not in to_keep:
target.remove(t) #I believe this method is better here than drop_tree()
print(lxml.html.tostring(target).decode())
The output I get is your expected output.
Try the below. The idea is to clean the root and add the required sub elements.
Note that no external lib is required.
import xml.etree.ElementTree as ET
html = '''<div class="A">
<p>Text1</p>
<img src="A.jpg"/>
<div class="sub1">
<p>Subtext1</p>
</div>
<p>Text2</p>
link text
ContentDiv2
</div>'''
root = ET.fromstring(html)
p_lst = root.findall('./p')
a_lst = root.findall('./a')
children = list(root)
for c in children:
root.remove(c)
for p in p_lst:
p.tail = ''
root.append(p)
for a in a_lst:
a.tail = ''
root.append(a)
root.text = ''
ET.dump(root)
output
<?xml version="1.0" encoding="UTF-8"?>
<div class="A">
<p>Text1</p>
<p>Text2</p>
link text
</div>

Group in list by div class

Question:
Can I group found elements by a div class they're in and store them in lists in a list.
Is that possible?
*So I did some further testing and as said. It seems like that even if you store one div in a variable and when trying to search in that stored div it searches the whole site content.
from selenium import webdriver
driver = webdriver.Chrome()
result_text = []
# Let's say this is the class of the different divs, I want to group it by
#class='a-fixed-right-grid a-spacing-top-medium'
# These are the texts from all divs around the page that I'm looking for but I can't say which one belongs in witch div
elements = driver.find_elements_by_xpath("//a[contains(#href, '/gp/product/')]")
for element in elements:
result_text.append(element.text)
print(result_text )
Current Result:
I'm already getting all the information I'm looking for from different divs around the page but I want it to be "grouped" by the topmost div.
['Text11', 'Text12', 'Text2', 'Text31', 'Text32']
Result I want to achieve:
The
text is grouped by the #class='a-fixed-right-grid a-spacing-top-medium'
[['Text11', 'Text12'], ['Text2'], ['Text31', 'Text32']]
HTML: (looks something like this)
class="a-text-center a-fixed-left-grid-col a-col-left" is the first one that wraps the group from there on we can use any div to group it. At least I think that.
</div>
</div>
</div>
</div>
<div class="a-fixed-right-grid a-spacing-top-medium"><div class="a-fixed-right-grid-inner a-grid-vertical-align a-grid-top">
<div class="a-fixed-right-grid-col a-col-left" style="padding-right:3.2%;float:left;">
<div class="a-row">
<div class="a-fixed-left-grid a-spacing-base"><div class="a-fixed-left-grid-inner" style="padding-left:100px">
<div class="a-text-center a-fixed-left-grid-col a-col-left" style="width:100px;margin-left:-100px;float:left;">
<div class="item-view-left-col-inner">
<a class="a-link-normal" href="/gp/product/B07YCW79/ref=ppx_yo_dt_b_asin_image_o0_s00?ie=UTF8&psc=1">
<img alt="" src="https://images-eu.ssl-images-amazon.com/images/I/41rcskoL._SY90_.jpg" aria-hidden="true" onload="if (typeof uet == 'function') { uet('cf'); uet('af'); }" class="yo-critical-feature" height="90" width="90" title="Same as the text I'm looking for" data-a-hires="https://images-eu.ssl-images-amazon.com/images/I/41rsxooL._SY180_.jpg">
</a>
</div>
</div>
<div class="a-fixed-left-grid-col a-col-right" style="padding-left:1.5%;float:left;">
<div class="a-row">
<a class="a-link-normal" href="/gp/product/B07YCR79/ref=ppx_yo_dt_b_asin_title_o00_s0?ie=UTF8&psc=1">
Text I'm looking for
</a>
</div>
<div class="a-row">
I don't have the link to test it on but this might work for you:
from selenium import webdriver
driver = webdriver.Chrome()
result_text = [[a.text for a in div.find_elements_by_xpath("//a[contains(#href, '/gp/product/')]")]
for div in driver.find_elements_by_class_name('a-fixed-right-grid')]
print(result_text)
EDIT: added alternative function:
# if that doesn't work try:
def get_results(selenium_driver, div_class, a_xpath):
div_list = []
for div in selenium_driver.find_elements_by_class_name(div_class):
a_list = []
for a in div.find_elements_by_xpath(a_xpath):
a_list.append(a.text)
div_list.append(a_list)
return div_list
get_results(driver,
div_class='a-fixed-right-grid'
a_xpath="//a[contains(#href, '/gp/product/')]")
If that doesn't work then maybe the xpath is returning EVERY matching element every time despite being called from the div, or another element has that same class name farther up the document

Unable to locate an lists of elements using selenium

I need to scrape some pages. The exact structure of the part that I want is as follows:
<div class="someclasses">
<h3>...</h3> # Not needed
<ul class="ul-class1 ul-class2">
<li id="li1-id" class="li-class1 li-class2">
<div id ="div1-id" class="div-class1 div-class2 ... div-class6">
<div class="div2-class">
<div class="div3-class">...</div> #Not needed
<div class="div4-class1 div4-class2 div4-class3">
<a href="href1" data-control-id="id1" data-control-name="name" id ="a1-id" class="a-class1 a-class2">
<h3 class="h3-class1 h3-class2 h3-class3">Text1</h3>
</a></div>
<div>...</div> # Not needed
</div>
</li>
<li id="li2-id" class="li-class1 li-class2">
<div id ="div2-id" class="div-class1 div-class2 ... div-class6">
<div class="div2-class">
<div class="div3-class">...</div> #Not needed
<div class="div4-class1 div4-class2 div4-class3">
<a href="href2" data-control-id="id2" data-control-name="name" id ="a2-id" class="a-class1 a-class2">
<h3 class="h3-class1 h3-class2 h3-class3">Text2</h3>
</a></div>
<div>...</div> # Not needed
</div>
</li>
# More <li> elements
</ul>
</div>
Now what I want is to get the Texts as well as the hrefs.I have used the naming in above example exactly realistic i.e the same names are also the same in the real webpage. The code that I am currently using is:
elems = driver.find_elements_by_xpath("//div[#class='someclasses']/ul[#class='ul-class1']/li[#class='li-class1']")
print(len(elems))
for elem in elems:
elem1 = driver.find_element_by_xpath("./a[#data-control-name='name']")
names2.append(elem1.text)
print(elem1.text)
hrefs.append(elem.get_attribute("href"))
The result of the print statement above is 0 so basically the elements are not found. Can anyone please tell me what am I doing wrong.
You are using only part of the class name... in XPATH you need the full class name...
FYI: With CSS you can use part of the class name...
If you want to use XPATH try:
elems = driver.find_elements_by_xpath("//div[#class='someclasses']//li//a")
print(len(elems))
for elem in elems:
names2.append(elem.text)
print(elem.text)
new_href = elem.get_attribute("href")
print(new_href)
hrefs.append(new_href)
For CSS use: div.someclasses ul.ul-class1
elems = driver.find_elements_by_css_selector("div.someclasses ul.ul-class1 li a")
for elem in elems:
names2.append(elem.text)
print(elem.text)
new_href = elem.get_attribute("href")
print(new_href)
hrefs.append(new_href)

Categories