Python + selenium: extract variable quantity of paragraphs between titles

Python + selenium: extract variable quantity of paragraphs between titles - python

Fellows, assuming the html below how can extract the paragraphs <p> who belongs to the tile <h3>.
<!DOCTYPE html>
<html>
<body>
...
<div class="main-div">
<h3>Title 1</h3>
<p></p>
<h3>Title 2</h3>
<p></p>
<p></p>
<p></p>
<h3>Title 3</h3>
<p></p>
<p></p>
...
</div>
</body>
As you can see both <h3> and <p> tags are children of the <div> tag but they have no class or id that makes possible to identify them and say that "Title 1" has 1 paragraph, title 2 has 3 paragraphs, title 3 has two paragraphs and so on. I can't see a way to tie the paragraph to the title...
I'm trying to do it using Python 2.7 + selenium. But I'm not sure that I'm working with the right tools, maybe you can suggest the solution or any different combinations like Beautifulsoup, urllib2...
Any suggestion/direction will be very appreciated!
UPDATE
After the brilliant solution pointed by #JustMe I came up with the solution below, hope it helps someone else or if someone can improve it to pythonic. I coming from c/c++/java/perl world so always I hit the wall :)
import bs4
page = """
<!DOCTYPE html>
<html>
<body>
...
<div class="maincontent-block">
<h3>Title 1</h3>
<p>1</p>
<p>2</p>
<p>3</p>
<h3>Title 2</h3>
<p>2</p>
<p>3</p>
<p>4</p>
<h3>Title 3</h3>
<p>7</p>
<p>9</p>
...
</div>
</body>
"""
page = bs4.BeautifulSoup(page, "html.parser")
div = page.find('div', {'class':"maincontent-block"})
mydict = {}
# write to the dictionary
for tag in div.findChildren():
if (tag.name == "h3"):
#print(tag.string)
mydict[tag.string] = None
nextTags = tag.findAllNext()
arr = [];
for nt in nextTags:
if (nt.name == "p"):
arr.append(nt.string)
mydict[tag.string] = arr
elif (nt.name == "h3"):
arr = []
break
# read from dictionary
arrKeys = []
for k in mydict:
arrKeys.append(k)
arrKeys.sort()
for k in arrKeys:
print k
for v in mydict[k]:
print v

It's easy to be done using BeautifulSoup
import bs4
page = """
<!DOCTYPE html>
<html>
<body>
...
<div class="main-div">
<h3>Title 1</h3>
<p></p>
<h3>Title 2</h3>
<p></p>
<p></p>
<p></p>
<h3>Title 3</h3>
<p></p>
<p></p>
...
</div>
</body>
"""
page = bs4.BeautifulSoup(page)
h3_tag = page.div.find("h3").string
print(h3_tag)
>>> u'Title 1'
h3_tag.find_next_siblings("p")
>>> [<p></p>, <p></p>, <p></p>, <p></p>, <p></p>, <p></p>]
len(h3_tag.find_next_siblings("p"))/2
>>> 3
Ok, since You want separated count of paragraphs i came up with this, crude thing.
h_counters = []
count = -1
for child in page.div.findChildren():
if "<h3>" in str(child):
h_counters.append(count)
count = 0
else:
count += 1
h_counters.append(count)
h_counters = h_counters[1:]
print (h_counters)
>> [1, 3, 2]

Related

Python: Extract text and element selectors from html elements

Given something like the following html:
<div>
<div>
<meta ... />
<img />
</div>
<div id="main">
<p class="foo">Hello, World</p>
<div>
<div class="bar">Hey, there!</div>
</div>
</div>
</div>
How would I go about selecting only the elements that have text and outputting a generated, unique css selector for said element?
For this example, that would be:
# can be even more specific if there are other .foo's
------
[ |
{ "html": "Hello, World", "selector": ".foo"},
{ "html": "Hey, there!", "selector": ".bar" }
]
Was playing with BeautifulSoup and html_sanitizer but wasn't getting great results.

This should be a piece of cake with BeautifulSoup
from bs4 import BeautifulSoup
html = """
<div>
<div>
<meta ... />
<img />
</div>
<div id="main">
<p class="foo">Hello, World</p>
<div>
<div class="bar">Hey, there!</div>
</div>
</div>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
results = []
for element in soup.find_all(string=True):
parent = element.parent
while parent and not (parent.has_attr('id') or parent.has_attr('class')):
parent = parent.parent
if parent and element.strip() != '':
if parent.has_attr('id'):
results.append({
"html": element.strip(),
"selector": '#' + parent['id']
})
elif parent.has_attr('class'):
results.append({
"html": element.strip(),
"selector": list(map(lambda cls: '.' + cls, parent['class']))
})
print(results)

Came up with this by directing Copilot and ChatGPT:
def get_css_selector(element):
selector = element.name
if element.has_attr("id"):
selector += "#" + element["id"]
else:
classes = element.get("class", [])
if classes:
selector += "." + ".".join(classes)
else:
parent = element.find_parent()
if parent:
parent_selector = get_css_selector(parent)
selector = parent_selector + " > " + selector
index = 1
for sibling in element.previous_siblings:
if sibling.name == element.name:
index += 1
selector += f":nth-of-type({index})"
return selector
def get_html_segments(page):
soup = BeautifulSoup(page, "html.parser")
html_segments = []
for tag in soup.find_all():
if tag.name not in ["p", "h1", "h2", "h3", "h4", "h5", "h6", "li", "a", "span", "div"]:
continue
if len(tag.contents) == 1 and tag.contents[0].name is None:
html_segments.append(
{"text": str(tag.contents), "css_selector": get_css_selector(tag)}
)
return html_segments
Please let me know if someone comes up with something more effective.
At some point, it'd be cool to take the text from block elements and have inline text elements turned to their innerText.

How to keep all html elements with selector but drop all others?

I would like to get a HTML string without certain elements. However, upfront I just know which elements to keep but don't know which ones to drop.
Let's say I just want to keep all p and a tags inside the div with class="A".
Input:
<div class="A">
<p>Text1</p>
<img src="A.jpg">
<div class="sub1">
<p>Subtext1</p>
</div>
<p>Text2</p>
link text
</div>
<div class="B">
ContentDiv2
</div>
Expected output:
<div class="A">
<p>Text1</p>
<p>Text2</p>
link text
</div>
If I'd know all the selectors of all other elements I could just use lxml's drop_tree(). But the problem is that I don't know ['img', 'div.sub1', 'div.B'] upfront.
Example with drop_tree():
import lxml.cssselect
import lxml.html
tree = lxml.html.fromstring(html_str)
elements_drop = ['img', 'div.sub1', 'div.B']
for j in elements_drop:
selector = lxml.cssselect.CSSSelector(j)
for e in selector(tree):
e.drop_tree()
output = lxml.html.tostring(tree)

I'm still not entirely sure I understand correctly, but it seems like you may be looking for something resembling this:
target = tree.xpath('//div[#class="A"]')[0]
to_keep = target.xpath('//p | //a')
for t in target.xpath('.//*'):
if t not in to_keep:
target.remove(t) #I believe this method is better here than drop_tree()
print(lxml.html.tostring(target).decode())
The output I get is your expected output.

Try the below. The idea is to clean the root and add the required sub elements.
Note that no external lib is required.
import xml.etree.ElementTree as ET
html = '''<div class="A">
<p>Text1</p>
<img src="A.jpg"/>
<div class="sub1">
<p>Subtext1</p>
</div>
<p>Text2</p>
link text
ContentDiv2
</div>'''
root = ET.fromstring(html)
p_lst = root.findall('./p')
a_lst = root.findall('./a')
children = list(root)
for c in children:
root.remove(c)
for p in p_lst:
p.tail = ''
root.append(p)
for a in a_lst:
a.tail = ''
root.append(a)
root.text = ''
ET.dump(root)
output
<?xml version="1.0" encoding="UTF-8"?>
<div class="A">
<p>Text1</p>
<p>Text2</p>
link text
</div>

Python + BeautifulSoup: How to get wrapper out of HTML based on text?

Would like to get the wrapper of a key text. For example, in HTML:
…
<div class=“target”>chicken</div>
<div class=“not-target”>apple</div>
…
And by based on the text “chicken”, would like to get back <div class=“target”>chicken</div>.
Currently, have the following to fetch the HTML:
import requests
from bs4 import BeautifulSoup
req = requests.get(url).txt
soup = BeautifulSoup(r, ‘html.parser’)
And having to just do soup.find_all(‘div’,…) and loop through all available div to find the wrapper that I am looking for.
But without having to loop through every div, What would be the proper and most optimal way of fetching the wrapper in HTML based on a defined text?
Thank you in advance and will be sure to accept/upvote answer!

# coding: utf-8
html_doc = """
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title> Last chicken leg on stock! Only 500$ !!! </title>
</head>
</body>
<div id="layer1" class="class1">
<div id="layer2" class="class2">
<div id="layer3" class="class3">
<div id="layer4" class="class4">
<div id="layer5" class="class5">
<p>My chicken has <span style="color:blue">ONE</span> leg :P</p>
<div id="layer6" class="class6">
<div id="layer7" class="class7">
<div id="chicken_surname" class="chicken">eat me</div>
<div id="layer8" class="class8">
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</body>
</html>"""
from bs4 import BeautifulSoup as BS
import re
soup = BS(html_doc, "lxml")
# (tag -> text) direction is pretty obvious that way
tag = soup.find('div', class_="chicken")
tag2 = soup.find('div', {'id':"chicken_surname"})
print('\n###### by_cls:')
print(tag)
print('\n###### by_id:')
print(tag2)
# but can be tricky when need to find tag by substring
tag_by_str = soup.find(string="eat me")
tag_by_sub = soup.find(string="eat")
tag_by_resub = soup.find(string=re.compile("eat"))
print('\n###### tag_by_str:')
print(tag_by_str)
print('\n###### tag_by_sub:')
print(tag_by_sub)
print('\n###### tag_by_resub:')
print(tag_by_resub)
# there are more than one way to access underlying strings
# both are different - see results
tag = soup.find('p')
print('\n###### .text attr:')
print( tag.text, type(tag.text) )
print('\n###### .strings generator:')
for s in tag.strings: # strings is an generator object
print s, type(s)
# note that .strings generator returns list of bs4.element.NavigableString elements
# so we can use them to navigate, for example accessing their parents:
print('\n###### NavigableString parents:')
for s in tag.strings:
print s.parent
# or even grandparents :)
print('\n###### grandparents:')
for s in tag.strings:
print s.parent.parent

lxml - how to remove element but not it's content?

Let's assume I have following code:
<div id="first">
<div id="second">
<a></a>
<ul>...</ul>
</div>
</div>
Here's my code:
div_parents = root_element.xpath('//div[div]')
for div in reversed(div_parents):
if len(div.getchildren()) == 1:
# remove second div and replace it with it's content
I'm reaching div's with div children and then I want to remove the child div if that's the only child it's parent has. The result should be:
<div id="first">
<a></a>
<ul>...</ul>
</div>
I wanted to do it with:
div.replace(div.getchildren()[0], div.getchildren()[0].getchildren())
but unfortunatelly, both arguments on replace should consist of only one element. Is there something easier than reassigning all properties of first div to second div and then replacing both? - because I could do it easily with:
div.getparent().replace(div, div.getchildren()[0])

Consider using copy.deepcopy as suggested in the docs:
For example:
div_parents = root_element.xpath('//div[div]')
for outer_div in div_parents:
if len(outer_div.getchildren()) == 1:
inner_div = outer_div[0]
# Copy the children of innder_div to outer_div
for e in inner_div: outer_div.append( copy.deepcopy(e) )
# Remove inner_div from outer_div
outer_div.remove(inner_div)
Full code used to test:
import copy
import lxml.etree
def pprint(e): print(lxml.etree.tostring(e, pretty_print=True))
xml = '''
<body>
<div id="first">
<div id="second">
<a>...</a>
<ul>...</ul>
</div>
</div>
</body>
'''
root_element = lxml.etree.fromstring(xml)
div_parents = root_element.xpath('//div[div]')
for outer_div in div_parents:
if len(outer_div.getchildren()) == 1:
inner_div = outer_div[0]
# Copy the children of innder_div to outer_div
for e in inner_div: outer_div.append( copy.deepcopy(e) )
# Remove inner_div from outer_div
outer_div.remove(inner_div)
pprint(root_element)
Output:
<body>
<div id="first">
<a>...</a>
<ul>...</ul>
</div>
</body>
Note: The enclosing <body> tag in the test code is unnecessary, I was just using it for testing multiple cases. The test code operates without issue on your input.

I'd just use list-replacement:
from lxml.etree import fromstring, tostring
xml = """<div id="first">
<div id="second">
<a></a>
<ul>...</ul>
</div>
</div>"""
doc = fromstring(xml)
outer_divs = doc.xpath("//div[div]")
for outer in outer_divs:
outer[:] = list(outer[0])
print tostring(doc)

Why does BeautifulSoup .children contain nameless elements as well as the expected tag(s)

Code
#!/usr/bin/env python3
from bs4 import BeautifulSoup
test="""<!DOCTYPE html>
<html>
<head>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/>
<title>Test</title>
</head>
<body>
<table>
<tbody>
<tr>
<td>
<div>
<b>
Icon
</b>
</div>
</td>
</tr>
</tbody>
</table>
</body>
</html>"""
soup = BeautifulSoup(test2)
rows = soup.findAll('tr')
for r in rows:
print(r.name)
for c in r.children:
print('>', c.name)
Output
tr
> None
> td
> None
Why are there nameless elements in the list of the row's children?
This occurs running Python 3.3.1 64-bit on Windows 8, with html.parser (that's Python's built-in one).

The elements of .children can be NavigableStrings as well as Tags. In the case of your example, they're the whitespace before and after the td element.
This variation on your code hopefully makes it clear:
>>> rows = soup.findAll('tr')
>>> for r in rows:
... print("row:", r.name)
... for c in r.children:
... print("---")
... print(type(c))
... print(repr(c))
...
row: tr
---
<class 'bs4.element.NavigableString'>
'\n'
---
<class 'bs4.element.Tag'>
<td>
<div>
<b>
Icon
</b>
</div>
</td>
---
<class 'bs4.element.NavigableString'>
'\n'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python + selenium: extract variable quantity of paragraphs between titles - python

Related

Python: Extract text and element selectors from html elements

How to keep all html elements with selector but drop all others?

Python + BeautifulSoup: How to get wrapper out of HTML based on text?

lxml - how to remove element but not it's content?

Why does BeautifulSoup .children contain nameless elements as well as the expected tag(s)

Categories

Resources