proper xpath to roll up text of children

proper xpath to roll up text of children - python

I'm parsing a page that has structure like this:
<pre class="asdf">content a</pre>
<pre class="asdf">content b</pre>
# returns
content a
content b
And I'm using the following XPath to get the content:
"//pre[#class='asdf']/text()"
It works well, except if there are any elements nested inside the <pre> tag, where it doesn't concatenate them:
<pre class="asdf">content <a href="http://stackoverflow.com"</a>a</a></pre>
<pre class="asdf">content b</pre>
# returns
content
content b
If I use this XPath, I get the output that follows.
"//pre[#class='asdf']//text()"
content
a
content b
I don't want either of those. I want to get all text inside a <pre>, even if it has children. I don't care if the tags are stripped or not- but I want it concatenated together.
How do I do this? I'm using lxml.html.xpath in python2, but I don't think it matters. This answer to another question makes me think that maybe child:: has something to do with my answer.
Here's some code that reproduces it.
from lxml import html
tree = html.fromstring("""
<pre class="asdf">content a</pre>
<pre class="asdf">content b</pre>
""")
for row in tree.xpath("//*[#class='asdf']/text()"):
print("row: ", row)

.text_content() is what you should use:
.text_content():
Returns the text content of the element, including the text content of its children, with no markup.
for row in tree.xpath("//*[#class='asdf']"):
print("row: ", row.text_content())
Demo:
>>> from lxml import html
>>>
>>> tree = html.fromstring("""
... <pre class="asdf">content a</pre>
... <pre class="asdf">content b</pre>
... """)
>>> for row in tree.xpath("//*[#class='asdf']"):
... print("row: ", row.text_content())
...
('row: ', 'content a')
('row: ', 'content b')

Related

How can I improve my solution for getting unknown text between tags?

I'm very at Python and BeautifulSoup and trying to up my game. Let's say this is my HTML:
<div class="container">
<h4>Title 1</h4>
Text I want is here
<br /> # random break tags inserted throughout
<br />
More text I want here
<br />
yet more text I want
<h4>Title 2</h4>
More text here, but I do not want it
<br />
<ul> # More HTML that I do not want</ul>
</div> # End container div
My expected output is the text between the two H4 tags:
Text I want is here
More text I want here
yet more text I want
But I don't know in advance what this text will say or how much of it there will be. There might be only one line, or there might be several paragraphs. It is not tagged with anything: no p tags, no id, nothing. The only thing I know about it is that it will appear between those two H4 tags.
At the moment, what I'm doing is working backward from the second H4 tag by using .previous_siblings to get everything up to the container div.
text = soup.find('div', class_ = 'container').find_next('h4', text = 'Title 2')
text = text.previous_siblings
text_list = []
for line in text:
text_list.append(line)
text_list.reverse()
full_text = ' '.join([str(line) for line in text_list])
text = full_text.strip().replace('<h4>Title 1</h4>', '').replace('<br />'>, '')
This gives me the content I want, but it also gives me a lot more that I don't want, plus it gives it to me backwards, which is why I need to use reverse(). Then I end up having to strip out a lot of stuff using replace().
What I don't like about this is that since my end result is a list, I'm finding it hard to clean up the output. I can't use get_text() on a list. In my real-life version of this I have about ten instances of replace() and it's still not getting rid of everything.
Is there a more elegant way for me to get the desired output?

You can filter the previous siblings for NavigableStrings.
For example:
from bs4 import NavigableString
text = soup.find('div', class_ = 'container').find_next('h4', text = 'Title 2')
text = text.previous_siblings
text_list = [t for t in text if type(t) == NavigableString]
text_list will look like:
>>> text_list
[u'\nyet more text I want\n', u'\nMore text I want here\n', u'\n', u'\nText I want is here\n', u'\n']
You can also filter out \n's:
text_list = [t for t in text if type(t) == NavigableString and t != '\n']

Other solution: Use .find_next_siblings() with text=True (that will find only NavigableString nodes in the tree). Then each iteration check, if previous <h4> is correct one:
from bs4 import BeautifulSoup
txt = '''<div class="container">
<h4>Title 1</h4>
Text I want is here
<br />
<br />
More text I want here
<br />
yet more text I want
<h4>Title 2</h4>
More text here, but I do not want it
<br />
</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
out = []
first_h4 = soup.find('h4')
for t in first_h4.find_next_siblings(text=True):
if t.find_previous('h4') != first_h4:
break
elif t.strip():
out.append(t.strip())
print(out)
Prints:
['Text I want is here', 'More text I want here', 'yet more text I want']

Parse HTML page to get contents of <p> and <b> tags

There are lots of HTML pages which are structured as a sequence of such groups:
<p>
<b> Keywords/Category:</b>
"keyword_a, keyword_b"
</p>
The addresses of these pages are like https://some.page.org/year/0001, https://some.page.org/year/0002, etc.
How can I extract the keywords separately from each of such pages? I've tried to use BeautifulSoup, but unsuccessfully. I've only written the program that prints titles of groups (between <b> and </b>).
from bs4 import BeautifulSoup
from urllib2 import urlopen
import re
html_doc = urlopen('https://some.page.org/2018/1234').read()
soup = BeautifulSoup(html_doc)
for link in soup.find_all('a'):
print 'https://some.page.org'+link.get('href')
for node in soup.findAll('b'):
print ''.join(node.findAll(text=True))

I can't test this without knowing the actual source code format but it seems you want the <p> tags text vaue:
for node in soup.findAll('p'):
print(node.text)
# or: keywords = node.text.split(', ')
# print(keywords)

You need to split your string which in this case is url with /
And then you can choose chunks you want
For example if url is https://some.page.org/year/0001 i use split function to split url with / sign
it will convert it to array and then i choose what i need and again convert it to string with ''.join() method you can read about split method in this link

There are different ways to HTML parse the desired categories and keywords from this kind of HTML structure, but here is one of the "BeautifulSoup" ways to do it:
find b elements with a text which ends with a :
use .next_sibling to get to the next text node which contains keywords
Working example:
from bs4 import BeautifulSoup
data = """
<div>
<p>
<b> Category 1:</b>
"keyword_a, keyword_b"
</p>
<p>
<b> Category 2:</b>
"keyword_c, keyword_d"
</p>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for category in soup('b', text=lambda text: text and text.endswith(":")):
keywords = category.next_sibling.strip('" \n').split(", ")
print(category.get_text(strip=True), keywords)
Prints:
Category 1: ['keyword_a', 'keyword_b']
Category 2: ['keyword_c', 'keyword_d']

Assuming for each block
<p>
<b> Keywords/Category:</b>
"keyword_a, keyword_b"
</p>
you want to extract keyword_a and keyword_b for each Keywords/Category. So an example would be:
<p>
<b>Mammals</b>
"elephant, rhino"
</p>
<p>
<b>Birds</b>
"hummingbird, ostrich"
</p>
Once you have the HTML code, you can do:
from bs4 import BeautifulSoup
html = '''<p>
<b>Mammals</b>
"elephant, rhino"
</p>
<p>
<b>Birds</b>
"hummingbird, ostrich"
</p>'''
soup = BeautifulSoup(html, 'html.parser')
p_elements = soup.find_all('p')
for p_element in p_elements:
b_element = soup.find_all('b')[0]
b_element.extract()
category = b_element.text.strip()
keywords = p_element.text.strip()
keyword_a, keyword_b = keywords[1:-1].split(', ')
print('Category:', category)
print('Keyword A:', keyword_a)
print('Keyword B:', keyword_b)
Which prints:
Category: Mammals
Keyword A: elephant
Keyword B: rhino
Category: Birds
Keyword A: hummingbird
Keyword B: ostrich

Headings from XML after span tag

I have an XML file from which I would like to extract heading tags (h1, h2, .. as well as its text) which are between </span> <span class='classy'> tags (this way around). I want to do this in Python 2.7, and I have tried beautifulsoup and elementtree but couldn't work it out.
The file contains sections like this:
<section>
<p>There is some text <span class='classy' data-id='234'></span> and there is more text now.</p>
<h1>A title</h1>
<p>A paragraph</p>
<h2>Some second title</h2>
<p>Another paragraph with random tags like <img />, <table> or <div></p>
<div>More random content</div>
<h2>Other title.</h2>
<p>Then more paragraph <span class='classy' data-id='235'></span> with other stuff.</p>
<h2>New title</h2>
<p>Blhablah, followed by a div like that:</p>
<div class='classy' data-id='236'></div>
<p>More text</p>
<h3>A new title</h3>
</section>
I would like to write in a csv file like this:
data-id,heading.name,heading.text
234,h1,A title
234,h2,Some second title
234,h2,Another title.
235,h2,New title
236,h3,A new title
and ideally I would write this:
id,h1,h2,h3
234,A title,Some second title,
234,A title,Another title,
235,A title,New title,
236,A title,New title,A new title
but then I guess I can always reshape it afterwards.
I have tried to iterate through the file, but I only seem to be able to keep all the text without the heading tags. Also, to make things more annoying, sometimes it is not a span but a div, which has the same class and attributes.
Any suggestion on what would be the best tool for this in Python?
I've two pieces of code that work:
- finding the text with itertools.takewhile
- finding all h1,h2,h3 but without the span id.
soup = BeautifulSoup(open(xmlfile,'r'),'lxml')
spans = soup('span',{'class':'page-break'})
for el in spans:
els = [i for i in itertools.takewhile(lambda x:x not in [el,'script'],el.next_siblings)]
print els
This gives me a list of text contained between spans. I wanted to iterate through it, but there are no more html tags.
To find the h1,h2,h3 I used:
with open('titles.csv','wb') as f:
csv_writer = csv.writer(f)
for header in soup.find_all(['h1','h2','h3']):
if header.name == 'h1':
h1text = header.get_text()
elif header.name == 'h2':
h2text = header.get_text()
elif header.name == 'h3':
h3text = header.get_text()
csv_writer.writerow([h1text,h2text,h3text,header.name])
I've now tried with xpath without much luck.
Since it's an xhtml document, I used:
from lxml import etree
with open('myfile.xml', 'rt') as f:
tree = etree.parse(f)
root = tree.getroot()
spans = root.xpath('//xhtml:span',namespaces={'xhtml':'http://www.w3.org/1999/xhtml'})
This gives me the list of spans objects but I don't know how to iterate between two spans.
Any suggestion?

Python: Parse all elements under a div

I am trying to parse all elements under div using beautifulsoup the issue is that I don't know all the elements underneath the div prior to parsing. For example a div can have text data in paragraph mode and bullet format along with some href elements. Each url that I open can have different elements underneath the specific div class that I am looking at:
example:
url a can have following:
<div class='content'>
<p> Hello I have a link </p>
<li> I have a bullet point
foo
</div>
but url b
can have
<div class='content'>
<p> I only have paragraph </p>
</div>
I started as doing something like this:
content = souping_page.body.find('div', attrs={'class': 'content})
but how to go beyond this is little confuse. I was hoping to create one string from all the parse data as a end result.
At the end I want the following string to be obtain from each example:
Example 1: Final Output
parse_data = Hello I have a link I have a bullet point
parse_links = foo.com
Example 2: Final Output
parse_data = I only have paragraph

You can get just the text of a text with element.get_text():
>>> from bs4 import BeautifulSoup
>>> sample1 = BeautifulSoup('''\
... <div class='content'>
... <p> Hello I have a link </p>
...
... <li> I have a bullet point
...
... foo
... </div>
... ''').find('div')
>>> sample2 = BeautifulSoup('''\
... <div class='content'>
... <p> I only have paragraph </p>
...
... </div>
... ''').find('div')
>>> sample1.get_text()
u'\n Hello I have a link \n I have a bullet point\n\nfoo\n'
>>> sample2.get_text()
u'\n I only have paragraph \n'
or you can strip it down a little using element.stripped_strings:
>>> ' '.join(sample1.stripped_strings)
u'Hello I have a link I have a bullet point foo'
>>> ' '.join(sample2.stripped_strings)
u'I only have paragraph'
To get all links, look for all a elements with href attributes and gather these in a list:
>>> [a['href'] for a in sample1.find_all('a', href=True)]
['foo.com']
>>> [a['href'] for a in sample2.find_all('a', href=True)]
[]
The href=True argument limits the search to <a> tags that have a href attribute defined.

Per the Beautiful Soup docs, to iterate over the children of a tag use either .contents to get them as a list or .children (a generator).
for child in title_tag.children:
print(child)
So, in your case, for example, you grab the .text of each tag and concatenate it together. I'm not clear on whether you want the link location or simply the label, if the former, refer to this SO question.

Get bulletted list in lxml

So I have a html like this:
...
<ul class="myclass">
<li>blah</li>
<li>blah2</li>
</ul>
...
I want to get the texts "blah" and "blah2" from the ul with the class name "myclass"
So I tried to use innerhtml(), but for some reason it doesn't work with lxml.
I'm using Python 3.

I would try:
doc.xpath('.//ul[#class = "myclass"]/li/text()')
# out: ["blah","blah2"]
edit:
what if there was a <a> in the <li>? for example, how would I get "link" and text" from <li>text</li>?
link = doc.xpath('.//ul[#class = "myclass"]/li/a/#href')
txt= doc.xpath('.//ul[#class = "myclass"]/li/a/text()')
If you want you can combine those, and if we take #larsmans example, you can use '//' to get the whole text, because I belive that lxml does't support the string() method in an expression.
doc.xpath('.//ul[#class="myclass"]/li[a]//text() | .//ul[#class="myclass"]/li/a/#href')
# out: ['I contain a ', 'http://example.com', 'link', '.']
Also, you can use the text_content() method:
html=\
"""
<html>
<ul class="myclass">
<li>I contain a link.</li>
<li>blah</li>
<li>blah2</li>
</ul>
</html>
"""
import lxml.html as lh
doc=lh.fromstring(html)
for elem in doc.xpath('.//ul[#class="myclass"]/li'):
print elem.text_content()
prints:
#I contain a link.
#blah
#blah2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

proper xpath to roll up text of children - python

Related

How can I improve my solution for getting unknown text between tags?

Parse HTML page to get contents of <p> and <b> tags

Headings from XML after span tag

Python: Parse all elements under a div

Get bulletted list in lxml

Categories

Resources