Headings from XML after span tag - python

I have an XML file from which I would like to extract heading tags (h1, h2, .. as well as its text) which are between </span> <span class='classy'> tags (this way around). I want to do this in Python 2.7, and I have tried beautifulsoup and elementtree but couldn't work it out.
The file contains sections like this:
<section>
<p>There is some text <span class='classy' data-id='234'></span> and there is more text now.</p>
<h1>A title</h1>
<p>A paragraph</p>
<h2>Some second title</h2>
<p>Another paragraph with random tags like <img />, <table> or <div></p>
<div>More random content</div>
<h2>Other title.</h2>
<p>Then more paragraph <span class='classy' data-id='235'></span> with other stuff.</p>
<h2>New title</h2>
<p>Blhablah, followed by a div like that:</p>
<div class='classy' data-id='236'></div>
<p>More text</p>
<h3>A new title</h3>
</section>
I would like to write in a csv file like this:
data-id,heading.name,heading.text
234,h1,A title
234,h2,Some second title
234,h2,Another title.
235,h2,New title
236,h3,A new title
and ideally I would write this:
id,h1,h2,h3
234,A title,Some second title,
234,A title,Another title,
235,A title,New title,
236,A title,New title,A new title
but then I guess I can always reshape it afterwards.
I have tried to iterate through the file, but I only seem to be able to keep all the text without the heading tags. Also, to make things more annoying, sometimes it is not a span but a div, which has the same class and attributes.
Any suggestion on what would be the best tool for this in Python?
I've two pieces of code that work:
- finding the text with itertools.takewhile
- finding all h1,h2,h3 but without the span id.
soup = BeautifulSoup(open(xmlfile,'r'),'lxml')
spans = soup('span',{'class':'page-break'})
for el in spans:
els = [i for i in itertools.takewhile(lambda x:x not in [el,'script'],el.next_siblings)]
print els
This gives me a list of text contained between spans. I wanted to iterate through it, but there are no more html tags.
To find the h1,h2,h3 I used:
with open('titles.csv','wb') as f:
csv_writer = csv.writer(f)
for header in soup.find_all(['h1','h2','h3']):
if header.name == 'h1':
h1text = header.get_text()
elif header.name == 'h2':
h2text = header.get_text()
elif header.name == 'h3':
h3text = header.get_text()
csv_writer.writerow([h1text,h2text,h3text,header.name])
I've now tried with xpath without much luck.
Since it's an xhtml document, I used:
from lxml import etree
with open('myfile.xml', 'rt') as f:
tree = etree.parse(f)
root = tree.getroot()
spans = root.xpath('//xhtml:span',namespaces={'xhtml':'http://www.w3.org/1999/xhtml'})
This gives me the list of spans objects but I don't know how to iterate between two spans.
Any suggestion?

Related

How to highlight text in complex html

I have an html that looks like this:
<h3>
Heading 3
</h3>
<ol>
<li>
<ol>
....
</li>
</ol>
Need to highlight the entire html starting from first ol. I have found this solution:
soup = bs4.BeautifulSoup(open('temp.html').read(), 'lxml')
new_h1 = soup.new_tag('h1')
new_h1.string = 'Hello '
mark = soup.new_tag('mark')
mark.string = 'World'
new_h1.append(mark)
h1 = soup.h1
h1.replace_with(new_h1)
print(soup.prettify())
Is there any way to highlight entire html without having to find out the specific text?
Edit:
This is what I mean by highlighted text
Edit:
I have tried this code but it only highlights the very innermost li:
for node in soup2.findAll('li'):
if not node.string:
continue
value = node.string
mark = soup2.new_tag('mark')
mark.string = value
node.replace_with(mark)
This will highlight all the <li> content.
As I have no clear idea of how your HTML code looks like, I have tried to highlight all the <li> content. You can modify this code to suit your requirements.
from bs4 import BeautifulSoup
with open('index.html') as f:
soup = BeautifulSoup(f.read(), 'html.parser')
tag = soup.findAll('li')
# Highlights the <li> content
for li in tag:
newtag = soup.new_tag('mark')
li.string.wrap(newtag)
print(soup)
After Highlighting: https://i.stack.imgur.com/iIbXk.jpg

How can I improve my solution for getting unknown text between tags?

I'm very at Python and BeautifulSoup and trying to up my game. Let's say this is my HTML:
<div class="container">
<h4>Title 1</h4>
Text I want is here
<br /> # random break tags inserted throughout
<br />
More text I want here
<br />
yet more text I want
<h4>Title 2</h4>
More text here, but I do not want it
<br />
<ul> # More HTML that I do not want</ul>
</div> # End container div
My expected output is the text between the two H4 tags:
Text I want is here
More text I want here
yet more text I want
But I don't know in advance what this text will say or how much of it there will be. There might be only one line, or there might be several paragraphs. It is not tagged with anything: no p tags, no id, nothing. The only thing I know about it is that it will appear between those two H4 tags.
At the moment, what I'm doing is working backward from the second H4 tag by using .previous_siblings to get everything up to the container div.
text = soup.find('div', class_ = 'container').find_next('h4', text = 'Title 2')
text = text.previous_siblings
text_list = []
for line in text:
text_list.append(line)
text_list.reverse()
full_text = ' '.join([str(line) for line in text_list])
text = full_text.strip().replace('<h4>Title 1</h4>', '').replace('<br />'>, '')
This gives me the content I want, but it also gives me a lot more that I don't want, plus it gives it to me backwards, which is why I need to use reverse(). Then I end up having to strip out a lot of stuff using replace().
What I don't like about this is that since my end result is a list, I'm finding it hard to clean up the output. I can't use get_text() on a list. In my real-life version of this I have about ten instances of replace() and it's still not getting rid of everything.
Is there a more elegant way for me to get the desired output?
You can filter the previous siblings for NavigableStrings.
For example:
from bs4 import NavigableString
text = soup.find('div', class_ = 'container').find_next('h4', text = 'Title 2')
text = text.previous_siblings
text_list = [t for t in text if type(t) == NavigableString]
text_list will look like:
>>> text_list
[u'\nyet more text I want\n', u'\nMore text I want here\n', u'\n', u'\nText I want is here\n', u'\n']
You can also filter out \n's:
text_list = [t for t in text if type(t) == NavigableString and t != '\n']
Other solution: Use .find_next_siblings() with text=True (that will find only NavigableString nodes in the tree). Then each iteration check, if previous <h4> is correct one:
from bs4 import BeautifulSoup
txt = '''<div class="container">
<h4>Title 1</h4>
Text I want is here
<br />
<br />
More text I want here
<br />
yet more text I want
<h4>Title 2</h4>
More text here, but I do not want it
<br />
</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
out = []
first_h4 = soup.find('h4')
for t in first_h4.find_next_siblings(text=True):
if t.find_previous('h4') != first_h4:
break
elif t.strip():
out.append(t.strip())
print(out)
Prints:
['Text I want is here', 'More text I want here', 'yet more text I want']

Extract text from an href recursively with Scrapy

We have the following HTML:
<a class="link contact-info__link" href="tel:+99999999999">
<svg class="icon icon--telephone contact-info__link-icon contact-info__link-icon--phone">
<use xlink:href="/local/templates/.default/img/icon-font/icon-font.svg#icon-phone"></use>
</svg>
<span class="contact-info__link-text">+9 (999) 999-99-99</span>
</a>
I need to get this dictionary:
{"tel:+99999999999": "+9 (999) 999-99-99"}
That is, I need the href link and the respective text, regardless of how many "child" tags there are after the href. In this case, I need the href link itself and the text in the span, but consider that it could be span or any other type of tag.
I am currently using this code to get all href + text from any page (as this is the goal):
for r in response.css('a'):
url = r.css('::attr(href)').get()
txt = r.css('::text').get()
That works for this type of case:
This is my phone
But not when it is recursive, like the first code, it just returns this:
{"tel:+99999999999": "\n"}
To get whole text under a tag you can use getall() method and then join all text into one string.
This example you can use:
url = r.css('::attr(href)').get()
txt = r.css('::text').getall()
txt = ''.join([t.strip() for t in txt if t.strip()]) if txt else txt
Try this
tel_s = response.css('.link contact-info__link')
yield {tel_s.css('::attr(href)').get(): tel_s.css('span::text)').get()}
output:
{"tel:+99999999999": "+9 (999) 999-99-99"}

Beautifulsoup get text based on nextSibling tag name

I'm scraping multiple pages that all have a similar format, but it changes a little here and there and there are no classes to use to search for what I need.
The format looks like this:
<div id="mainContent">
<p>Some Text I don't want</p>
<p>Some Text I don't want</p>
<p>Some Text I don't want</p>
<span> More text I don't want</span>
<ul>...unordered-list items..</ul>
<p>Text I WANT</p>
<ol>...ordered-list items..</ol>
<p>Text I WANT</p>
<ol>...ordered-list items..</ol>
</div>
The number of ordered/unordered lists and other tags changes depending on the page, but what stays the same is I always want the text from the <p> tag that is the previous sibling of the <ol> tag.
What I'm trying (and isn't working) is:
main = soup.find("div", {"id":"mainContent"})
for d in main.children:
if d.name == 'p' and d.nextSibling.name == 'ol':
print(d.text)
else:
print("fail")
The out put of this is fail for every iteration. In trying to figure out why this isn't working I tried:
for d in main.children:
if d.name == 'p':
print(d.nextSibling.name)
else:
print("fail")
The output of this is something like:
fail
None
fail
None
fail
None
fail
fail
fail
fail
fail
None
fail
etc...
Why is this not working like I think it would? How could I get the text from a <p> element only if the next tag is <ol>?
You want only the p tags which are before ol tag. Find the ol tags first and then find the previous Tag objects which are in this case, the p tag. Now your code is not working because, there is a newline between the Tag elements which are NavigableString type objects. And d.nextSibling yields you those newlines also. So You have to check the type of the object here.
from bs4 import Tag
# create soup
# find the ols
ols = soup.find_all('ol')
for ol in ols:
prev = ol.previous_sibling
while(not isinstance(prev, Tag)):
prev = prev.previous_sibling
print(prev.text)
This will give you the text you want.
Text I WANT
Text I WANT
You can use a css selector, i.e ul ~ p to find all the p tags preceded by the ul:
html = """<div id="mainContent">
<p>Some Text I don't want</p>
<p>Some Text I don't want</p>
<p>Some Text I don't want</p>
<span> More text I don't want</span>
<ul>...unordered-list items..</ul>
<p>Text I WANT</p>
<ol>...ordered-list items..</ol>
<p>Text I WANT</p>
<ol>...ordered-list items..</ol>
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print([p.text for p in soup.select("#mainContent ul ~ p")])
Which will give you:
['Text I WANT', 'Text I WANT']
Or find the ol's and then look for the previous sibling p:
print([ol.find_previous_sibling("p").text for ol in soup.select("#mainContent ol")])
Which would also give you:
['Text I WANT', 'Text I WANT']

Combine multiple tags with lxml

I have an html file which looks like:
...
<p>
<strong>This is </strong>
<strong>a lin</strong>
<strong>e which I want to </strong>
<strong>join.</strong>
</p>
<p>
2.
<strong>But do not </strong>
<strong>touch this</strong>
<em>Maybe some other tags as well.</em>
bla bla blah...
</p>
...
What I need is, if all the tags in a 'p' block are 'strong', then combine them into one line, i.e.
<p>
<strong>This is a line which I want to join.</strong>
</p>
Without touching the other block since it contains something else.
Any suggestions? I am using lxml.
UPDATE:
So far I tried:
for p in self.tree.xpath('//body/p'):
if p.tail is None: #no text before first element
children = p.getchildren()
for child in children:
if len(children)==1 or child.tag!='strong' or child.tail is not None:
break
else:
etree.strip_tags(p,'strong')
With these code I was able to strip off the strong tag in the part desired, giving:
<p>
This is a line which I want to join.
</p>
So now I just need a way to put the tag back in...
I was able to do this with bs4 (BeautifulSoup):
from bs4 import BeautifulSoup as bs
html = """<p>
<strong>This is </strong>
<strong>a lin</strong>
<strong>e which I want to </strong>
<strong>join.</strong>
</p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p>"""
soup = bs(html)
s = ''
# note that I use the 0th <p> block ...[0],
# so make the appropriate change in your code
for t in soup.find_all('p')[0].text:
s = s+t.strip('\n')
s = '<p><strong>'+s+'</strong></p>'
print s # prints: <p><strong>This is a line which I want to join.</strong></p>
Then use replace_with():
p_tag = soup.p
p_tag.replace_with(bs(s, 'html.parser'))
print soup
prints:
<html><body><p><strong>This is a line which I want to join.</strong></p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p></body></html>
I have managed to solve my own problem.
for p in self.tree.xpath('//body/p'):
if p.tail is None: # some conditions specifically for my doc
children = p.getchildren()
if len(children)>1:
for child in children:
#if other stuffs present, break
if child.tag!='strong' or child.tail is not None:
break
else:
# If not break, we find a p block to fix
# Get rid of stuffs inside p, and put a SubElement in
etree.strip_tags(p,'strong')
tmp_text = p.text_content()
p.clear()
subtext = etree.SubElement(p, "strong")
subtext.text = tmp_text
Special thanks to #Scott who helps me come down to this solution. Although I cannot mark his answer correct, I have no less appreciation to his guidance.
Alternatively, you can use more specific xpath to get the targeted p elements directly :
p_target = """
//p[strong]
[not(*[not(self::strong)])]
[not(text()[normalize-space()])]
"""
for p in self.tree.xpath(p_target):
#logic inside the loop can also be the same as your `else` block
content = p.xpath("normalize-space()")
p.clear()
strong = etree.SubElement(p, "strong")
strong.text = content
brief explanation about xpath being used :
//p[strong] : find p element, anywhere in the XML/HTML document, having child element strong...
[not(*[not(self::strong)])] : ..and not having child element other than strong...
[not(text()[normalize-space()])] : ..and not having non-empty text node child.
normalize-space() : get all text nodes from current context element, concatenated with consecutive whitespaces normalized to single space

Categories