python xpath parsing of xml avoiding <lb/> - python

I am using xpath to parse an xml file
from lxml import etree
example='''<div n="0001" type="car" xml:id="_3a327f0002">
<p xml:id="_3a327f0003">
1. A car is
<p xml:id="_3a327f0004"> - big, yellow and red;</p>
<p xml:id="_3a327f0005"> - has a big motor;</p>
<p xml:id="_3a327f0006"> - and also has <lb/>
big seats.
</p>
</p>
</div>'''
I want to serialize the above XML file in the following way:
{"_3a327f0003": "1. A car is",
"_3a327f0004":"- big, yellow and red;"
"_3a327f0005":"- has a big motor;"
"_3a327f0006":"- and also has big seats"
Basically extracting the text and building a dictionary where every text belongs to his xml:id. My code is as follows:
parser = etree.XMLParser(resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True)
XML_tree = etree.fromstring(example.encode() , parser=parser)
all_paras = XML_tree.xpath('.//p[#xml:id]')
list_of_paragraphs = []
for para in all_paras:
mydict = {}
mydict['text'] = para.text
for att in para.attrib:
mykey=att
if 'id' in mykey:
mykey='xmlid'
mydict[mykey] = para.attrib[att]
list_of_paragraphs.append(mydict)
PDM_XML_serializer(example)
It works except for the fact that if I have a node like:
<p xml:id="_3a327f0006"> - and also has <lb/>
big seats.
</p>
It will not extract the part from
How should I modify:
XML_tree.xpath('.//p[#xml:id]')
in order to get all the text from <p to /p> ?
EDIT:
para.itertext() could be used but then the first node will give back all the text of the other nodes as well.

Using xml.etree.ElementTree
import xml.etree.ElementTree as ET
xml = '''<div n="0001" type="car" xml:id="_3a327f0002">
<p xml:id="_3a327f0003">
1. A car is
<p xml:id="_3a327f0004"> - big, yellow and red;</p>
<p xml:id="_3a327f0005"> - has a big motor;</p>
<p xml:id="_3a327f0006"> - and also has <lb/>
big seats.
</p>
</p>
</div>'''
def _get_element_txt(element):
txt = element.text
children = list(element)
if children:
txt += children[0].tail.strip()
return txt
root = ET.fromstring(xml)
data = {p.attrib['{http://www.w3.org/XML/1998/namespace}id']: _get_element_txt(p)
for p in root.findall('.//p/p')}
for k, v in data.items():
print(f'{k} --> {v}')
output
_3a327f0004 --> - big, yellow and red;
_3a327f0005 --> - has a big motor;
_3a327f0006 --> - and also has big seats.

Using lxml.etree parse all elements in all_paras in a list/dict comprehension. Since your XML uses the special xml prefix and lxml does not yet support parsing namespace prefix in attributes (see #mzjn's answer here), below uses workaround with next + iter to retrieve attribute value.
Additionally, to retrieve all text values between nodes, xpath("text()") is used with str.strip and .join to clean up whitespace and line breaks and concatenate together.
from lxml import etree
example='''<div n="0001" type="car" xml:id="_3a327f0002">
<p xml:id="_3a327f0003">
1. A car is
<p xml:id="_3a327f0004"> - big, yellow and red;</p>
<p xml:id="_3a327f0005"> - has a big motor;</p>
<p xml:id="_3a327f0006"> - and also has <lb/>
big seats.
</p>
</p>
</div>'''
XML_tree = etree.fromstring(example)
all_paras = XML_tree.xpath('.//p[#xml:id]')
output = {
next(iter(t.attrib.values())):" ".join(i.strip()
for i in t.xpath("text()")).strip()
for t in all_paras
}
output
# {
# '_3a327f0003': '1. A car is',
# '_3a327f0004': '- big, yellow and red;',
# '_3a327f0005': '- has a big motor;',
# '_3a327f0006': '- and also has big seats.'
# }

You could use lxml itertext() to get text content of the p element:
mydict['text'] = ''.join(para.itertext())
See this question as well for more generic solution.

This modifies the xpath to exclude the "A car is" text as per your example. It also uses the xpath functions string and normalize-space to evaluate the para node as a string and join its text nodes, as well as clean up the text to match your example.
from lxml import etree
example='''<div n="0001" type="car" xml:id="_3a327f0002">
<p xml:id="_3a327f0003">
1. A car is
<p xml:id="_3a327f0004"> - big, yellow and red;</p>
<p xml:id="_3a327f0005"> - has a big motor;</p>
<p xml:id="_3a327f0006"> - and also has <lb/>
big seats.
</p>
</p>
</div>'''
parser = etree.XMLParser(resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True)
XML_tree = etree.fromstring(example.encode() , parser=parser)
all_paras = XML_tree.xpath('./p/p[#xml:id]')
list_of_paragraphs = []
for para in all_paras:
mydict = {}
mydict['text'] = para.xpath('normalize-space(string(.))')
for att in para.attrib:
mykey=att
if 'id' in mykey:
mykey='xmlid'
mydict[mykey] = para.attrib[att]
list_of_paragraphs.append(mydict)
PDM_XML_serializer(example)

If these tags are just noise for you, you can simply remove them before reading the xml
XML_tree = etree.fromstring(example.replace('<lb/>', '').encode() , parser=parser)

Related

How can I improve my solution for getting unknown text between tags?

I'm very at Python and BeautifulSoup and trying to up my game. Let's say this is my HTML:
<div class="container">
<h4>Title 1</h4>
Text I want is here
<br /> # random break tags inserted throughout
<br />
More text I want here
<br />
yet more text I want
<h4>Title 2</h4>
More text here, but I do not want it
<br />
<ul> # More HTML that I do not want</ul>
</div> # End container div
My expected output is the text between the two H4 tags:
Text I want is here
More text I want here
yet more text I want
But I don't know in advance what this text will say or how much of it there will be. There might be only one line, or there might be several paragraphs. It is not tagged with anything: no p tags, no id, nothing. The only thing I know about it is that it will appear between those two H4 tags.
At the moment, what I'm doing is working backward from the second H4 tag by using .previous_siblings to get everything up to the container div.
text = soup.find('div', class_ = 'container').find_next('h4', text = 'Title 2')
text = text.previous_siblings
text_list = []
for line in text:
text_list.append(line)
text_list.reverse()
full_text = ' '.join([str(line) for line in text_list])
text = full_text.strip().replace('<h4>Title 1</h4>', '').replace('<br />'>, '')
This gives me the content I want, but it also gives me a lot more that I don't want, plus it gives it to me backwards, which is why I need to use reverse(). Then I end up having to strip out a lot of stuff using replace().
What I don't like about this is that since my end result is a list, I'm finding it hard to clean up the output. I can't use get_text() on a list. In my real-life version of this I have about ten instances of replace() and it's still not getting rid of everything.
Is there a more elegant way for me to get the desired output?
You can filter the previous siblings for NavigableStrings.
For example:
from bs4 import NavigableString
text = soup.find('div', class_ = 'container').find_next('h4', text = 'Title 2')
text = text.previous_siblings
text_list = [t for t in text if type(t) == NavigableString]
text_list will look like:
>>> text_list
[u'\nyet more text I want\n', u'\nMore text I want here\n', u'\n', u'\nText I want is here\n', u'\n']
You can also filter out \n's:
text_list = [t for t in text if type(t) == NavigableString and t != '\n']
Other solution: Use .find_next_siblings() with text=True (that will find only NavigableString nodes in the tree). Then each iteration check, if previous <h4> is correct one:
from bs4 import BeautifulSoup
txt = '''<div class="container">
<h4>Title 1</h4>
Text I want is here
<br />
<br />
More text I want here
<br />
yet more text I want
<h4>Title 2</h4>
More text here, but I do not want it
<br />
</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
out = []
first_h4 = soup.find('h4')
for t in first_h4.find_next_siblings(text=True):
if t.find_previous('h4') != first_h4:
break
elif t.strip():
out.append(t.strip())
print(out)
Prints:
['Text I want is here', 'More text I want here', 'yet more text I want']

Python/BeautifulSoup - Can't match a tag containing specific text

I've read tons of articles on stackoverflow about the same problem but no one is working for me.
Tag I need to select:
<p class="line">Actors: Actor 1</p>
The page is full of p tags with class="line" so I'm trying to match it by looking for the ones that contains the string "Actors: ":
data.find('p', attrs={'class':'line'}, text=re.compile(r'^Actors: $'))
This doesn't match anything. What would be the right syntax?
I don't think it can be done with a single expression.
import re
from bs4 import BeautifulSoup
s="""
<p class="line">Actors: Actor 1</p>
<p class="line">Other: Stunt 0</p>
<p class="line">Actors: Actor 3</p>
<p class="role-line">Actors: Actor 4</p>
"""
soup = BeautifulSoup(s, 'html.parser')
This work:
soup.findAll(attrs={'class':'line'})
This too:
soup.findAll(string=re.compile(r'^Actors'))
But both combined doesn't work, bug or unsupported, I don't know:
soup.findAll(attrs={'class':'line'}, string=re.compile(r'^Actors'))
But you have alternatives.
Using set intersection:
set([node.parent for node in soup.findAll(string=re.compile(r'^Actors'))]) &
set(soup.findAll(attrs={'class':'line'}))
Result:
{<p class="line">Actors: Actor 3</p>,
<p class="line">Actors: Actor 1</p>}
Using findParents:
[node.findParents('p', class_='line') for node in \
soup.findAll(string=re.compile(r'^Actors'))]
Result: (needs some filtering)
[[<p class="line">Actors: Actor 1</p>],
[<p class="line">Actors: Actor 3</p>],
[]]
Using loops and conditions:
for p in [node.parent for node in soup.findAll(text=re.compile(r'^Actors'))]:
if not 'line' in p.attrs['class']:
continue
print(p)
Result:
<p class="line">Actors: Actor 1</p>
<p class="line">Actors: Actor 3</p>
Note: string is the new text parameter in BeautifulSoup 4.4+
html = '''<p class="line">Actors: Actor 1</p>'''
soup = bs4.BeautifulSoup(html, 'lxml')
soup.p.text
out:
'Actors: Actor 1'
p tag has multiple text field.
soup.find('p', attrs={'class':'line'}, string=None)
soup.find('p', attrs={'class':'line'}, text=None)
out:
<p class="line">Actors: Actor 1</p>
The reason why text/string=None will match the p tag is:
when we use text/string in the find() as a filter, it's use p.string to get the string of p tag, and p tag has multiple text fields
If a tag contains more than one thing, then it’s not clear what
.string should refer to, so .string is defined to be None
But you can find the text first, and get the element before the text:
soup.find(text='Actors: ').previous_element
in this case, find() only contains text filter, it will act like find in the text editor.
I had a lot of confusion with this too!
soup.findAll('p', {'class': 'line'}, text='Actors: ')
That should return the right thing? I believe you can also replace the class for ID?.
Hope that works. Did on my test.

Retrieving tail text from html

Python 2.7 using lxml
I have some annoyingly formed html that looks like this:
<td>
<b>"John"
</b>
<br>
"123 Main st.
"
<br>
"New York
"
<b>
"Sally"
</b>
<br>
"101 California St.
"
<br>
"San Francisco
"
</td>
So basically it's a single td with a ton of stuff in it. I'm trying to compile a list or dict of the names and their addresses.
So far what I've done is gotten a list of nodes with names using tree.xpath('//td/b'). So let's assume I'm currently on the b node for John.
I'm trying to get whatever.xpath('string()') for everything following the current node but preceding the next b node (Sally). I've tried a bunch of different xpath queries but can't seem to get this right. In particular, any time I use an and operator in an expression that has no [] brackets, it returns a bool rather than a list of all nodes meeting the conditions. Can anyone help out?
This should work:
from lxml import etree
p = etree.HTMLParser()
html = open(r'./test.html','r')
data = html.read()
tree = etree.fromstring(data, p)
my_dict = {}
for b in tree.iter('b'):
br = b.getnext().tail.replace('\n', '')
my_dict[b.text.replace('\n', '')] = br
print my_dict
This code prints:
{'"John"': '"123 Main st."', '"Sally"': '"101 California St."'}
(You may want to strip the quotation marks out!)
Rather than using xpath, you could use one of lxml's parsers in order to easily navigate the HTML. The parser will turn the HTML document into an "etree", which you can navigate with provided methods. The lxml module provides a method called iter() which allows you to pass in a tag name and receive all elements in the tree with that name. In your case, if you use this to obtain all of the <b> elements, you can then manually navigate to the <br> element and retrieve its tail text, which contains the information you need. You can find information about this in the "Elements contain text" header of the lxml.etree tutorial.
What not use getchildren function from view of each td. For example:
from lxml import html
s = """
<td>
<b>"John"
</b>
<br>
"123 Main st.
"
<br>
"New York
"
<b>
"Sally"
</b>
<br>
"101 California St.
"
<br>
"San Francisco
"
</td>
"""
records = []
cur_record = -1
cur_field = 1
FIELD_NAME = 0
FIELD_STREET = 1
FIELD_CITY = 2
doc = html.fromstring(s)
td = doc.xpath('//td')[0]
for child in td.getchildren():
if child.tag == 'b':
cur_record += 1
record = dict()
record['name'] = child.text.strip()
records.append(record)
cur_field = 1
elif child.tag == 'br':
if cur_field == FIELD_STREET:
records[cur_record]['street'] = child.tail.strip()
cur_field += 1
elif cur_field == FIELD_CITY:
records[cur_record]['city'] = child.tail.strip()
And the results are:
records = [
{'city': '"New York\n"', 'name': '"John"\n', 'street': '"123 Main st.\n"'},
{'city': '"San Francisco\n"', 'name': '\n"Sally"\n', 'street': '"101 California St.\n"'}
]
Note you should use tag.tail if you want to get text of some non-close html tag, e.g., <br>.
Hope this would be helpful.

Headings from XML after span tag

I have an XML file from which I would like to extract heading tags (h1, h2, .. as well as its text) which are between </span> <span class='classy'> tags (this way around). I want to do this in Python 2.7, and I have tried beautifulsoup and elementtree but couldn't work it out.
The file contains sections like this:
<section>
<p>There is some text <span class='classy' data-id='234'></span> and there is more text now.</p>
<h1>A title</h1>
<p>A paragraph</p>
<h2>Some second title</h2>
<p>Another paragraph with random tags like <img />, <table> or <div></p>
<div>More random content</div>
<h2>Other title.</h2>
<p>Then more paragraph <span class='classy' data-id='235'></span> with other stuff.</p>
<h2>New title</h2>
<p>Blhablah, followed by a div like that:</p>
<div class='classy' data-id='236'></div>
<p>More text</p>
<h3>A new title</h3>
</section>
I would like to write in a csv file like this:
data-id,heading.name,heading.text
234,h1,A title
234,h2,Some second title
234,h2,Another title.
235,h2,New title
236,h3,A new title
and ideally I would write this:
id,h1,h2,h3
234,A title,Some second title,
234,A title,Another title,
235,A title,New title,
236,A title,New title,A new title
but then I guess I can always reshape it afterwards.
I have tried to iterate through the file, but I only seem to be able to keep all the text without the heading tags. Also, to make things more annoying, sometimes it is not a span but a div, which has the same class and attributes.
Any suggestion on what would be the best tool for this in Python?
I've two pieces of code that work:
- finding the text with itertools.takewhile
- finding all h1,h2,h3 but without the span id.
soup = BeautifulSoup(open(xmlfile,'r'),'lxml')
spans = soup('span',{'class':'page-break'})
for el in spans:
els = [i for i in itertools.takewhile(lambda x:x not in [el,'script'],el.next_siblings)]
print els
This gives me a list of text contained between spans. I wanted to iterate through it, but there are no more html tags.
To find the h1,h2,h3 I used:
with open('titles.csv','wb') as f:
csv_writer = csv.writer(f)
for header in soup.find_all(['h1','h2','h3']):
if header.name == 'h1':
h1text = header.get_text()
elif header.name == 'h2':
h2text = header.get_text()
elif header.name == 'h3':
h3text = header.get_text()
csv_writer.writerow([h1text,h2text,h3text,header.name])
I've now tried with xpath without much luck.
Since it's an xhtml document, I used:
from lxml import etree
with open('myfile.xml', 'rt') as f:
tree = etree.parse(f)
root = tree.getroot()
spans = root.xpath('//xhtml:span',namespaces={'xhtml':'http://www.w3.org/1999/xhtml'})
This gives me the list of spans objects but I don't know how to iterate between two spans.
Any suggestion?

How to split the tags from html tree

This is my html tree
<li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1">
Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a>
</h3>Get the IndianOil Citibank <b>Card</b>. Apply Now!
<br />
Get 10X Rewards On Shopping -
Save Over 5% On Fuel
<br />
<cite>www.citibank.co.in/<b>CreditCards</b></cite>
</li>
From this html i need to extract the lines beforeth of < br > tag
line1 : Get the IndianOil Citibank Card. Apply Now!
line2 : Get 10X Rewards On Shopping - Save Over 5% On Fuel
how it would supposed to do in python?
I think you just asked for the line before each <br/>.
This following code will do it for the sample you've provided, by striping out the <b> and <a> tags and printing the .tail of each element whose following-sibling is a <br/>.
from lxml import etree
doc = etree.HTML("""
<li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1">
Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a>
</h3>Get the IndianOil Citibank <b>Card</b>. Apply Now!
<br />
Get 10X Rewards On Shopping -
Save Over 5% On Fuel
<br />
<cite>www.citibank.co.in/<b>CreditCards</b></cite>
</li>""")
etree.strip_tags(doc,'a','b')
for element in doc.xpath('//*[following-sibling::*[name()="br"]]'):
print repr(element.tail.strip())
Yields:
'Get the IndianOil Citibank Card. Apply Now!'
'Get 10X Rewards On Shopping -\n Save Over 5% On Fuel'
As with all parsing of HTML you need to make some assumptions about the format of the HTML. If we can assume that the previous line is everything before the <br> tag up to a block level tag, or another <br> then we can do the following...
from BeautifulSoup import BeautifulSoup
doc = """
<li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1">
Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a>
</h3>Get the IndianOil Citibank <b>Card</b>. Apply Now!
<br />
Get 10X Rewards On Shopping -
Save Over 5% On Fuel
<br />
<cite>www.citibank.co.in/<b>CreditCards</b></cite>
</li>
"""
soup = BeautifulSoup(doc)
Now we have parsed the HTML, next we define the list of tags we don't want to treat as part of the line. There are other block tags really, but this does for this HTML.
block_tags = ["div", "p", "h1", "h2", "h3", "h4", "h5", "h6", "br"]
We cycle through each <br> tag stepping back through its siblings until we either have no more, or we hit a block level tag. Each time we loop we get add the node to the front of our line. NavigableStrings don't have name attributes, but we want to include them hence the two part test in the while loop.
for node in soup.findAll("br"):
line = ""
sibling = node.previousSibling
while sibling is not None and (not hasattr(sibling, "name") or sibling.name not in block_tags):
line = unicode(sibling) + line
sibling = sibling.previousSibling
print line
Solution without relaying on <br> tags:
import lxml.html
html = "..."
tree = lxml.html.fromstring(html)
line1 = ''.join(tree.xpath('//li[#class="taf"]/text() | b/text()')[:3]).strip()
line2 = ' - '.join(tree.xpath('//li[#class="taf"]//a[not(#id)]/text()'))
I dont know whether you want to use lxml or beautiful soup. But for lxml using xpath here is an example
import lxml
from lxml import etree
import urllib2
response = urllib2.urlopen('your url here')
html = response.read()
imdb = etree.HTML(html)
titles = imdb.xpath('/html/body/li/a/text()')//xpath for "line 2" data.[use firebug]
The xpath I used is for your given html snippet. It may change in the original context.
You can also give cssselect in lxml a try.
import lxml.html
import urllib
data = urllib.urlopen('your url').read()
doc = lxml.html.fromstring(data)
elements = doc.cssselect('your csspath here') // CSSpath[using firebug extension]
for element in elements:
print element.text_content()

Categories