Processing html text nodes with scrapy and XPath - python

I'm using scrapy to process documents like this one:
...
<div class="contents">
some text
<ol>
<li>
more text
</li>
...
</ol>
</div>
...
I want to collect all the text inside the contents area into a string.
I also need the '1., 2., 3....' from the <li> elements, so my result should be 'some text 1. more text...'
So, I'm looping over <div class="contents">'s children
for n in response.xpath('//div[#class="contents"]/node()'):
if n.xpath('self::ol'):
result += process_list(n)
else:
result += n.extract()
If nis an ordered list, I loop over its elements and add a number to li/text() (in process_list()). If nis a text node itself, I just read its value.
However, 'some text' doesn't seem to be part of the node set, since the loop doesn't get inside the else part. My result is '1. more text'
Finding text nodes relative to their parent node works:
response.xpath('//div[#class="contents"]//text()')
finds all the text, but this way I can't add the list item numbers.
What am I doing wrong and is there a better way to achieve my task?

Scrapy's Selectors use lxml under the hood, but lxml doesn't work with XPath calls on text nodes.
>>> import scrapy
>>> s = scrapy.Selector(text='''<div class="contents">
... some text
... <ol>
... <li>
... more text
... </li>
... ...
... </ol>
... </div>''')
>>> s.xpath('.//div[#class="contents"]/node()')
[<Selector xpath='.//div[#class="contents"]/node()' data='\n some text\n '>, <Selector xpath='.//div[#class="contents"]/node()' data='<ol>\n <li>\n more text\n'>, <Selector xpath='.//div[#class="contents"]/node()' data='\n'>]
>>> for n in s.xpath('.//div[#class="contents"]/node()'):
... print(n.xpath('self::ol'))
...
[]
[<Selector xpath='self::ol' data='<ol>\n <li>\n more text\n'>]
[]
But you could hack into the underlying lxml object to test it's type for a text node (it's "hidden" in a .root attribute of each scrapy Selector):
>>> for n in s.xpath('.//div[#class="contents"]/node()'):
... print([type(n.root), n.root])
...
[<class 'str'>, '\n some text\n ']
[<class 'lxml.etree._Element'>, <Element ol at 0x7fa020f2f9c8>]
[<class 'str'>, '\n']
An alternative is to use some HTML-to-text conversion library like html2text
>>> import html2text
>>> html2text.html2text("""<div class="contents">
... some text
... <ol>
... <li>
... more text
... </li>
... ...
... </ol>
... </div>""")
'some text\n\n 1. more text \n...\n\n'

If n is not an ol element, self::ol yields an empty node set. What is n.xpath(...) supposed to return when the result of the expression is an empty node set?
An empty node set is "falsy" in XPath, but you're not evaluating it as a boolean in XPath, only in Python. Is an empty node set falsy in Python?
If that's the problem, you could fix it by changing the if statement to
if n.xpath('boolean(self::ol)'):
or
if n.xpath('count(self::ol) > 1'):

Related

Find unknown tag containing given text

My HTML is like :
<body>
<div class="afds">
<span class="dfsdf">mytext</span>
</div>
<div class="sdf dzf">
<h1>some random text</h1>
</div>
</body>
I want to find all tags containing "text" & their corresponding classes. In this case, I want:
span, "dfsdf"
h1, null
Next, I want to be able to navigate through the returned tags. For example, find the div parent tag & respective classes of all the returned tags.
If I execute the following
soupx.find_all(text=re.compile(".*text.*"))
it simply returns the text part of the tags:
['mytext', ' some random text']
Please help.
You are probably looking for something along these lines:
ts = soup.find_all(text=re.compile(".*text.*"))
for t in ts:
if len(t.parent.attrs)>0:
for k in t.parent.attrs.keys():
print(t.parent.name,t.parent.attrs[k][0])
else:
print(t.parent.name,"null")
Output:
span dfsdf
h1 null
find_all() does not return just strings, it returns bs4.element.NavigableString.
That means you can call other beautifulsoup functions on those results.
Have a look at find_parent and find_parents: documentation
childs = soupx.find_all(text=re.compile(".*text.*"))
for c in childs:
c.find_parent("div")

BeautifulSoup search attributes-value

I'm trying to search in HTML documents for specific attribute values.
e.g.
<html>
<h2 itemprop="prio1"> TEXT PRIO 1 </h2>
<span id="prio2"> TEXT PRIO 2 </span>
</html>
I want to find all items with atrributes values beginning with "prio"
I know that I can do something like:
soup.find_all(itemprop=re.compile('prio.*')) )
Or
soup.find_all(id=re.compile('prio.*')) )
But what I am looking for is something like:
soup.find_all(*=re.compile('prio.*')) )
First off your regex is wrong, if you wanted to only find strings starting with prio you would prefix with ^, as it is your regex would match prio anywhere in the string, if you were going to search each attribute you should just use str.startswith:
h = """<html>
<h2 itemprop="prio1"> TEXT PRIO 1 </h2>
<span id="prio2"> TEXT PRIO 2 </span>
</html>"""
soup = BeautifulSoup(h, "lxml")
tags = soup.find_all(lambda t: any(a.startswith("prio") for a in t.attrs.values()))
If you just want to check for certain attributes:
tags = soup.find_all(lambda t: t.get("id","").startswith("prio") or t.get("itemprop","").startswith("prio"))
But if you wanted a more efficient solution you might want to look at lxml which allows you to use wildcards:
from lxml import html
xml = html.fromstring(h)
tags = xml.xpath("//*[starts-with(#*,'prio')]")
print(tags)
Or just id an itemprop:
tags = xml.xpath("//*[starts-with(#id,'prio') or starts-with(#itemprop, 'prio')]")
I don't know if this is the best way, but this works:
>>> soup.find_all(lambda element: any(re.search('prio.*', attr) for attr in element.attrs.values()))
[<h2 itemprop="prio1"> TEXT PRIO 1 </h2>, <span id="prio2"> TEXT PRIO 2 </span>]
In this case, you can access the element use lambda in lambda element:. And we search for 'prio.*' use re.search in the element.attrs.values() list.
Then, we use any() on the result to see if there's an element which has an attribute and it's value starts with 'prio'.
You can also use str.startswith here instead of RegEx since you're just trying to check that attributes-value starts with 'prio' or not, like below:
soup.find_all(lambda element: any(attr.startswith('prio') for attr in element.attrs.values())))

Extract based on instances of nodes

I'm trying to extract some text using Beautiful Soup. The relevant portion looks something like this.
...
<p class="consistent"><strong>RecurringText</strong></p>
<p class="consistent">Text1</p>
<p class="consistent">Text2</p>
<p class="consistent">Text3</p>
<p class="consistent"><strong>VariableText</strong></p>
...
RecurringText, as the name implies, is consistent in all the files. However, VariableText changes. The only thing it has in common is it is the next coded section. I'd like to get Text1, Text2, and Text3 extract. What comes before (up to and including RecurringText) and what comes after (including and after VariableText) can be left behind. The portion of extract from RecurringText I have found elsewhere, but I am unsure how to remove the next item, if that makes sense.
In sum, how I can extract based on the characteristic of VariableText (which the string is variable throughout the urls) consistently coming after the last item of Text1, Text2, ..., Textn (where n is different across files).
You can basically get items from p element containing strong element to another p element containing strong element:
from bs4 import BeautifulSoup
data = """
<div>
<p class="consistent"><strong>RecurringText</strong></p>
<p class="consistent">Text1</p>
<p class="consistent">Text2</p>
<p class="consistent">Text3</p>
<p class="consistent"><strong>VariableText</strong></p>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for p in soup.find_all(lambda elm: elm and elm.name == "p" and elm.text == "RecurringText" and \
"consistent" in elm.get("class") and elm.strong):
for item in p.find_next_siblings("p"):
if item.strong:
break
print(item.text)
Prints:
Text1
Text2
Text3

Python lxml cssselect nth match

Say i have some html similar to this:
<div id="content">
<span class="green">something</span>
<span class="blue">something</span>
<span class="red">something</span>
<span class="green">something</span>
<span class="yellow">something</span>
</div>
What's the best way to just get the 2nd element using cssselect?
I can always do cssselect('span.green') and then choose the 2nd element from the results, but in a big page with hundreds of elements i guess it's going to be much slower.
Although this is not an answer to your question, here is the way I did this:
Use XPath instead of cssselect:
>>> from lxml.etree import tostring
>>> from lxml.html.soupparser import fromstring
>>> x = tostring('<div id="content"><span class="green">something</span><span class="blue">something</span><span class="red">something</span><span class="green">something</span><span class="yellow">something</span></div>')
>>> x.xpath('//span[#class="green"][2]')
[<Element span at b6df71ac>]
>>> x.xpath('//span[#class="green"][2]')[0]
<Element span at b6df71ac>
>>> tostring(x.xpath('//span[#class="green"][2]')[0])
'<span class="green">something</span>'
or if you prefer a list of the elements in Python:
>>> x.xpath('//span[#class="green"]')
[<Element span at b6df71ac>, <Element span at b6df720c>]
>>> tostring(x.xpath('//span[#class="green"]')[1])
'<span class="green">something</span>'

Filtering xml file to remove lines with certain text in them?

For example, suppose I have:
<div class="info"><p><b>Orange</b>, <b>One</b>, ...
<div class="info"><p><b>Blue</b>, <b>Two</b>, ...
<div class="info"><p><b>Red</b>, <b>Three</b>, ...
<div class="info"><p><b>Yellow</b>, <b>Four</b>, ...
And I'd like to remove all lines that have words from a list so I'll only use xpath on the lines that fit my criteria. For example, I could use the list as ['Orange', 'Red'] to mark the unwanted lines, so in the above example I'd only want to use lines 2 and 4 for further processing.
How can I do this?
Use:
//div
[not(p/b[contains('|Orange|Red|',
concat('|', ., '|')
)
]
)
]
This selects any div elements in the XML document, such that it has no p child whose b child's string valu is one of the strings in the pipe-separated list of strings to use as filters.
This approach allows extensibility by just adding new filter values to the pipe-separated list, without changing anything else in the XPath expression.
Note: When the structure of the XML document is statically known, always avoid using the // XPath pseudo-operator, because it leads to significant inefficiency (slowdown).
import lxml.html as lh
# http://lxml.de/xpathxslt.html
# http://exslt.org/regexp/functions/match/index.html
content='''\
<table>
<div class="info"><p><b>Orange</b>, <b>One</b></p></div>
<div class="info"><p><b>Blue</b>, <b>Two</b></p></div>
<div class="info"><p><b>Red</b>, <b>Three</b></p></div>
<div class="info"><p><b>Yellow</b>, <b>Four</b></p></div>
</table>
'''
NS = 'http://exslt.org/regular-expressions'
tree = lh.fromstring(content)
exclude=['Orange','Red']
for elt in tree.xpath(
"//div[not(re:test(p/b[1]/text(), '{0}'))]".format('|'.join(exclude)),
namespaces={'re': NS}):
print(lh.tostring(elt))
print('-'*80)
yields
<div class="info"><p><b>Blue</b>, <b>Two</b></p></div>
--------------------------------------------------------------------------------
<div class="info"><p><b>Yellow</b>, <b>Four</b></p></div>
--------------------------------------------------------------------------------

Categories