Python lxml cssselect nth match - python

Say i have some html similar to this:
<div id="content">
<span class="green">something</span>
<span class="blue">something</span>
<span class="red">something</span>
<span class="green">something</span>
<span class="yellow">something</span>
</div>
What's the best way to just get the 2nd element using cssselect?
I can always do cssselect('span.green') and then choose the 2nd element from the results, but in a big page with hundreds of elements i guess it's going to be much slower.

Although this is not an answer to your question, here is the way I did this:
Use XPath instead of cssselect:
>>> from lxml.etree import tostring
>>> from lxml.html.soupparser import fromstring
>>> x = tostring('<div id="content"><span class="green">something</span><span class="blue">something</span><span class="red">something</span><span class="green">something</span><span class="yellow">something</span></div>')
>>> x.xpath('//span[#class="green"][2]')
[<Element span at b6df71ac>]
>>> x.xpath('//span[#class="green"][2]')[0]
<Element span at b6df71ac>
>>> tostring(x.xpath('//span[#class="green"][2]')[0])
'<span class="green">something</span>'
or if you prefer a list of the elements in Python:
>>> x.xpath('//span[#class="green"]')
[<Element span at b6df71ac>, <Element span at b6df720c>]
>>> tostring(x.xpath('//span[#class="green"]')[1])
'<span class="green">something</span>'

Related

Python3 - Extract the text from a bs4.element.Tag and add to a dictonary

I am scraping a website which returns a bs4.element.Tag similar to the following:
<span class="attributes-value">
<span class="four-door">four door</span>
<span class="inline-4-engine">inline 4 engine</span>
<span class="24-gallons-per-mile">24 gallons per mile</span>
</span>
I am trying to extract just the text from this block and add it to a dictionary. All of the examples that I am seeing on the forum include some sort of common element like an 'id' or similar. I am not an html guy so i may be using incorrect terms.
What I would like to do is get the text ("four door", "v6 engine", etc) and add them as values to a dictionary with the key being a pre-designated variable of car_model.
cars = {'528i':['four door', 'inline 4 engine']}
I cant figure out a universal way to pull out the text because there may be more or fewer span classes with different text. Thanks for your help!
You need to loop through all the elements by selector and extract text value from these elements.
A selector is a specific path to the element you want. In my case, the selector is .attributes-value span, where .attributes-value allows you to access the class, and span allows you to access the tags within that class.
The get_text() method retrieves the content between the opening and closing tags. This is exactly what you need.
I also recommend using lxml because it will speed up your code.
The full code is attached below:
from bs4 import BeautifulSoup
import lxml
html = '''
<span class="attributes-value">
<span class="four-door">four door</span>
<span class="inline-4-engine">inline 4 engine</span>
<span class="24-gallons-per-mile">24 gallons per mile</span>
</span>
'''
soup = BeautifulSoup(html, 'lxml')
cars = {
'528i': []
}
for span in soup.select(".attributes-value span"):
cars['528i'].append(span.get_text())
print(cars)
Output:
{'528i': ['four door', 'inline 4 engine', '24 gallons per mile']}
You can use:
out = defaultdict(list)
soup = BeautifulSoup(html_doc, 'html.parser')
for tag in soup.select(".attributes-value span"):
out["528i"].append(tag.text)
print(dict(out))
Prints:
{'528i': ['four door', 'inline 4 engine', '24 gallons per mile']}

xpath to get lists of element in Python

I am trying to scrape lists of elements from a page that looks like this:
<div class="container">
<b>1</b>
<b>2</b>
<b>3</b>
</div>
<div class="container">
<b>4</b>
<b>5</b>
<b>6</b>
</div>
I would like to get lists or tuples using xpath: [1,2,3],[4,5,6]...
Using for loop on the page I get either the first element of each list or all numbers as one list.
Could you please help me to solve the exercise?
Thank you in advance for any help!
For web-scraping of static pages bs4 is best package to work with. and using bs4 you can achieve your goal as easy as below:
from bs4 import BeautifulSoup
source = """<div class="container">
<b>1</b>
<b>2</b>
<b>3</b>
</div>
<div class="container">
<b>4</b>
<b>5</b>
<b>6</b>
</div>"""
soup = BeautifulSoup(source, 'html.parser') # parse content/ page source
soup.find_all('div', {'class': 'container'}) # find all the div element (second argument is optional mentioned to scrape/find only element with attribute value)
print([[int(x.text) for x in i.find_all('b')] for i in soup.find_all('div', {'class': 'container'})]) # get list of all div's number list as you require
Output:
[[1, 2, 3], [4, 5, 6]]
you could use this xpath expression, which will give you two strings
.//*[#class='container'] ➡ '1 2 3', '4 5 6'
if you would prefer 6 strings
.//*[#class='container']/b ➡ '1','2','3','4','5','6'
to get exactly what you are looking for though you would have to separate the xpath expressions
.//*[#class='container'][1]/b ➡ '1','2','3'
.//*[#class='container'][2]/b ➡ '4','5','6'

Processing html text nodes with scrapy and XPath

I'm using scrapy to process documents like this one:
...
<div class="contents">
some text
<ol>
<li>
more text
</li>
...
</ol>
</div>
...
I want to collect all the text inside the contents area into a string.
I also need the '1., 2., 3....' from the <li> elements, so my result should be 'some text 1. more text...'
So, I'm looping over <div class="contents">'s children
for n in response.xpath('//div[#class="contents"]/node()'):
if n.xpath('self::ol'):
result += process_list(n)
else:
result += n.extract()
If nis an ordered list, I loop over its elements and add a number to li/text() (in process_list()). If nis a text node itself, I just read its value.
However, 'some text' doesn't seem to be part of the node set, since the loop doesn't get inside the else part. My result is '1. more text'
Finding text nodes relative to their parent node works:
response.xpath('//div[#class="contents"]//text()')
finds all the text, but this way I can't add the list item numbers.
What am I doing wrong and is there a better way to achieve my task?
Scrapy's Selectors use lxml under the hood, but lxml doesn't work with XPath calls on text nodes.
>>> import scrapy
>>> s = scrapy.Selector(text='''<div class="contents">
... some text
... <ol>
... <li>
... more text
... </li>
... ...
... </ol>
... </div>''')
>>> s.xpath('.//div[#class="contents"]/node()')
[<Selector xpath='.//div[#class="contents"]/node()' data='\n some text\n '>, <Selector xpath='.//div[#class="contents"]/node()' data='<ol>\n <li>\n more text\n'>, <Selector xpath='.//div[#class="contents"]/node()' data='\n'>]
>>> for n in s.xpath('.//div[#class="contents"]/node()'):
... print(n.xpath('self::ol'))
...
[]
[<Selector xpath='self::ol' data='<ol>\n <li>\n more text\n'>]
[]
But you could hack into the underlying lxml object to test it's type for a text node (it's "hidden" in a .root attribute of each scrapy Selector):
>>> for n in s.xpath('.//div[#class="contents"]/node()'):
... print([type(n.root), n.root])
...
[<class 'str'>, '\n some text\n ']
[<class 'lxml.etree._Element'>, <Element ol at 0x7fa020f2f9c8>]
[<class 'str'>, '\n']
An alternative is to use some HTML-to-text conversion library like html2text
>>> import html2text
>>> html2text.html2text("""<div class="contents">
... some text
... <ol>
... <li>
... more text
... </li>
... ...
... </ol>
... </div>""")
'some text\n\n 1. more text \n...\n\n'
If n is not an ol element, self::ol yields an empty node set. What is n.xpath(...) supposed to return when the result of the expression is an empty node set?
An empty node set is "falsy" in XPath, but you're not evaluating it as a boolean in XPath, only in Python. Is an empty node set falsy in Python?
If that's the problem, you could fix it by changing the if statement to
if n.xpath('boolean(self::ol)'):
or
if n.xpath('count(self::ol) > 1'):

Can I do a findall regular expression like this?

So I need to grab the numbers after lines looking like this
<div class="gridbarvalue color_blue">79</div>
and
<div class="gridbarvalue color_red">79</div>
Is there a way I can do a findAll('div', text=re.recompile('<>)) where I would find tags with gridbarvalue color_<red or blue>?
I'm using beautifulsoup.
Also sorry if I'm not making my question clear, I'm pretty inexperienced with this.
class is a Python keyword, so BeautifulSoup expects you to put an underscore after it when using it as a keyword parameter
>>> soup.find_all('div', class_=re.compile(r'color_(?:red|blue)'))
[<div class="gridbarvalue color_blue">79</div>, <div class="gridbarvalue color_red">79</div>]
To also match the text, use
>>> soup.find_all('div', class_=re.compile(r'color_(?:red|blue)'), text='79')
[<div class="gridbarvalue color_blue">79</div>, <div class="gridbarvalue color_red">79</div>]
import re
elems = soup.findAll(attrs={'class' : re.compile("color_(blue|red)")})
for each e in elems:
m = re.search(">(\d+)<", str(e))
print "The number is %s" % m.group(1)

Want to get part of string using regular expression

I have a string:
<a class="x3-large" href="_ylt=Ats3LonepB5YtO8vbPyjYAWbvZx4;_ylu=X3oDMTVlanQ4dDV1BGEDMTIwOTI4IG5ld3MgZGFkIHNob290cyBzb24gdARjY29kZQNwemJ1ZmNhaDUEY3BvcwMxBGVkAzEEZwNpZC0yNjcyMDgwBGludGwDdXMEaXRjAzAEbWNvZGUDcHpidWFsbGNhaDUEbXBvcwMxBHBrZ3QDMQRwa2d2AzI1BHBvcwMyBHNlYwN0ZC1mZWEEc2xrA3RpdGxlBHRlc3QDNzAxBHdvZQMxMjc1ODg0Nw--/SIG=12uht5d19/EXP=1348942343/**http%3A//news.yahoo.com/conn-man-kills-masked-teen-learns-son-063653076.html" style="font-family: inherit;">Man kills masked teen, learns it's his son</a>
And I want to get only the last part of it, the actual message:
Man kills masked teen, learns it's his son
So far I made something like this:
pattern = '''<a class="x3-large" (.*)">(.*)</a>'''
But It doesn't do what I want, the first (.*) match all crap inside the link, But the second one the actual message that I want to get
In the spirit of answering the question you should have asked instead ;^), yes, you should use BeautifulSoup [link] or lxml or a real parser to handle HTML. For example:
>>> s = '<a class="x3-large" href="_stuff--/SIG**morestuff" style="font-family: inherit;">Man learns not to give himself headaches using regex to deal with HTML</a>'
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(s)
>>> soup.get_text()
u'Man learns not to give himself headaches using regex to deal with HTML'
Or if there are multiple texts to be captured:
>>> s = '<a class="test" href="ignore1">First sentence</a><a class="test" href="ignore1">Second sentence</a>'
>>> soup = BeautifulSoup(s)
>>> soup.find_all("a")
[<a class="test" href="ignore1">First sentence</a>, <a class="test" href="ignore1">Second sentence</a>]
>>> [a.get_text() for a in soup.find_all("a")]
[u'First sentence', u'Second sentence']
Or if you only want certain values of class:
>>> s = '<a class="test" href="ignore1">First sentence</a><a class="x3-large" href="ignore1">Second sentence</a>'
>>> soup = BeautifulSoup(s)
>>> soup.find_all("a", {"class": "x3-large"})
[<a class="x3-large" href="ignore1">Second sentence</a>]
Type ([^"]*) instead of first (.*) and ([^<]*) instead of second. Or use non greedy quantifiers like (.*?).

Categories