I have a string:
<a class="x3-large" href="_ylt=Ats3LonepB5YtO8vbPyjYAWbvZx4;_ylu=X3oDMTVlanQ4dDV1BGEDMTIwOTI4IG5ld3MgZGFkIHNob290cyBzb24gdARjY29kZQNwemJ1ZmNhaDUEY3BvcwMxBGVkAzEEZwNpZC0yNjcyMDgwBGludGwDdXMEaXRjAzAEbWNvZGUDcHpidWFsbGNhaDUEbXBvcwMxBHBrZ3QDMQRwa2d2AzI1BHBvcwMyBHNlYwN0ZC1mZWEEc2xrA3RpdGxlBHRlc3QDNzAxBHdvZQMxMjc1ODg0Nw--/SIG=12uht5d19/EXP=1348942343/**http%3A//news.yahoo.com/conn-man-kills-masked-teen-learns-son-063653076.html" style="font-family: inherit;">Man kills masked teen, learns it's his son</a>
And I want to get only the last part of it, the actual message:
Man kills masked teen, learns it's his son
So far I made something like this:
pattern = '''<a class="x3-large" (.*)">(.*)</a>'''
But It doesn't do what I want, the first (.*) match all crap inside the link, But the second one the actual message that I want to get
In the spirit of answering the question you should have asked instead ;^), yes, you should use BeautifulSoup [link] or lxml or a real parser to handle HTML. For example:
>>> s = '<a class="x3-large" href="_stuff--/SIG**morestuff" style="font-family: inherit;">Man learns not to give himself headaches using regex to deal with HTML</a>'
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(s)
>>> soup.get_text()
u'Man learns not to give himself headaches using regex to deal with HTML'
Or if there are multiple texts to be captured:
>>> s = '<a class="test" href="ignore1">First sentence</a><a class="test" href="ignore1">Second sentence</a>'
>>> soup = BeautifulSoup(s)
>>> soup.find_all("a")
[<a class="test" href="ignore1">First sentence</a>, <a class="test" href="ignore1">Second sentence</a>]
>>> [a.get_text() for a in soup.find_all("a")]
[u'First sentence', u'Second sentence']
Or if you only want certain values of class:
>>> s = '<a class="test" href="ignore1">First sentence</a><a class="x3-large" href="ignore1">Second sentence</a>'
>>> soup = BeautifulSoup(s)
>>> soup.find_all("a", {"class": "x3-large"})
[<a class="x3-large" href="ignore1">Second sentence</a>]
Type ([^"]*) instead of first (.*) and ([^<]*) instead of second. Or use non greedy quantifiers like (.*?).
Related
Suppose I have a list of keywords, i.e. ["apple", "dog", "cat"], and some HTML. I'd like to return all the elements that contain one of these keywords as its immediate descendant. How would I go about doing this?
I've tried using soup.find_all(text=keywords), but that yields nothing.
from bs4 import BeautifulSoup
source = """
<html>
<p>I like apples</p>
<p>I don't want to match this</p>
<div>Dogs are cool. I don't match this either.</div>
<div>I have a cat.</div>
</html>
"""
soup = BeautifulSoup(source, "html.parser")
keywords = ["apple", "dog", "cat"]
BeautifulSoup supports regex for text searching, so you can utilize that with the IGNORECASE flag (since your keyword is dog, and your element contains Dogs)
import re
from bs4 import BeautifulSoup
source = """
<html>
<p>I like apples</p>
<p>I don't want to match this</p>
<div>Dogs are cool. I don't match this either.</div>
<div>I have a cat.</div>
</html>
"""
soup = BeautifulSoup(source, "html.parser")
keywords = ["apple", "dog", "cat"]
print(soup.find_all(text=re.compile("|".join(keywords), flags=re.IGNORECASE)))
>>>> ['I like apples', "Dogs are cool. I don't match this either.", 'I have a cat.']
As a note, you say "immediate descendant" and the element with dogs has "I don't match this either". Due to how your HTML is formatted, it will pick this up. If this line was something like <div>Dogs are cool. <div>I don't match this either.</div></div>, then the output would be
['I like apples', 'Dogs are cool. ', 'I have a cat.']
I'm working on getting the words between certain words in a string.
Find string between two substrings Referring to this article, I succeeded in catching words in the following way.
s = 'asdf=5;iwantthis123jasd'
result = re.search('asdf=5;(.*)123jasd', s)
print(result.group(1))
But in the sentence below it failed.
s = ''' <div class="prod-origin-price ">
<span class="discount-rate">
4%
</span>
<span class="origin-price">'''
result = re.search('<span class="discount-rate">(.*)</span>', s)
print(result.group(1))
I'm trying to bring '4%'. Everything else succeeds, but I don't know why only this one fails.
Help
Try this (mind the white spaces and new lines)
import re
s = ''' <div class="prod-origin-price ">
<span class="discount-rate">
4%
</span>
<span class="origin-price">'''
result = re.search('<span class="discount-rate">\s*(.*)\s*</span>', s)
print(result.group(1))
Use re.DOTALL flag for matching new lines:
result = re.search('<span class="discount-rate">(.*)</span>', s, re.DOTALL)
Documentation: https://docs.python.org/3/library/re.html
This is structured data, not just a string, so we can use a library like Beautiful Soup to help us simplify such tasks:
from bs4 import BeautifulSoup
s = ''' <div class="prod-origin-price ">
<span class="discount-rate">
4%
</span>
<span class="origin-price">'''
soup = BeautifulSoup(s)
value = soup.find(class_='discount-rate').get_text(strip=True)
print(value)
# Output:
4%
i would like to extract the exact matching value of text from html by using beautifulsoup.But i am getting some almost matching text with my exact text.
my code is:
from bs4 import BeautifulSoup
import urllib2enter code here
url="http://www.somesite.com"
page=urllib2.urlopen(url)
soup=BeautifulSoup(page,"lxml")
for elem in soup(text=re.compile("exact text")):
print elem
for the above mentioned code output is like:
1.exact text
2.almost exact text
how can i get only the exact match by using beautifulsoup?
note:the variable(elem) should be in <class 'bs4.element.Comment'>type
You can search at your soup for the desired element, using it's tag and any attribute value.
I.e.: this code will search for all a elements with id equal to some_id_value.
Then it'll loop each element found, testing if it's .text value is equal to "exact text".
If so, it'll print the whole element.
for elem in soup.find_all('a', {'id':'some_id_value'}):
if elem.text == "exact text":
print(elem)
Use BeautifulSoup's find_all method with its string argument for this.
As an example, here I parse a small page from wikipedia about a place in Jamaica. I look for all strings whose texts are 'Jamaica stubs' but I expect to find just one. When I find it I display the text and its parent.
>>> url = 'https://en.wikipedia.org/wiki/Cassava_Piece'
>>> from bs4 import BeautifulSoup
>>> import requests
>>> page = requests.get(url).text
>>> soup = BeautifulSoup(page, 'lxml')
>>> for item in soup.find_all(string="Jamaica stubs"):
... item
... item.findParent()
...
'Jamaica stubs'
Jamaica stubs
On second thoughts, after reading the comment, a better way would be:
>>> url = 'https://en.wikipedia.org/wiki/Hockey'
>>> from bs4 import BeautifulSoup
>>> import requests
>>> import re
>>> page = requests.get(url).text
>>> soup = BeautifulSoup(page, 'lxml')
>>> for i, item in enumerate(soup.find_all(string=re.compile('women', re.IGNORECASE))):
... i, item.findParent().text[:100]
...
(0, "Women's Bandy World Championships")
(1, "The governing body is the 126-member International Hockey Federation (FIH). Men's field hockey has b")
(2, 'The governing body of international play is the 77-member International Ice Hockey Federation (IIHF)')
(3, "women's")
I use IGNORECASE in the regex so that both 'Women' and 'women' are found in the wikipedia article. I use enumerate in the for loop so that I can number the items that are displayed to make them easier to read.
So I need to grab the numbers after lines looking like this
<div class="gridbarvalue color_blue">79</div>
and
<div class="gridbarvalue color_red">79</div>
Is there a way I can do a findAll('div', text=re.recompile('<>)) where I would find tags with gridbarvalue color_<red or blue>?
I'm using beautifulsoup.
Also sorry if I'm not making my question clear, I'm pretty inexperienced with this.
class is a Python keyword, so BeautifulSoup expects you to put an underscore after it when using it as a keyword parameter
>>> soup.find_all('div', class_=re.compile(r'color_(?:red|blue)'))
[<div class="gridbarvalue color_blue">79</div>, <div class="gridbarvalue color_red">79</div>]
To also match the text, use
>>> soup.find_all('div', class_=re.compile(r'color_(?:red|blue)'), text='79')
[<div class="gridbarvalue color_blue">79</div>, <div class="gridbarvalue color_red">79</div>]
import re
elems = soup.findAll(attrs={'class' : re.compile("color_(blue|red)")})
for each e in elems:
m = re.search(">(\d+)<", str(e))
print "The number is %s" % m.group(1)
Say i have some html similar to this:
<div id="content">
<span class="green">something</span>
<span class="blue">something</span>
<span class="red">something</span>
<span class="green">something</span>
<span class="yellow">something</span>
</div>
What's the best way to just get the 2nd element using cssselect?
I can always do cssselect('span.green') and then choose the 2nd element from the results, but in a big page with hundreds of elements i guess it's going to be much slower.
Although this is not an answer to your question, here is the way I did this:
Use XPath instead of cssselect:
>>> from lxml.etree import tostring
>>> from lxml.html.soupparser import fromstring
>>> x = tostring('<div id="content"><span class="green">something</span><span class="blue">something</span><span class="red">something</span><span class="green">something</span><span class="yellow">something</span></div>')
>>> x.xpath('//span[#class="green"][2]')
[<Element span at b6df71ac>]
>>> x.xpath('//span[#class="green"][2]')[0]
<Element span at b6df71ac>
>>> tostring(x.xpath('//span[#class="green"][2]')[0])
'<span class="green">something</span>'
or if you prefer a list of the elements in Python:
>>> x.xpath('//span[#class="green"]')
[<Element span at b6df71ac>, <Element span at b6df720c>]
>>> tostring(x.xpath('//span[#class="green"]')[1])
'<span class="green">something</span>'