finding exact match of the text by using Beautifulsoup - python

i would like to extract the exact matching value of text from html by using beautifulsoup.But i am getting some almost matching text with my exact text.
my code is:
from bs4 import BeautifulSoup
import urllib2enter code here
url="http://www.somesite.com"
page=urllib2.urlopen(url)
soup=BeautifulSoup(page,"lxml")
for elem in soup(text=re.compile("exact text")):
print elem
for the above mentioned code output is like:
1.exact text
2.almost exact text
how can i get only the exact match by using beautifulsoup?
note:the variable(elem) should be in <class 'bs4.element.Comment'>type

You can search at your soup for the desired element, using it's tag and any attribute value.
I.e.: this code will search for all a elements with id equal to some_id_value.
Then it'll loop each element found, testing if it's .text value is equal to "exact text".
If so, it'll print the whole element.
for elem in soup.find_all('a', {'id':'some_id_value'}):
if elem.text == "exact text":
print(elem)

Use BeautifulSoup's find_all method with its string argument for this.
As an example, here I parse a small page from wikipedia about a place in Jamaica. I look for all strings whose texts are 'Jamaica stubs' but I expect to find just one. When I find it I display the text and its parent.
>>> url = 'https://en.wikipedia.org/wiki/Cassava_Piece'
>>> from bs4 import BeautifulSoup
>>> import requests
>>> page = requests.get(url).text
>>> soup = BeautifulSoup(page, 'lxml')
>>> for item in soup.find_all(string="Jamaica stubs"):
... item
... item.findParent()
...
'Jamaica stubs'
Jamaica stubs
On second thoughts, after reading the comment, a better way would be:
>>> url = 'https://en.wikipedia.org/wiki/Hockey'
>>> from bs4 import BeautifulSoup
>>> import requests
>>> import re
>>> page = requests.get(url).text
>>> soup = BeautifulSoup(page, 'lxml')
>>> for i, item in enumerate(soup.find_all(string=re.compile('women', re.IGNORECASE))):
... i, item.findParent().text[:100]
...
(0, "Women's Bandy World Championships")
(1, "The governing body is the 126-member International Hockey Federation (FIH). Men's field hockey has b")
(2, 'The governing body of international play is the 77-member International Ice Hockey Federation (IIHF)')
(3, "women's")
I use IGNORECASE in the regex so that both 'Women' and 'women' are found in the wikipedia article. I use enumerate in the for loop so that I can number the items that are displayed to make them easier to read.

Related

Beautiful Soup filtering for more than one keyword

CODE
soup = BeautifulSoup(urllib.request.urlopen(link['href']).read(), 'lxml')
# Find CompanyA links
for link in soup.findAll('a', href=True, text='CompanyA'):
print (link['href'])
Is it possible to filter for more than one, like this?
text='CompanyA' OR text='CompanyB' OR text='CompanyC'
This will give you all the elements which have a text attribute and match your list of texts.
soup.findAll('a', href=True, text=lambda value: value and value in ["CompanyA", "CompanyB", "CompanyC"])
Use regular expression.
import re
for link in soup.findAll("a", href=True,text=re.compile("CompanyA|CompanyB|CompanyC")):
print (link['href'])

Python, Beautiful soup, <br> tag

So I have looked through stack overflow but I cannot seem to find an answer to my problem. How do I get the text, specific text, after a < br > tag?
This is my Code:
product_review_container = container.findAll("span",{"class":"search_review_summary"})
for product_review in product_review_container:
prr = product_review.get('data-tooltip-html')
print(prr)
This is the output:
Very Positive<br>86% of the 1,013 user reviews for this game are positive.
I want in this string only the 86% and also seperatly only the 1,013. So the numbers only. However it is not an int so I do not know what to do.
Here is where the text comes from:
[<span class="search_review_summary positive" data-tooltip-html="Very Positive<br>86% of the 1,013 user reviews for this game are positive.">
</span>]
Here is the link from where I am getting the information: https://store.steampowered.com/search/?specials=1&page=1
Thank you!
You need to use regex here!
import re
string = 'Very Positive<br>86% of the 1,013 user reviews for this game are positive.'
a = re.findall('(\d+%)|(\d+,\d+)',string)
print(a)
output: [('86%', ''), ('', '1,013')]
#Then a[0][0] will be 86% and a[1][1] will be 1,013
Where \d is any number character in the string, and the + is there are at least 1 or more digits.
If you need more specific regex then you can trying it in https://regex101.com
There's a non-regex way to do it; admittedly somewhat convoluted, but still fun:
First, we borrow (and modify) this nice function:
def split_and_keep(s, sep):
if not s: return [''] # consistent with string.split()
p=chr(ord(max(s))+1)
return s.replace(sep, sep+p).split(p)
Then we go through some standard steps:
html = """
[<span class="search_review_summary positive" data-tooltip-html="Very Positive<br>86% of the 1,013 user reviews for this game are positive."></span>]
"""
from bs4 import BeautifulSoup as bs4
soup = bs4(html, 'html.parser')
info = soup.select('span')[0].get("data-tooltip-html")
print(info)
Output so far, is:
Very Positive<br>86% of the 1,013 user reviews for this game are positive.
Next we go:
data = ''.join(c for c in info if (c.isdigit()) or c == '%')
print(data)
Output is a little better now:
86%1013
Almost there; now the pièce de résistance:
split_and_keep(data, '%')
Final output:
['86%', '1013']

Extracting text from html tags with beautiful soup

I have some html page to scrape data from.
So I need to get item title like here: 'Caliper Ring'.
I'm getting data from tag where that title appears:
item_title = base_page.find_all('h1', class_='itemTitle')
It contains these tags structure:
> [<h1 class="itemTitle"> <div class="l1">Caliper</div>
> Ball
> </h1>]
To extract 'Caliper Ball' I`m using
collector = []
for _ in item_title:
collector.append(_.text)
so I`m getting such ugly output in collector list:
[u"\nCaliper\r\n Ball\r\n "]
How can I make output clear like here "Caliper Ball"
Don't use regex. You're adding too much overhead for something simple. BeautifulSoup4 already has something for this called stripped_strings. See my code below.
from bs4 import BeautifulSoup as bsoup
html = """[<h1 class="itemTitle"> <div class="l1">Caliper</div>
Ball
</h1>]"""
soup = bsoup(html)
soup.prettify()
item = soup.find("h1", class_="itemTitle")
base = list(item.stripped_strings)
print " ".join(base)
Result:
Caliper Ball
[Finished in 0.5s]
Explanation: stripped_strings basically gets all the text inside a specified tag, strips them of all the spaces, line breaks, what have you. It returns a generator, which we can catch with list so it returns a list instead. Once it's a list, it's just a matter of using " ".join.
Let us know if this helps.
PS: Just to correct myself -- there's actually no need to use list on the result of stripped_strings, but it's better to show the above as such so it's explicit.
This regex will help you to get the output(Caliper Ball),
import re
str="""[<h1 class="itemTitle"> <div class="l1">Caliper</div>
Ball
</h1>]"""
regex = r'.*>([^<]*)<\/div>\s*\n\s*(\w*).*'
match = re.findall(regex, str)
new_data = (' '.join(w) for w in match)
print ''.join(new_data) # => Caliper Ball
you can use replace() method to replace \n and \r with nothing or space and after this use method trim() to remvoe spaces.

Want to get part of string using regular expression

I have a string:
<a class="x3-large" href="_ylt=Ats3LonepB5YtO8vbPyjYAWbvZx4;_ylu=X3oDMTVlanQ4dDV1BGEDMTIwOTI4IG5ld3MgZGFkIHNob290cyBzb24gdARjY29kZQNwemJ1ZmNhaDUEY3BvcwMxBGVkAzEEZwNpZC0yNjcyMDgwBGludGwDdXMEaXRjAzAEbWNvZGUDcHpidWFsbGNhaDUEbXBvcwMxBHBrZ3QDMQRwa2d2AzI1BHBvcwMyBHNlYwN0ZC1mZWEEc2xrA3RpdGxlBHRlc3QDNzAxBHdvZQMxMjc1ODg0Nw--/SIG=12uht5d19/EXP=1348942343/**http%3A//news.yahoo.com/conn-man-kills-masked-teen-learns-son-063653076.html" style="font-family: inherit;">Man kills masked teen, learns it's his son</a>
And I want to get only the last part of it, the actual message:
Man kills masked teen, learns it's his son
So far I made something like this:
pattern = '''<a class="x3-large" (.*)">(.*)</a>'''
But It doesn't do what I want, the first (.*) match all crap inside the link, But the second one the actual message that I want to get
In the spirit of answering the question you should have asked instead ;^), yes, you should use BeautifulSoup [link] or lxml or a real parser to handle HTML. For example:
>>> s = '<a class="x3-large" href="_stuff--/SIG**morestuff" style="font-family: inherit;">Man learns not to give himself headaches using regex to deal with HTML</a>'
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(s)
>>> soup.get_text()
u'Man learns not to give himself headaches using regex to deal with HTML'
Or if there are multiple texts to be captured:
>>> s = '<a class="test" href="ignore1">First sentence</a><a class="test" href="ignore1">Second sentence</a>'
>>> soup = BeautifulSoup(s)
>>> soup.find_all("a")
[<a class="test" href="ignore1">First sentence</a>, <a class="test" href="ignore1">Second sentence</a>]
>>> [a.get_text() for a in soup.find_all("a")]
[u'First sentence', u'Second sentence']
Or if you only want certain values of class:
>>> s = '<a class="test" href="ignore1">First sentence</a><a class="x3-large" href="ignore1">Second sentence</a>'
>>> soup = BeautifulSoup(s)
>>> soup.find_all("a", {"class": "x3-large"})
[<a class="x3-large" href="ignore1">Second sentence</a>]
Type ([^"]*) instead of first (.*) and ([^<]*) instead of second. Or use non greedy quantifiers like (.*?).

Preserve space when stripping HTML with Beautiful Soup

from BeautifulSoup import BeautifulSoup
html = "<html><p>Para 1. Words</p><p>Merge. Para 2<blockquote>Quote 1<blockquote>Quote 2</p></html>"
print html
soup = BeautifulSoup(html)
print u''.join(soup.findAll(text=True))
The out put of this code is "Para 1 WordsMerge. Para 2Quote 1Quote 2".
I don't want the last word of paragraph one merging with the first word of paragraph two.
eg. "Para 1 Words Merge. Para 2 Quote 1 Quote 2".
Can this be achieved using the BeautifulSoup library?
And if you are using get_text() in version 4.x:
from bs4 import BeautifulSoup
...
...
soup.get_text(" ")
Just join the pieces with a space:
print u' '.join(soup.findAll(text=True))

Categories