So I have looked through stack overflow but I cannot seem to find an answer to my problem. How do I get the text, specific text, after a < br > tag?
This is my Code:
product_review_container = container.findAll("span",{"class":"search_review_summary"})
for product_review in product_review_container:
prr = product_review.get('data-tooltip-html')
print(prr)
This is the output:
Very Positive<br>86% of the 1,013 user reviews for this game are positive.
I want in this string only the 86% and also seperatly only the 1,013. So the numbers only. However it is not an int so I do not know what to do.
Here is where the text comes from:
[<span class="search_review_summary positive" data-tooltip-html="Very Positive<br>86% of the 1,013 user reviews for this game are positive.">
</span>]
Here is the link from where I am getting the information: https://store.steampowered.com/search/?specials=1&page=1
Thank you!
You need to use regex here!
import re
string = 'Very Positive<br>86% of the 1,013 user reviews for this game are positive.'
a = re.findall('(\d+%)|(\d+,\d+)',string)
print(a)
output: [('86%', ''), ('', '1,013')]
#Then a[0][0] will be 86% and a[1][1] will be 1,013
Where \d is any number character in the string, and the + is there are at least 1 or more digits.
If you need more specific regex then you can trying it in https://regex101.com
There's a non-regex way to do it; admittedly somewhat convoluted, but still fun:
First, we borrow (and modify) this nice function:
def split_and_keep(s, sep):
if not s: return [''] # consistent with string.split()
p=chr(ord(max(s))+1)
return s.replace(sep, sep+p).split(p)
Then we go through some standard steps:
html = """
[<span class="search_review_summary positive" data-tooltip-html="Very Positive<br>86% of the 1,013 user reviews for this game are positive."></span>]
"""
from bs4 import BeautifulSoup as bs4
soup = bs4(html, 'html.parser')
info = soup.select('span')[0].get("data-tooltip-html")
print(info)
Output so far, is:
Very Positive<br>86% of the 1,013 user reviews for this game are positive.
Next we go:
data = ''.join(c for c in info if (c.isdigit()) or c == '%')
print(data)
Output is a little better now:
86%1013
Almost there; now the pièce de résistance:
split_and_keep(data, '%')
Final output:
['86%', '1013']
Related
I found this link [and a few others] which talks a bit about BeautifulSoup for reading html. It mostly does what I want, grabs a title for a webpage.
def get_title(url):
html = requests.get(url).text
if len(html) > 0:
contents = BeautifulSoup(html)
title = contents.title.string
return title
return None
The issue that I run into is that sometimes articles will come back with metadata attached at the end with " - some_data". A good example is this link to a BBC Sport article which reports the title as
Jack Charlton: 1966 England World Cup winner dies aged 85 - BBC Sport
I could do something simple like cut off anything after the last '-' character
title = title.rsplit(', ', 1)[0]
But that assumes that any meta exists after a "-" value. I don't want to assume that there will never be an article who's title ends in " - part_of_title"
I found the Newspaper3k library but it's definitely more than I need - all I need is to grab a title and ensure that it's the same as what the user posted. My friend who pointed me to Newspaper3k also mentioned it could be buggy and didn't always find titles correctly, so I would be inclined to use something else if possible.
My current thought is to continue using BeautifulSoup and just add on fuzzywuzzy which would honestly also help with slight misspellings or punctuation differences. But, I would certainly prefer to start from a place that included comparing against accurate titles.
Here is how reddit handles getting title data.
https://github.com/reddit-archive/reddit/blob/40625dcc070155588d33754ef5b15712c254864b/r2/r2/lib/utils/utils.py#L255
def extract_title(data):
"""Try to extract the page title from a string of HTML.
An og:title meta tag is preferred, but will fall back to using
the <title> tag instead if one is not found. If using <title>,
also attempts to trim off the site's name from the end.
"""
bs = BeautifulSoup(data, convertEntities=BeautifulSoup.HTML_ENTITIES)
if not bs or not bs.html.head:
return
head_soup = bs.html.head
title = None
# try to find an og:title meta tag to use
og_title = (head_soup.find("meta", attrs={"property": "og:title"}) or
head_soup.find("meta", attrs={"name": "og:title"}))
if og_title:
title = og_title.get("content")
# if that failed, look for a <title> tag to use instead
if not title and head_soup.title and head_soup.title.string:
title = head_soup.title.string
# remove end part that's likely to be the site's name
# looks for last delimiter char between spaces in strings
# delimiters: |, -, emdash, endash,
# left- and right-pointing double angle quotation marks
reverse_title = title[::-1]
to_trim = re.search(u'\s[\u00ab\u00bb\u2013\u2014|-]\s',
reverse_title,
flags=re.UNICODE)
# only trim if it won't take off over half the title
if to_trim and to_trim.end() < len(title) / 2:
title = title[:-(to_trim.end())]
if not title:
return
# get rid of extraneous whitespace in the title
title = re.sub(r'\s+', ' ', title, flags=re.UNICODE)
return title.encode('utf-8').strip()
i would like to extract the exact matching value of text from html by using beautifulsoup.But i am getting some almost matching text with my exact text.
my code is:
from bs4 import BeautifulSoup
import urllib2enter code here
url="http://www.somesite.com"
page=urllib2.urlopen(url)
soup=BeautifulSoup(page,"lxml")
for elem in soup(text=re.compile("exact text")):
print elem
for the above mentioned code output is like:
1.exact text
2.almost exact text
how can i get only the exact match by using beautifulsoup?
note:the variable(elem) should be in <class 'bs4.element.Comment'>type
You can search at your soup for the desired element, using it's tag and any attribute value.
I.e.: this code will search for all a elements with id equal to some_id_value.
Then it'll loop each element found, testing if it's .text value is equal to "exact text".
If so, it'll print the whole element.
for elem in soup.find_all('a', {'id':'some_id_value'}):
if elem.text == "exact text":
print(elem)
Use BeautifulSoup's find_all method with its string argument for this.
As an example, here I parse a small page from wikipedia about a place in Jamaica. I look for all strings whose texts are 'Jamaica stubs' but I expect to find just one. When I find it I display the text and its parent.
>>> url = 'https://en.wikipedia.org/wiki/Cassava_Piece'
>>> from bs4 import BeautifulSoup
>>> import requests
>>> page = requests.get(url).text
>>> soup = BeautifulSoup(page, 'lxml')
>>> for item in soup.find_all(string="Jamaica stubs"):
... item
... item.findParent()
...
'Jamaica stubs'
Jamaica stubs
On second thoughts, after reading the comment, a better way would be:
>>> url = 'https://en.wikipedia.org/wiki/Hockey'
>>> from bs4 import BeautifulSoup
>>> import requests
>>> import re
>>> page = requests.get(url).text
>>> soup = BeautifulSoup(page, 'lxml')
>>> for i, item in enumerate(soup.find_all(string=re.compile('women', re.IGNORECASE))):
... i, item.findParent().text[:100]
...
(0, "Women's Bandy World Championships")
(1, "The governing body is the 126-member International Hockey Federation (FIH). Men's field hockey has b")
(2, 'The governing body of international play is the 77-member International Ice Hockey Federation (IIHF)')
(3, "women's")
I use IGNORECASE in the regex so that both 'Women' and 'women' are found in the wikipedia article. I use enumerate in the for loop so that I can number the items that are displayed to make them easier to read.
import urllib2,sys
from bs4 import BeautifulSoup,NavigableString
obama_4427_url = 'http://www.millercenter.org/president/obama/speeches/speech-4427'
obama_4427_html = urllib2.urlopen(obama_4427_url).read()
obama_4427_soup = BeautifulSoup(obama_4427_html)
# find the speech itself within the HTML
obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
# convert soup to string for easier processing
obama_4427_str = str(obama_4427_div)
# list of characters to be removed from obama_4427_str
remove_char = ['<br/>','</p>','</div>','<div class="indent" id="transcript">','<h2>','</h2>','<p>']
remove_char
for char in obama_4427_str:
if char in obama_4427_str:
obama_4427_replace = obama_4427_str.replace(remove_char,'')
obama_4427_replace = obama_4427_str.replace(remove_char,'')
print(obama_4427_replace)
Using BeautifulSoup, I scraped one of Obama's speeches off of the above website. Now, I need to replace some residual HTML in an efficient manner. I've stored a list of elements I'd like to eliminate in remove_char. I'm trying to write a simple for statement, but am getting the error: TypeError: expected a character object buffer. It's a beginner question, I know, but how can I get around this?
Since you are using BeautifulSoup already , you can directly use obama_4427_div.text instead of str(obama_4427_div) to get the correctly formatted text. Then the text you get would not contain any residual html elements, etc.
Example -
>>> obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
>>> obama_4427_str = obama_4427_div.text
>>> print(obama_4427_str)
Transcript
To Chairman Dean and my great friend Dick Durbin; and to all my fellow citizens of this great nation;
With profound gratitude and great humility, I accept your nomination for the presidency of the United States.
Let me express my thanks to the historic slate of candidates who accompanied me on this ...
...
...
...
Thank you, God Bless you, and God Bless the United States of America.
For completeness, for removing elements from a string, I would create a list of elements to remove (like the remove_char list you have created) and then we can do str.replace() on the string for each element in the list. Example -
obama_4427_str = str(obama_4427_div)
remove_char = ['<br/>','</p>','</div>','<div class="indent" id="transcript">','<h2>','</h2>','<p>']
for char in remove_char:
obama_4427_str = obama_4427_str.replace(char,'')
I trying to create a regex to extract telephone, streetAddress, Pages values (9440717256,H.No. 3-11-62, RTC Colony..) from the html page in python. These three fields are optional I tried this regex, but output is inconsistent
telephone\S+>(.+)</em>.*(?:streetAddress\S+(.+)</span>)?.*(?:pages\S+>(.+)</a></span>)?
sample string
<em phone="**telephone**">9440717256</em></div></div></li><li class="row"><i class="icon-sm icon-address"></i><div class="profile-details"><strong>Address</strong><div class="profi`enter code here`le-child"><address itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress" class="data-item"><span itemprop="**streetAddress**">H.No. 3-11-62, RTC Colony</span>, <span>Vastu Colony, </span><span class="text-black" itemprop="addressLocality">Lal Bahadur Nagar</span>
Can anyone help me building the regex please ?
Considering that your input is not valid HTML and that it may be subject to change, you can use a HTML parser like BeautifulSoup. But if your input changes, these simple selectors will have to be adapted.
from bs4 import BeautifulSoup
h = """<em phone="**telephone**">9440717256</em></div></div></li><li class="row"><i class="icon-sm icon-address"></i><div class="profile-details"><strong>Address</strong><div class="profi`enter code here`le-child"><address itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress" class="data-item"><span itemprop="**streetAddress**">H.No. 3-11-62, RTC Colony</span>, <span>Vastu Colony, </span><span class="text-black" itemprop="addressLocality">Lal Bahadur Nagar</span>"""
soup = BeautifulSoup(h)
Edit: Since you now tell us that you want the text of the elements that have the specified attribute value, you can use a function as filter.
def find_phone(tag):
return tag.has_attr("phone") and tag.get("phone") == "**telephone**"
def find_streetAddress(tag):
return tag.has_attr("itemprop") and tag.get("itemprop") == "**streetAddress**"
def find_pages(tag):
return tag.has_attr("title") and tag.get("title") == "**Pages**"
print(soup.find(find_phone).string)
print(soup.find(find_streetAddress).string)
print(soup.find(find_pages).string)
Output:
9440717256
H.No. 3-11-62, RTC Colony
Lal Bahadur Nagar
Regex is safe to use in case you know the HTML provider, what the code inside looks like.
Then, just use alternations and named capture groups.
telephone[^>]*>(?P<Telephone>[^<]+)|streetAddress[^>]*>(?P<Address>[^<]+)|Pages[^>]*>(?P<Pages>[^<]+)
See demo
In case > is not serialized, you can use this regex (more universal one, edit: now, verbose):
telephone[^<]*> # Looking for telephone
(?P<Telephone>[^<]+) # Capture telephone (all text up to the next tag)
|
streetAddress[^<]*> # Looking for streetAddress
(?P<Address>[^<]+) # Capture address (all text up to the next tag)
|
Pages[^<]*> # Looking for Pages
(?P<Pages>[^<]+) # Capture Pages (all text up to the next tag)
Sample demo on IDEONE
Pasting regex code part:
p = re.compile(ur'''telephone[^<]*> # Looking for telephone
(?P<Telephone>[^<]+) # Capture telephone (all text up to the next tag)
|
streetAddress[^<]*> # Looking for streetAddress
(?P<Address>[^<]+) # Capture address (all text up to the next tag)
|
Pages[^<]*> # Looking for Pages
(?P<Pages>[^<]+) # Capture Pages (all text up to the next tag)''', re.IGNORECASE | re.VERBOSE)
test_str = "YOUR STRING"
print filter(None, [x.group("Telephone") for x in re.finditer(p, test_str)])
print filter(None, [x.group("Address") for x in re.finditer(p, test_str)])
print filter(None, [x.group("Pages") for x in re.finditer(p, test_str)])
Output (doubled results are the result of my duplicating the input string with different node order):
[u'9440717256', u'9440717256']
[u'H.No. 3-11-62, RTC Colony', u'H.No. 3-11-62, RTC Colony']
[u'Lal Bahadur Nagar', u'Lal Bahadur Nagar']
I have some html page to scrape data from.
So I need to get item title like here: 'Caliper Ring'.
I'm getting data from tag where that title appears:
item_title = base_page.find_all('h1', class_='itemTitle')
It contains these tags structure:
> [<h1 class="itemTitle"> <div class="l1">Caliper</div>
> Ball
> </h1>]
To extract 'Caliper Ball' I`m using
collector = []
for _ in item_title:
collector.append(_.text)
so I`m getting such ugly output in collector list:
[u"\nCaliper\r\n Ball\r\n "]
How can I make output clear like here "Caliper Ball"
Don't use regex. You're adding too much overhead for something simple. BeautifulSoup4 already has something for this called stripped_strings. See my code below.
from bs4 import BeautifulSoup as bsoup
html = """[<h1 class="itemTitle"> <div class="l1">Caliper</div>
Ball
</h1>]"""
soup = bsoup(html)
soup.prettify()
item = soup.find("h1", class_="itemTitle")
base = list(item.stripped_strings)
print " ".join(base)
Result:
Caliper Ball
[Finished in 0.5s]
Explanation: stripped_strings basically gets all the text inside a specified tag, strips them of all the spaces, line breaks, what have you. It returns a generator, which we can catch with list so it returns a list instead. Once it's a list, it's just a matter of using " ".join.
Let us know if this helps.
PS: Just to correct myself -- there's actually no need to use list on the result of stripped_strings, but it's better to show the above as such so it's explicit.
This regex will help you to get the output(Caliper Ball),
import re
str="""[<h1 class="itemTitle"> <div class="l1">Caliper</div>
Ball
</h1>]"""
regex = r'.*>([^<]*)<\/div>\s*\n\s*(\w*).*'
match = re.findall(regex, str)
new_data = (' '.join(w) for w in match)
print ''.join(new_data) # => Caliper Ball
you can use replace() method to replace \n and \r with nothing or space and after this use method trim() to remvoe spaces.