How to grab accurate titles from web pages without including site data - python

I found this link [and a few others] which talks a bit about BeautifulSoup for reading html. It mostly does what I want, grabs a title for a webpage.
def get_title(url):
html = requests.get(url).text
if len(html) > 0:
contents = BeautifulSoup(html)
title = contents.title.string
return title
return None
The issue that I run into is that sometimes articles will come back with metadata attached at the end with " - some_data". A good example is this link to a BBC Sport article which reports the title as
Jack Charlton: 1966 England World Cup winner dies aged 85 - BBC Sport
I could do something simple like cut off anything after the last '-' character
title = title.rsplit(', ', 1)[0]
But that assumes that any meta exists after a "-" value. I don't want to assume that there will never be an article who's title ends in " - part_of_title"
I found the Newspaper3k library but it's definitely more than I need - all I need is to grab a title and ensure that it's the same as what the user posted. My friend who pointed me to Newspaper3k also mentioned it could be buggy and didn't always find titles correctly, so I would be inclined to use something else if possible.
My current thought is to continue using BeautifulSoup and just add on fuzzywuzzy which would honestly also help with slight misspellings or punctuation differences. But, I would certainly prefer to start from a place that included comparing against accurate titles.

Here is how reddit handles getting title data.
https://github.com/reddit-archive/reddit/blob/40625dcc070155588d33754ef5b15712c254864b/r2/r2/lib/utils/utils.py#L255
def extract_title(data):
"""Try to extract the page title from a string of HTML.
An og:title meta tag is preferred, but will fall back to using
the <title> tag instead if one is not found. If using <title>,
also attempts to trim off the site's name from the end.
"""
bs = BeautifulSoup(data, convertEntities=BeautifulSoup.HTML_ENTITIES)
if not bs or not bs.html.head:
return
head_soup = bs.html.head
title = None
# try to find an og:title meta tag to use
og_title = (head_soup.find("meta", attrs={"property": "og:title"}) or
head_soup.find("meta", attrs={"name": "og:title"}))
if og_title:
title = og_title.get("content")
# if that failed, look for a <title> tag to use instead
if not title and head_soup.title and head_soup.title.string:
title = head_soup.title.string
# remove end part that's likely to be the site's name
# looks for last delimiter char between spaces in strings
# delimiters: |, -, emdash, endash,
# left- and right-pointing double angle quotation marks
reverse_title = title[::-1]
to_trim = re.search(u'\s[\u00ab\u00bb\u2013\u2014|-]\s',
reverse_title,
flags=re.UNICODE)
# only trim if it won't take off over half the title
if to_trim and to_trim.end() < len(title) / 2:
title = title[:-(to_trim.end())]
if not title:
return
# get rid of extraneous whitespace in the title
title = re.sub(r'\s+', ' ', title, flags=re.UNICODE)
return title.encode('utf-8').strip()

Related

How to cleanup ‘orphaned’ text in poorly written html using Python?

Let me explain: I scrapped html off a poorly written website and wish to clean up the code by encapsulating each line within a <div> tag, keep the existing bold, italic and other formatting information, keep the images and links. I will then format everything and prettify it once cleaned.
Below are 3 sample lines from the website:
line1 = '1. O: upper border of 1st rib & cartilage.<div>2. I: inferior surface of middle third of the clavicle. </div><div>3. NS: nerve to subclavius. </div><div>4. A: anchors & depresses clavicle. </div><div><br></div><div><div><img src="paste-3461743641109.jpg"></div><div><span style="font-weight: bolder">Image: </span>Gray, Henry. <i>Anatomy of the Human Body.</i> Philadelphia: Lea & Febiger, 1918; Bartleby.com, 2000. www.bartleby.com/107/ [Accessed 15 Nov. 2018]. </div></div>'
line2 = '''<div><i>CVS</i></div><div>1. Cardiovascular conditioning & improves postural hypotension<br></div><div><b><span style="font-weight: 400;">2. Improves ventilation</span></b><br></div><div><b><span style="font-weight: 400;"><br></span></b></div><div><b><span style="font-weight: 400;"><i>BONES</i></span></b></div><div>3. Promote & maintain bone density, prevent osteoporosis<b><span style="font-weight: 400;"><br></span></b></div><div><br></div><div><i>MUSCLES & JOINTS</i></div>4. Safe reintroduction of the patient to vertical position<div><div>5. Facilitate early weight bearing</div><div><b><span style="font-weight: 400;">6. Prevent contractures</span></b><br></div><div><b><span style="font-weight: 400;"><br></span></b></div><div><b><span style="font-weight: 400;"><i>SKIN</i></span></b></div><div>7. Decreases prolonged bed rest & its complications</div></div><div><br></div><div><i>PSYCHOLOGY</i></div><div>8. Improves psychological outlook & motivation</div>'''
line3 = '''ORIGIN<div>1. Branch of the posterior cord of the brachial plexus - C5, C6.</div><div><br></div><div>COURSE</div>2. Passes out of the axilla, through the quadrangular space with posterior circumflex humeral vessels, to the upper arm where it's in contact with surgical neck of the humerus. </div><div><br></div><div>BRANCHES</div><div><i><font color="#ff086c">3. Sensory supply to small 'regimental patch' over shoulder.</font></i></div><div><i><font color="#ff086c">4. Anterior - supplies the deltoid. </font></i></div><div><i><font color="#ff086c">5. Posterior - supplies teres minor, becomes upper lateral cutaneous nerve of the arm. </font></i></div><i><font color="#ff086c"><img src="paste-6103148528016.jpg"></font></i></div><div><div><b style="font-weight: bold; ">Image: </b>Gray, Henry. <i>Anatomy of the Human Body.</i> Philadelphia: Lea & Febiger, 1918; Bartleby.com, 2000. www.bartleby.com/107/ [Accessed 16 Nov. 2018].</div></div>'''
You will notice that in line1 there is no <div> tag at all at the beginning, whereas line2 starts with a tag but point 4 within is not enclosed in such a tag. line3 has multiple strings not enclosed in <div> tags.
I wrote the following to correct the first line (line1):
# 1. First, find all lines enclosed in <div> tags
temp_soup = BeautifulSoup(html.unescape(line), "html.parser")
soup = BeautifulSoup("", "html.parser")
for tag in temp_soup.find_all('div'):
tag.extract()
soup.append(tag)
# 2. Then, ensure that the first line starts with the <div> tag, else isolate the first sentence and enclose it between <div> tags
new_div = soup.new_tag("div")
new_div.string = str(temp_soup)
soup.insert(0, new_div)
print(soup)
However, the above code does not fix the second line. Moreover, it cannot correct lines with multiple strings not enclosed in <div> tags.
Could someone suggest an algorithm to clean up all 3 lines? I've tried BeautifulSoup.prettify() and lxml clean_html() to no avail.
I've tried BeautifulSoup.prettify() and lxml clean_html() to no avail.
These generally just make sure that the html is valid, so even if you consider it poorly written, they are unlikely to correct anything unless it makes the html invalid.
It's probably not most elegant solution, but I think this will do what you're wanting to:
# from bs4 import BeautifulSoup
## just a function I use often to reduce whitespace ##
def miniStr(obj, lineSep=' ', wordSep=' '):
return lineSep.join(wordSep.join(
w for w in l.split() if w) for l in str(obj).splitlines() if l.strip())
## the solution ##
def containStrings(inpObj, asHtmlStr=True, tCont='div'):
### initiate [main] container - "fixed" html will be added to this ###
cDiv = BeautifulSoup('<div></div>', "html.parser").div
### format inputs ###
tCont = str(tCont) if str(tCont).isalpha() else 'div' # can only contain letters
if not isinstance(inpObj, type(cDiv)):
# parses html strings twice [first time to correct errors - like open tags]
inpObj = BeautifulSoup(
f'<div>{BeautifulSoup(inpObj, "html.parser")}</div>', "html.parser").div
### loop through input and fill up cDiv ###
for c in inpObj.children:
if isinstance(c, str):
c = BeautifulSoup(f'<{tCont}>{c}</{tCont}>', "html.parser").find(tCont)
cDiv.append(BeautifulSoup(c.prettify(), "html.parser").find(c.name))
return cDiv.prettify()[6:-7] if asHtmlStr else cDiv
The input can be a HTML string or a bs4 Tag, and the output can also be either [default output is string - set asHtmlStr argument as False to get a Tag instead].
Also, the "orphan" strings don't necessarily have to be enclosed in div tags - you can specify a different Tag name via the tCont argument.
For test inputs, I copied your sample lines into a list named linesList - see some of the HTMLs after and before being processed by the function:
The outputs in the above screenshot were printed with:
# the inputs [before processing]
print('\n\n'.join(f'<!--from line{i+1}-->\n'+'\n'.join(
[miniStr(c, '') for c in BeautifulSoup(l,'html.parser').children]
) for i, l in enumerate(linesList)))
# the outputs [after processing]
print('\n\n'.join(f'<!--from line{i+1}-->\n'+'\n'.join(
[miniStr(c, '') for c in containStrings(l, False).children]
) for i, l in enumerate(linesList)))

Pulling a Spesific Text of a Part of Website

I'm new on web scraping and BeautifulSoup. I'm making a currency converter via using a site. I use this code to pull currency rate:
import requests
from bs4 import BeautifulSoup
from_ = input("WHICH CURRENCY DO YOU WANT TO CONVERT: ").upper()
to = input("WHICH CURRENCY DO YOU WANT TO CONVERT TO: ").upper()
url = requests.get(f'https://www.xe.com/currencyconverter/convert/?Amount=1&From={from_}&To={to}').text
soup = BeautifulSoup(url, 'lxml')
currency = soup.find('p', class_ = 'result__BigRate-sc-1bsijpp-1 iGrAod').getText()
print(currency)
This is okay but it returns a full of text (e.g. 0.84311378 Euros). I want to pull only numbers that marked with red in picture:
Due to the number would always be the first element of this tag.An easy way could be :
currency_tag = soup.find('p', class_ = 'result__BigRate-sc-1bsijpp-1 iGrAod')
print(next(iter(currency_tag)))
And result:
0.84
You can also use .contents and get the first item from it.
currency = soup.find('p', class_ = 'result__BigRate-sc-1bsijpp-1 iGrAod').contents
print(currency[0].strip())
0.84
From what I can see, the string you highlighted in the picture represents the first three characters of the resulted price.
This means that any time you try to convert some type of currency to another one, the numbers marked with red will always represent a string with a length equal to 3.
We can pull the information you need by getting a substring of the paragraph’s text. Just replace the last line you provided with:
print(currency[0:4])
This will always return a string containing exactly the characters you are looking for.

Skipping XML elements using Regular Expressions in Python 3

I have an XML document where I wish to extract certain text contained in specific tags such as-
<title>Four-minute warning</title>
<categories>
<category>Nuclear warfare</category>
<category>Cold War</category>
<category>Cold War military history of the United Kingdom</category>
<category>disaster preparedness in the United Kingdom</category>
<category>History of the United Kingdom</category>
</categories>
<bdy>
some text
</bdy>
In this toy example, if I want to extract all the text contained in tags by using the following Regular Expression code in Python 3-
# Python 3 code using RE-
file = open("some_xml_file.xml", "r")
xml_doc = file.read()
file.close()
title_text = re.findall(r'<title>.+</title>', xml_doc)
if title_text:
print("\nMatches found!\n")
for title in title_text:
print(title)
else:
print("\nNo matches found!\n\n")
It gives me the text within the XML tags ALONG with the tags. An example of a single output would be-
<title>Four-minute warning</title>
My question is, how should I frame the pattern within the re.findall() or re.search() methods so that and tags are skipped and all I get is the text between them.
Thanks for your help!
Just use a capture group in your regex (re.findall() takes care of the rest in this case). For example:
import re
s = '<title>Four-minute warning</title>'
title_text = re.findall(r'<title>(.+)</title>', s)
print(title_text[0])
# OUTPUT
# Four-minute warning

Remove items in string paragraph if they belong to a list of strings?

import urllib2,sys
from bs4 import BeautifulSoup,NavigableString
obama_4427_url = 'http://www.millercenter.org/president/obama/speeches/speech-4427'
obama_4427_html = urllib2.urlopen(obama_4427_url).read()
obama_4427_soup = BeautifulSoup(obama_4427_html)
# find the speech itself within the HTML
obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
# convert soup to string for easier processing
obama_4427_str = str(obama_4427_div)
# list of characters to be removed from obama_4427_str
remove_char = ['<br/>','</p>','</div>','<div class="indent" id="transcript">','<h2>','</h2>','<p>']
remove_char
for char in obama_4427_str:
if char in obama_4427_str:
obama_4427_replace = obama_4427_str.replace(remove_char,'')
obama_4427_replace = obama_4427_str.replace(remove_char,'')
print(obama_4427_replace)
Using BeautifulSoup, I scraped one of Obama's speeches off of the above website. Now, I need to replace some residual HTML in an efficient manner. I've stored a list of elements I'd like to eliminate in remove_char. I'm trying to write a simple for statement, but am getting the error: TypeError: expected a character object buffer. It's a beginner question, I know, but how can I get around this?
Since you are using BeautifulSoup already , you can directly use obama_4427_div.text instead of str(obama_4427_div) to get the correctly formatted text. Then the text you get would not contain any residual html elements, etc.
Example -
>>> obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
>>> obama_4427_str = obama_4427_div.text
>>> print(obama_4427_str)
Transcript
To Chairman Dean and my great friend Dick Durbin; and to all my fellow citizens of this great nation;
With profound gratitude and great humility, I accept your nomination for the presidency of the United States.
Let me express my thanks to the historic slate of candidates who accompanied me on this ...
...
...
...
Thank you, God Bless you, and God Bless the United States of America.
For completeness, for removing elements from a string, I would create a list of elements to remove (like the remove_char list you have created) and then we can do str.replace() on the string for each element in the list. Example -
obama_4427_str = str(obama_4427_div)
remove_char = ['<br/>','</p>','</div>','<div class="indent" id="transcript">','<h2>','</h2>','<p>']
for char in remove_char:
obama_4427_str = obama_4427_str.replace(char,'')

Extracting text from html tags with beautiful soup

I have some html page to scrape data from.
So I need to get item title like here: 'Caliper Ring'.
I'm getting data from tag where that title appears:
item_title = base_page.find_all('h1', class_='itemTitle')
It contains these tags structure:
> [<h1 class="itemTitle"> <div class="l1">Caliper</div>
> Ball
> </h1>]
To extract 'Caliper Ball' I`m using
collector = []
for _ in item_title:
collector.append(_.text)
so I`m getting such ugly output in collector list:
[u"\nCaliper\r\n Ball\r\n "]
How can I make output clear like here "Caliper Ball"
Don't use regex. You're adding too much overhead for something simple. BeautifulSoup4 already has something for this called stripped_strings. See my code below.
from bs4 import BeautifulSoup as bsoup
html = """[<h1 class="itemTitle"> <div class="l1">Caliper</div>
Ball
</h1>]"""
soup = bsoup(html)
soup.prettify()
item = soup.find("h1", class_="itemTitle")
base = list(item.stripped_strings)
print " ".join(base)
Result:
Caliper Ball
[Finished in 0.5s]
Explanation: stripped_strings basically gets all the text inside a specified tag, strips them of all the spaces, line breaks, what have you. It returns a generator, which we can catch with list so it returns a list instead. Once it's a list, it's just a matter of using " ".join.
Let us know if this helps.
PS: Just to correct myself -- there's actually no need to use list on the result of stripped_strings, but it's better to show the above as such so it's explicit.
This regex will help you to get the output(Caliper Ball),
import re
str="""[<h1 class="itemTitle"> <div class="l1">Caliper</div>
Ball
</h1>]"""
regex = r'.*>([^<]*)<\/div>\s*\n\s*(\w*).*'
match = re.findall(regex, str)
new_data = (' '.join(w) for w in match)
print ''.join(new_data) # => Caliper Ball
you can use replace() method to replace \n and \r with nothing or space and after this use method trim() to remvoe spaces.

Categories