Pulling a Spesific Text of a Part of Website - python

I'm new on web scraping and BeautifulSoup. I'm making a currency converter via using a site. I use this code to pull currency rate:
import requests
from bs4 import BeautifulSoup
from_ = input("WHICH CURRENCY DO YOU WANT TO CONVERT: ").upper()
to = input("WHICH CURRENCY DO YOU WANT TO CONVERT TO: ").upper()
url = requests.get(f'https://www.xe.com/currencyconverter/convert/?Amount=1&From={from_}&To={to}').text
soup = BeautifulSoup(url, 'lxml')
currency = soup.find('p', class_ = 'result__BigRate-sc-1bsijpp-1 iGrAod').getText()
print(currency)
This is okay but it returns a full of text (e.g. 0.84311378 Euros). I want to pull only numbers that marked with red in picture:

Due to the number would always be the first element of this tag.An easy way could be :
currency_tag = soup.find('p', class_ = 'result__BigRate-sc-1bsijpp-1 iGrAod')
print(next(iter(currency_tag)))
And result:
0.84

You can also use .contents and get the first item from it.
currency = soup.find('p', class_ = 'result__BigRate-sc-1bsijpp-1 iGrAod').contents
print(currency[0].strip())
0.84

From what I can see, the string you highlighted in the picture represents the first three characters of the resulted price.
This means that any time you try to convert some type of currency to another one, the numbers marked with red will always represent a string with a length equal to 3.
We can pull the information you need by getting a substring of the paragraph’s text. Just replace the last line you provided with:
print(currency[0:4])
This will always return a string containing exactly the characters you are looking for.

Related

Pandas: How to avoid duplicated value when the value is an url?

I have a column in my dataframe for articles that looks like this:
id link
1 https://www.msn.com/rachat-de-soufflet-par-invivo-les-secrets-dun-deal-%C3%A0-2-milliards-deuros/ar-BB1cKCRg
2 https://www.msn.com/rachat-de-soufflet-par-invivo-les-secrets-d-un-deal-%C3%A0-2-milliards-deuros/ar-BB1cKCRg
3 other link
For example the two first urls look to be the same but change here:
d-un-deal
In my dataframe I have some links that are almost similar. The content is the same but the link change, sometimes the difference between the two links is a letter having an uppercase in one of the link or just other character differing.
Example:
url1 = https://site/presidency...
url2 = https://site/Presidency...
url3 = https://site/news-of-today
url4 = same as url3 but at the end ?autoplay
How can I check all the links and delete the duplicates (similar content but the link is changing a little) ?
Here is one solution:
Find the similarity metric between two strings
You could use a metric for this. Decide which similarity you want to use.

How to grab accurate titles from web pages without including site data

I found this link [and a few others] which talks a bit about BeautifulSoup for reading html. It mostly does what I want, grabs a title for a webpage.
def get_title(url):
html = requests.get(url).text
if len(html) > 0:
contents = BeautifulSoup(html)
title = contents.title.string
return title
return None
The issue that I run into is that sometimes articles will come back with metadata attached at the end with " - some_data". A good example is this link to a BBC Sport article which reports the title as
Jack Charlton: 1966 England World Cup winner dies aged 85 - BBC Sport
I could do something simple like cut off anything after the last '-' character
title = title.rsplit(', ', 1)[0]
But that assumes that any meta exists after a "-" value. I don't want to assume that there will never be an article who's title ends in " - part_of_title"
I found the Newspaper3k library but it's definitely more than I need - all I need is to grab a title and ensure that it's the same as what the user posted. My friend who pointed me to Newspaper3k also mentioned it could be buggy and didn't always find titles correctly, so I would be inclined to use something else if possible.
My current thought is to continue using BeautifulSoup and just add on fuzzywuzzy which would honestly also help with slight misspellings or punctuation differences. But, I would certainly prefer to start from a place that included comparing against accurate titles.
Here is how reddit handles getting title data.
https://github.com/reddit-archive/reddit/blob/40625dcc070155588d33754ef5b15712c254864b/r2/r2/lib/utils/utils.py#L255
def extract_title(data):
"""Try to extract the page title from a string of HTML.
An og:title meta tag is preferred, but will fall back to using
the <title> tag instead if one is not found. If using <title>,
also attempts to trim off the site's name from the end.
"""
bs = BeautifulSoup(data, convertEntities=BeautifulSoup.HTML_ENTITIES)
if not bs or not bs.html.head:
return
head_soup = bs.html.head
title = None
# try to find an og:title meta tag to use
og_title = (head_soup.find("meta", attrs={"property": "og:title"}) or
head_soup.find("meta", attrs={"name": "og:title"}))
if og_title:
title = og_title.get("content")
# if that failed, look for a <title> tag to use instead
if not title and head_soup.title and head_soup.title.string:
title = head_soup.title.string
# remove end part that's likely to be the site's name
# looks for last delimiter char between spaces in strings
# delimiters: |, -, emdash, endash,
# left- and right-pointing double angle quotation marks
reverse_title = title[::-1]
to_trim = re.search(u'\s[\u00ab\u00bb\u2013\u2014|-]\s',
reverse_title,
flags=re.UNICODE)
# only trim if it won't take off over half the title
if to_trim and to_trim.end() < len(title) / 2:
title = title[:-(to_trim.end())]
if not title:
return
# get rid of extraneous whitespace in the title
title = re.sub(r'\s+', ' ', title, flags=re.UNICODE)
return title.encode('utf-8').strip()

finding exact match of the text by using Beautifulsoup

i would like to extract the exact matching value of text from html by using beautifulsoup.But i am getting some almost matching text with my exact text.
my code is:
from bs4 import BeautifulSoup
import urllib2enter code here
url="http://www.somesite.com"
page=urllib2.urlopen(url)
soup=BeautifulSoup(page,"lxml")
for elem in soup(text=re.compile("exact text")):
print elem
for the above mentioned code output is like:
1.exact text
2.almost exact text
how can i get only the exact match by using beautifulsoup?
note:the variable(elem) should be in <class 'bs4.element.Comment'>type
You can search at your soup for the desired element, using it's tag and any attribute value.
I.e.: this code will search for all a elements with id equal to some_id_value.
Then it'll loop each element found, testing if it's .text value is equal to "exact text".
If so, it'll print the whole element.
for elem in soup.find_all('a', {'id':'some_id_value'}):
if elem.text == "exact text":
print(elem)
Use BeautifulSoup's find_all method with its string argument for this.
As an example, here I parse a small page from wikipedia about a place in Jamaica. I look for all strings whose texts are 'Jamaica stubs' but I expect to find just one. When I find it I display the text and its parent.
>>> url = 'https://en.wikipedia.org/wiki/Cassava_Piece'
>>> from bs4 import BeautifulSoup
>>> import requests
>>> page = requests.get(url).text
>>> soup = BeautifulSoup(page, 'lxml')
>>> for item in soup.find_all(string="Jamaica stubs"):
... item
... item.findParent()
...
'Jamaica stubs'
Jamaica stubs
On second thoughts, after reading the comment, a better way would be:
>>> url = 'https://en.wikipedia.org/wiki/Hockey'
>>> from bs4 import BeautifulSoup
>>> import requests
>>> import re
>>> page = requests.get(url).text
>>> soup = BeautifulSoup(page, 'lxml')
>>> for i, item in enumerate(soup.find_all(string=re.compile('women', re.IGNORECASE))):
... i, item.findParent().text[:100]
...
(0, "Women's Bandy World Championships")
(1, "The governing body is the 126-member International Hockey Federation (FIH). Men's field hockey has b")
(2, 'The governing body of international play is the 77-member International Ice Hockey Federation (IIHF)')
(3, "women's")
I use IGNORECASE in the regex so that both 'Women' and 'women' are found in the wikipedia article. I use enumerate in the for loop so that I can number the items that are displayed to make them easier to read.

Remove items in string paragraph if they belong to a list of strings?

import urllib2,sys
from bs4 import BeautifulSoup,NavigableString
obama_4427_url = 'http://www.millercenter.org/president/obama/speeches/speech-4427'
obama_4427_html = urllib2.urlopen(obama_4427_url).read()
obama_4427_soup = BeautifulSoup(obama_4427_html)
# find the speech itself within the HTML
obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
# convert soup to string for easier processing
obama_4427_str = str(obama_4427_div)
# list of characters to be removed from obama_4427_str
remove_char = ['<br/>','</p>','</div>','<div class="indent" id="transcript">','<h2>','</h2>','<p>']
remove_char
for char in obama_4427_str:
if char in obama_4427_str:
obama_4427_replace = obama_4427_str.replace(remove_char,'')
obama_4427_replace = obama_4427_str.replace(remove_char,'')
print(obama_4427_replace)
Using BeautifulSoup, I scraped one of Obama's speeches off of the above website. Now, I need to replace some residual HTML in an efficient manner. I've stored a list of elements I'd like to eliminate in remove_char. I'm trying to write a simple for statement, but am getting the error: TypeError: expected a character object buffer. It's a beginner question, I know, but how can I get around this?
Since you are using BeautifulSoup already , you can directly use obama_4427_div.text instead of str(obama_4427_div) to get the correctly formatted text. Then the text you get would not contain any residual html elements, etc.
Example -
>>> obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
>>> obama_4427_str = obama_4427_div.text
>>> print(obama_4427_str)
Transcript
To Chairman Dean and my great friend Dick Durbin; and to all my fellow citizens of this great nation;
With profound gratitude and great humility, I accept your nomination for the presidency of the United States.
Let me express my thanks to the historic slate of candidates who accompanied me on this ...
...
...
...
Thank you, God Bless you, and God Bless the United States of America.
For completeness, for removing elements from a string, I would create a list of elements to remove (like the remove_char list you have created) and then we can do str.replace() on the string for each element in the list. Example -
obama_4427_str = str(obama_4427_div)
remove_char = ['<br/>','</p>','</div>','<div class="indent" id="transcript">','<h2>','</h2>','<p>']
for char in remove_char:
obama_4427_str = obama_4427_str.replace(char,'')

Extracting text from html tags with beautiful soup

I have some html page to scrape data from.
So I need to get item title like here: 'Caliper Ring'.
I'm getting data from tag where that title appears:
item_title = base_page.find_all('h1', class_='itemTitle')
It contains these tags structure:
> [<h1 class="itemTitle"> <div class="l1">Caliper</div>
> Ball
> </h1>]
To extract 'Caliper Ball' I`m using
collector = []
for _ in item_title:
collector.append(_.text)
so I`m getting such ugly output in collector list:
[u"\nCaliper\r\n Ball\r\n "]
How can I make output clear like here "Caliper Ball"
Don't use regex. You're adding too much overhead for something simple. BeautifulSoup4 already has something for this called stripped_strings. See my code below.
from bs4 import BeautifulSoup as bsoup
html = """[<h1 class="itemTitle"> <div class="l1">Caliper</div>
Ball
</h1>]"""
soup = bsoup(html)
soup.prettify()
item = soup.find("h1", class_="itemTitle")
base = list(item.stripped_strings)
print " ".join(base)
Result:
Caliper Ball
[Finished in 0.5s]
Explanation: stripped_strings basically gets all the text inside a specified tag, strips them of all the spaces, line breaks, what have you. It returns a generator, which we can catch with list so it returns a list instead. Once it's a list, it's just a matter of using " ".join.
Let us know if this helps.
PS: Just to correct myself -- there's actually no need to use list on the result of stripped_strings, but it's better to show the above as such so it's explicit.
This regex will help you to get the output(Caliper Ball),
import re
str="""[<h1 class="itemTitle"> <div class="l1">Caliper</div>
Ball
</h1>]"""
regex = r'.*>([^<]*)<\/div>\s*\n\s*(\w*).*'
match = re.findall(regex, str)
new_data = (' '.join(w) for w in match)
print ''.join(new_data) # => Caliper Ball
you can use replace() method to replace \n and \r with nothing or space and after this use method trim() to remvoe spaces.

Categories