Extract text from html string with beautiful soup - python

I write the following code to extract price from webpage:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.teleborsa.it/azioni/intesa-sanpaolo-isp-it0000072618-SVQwMDAwMDcyNjE4"
html = urlopen(url)
soup = BeautifulSoup(html,'lxml')
prize = soup.select('.h-price')
print(prize)
output is:
<span class="h-price fc0" id="ctl00_phContents_ctlHeader_lblPrice">1,384</span>
i want to extract 1,384 value.

Try this
document.getElementById("ctl00_phContents_ctlHeader_lblPrice").innerText
Or if you are having dynamic elements, you can iterate over each element and get innerText from it.

You can use .text property to get the desired text.
For example:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.teleborsa.it/azioni/intesa-sanpaolo-isp-it0000072618-SVQwMDAwMDcyNjE4"
html = urlopen(url)
soup = BeautifulSoup(html,'lxml')
prize = soup.select_one('.h-price') # <- change to .select_one() to get only one element
print(prize.text) # <- use the .text property to get text of the tag
Prints:
1,384

Related

Python web scraping - Why do I get this error?

I want to get text from a website using bs4 but I keep getting this error and I don't know why. This is the error: TypeError: slice indices must be integers or None or have an index method.
This is my code:
from urllib.request import urlopen
import bs4
url = "https://www.oddsshark.com/nfl/dallas-pittsburgh-odds-august-5-2021-1410371"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
text = html.find("div", {"class":"gc-score__title"})#the error is in this line
print(text)
On this line:
text = html.find("div", {"class":"gc-score__title"})
you just use str.find method, not bs4.BeautifulSoup.find method
So if you do
soup = bs4.BeautifulSoup(html, 'html.parser')
text = soup.find("div", {"class":"gc-score__title"})
print(text)
you will get rid of the error.
That said, the site is using JavaScript, so this will not yield what you expect. You will need to use tools like Selenium to scrape this site.
First, if you want BeautifulSoup to parse the data, you need to ask it to do that.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.oddsshark.com/nfl/dallas-pittsburgh-odds-august-5-2021-1410371"
page = urlopen(url)
html_bytes = page.read()
soup = BeautifulSoup(html_bytes)
Then you can used soup.find to find <div> tags:
text = soup.find("div", {"class":"gc-score__title"})
That will eliminate the error. You were calling str.find because html is a string, and to pick tags out you need to call the find method of a bs4.BeautifulSoup object.
But besides eliminating the error, that line won't do what you want. It won't return anything, because the data at that url does not contain the tag <div class="gc-score__title">.
Copy the contents of html_bytes to a text editor to confirm this.

BeautifulSoup and urllib to find data from website

Background
I am trying to understand the process in which specific data can be extracted from a website using beautifulsoup4 and urllib libraries.
How would I get the specific price of a DVD from a website, if:
The div class is <div class="productPrice" data-component="productPrice">
The p class is <p class="productPrice_price" data-product-price="price">£9.99 </p>
Code so far:
from bs4 import BeautifulSoup
from urllib.request import urlopen
html = urlopen("https://www.zavvi.com/dvd/rampage-includes-digital-download/11729469.html ")
bsObj = BeautifulSoup(html.read(), features='html.parser')
all_divs = bsObj.find_all('div', {'class':'productPrice'}) # 1. get all divs
What is the remaining process of finding the price?
Website (https://www.zavvi.com/dvd/rampage-includes-digital-download/11729469.html)
You're almost there, just one more step. You just need to loop through the elements and find the <p> tag, with class="productPrice_price", and grab the text:
from bs4 import BeautifulSoup
from urllib.request import urlopen
html = urlopen("https://www.zavvi.com/dvd/rampage-includes-digital-download/11729469.html ")
bsObj = BeautifulSoup(html.read(), features='html.parser')
all_divs = bsObj.find_all('div', {'class':'productPrice'}) # 1. get all divs
for ele in all_divs:
price = ele.find('p', {'class':'productPrice_price'}).text
print (price)
Output:
£9.99

how to reach dipper divs inside <span> tag using python crawler?

the body tag has a <span> tag. There are many other divs inside the span tag. I want to go dipper but when I trying this code:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'https://www.instagram.com/artfido/'
data = urlopen(url)
soup = BeautifulSoup(data, 'html.parser')
result = soup.body.span
print (result)
the result was just this:
<span id="react-root"></span>
How can I reach to divs inside the span tag?
Can we parse the <span> tag? Is it possible? If yes so why I'm not able to parse the span?
By using this:
result = soup.body.span.contents
The output was:
[]
As talked in comments, urlopen(url) returns a file like object, which means that you need to read from it if you want to get what's inside it.
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'https://www.instagram.com/artfido/'
data = urlopen(url)
soup = BeautifulSoup(data.read(), 'html.parser')
result = soup.body.span
print (result)
The code I used for my python 2.7 setup:
from bs4 import BeautifulSoup
import urllib2
url = 'https://www.instagram.com/artfido/'
data = urllib2.urlopen(url)
soup = BeautifulSoup(data.read(), 'lxml')
result = soup.body.span
print result
EDIT
for future reference, if you want something more simple for handling the url, there is a package called requests . In this case, it is similar but I find it easier to understand.
from bs4 import BeautifulSoup
import requests
url = 'https://www.instagram.com/artfido/'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'lxml')
result = soup.body.span
print result

python beautifulsoup get html tag content

How can I get the content of an html tag with beautifulsoup? for example the content of <title> tag?
I tried:
from bs4 import BeautifulSoup
url ='http://www.websiteaddress.com'
soup = BeautifulSoup(url)
result = soup.findAll('title')
for each in result:
print(each.get_text())
But nothing happened. I'm using python3.
You need to fetch the website data first. You can do this with the urllib.request module. Note that HTML documents only have one title so there is no need to use find_all() and a loop.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url ='http://www.websiteaddress.com'
data = urlopen(url)
soup = BeautifulSoup(data, 'html.parser')
result = soup.find('title')
print(result.get_text())

How to extract specific URL from HTML using Beautiful Soup?

I want to extract specific URLs from an HTML page.
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
url = http://bassrx.tumblr.com/tagged/tt # nsfw link
page = urlopen(url)
html = page.read() # get the html from the url
# this works without BeautifulSoup, but it is slow:
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)
print image_links
The output of the above is exactly the URL, nothing else: http://38.media.tumblr.com/tumblr_ln5gwxHYei1qi02clo1_500.jpg
The only downside is it is very slow.
BeautifulSoup is extremely fast at parsing HTML, so that's why I want to use it.
The urls that I want are actually the img src. Here's a snippet from the HMTL that contains that information that I want.
<div class="media"><a href="http://bassrx.tumblr.com/image/85635265422">
<img src="http://38.media.tumblr.com/tumblr_ln5gwxHYei1qi02clo1_500.jpg"/>
</a></div>
So, my question is, how can I get BeautifulSoup to extract all of those 'img src' urls cleanly without any other cruft?
I just want a list of matching urls. I've been trying to use soup.findall() function, but cannot get any useful results.
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = 'http://bassrx.tumblr.com/tagged/tt'
soup = BeautifulSoup(urlopen(url).read())
for element in soup.findAll('img'):
print(element.get('src'))
You can use div.media > a > img CSS selector to find img tags inside a which is inside a div tag with media class:
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = "<url_here>"
soup = BeautifulSoup(urlopen(url))
images = soup.select('div.media > a > img')
print [image.get('src') for image in images]
In order to make the parsing faster you can use lxml parser:
soup = BeautifulSoup(urlopen(url), "lxml")
You need to install lxml module first, of course.
Also, you can make use of a SoupStrainer class for parsing only relevant part of the document.
Hope that helps.
Have a look a BeautifulSoup.find_all with re.compile mix
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
url = "http://bassrx.tumblr.com/tagged/tt" # nsfw link
page = urlopen(url)
html = page.read()
bs = BeautifulSoup(html)
a_tumblr = [a_element for a_element in bs.find_all(href=re.compile("media\.tumblr"))]
##[<link href="http://37.media.tumblr.com/avatar_df3a9e37c757_128.png" rel="shortcut icon"/>, <link href="http://37.media.tumblr.com/avatar_df3a9e37c757_128.png" rel="apple-touch-icon"/>]

Categories