python beautifulsoup get html tag content - python

How can I get the content of an html tag with beautifulsoup? for example the content of <title> tag?
I tried:
from bs4 import BeautifulSoup
url ='http://www.websiteaddress.com'
soup = BeautifulSoup(url)
result = soup.findAll('title')
for each in result:
print(each.get_text())
But nothing happened. I'm using python3.

You need to fetch the website data first. You can do this with the urllib.request module. Note that HTML documents only have one title so there is no need to use find_all() and a loop.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url ='http://www.websiteaddress.com'
data = urlopen(url)
soup = BeautifulSoup(data, 'html.parser')
result = soup.find('title')
print(result.get_text())

Related

Why is BeautifulSoup not extracting full HTML from a static website?

I am trying to webscrape a Chinese website https://bo.io.gov.mo/bo/ii/2021/43/avisosoficiais_cn.asp, but the code below is not returning the full html text. The strange thing is that the code is able to get me the full html from the Portuguese version of the same website https://bo.io.gov.mo/bo/ii/2021/43/avisosoficiais.asp. What is the problem?
from bs4 import BeautifulSoup
from urllib.request import urlopen
response = urlopen('https://bo.io.gov.mo/bo/ii/2021/43/avisosoficiais_cn.asp')
html_doc = response.read()
soup = BeautifulSoup(html_doc, 'lxml')
strhtm = soup.prettify()
print(strhtm)

Extract text from html string with beautiful soup

I write the following code to extract price from webpage:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.teleborsa.it/azioni/intesa-sanpaolo-isp-it0000072618-SVQwMDAwMDcyNjE4"
html = urlopen(url)
soup = BeautifulSoup(html,'lxml')
prize = soup.select('.h-price')
print(prize)
output is:
<span class="h-price fc0" id="ctl00_phContents_ctlHeader_lblPrice">1,384</span>
i want to extract 1,384 value.
Try this
document.getElementById("ctl00_phContents_ctlHeader_lblPrice").innerText
Or if you are having dynamic elements, you can iterate over each element and get innerText from it.
You can use .text property to get the desired text.
For example:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.teleborsa.it/azioni/intesa-sanpaolo-isp-it0000072618-SVQwMDAwMDcyNjE4"
html = urlopen(url)
soup = BeautifulSoup(html,'lxml')
prize = soup.select_one('.h-price') # <- change to .select_one() to get only one element
print(prize.text) # <- use the .text property to get text of the tag
Prints:
1,384

how to reach dipper divs inside <span> tag using python crawler?

the body tag has a <span> tag. There are many other divs inside the span tag. I want to go dipper but when I trying this code:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'https://www.instagram.com/artfido/'
data = urlopen(url)
soup = BeautifulSoup(data, 'html.parser')
result = soup.body.span
print (result)
the result was just this:
<span id="react-root"></span>
How can I reach to divs inside the span tag?
Can we parse the <span> tag? Is it possible? If yes so why I'm not able to parse the span?
By using this:
result = soup.body.span.contents
The output was:
[]
As talked in comments, urlopen(url) returns a file like object, which means that you need to read from it if you want to get what's inside it.
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'https://www.instagram.com/artfido/'
data = urlopen(url)
soup = BeautifulSoup(data.read(), 'html.parser')
result = soup.body.span
print (result)
The code I used for my python 2.7 setup:
from bs4 import BeautifulSoup
import urllib2
url = 'https://www.instagram.com/artfido/'
data = urllib2.urlopen(url)
soup = BeautifulSoup(data.read(), 'lxml')
result = soup.body.span
print result
EDIT
for future reference, if you want something more simple for handling the url, there is a package called requests . In this case, it is similar but I find it easier to understand.
from bs4 import BeautifulSoup
import requests
url = 'https://www.instagram.com/artfido/'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'lxml')
result = soup.body.span
print result

How to extract specific URL from HTML using Beautiful Soup?

I want to extract specific URLs from an HTML page.
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
url = http://bassrx.tumblr.com/tagged/tt # nsfw link
page = urlopen(url)
html = page.read() # get the html from the url
# this works without BeautifulSoup, but it is slow:
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)
print image_links
The output of the above is exactly the URL, nothing else: http://38.media.tumblr.com/tumblr_ln5gwxHYei1qi02clo1_500.jpg
The only downside is it is very slow.
BeautifulSoup is extremely fast at parsing HTML, so that's why I want to use it.
The urls that I want are actually the img src. Here's a snippet from the HMTL that contains that information that I want.
<div class="media"><a href="http://bassrx.tumblr.com/image/85635265422">
<img src="http://38.media.tumblr.com/tumblr_ln5gwxHYei1qi02clo1_500.jpg"/>
</a></div>
So, my question is, how can I get BeautifulSoup to extract all of those 'img src' urls cleanly without any other cruft?
I just want a list of matching urls. I've been trying to use soup.findall() function, but cannot get any useful results.
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = 'http://bassrx.tumblr.com/tagged/tt'
soup = BeautifulSoup(urlopen(url).read())
for element in soup.findAll('img'):
print(element.get('src'))
You can use div.media > a > img CSS selector to find img tags inside a which is inside a div tag with media class:
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = "<url_here>"
soup = BeautifulSoup(urlopen(url))
images = soup.select('div.media > a > img')
print [image.get('src') for image in images]
In order to make the parsing faster you can use lxml parser:
soup = BeautifulSoup(urlopen(url), "lxml")
You need to install lxml module first, of course.
Also, you can make use of a SoupStrainer class for parsing only relevant part of the document.
Hope that helps.
Have a look a BeautifulSoup.find_all with re.compile mix
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
url = "http://bassrx.tumblr.com/tagged/tt" # nsfw link
page = urlopen(url)
html = page.read()
bs = BeautifulSoup(html)
a_tumblr = [a_element for a_element in bs.find_all(href=re.compile("media\.tumblr"))]
##[<link href="http://37.media.tumblr.com/avatar_df3a9e37c757_128.png" rel="shortcut icon"/>, <link href="http://37.media.tumblr.com/avatar_df3a9e37c757_128.png" rel="apple-touch-icon"/>]

getting email ids using beautifulsoup

I am working on python. I am learning beautifulsoup & I am parsing a link.
my url :
http://www.dtemaharashtra.gov.in/approvedinstitues/StaticPages/frmInstituteSummary.aspx?InstituteCode=1002
I want to parse email id from that url.
How can I do that?
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.dtemaharashtra.gov.in/approvedinstitues/StaticPages/frmInstituteSummary.aspx?InstituteCode=1002').read()
soup = BeautifulSoup(html)
print soup.find(id='ctl00_rightContainer_ContentBox1_lblEMailAddress').text
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.dtemaharashtra.gov.in/approvedinstitues/StaticPages/frmInstituteSummary.aspx?InstituteCode=1002")
soup = BeautifulSoup(r.text)
soup.find("span", {"id":"ctl00_rightContainer_ContentBox1_lblEMailAddress"}).text

Categories