getting email ids using beautifulsoup - python

I am working on python. I am learning beautifulsoup & I am parsing a link.
my url :
http://www.dtemaharashtra.gov.in/approvedinstitues/StaticPages/frmInstituteSummary.aspx?InstituteCode=1002
I want to parse email id from that url.
How can I do that?

import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.dtemaharashtra.gov.in/approvedinstitues/StaticPages/frmInstituteSummary.aspx?InstituteCode=1002').read()
soup = BeautifulSoup(html)
print soup.find(id='ctl00_rightContainer_ContentBox1_lblEMailAddress').text

import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.dtemaharashtra.gov.in/approvedinstitues/StaticPages/frmInstituteSummary.aspx?InstituteCode=1002")
soup = BeautifulSoup(r.text)
soup.find("span", {"id":"ctl00_rightContainer_ContentBox1_lblEMailAddress"}).text

Related

Web Scraping with Beautiful Soup Python - Seesaw - The output has no length error

I am trying to get the comments from a website called Seesaw but the output has no length. What am I doing wrong?
import requests
import requests
import base64
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as req
from requests import get
html_text = requests.get("https://app.seesaw.me/#/activities/class/class.93a29acf-0eef-4d4e-9d56-9648d2623171").text
soup = BeautifulSoup(html_text, "lxml")
comments = soup.find_all("span", class_ = "ng-binding")
print(comments)
Because there is no span element with class ng-binding on the page (these elements added later via JavaScript)
import requests
html_text = requests.get("https://app.seesaw.me/#/activities/class/class.93a29acf-0eef-4d4e-9d56-9648d2623171").text
print(f'{"ng-binding" in html_text=}')
So output is:
"ng-binding" in html_text=False
Also you can check it using "View Page Source" function in your browser. You can try to use Selenium for automate interaction with the site.

Unable to retrieve crawling information

As you can see from the result screen in the picture, the class name is correct and there seems to be no mistake. But I'm not getting any results.
from bs4 import BeautifulSoup
from urllib.request import urlopen
response = urlopen("https://www.naver.com")
soup = BeautifulSoup(response, 'html.parser')
for anchor in soup.select('span .realtime_item'):
print(anchor)
enter image description here
Translate. The site can no longer be crawled this way.
it worked for me:
from bs4 import BeautifulSoup
from urllib.request import urlopen
response = urlopen("https://www.naver.com")
soup = BeautifulSoup(response, 'html.parser')
for anchor in soup.select('.realtime_item'):
print(anchor)
print("\n\n")
You are not getting any data because SPAN doesn't have anything like realtime_item. Try to print soup and find if the value is there or not and then do select
from bs4 import BeautifulSoup
from urllib.request import urlopen
response = urlopen("https://www.naver.com")
soup = BeautifulSoup(response, 'html.parser')
print(soup)

AttributeError: 'LXMLTreeBuilder' object has no attribute 'DEFAULT_NSMAPS_INVERTED' when using BeautifulSoup

I am trying to get some data from a URL by using BeautifulSoup in python but when I run the last command,
soup = BeautifulSoup(content)
I consistently get this error telling me that 'LXMLTreeBuilder' object has no attribute 'DEFAULT_NSMAPS_INVERTED'
How do I go about this problem?
Here is my code :
import urllib.request as urllib2
from bs4 import BeautifulSoup
import requests
url = 'https://www.ucf.edu/'
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
You imported requests so use it... Try it this way:
url = 'https://www.ucf.edu/'
page = requests.get(url)
soup = BeautifulSoup(page.content)
You don't specify parser in BeautifulSoup constructor. Try put html.parser there:
import urllib.request as urllib2
from bs4 import BeautifulSoup
import requests
url = 'https://www.ucf.edu/'
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content, 'html.parser') # <-- specify parser here
print(soup.prettify())
EDIT: Make sure you have latest version of BeautifulSoup installed (optionally latest version of lxml). I'm on version beautifulsoup4==4.8.0 and lxml==4.3.4

python beautifulsoup get html tag content

How can I get the content of an html tag with beautifulsoup? for example the content of <title> tag?
I tried:
from bs4 import BeautifulSoup
url ='http://www.websiteaddress.com'
soup = BeautifulSoup(url)
result = soup.findAll('title')
for each in result:
print(each.get_text())
But nothing happened. I'm using python3.
You need to fetch the website data first. You can do this with the urllib.request module. Note that HTML documents only have one title so there is no need to use find_all() and a loop.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url ='http://www.websiteaddress.com'
data = urlopen(url)
soup = BeautifulSoup(data, 'html.parser')
result = soup.find('title')
print(result.get_text())

Extracting table info using BeautifulSoup (bs4)

Could anyone please give me a snippet of BeautifulSoup code to extract some of the items in the table found here?
Here's my attempt:
from bs4 import BeautifulSoup
from urllib2 import urlopen
url = "http://biology.burke.washington.edu/conus/accounts/../recordview/record.php?ID=1ll&tabs=21100111&frms=1&res=&pglimit=A"
html = urlopen(url).read()
soup = BeautifulSoup(html,"lxml")
tables = soup.findAll("table")
However, this is failing -- tables turns out to be empty.
Sorry, I'm a BeautifulSoup noob.
Thanks!
The given url page does not contain any table element in the source.
table is generated by javascript inside an iframe.
import urllib
from bs4 import BeautifulSoup
url = 'http://biology.burke.washington.edu/conus/recordview/description.php?ID=1l9l0l421l55llll&tabs=21100111&frms=1&pglimit=A&offset=&res=&srt=&sql2='
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
tables = soup.find_all('table')
#print(tables)
selenium solution:
from selenium import webdriver
from bs4 import BeautifulSoup
url = "http://biology.burke.washington.edu/conus/accounts/../recordview/record.php?ID=1ll&tabs=21100111&frms=1&res=&pglimit=A"
driver = webdriver.Firefox()
driver.get(url)
driver.switch_to_frame(driver.find_elements_by_tag_name('iframe')[0])
soup = BeautifulSoup(driver.page_source)
tables = soup.find_all('table')
#print(tables)
driver.quit()
this is my current workflow:
from bs4 import beautifulsoup
from urllib2 import urlopen
url = "http://somewebpage.com"
html = urlopen(url).read()
soup = BeautifulSoup(html)
tables = soup.find_all('table')

Categories