Extracting table info using BeautifulSoup (bs4)

Extracting table info using BeautifulSoup (bs4) - python

Could anyone please give me a snippet of BeautifulSoup code to extract some of the items in the table found here?
Here's my attempt:
from bs4 import BeautifulSoup
from urllib2 import urlopen
url = "http://biology.burke.washington.edu/conus/accounts/../recordview/record.php?ID=1ll&tabs=21100111&frms=1&res=&pglimit=A"
html = urlopen(url).read()
soup = BeautifulSoup(html,"lxml")
tables = soup.findAll("table")
However, this is failing -- tables turns out to be empty.
Sorry, I'm a BeautifulSoup noob.
Thanks!

The given url page does not contain any table element in the source.
table is generated by javascript inside an iframe.
import urllib
from bs4 import BeautifulSoup
url = 'http://biology.burke.washington.edu/conus/recordview/description.php?ID=1l9l0l421l55llll&tabs=21100111&frms=1&pglimit=A&offset=&res=&srt=&sql2='
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
tables = soup.find_all('table')
#print(tables)
selenium solution:
from selenium import webdriver
from bs4 import BeautifulSoup
url = "http://biology.burke.washington.edu/conus/accounts/../recordview/record.php?ID=1ll&tabs=21100111&frms=1&res=&pglimit=A"
driver = webdriver.Firefox()
driver.get(url)
driver.switch_to_frame(driver.find_elements_by_tag_name('iframe')[0])
soup = BeautifulSoup(driver.page_source)
tables = soup.find_all('table')
#print(tables)
driver.quit()

this is my current workflow:
from bs4 import beautifulsoup
from urllib2 import urlopen
url = "http://somewebpage.com"
html = urlopen(url).read()
soup = BeautifulSoup(html)
tables = soup.find_all('table')

Related

Web scraping several a href

I would like to scrape this page with Python: https://statusinvest.com.br/acoes/proventos/ibovespa.
With this code:
import requests
from bs4 import BeautifulSoup as bs
URL = "https://statusinvest.com.br/acoes/proventos/ibovespa"
page = 1
req = requests.get(URL+str(page))
soup = bs(req.text, 'html.parser')
container = soup.find('div', attrs={'class','list'})
dividends = container.find('a')
for dividend in dividends:
links = dividend.find_all('a')
print(links)
But it doesn't return anything.
Can someone help me please?

Edited: you can see the below updated code to access any data you mentioned in the comment, you can modify according to your needs as all the data on that page is inside data variable.
Updated Code:
import json
import requests
from bs4 import BeautifulSoup as bs
url = "https://statusinvest.com.br"
links = []
req = requests.get(f"{url}/acoes/proventos/ibovespa")
soup = bs(req.content, 'html.parser')
data = json.loads(soup.find('input', attrs={'id': 'result'})["value"])
print("Date Com Data")
for datecom in data["dateCom"]:
print(f"{datecom['code']}\t{datecom['companyName']}\t{datecom['companyNameClean']}\t{datecom['companyId']}\t{datecom['companyId']}\t{datecom['resultAbsoluteValue']}\t{datecom['dateCom']}\t{datecom['paymentDividend']}\t{datecom['earningType']}\t{datecom['dy']}\t{datecom['recentEvents']}\t{datecom['recentEvents']}\t{datecom['uRLClear']}")
print("\nDate Payment Data")
for datePayment in data["datePayment"]:
print(f"{datePayment['code']}\t{datePayment['companyName']}\t{datePayment['companyNameClean']}\t{datePayment['companyId']}\t{datePayment['companyId']}\t{datePayment['resultAbsoluteValue']}\t{datePayment['dateCom']}\t{datePayment['paymentDividend']}\t{datePayment['earningType']}\t{datePayment['dy']}\t{datePayment['recentEvents']}\t{datePayment['recentEvents']}\t{datePayment['uRLClear']}")
print("\nProvisioned Data")
for provisioned in data["provisioned"]:
print(f"{provisioned['code']}\t{provisioned['companyName']}\t{provisioned['companyNameClean']}\t{provisioned['companyId']}\t{provisioned['companyId']}\t{provisioned['resultAbsoluteValue']}\t{provisioned['dateCom']}\t{provisioned['paymentDividend']}\t{provisioned['earningType']}\t{provisioned['dy']}\t{provisioned['recentEvents']}\t{provisioned['recentEvents']}\t{provisioned['uRLClear']}")
Seeing to the source code of that website one could fetch the json directly and get your desired links follow the below code.
Code:
import json
import requests
from bs4 import BeautifulSoup as bs
url = "https://statusinvest.com.br"
links=[]
req = requests.get(f"{url}/acoes/proventos/ibovespa")
soup = bs(req.content, 'html.parser')
data = json.loads(soup.find('input', attrs={'id': 'result'})["value"])
for datecom in data["dateCom"]:
links.append(f"{url}{datecom['uRLClear']}")
for datePayment in data["datePayment"]:
links.append(f"{url}{datePayment['uRLClear']}")
for provisioned in data["provisioned"]:
links.append(f"{url}{provisioned['uRLClear']}")
print(links)
Output:
Let me know if you have any questions :)

Unable to retrieve crawling information

As you can see from the result screen in the picture, the class name is correct and there seems to be no mistake. But I'm not getting any results.
from bs4 import BeautifulSoup
from urllib.request import urlopen
response = urlopen("https://www.naver.com")
soup = BeautifulSoup(response, 'html.parser')
for anchor in soup.select('span .realtime_item'):
print(anchor)

enter image description here
Translate. The site can no longer be crawled this way.

it worked for me:
from bs4 import BeautifulSoup
from urllib.request import urlopen
response = urlopen("https://www.naver.com")
soup = BeautifulSoup(response, 'html.parser')
for anchor in soup.select('.realtime_item'):
print(anchor)
print("\n\n")

You are not getting any data because SPAN doesn't have anything like realtime_item. Try to print soup and find if the value is there or not and then do select
from bs4 import BeautifulSoup
from urllib.request import urlopen
response = urlopen("https://www.naver.com")
soup = BeautifulSoup(response, 'html.parser')
print(soup)

locating child element by BeautifulSoup

I am new to BeautifulSoup and I am praticing with little tasks. Here I try to get the "previous" link in this site. The html is
here
My code is
import requests, bs4
from bs4 import BeautifulSoup
url = 'https://www.xkcd.com/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
result = soup.find('div', id="comic")
url2 = result.find('ul', class_='comicNav').find('a', rel='prev').find('href')
But it shows NoneType.. I have read some posts about the child elements in html, and I tried some different things. But it still does not work.. Thank you for your help in advance.

Tou could use a CSS Selector instead.
import requests, bs4
from bs4 import BeautifulSoup
url = 'https://www.xkcd.com/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
result = soup.select('.comicNav a[rel~="prev"]')[0]
print(result)
if you want just the href change
result = soup.select('.comicNav a[rel~="prev"]')[0]["href"]

To get prev link.find ul tag and then find a tag. Try below code.
import requests, bs4
from bs4 import BeautifulSoup
url = 'https://www.xkcd.com/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
url2 = soup.find('ul', class_='comicNav').find('a',rel='prev')['href']
print(url2)
Output:
/2254/

python beautifulsoup get html tag content

How can I get the content of an html tag with beautifulsoup? for example the content of <title> tag?
I tried:
from bs4 import BeautifulSoup
url ='http://www.websiteaddress.com'
soup = BeautifulSoup(url)
result = soup.findAll('title')
for each in result:
print(each.get_text())
But nothing happened. I'm using python3.

You need to fetch the website data first. You can do this with the urllib.request module. Note that HTML documents only have one title so there is no need to use find_all() and a loop.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url ='http://www.websiteaddress.com'
data = urlopen(url)
soup = BeautifulSoup(data, 'html.parser')
result = soup.find('title')
print(result.get_text())

getting email ids using beautifulsoup

I am working on python. I am learning beautifulsoup & I am parsing a link.
my url :
http://www.dtemaharashtra.gov.in/approvedinstitues/StaticPages/frmInstituteSummary.aspx?InstituteCode=1002
I want to parse email id from that url.
How can I do that?

import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.dtemaharashtra.gov.in/approvedinstitues/StaticPages/frmInstituteSummary.aspx?InstituteCode=1002').read()
soup = BeautifulSoup(html)
print soup.find(id='ctl00_rightContainer_ContentBox1_lblEMailAddress').text

import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.dtemaharashtra.gov.in/approvedinstitues/StaticPages/frmInstituteSummary.aspx?InstituteCode=1002")
soup = BeautifulSoup(r.text)
soup.find("span", {"id":"ctl00_rightContainer_ContentBox1_lblEMailAddress"}).text

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting table info using BeautifulSoup (bs4) - python

this is my current workflow: from bs4 import beautifulsoup from urllib2 import urlopen url = "http://somewebpage.com" html = urlopen(url).read() soup = BeautifulSoup(html) tables = soup.find_all('table')

Related

Web scraping several a href

Unable to retrieve crawling information

locating child element by BeautifulSoup

python beautifulsoup get html tag content

getting email ids using beautifulsoup

Categories

Resources