Import table from website with BeautifulSoup - python

I am trying to import a table from a website and afterwards transform the data into a pandas dataframe.
The website is: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
Thats my code so far:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
website_url = requests.get(
'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(website_url,'lxml')
My_table = soup.find('table',{'class':'wikitable sortable'})
for x in soup.find_all('table',{'class':'wikitable sortable'}):
table = x.text
print(My_table)
print(table)
Output of print(My_table)
Output of print(table)
How do I convert this webpage table to a panda dataframe?
panda dataframe

have you tried
pd.read_html()
?
Also, since the table is very standard, why not directly copy the table into excel and import it as DataFrame?

Related

Read table from Web using Python

I'm new to Python and am working to extract data from website https://www.screener.in/company/ABB/consolidated/ on a particular table (the last table which is Shareholding Pattern)
I'm using BeautifulSoup library for this but I do not know how to go about it.
So far, here below is my code snippet. am failing to pick the right table due to the fact that the page has multiple tables and all tables share common classes and IDs which makes it difficult for me to filter for the one table I want.
import requests import urllib.request
from bs4 import BeautifulSoup
url = "https://www.screener.in/company/ABB/consolidated/"
r = requests.get(url)
print(r.status_code)
html_content = r.text
soup = BeautifulSoup(html_content,"html.parser")
# print(soup)
#data_table = soup.find('table', class_ = "data-table")
# print(data_table) table_needed = soup.find("<h2>ShareholdingPattern</h2>")
#sub = table_needed.contents[0] print(table_needed)
Just use requests and pandas. Grab the last table and dump it to a .csv file.
Here's how:
import pandas as pd
import requests
df = pd.read_html(
requests.get("https://www.screener.in/company/ABB/consolidated/").text,
flavor="bs4",
)
df[-1].to_csv("last_table.csv", index=False)
Output from a .csv file:

pandas read_html / no tables found

I am having some problems with pandas read_html. When I tried to read the table using pandas, it wouldn't work. So I tried using requests and BeautifulSoup and solved the problem. But I would like to know why it was not possible for me to get the table using pandas at the first time. Thank you.
first code
import pandas as pd
url = 'https://finance.naver.com/item/sise_day.nhn?code=005930&page=1'
r = pd.read_html(url)[0]
second code that i tried
import requests
from bs4 import BeautifulSoup
import pandas as pd
ulr = 'https://finance.naver.com/item/sise_day.nhn?code=005930&page=1'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
table = str(soup.select("table"))
data=pd.read_html(table)[0]

How can I pull webpage data into my DataFrame by referencing a specific HTML class or id using pandas read_html?

I'm trying to pull the data from the table at this site and save it in a CSV with the column 'ticker' included. Right now my code is this:
import requests
import pandas as pd
url = 'https://www.biopharmcatalyst.com/biotech-stocks/company-pipeline-database#marketCap=mid|stages=approved,crl'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[0]
print (df)
df.to_csv('my data.csv')
and it results in a file that looks like this.
I want to have the 'ticker' column in my CSV file with the corresponding ticker listed for each company. The ticker is in the HTML here (class="ticker--small"). The output should look like this.
I'm totally stuck on this. I've tried doing it in BeautifulSoup too but I can't get it working. Any help would be greatly appreciated.
it has multiple table, use BeautifulSoup to extract and do loop to write the csv.
from bs4 import BeautifulSoup
import requests, lxml
import pandas as pd
url = 'https://www.biopharmcatalyst.com/biotech-stocks/company-pipeline-database#marketCap=mid|stages=approved,crl'
html = requests.get(url).text
soup = BeautifulSoup(html, 'lxml')
tables = soup.findAll('table')
for table in tables:
df = pd.read_html(str(table))[0]
with open('my_data.csv', 'a+') as f:
df.to_csv(f)

Scraping table data after applying filter

I am trying to scrape data from this website: http://nepalstock.com/indices . I am not being able to scrape table data after changing value of 'Select Indices or Sub Indices'
https://i.stack.imgur.com/J0WMn.png
Since the url does not change, I do not know how to proceed on this.
So far, this is my code:
import requests
import pandas
from bs4 import BeautifulSoup
html=requests.get("http://nepalstock.com/indices")
soup=BeautifulSoup(html.content,"html.parser")
lst=[]
for row in soup.find_all("tr"):
l=[]
for td in row.find_all("td"):
l.append(td.get_text())
lst.append(l)
df=pandas.DataFrame(lst[1:])
df.head()

Webscraping results to data frame

Im in the process of trying to figure out how to take my scraped data and convert it to a dataframe using pandas,
I scraped data off a grocery website as an experiment. Using Beautiful soup, once I import the URL and use beautiful soup I then created a loop to pull anything within a certain class with div tags. Then i used the code below to pull the data below:
import urllib2
import requests
import pandas as pd
from bs4 import BeautifulSoup
import pickle
link=requests.get("https://www.iga.net/en/online_grocery/frozen_grocery")
soup = BeautifulSoup(link.content, 'html.parser')
##print soup.prettify()
bowl=soup.find_all('div',class_='js-product js-equalized js-addtolist-container js-ga')
for bowls in bowl:
list=bowls.get('data-product')
print list
list printed:
{'ProductId':'00000_000000005500059917','BrandName':'Nestle','FullDisplayName':'10 Pack Mini Rolo Bars','IsAgeRequired':false,'SizeLabel':'','Size':'10 x 45 ml','ProductUrl':'/en/product/mini-rolo-bars10-pack/00000_000000005500059917','ProductImageUrl':'https://az836796.vo.msecnd.net/media/image/product/en/medium/0005500059917.jpg','HasNewPrice':false,'PromotionName':null,'RegularPrice':6.49000,'SalesPrice':null}
{'ProductId':'00000_000000005574253356','BrandName':'Compliments','FullDisplayName':'100% Pure Frozen Concentrate Pulp Free Juice','IsAgeRequired':false,'SizeLabel':'','Size':'283 ml','ProductUrl':'/en/product/juice100--pure-frozen-concentrate-pulp-free/00000_000000005574253356','ProductImageUrl':'https://az836796.vo.msecnd.net/media/image/product/en/medium/0005574253356.jpg','HasNewPrice':false,'PromotionName':null,'RegularPrice':1.79000,'SalesPrice':null}
I'm trying to take the productID, size and regular price for example and dump that into a table. I'd even be ok to take the entire keys and values and dump them into a data frame so i can play around with it in excel.
I've tried to do the following but i get an error (added in the data frame in the last block:
import urllib2
import requests
import pandas as pd
from bs4 import BeautifulSoup
import pickle
link=requests.get("https://www.iga.net/en/online_grocery/frozen_grocery")
soup = BeautifulSoup(link.content, 'html.parser')
##print soup.prettify()
bowl=soup.find_all('div',class_='js-product js-equalized js-addtolist-container js-ga')
for bowls in bowl:
list=bowls.get('data-product')
df = pd.DataFrame(list)
print df
This results in an error. Any help is appreciated. I'm a rookie to this.
You need convert each data-product to be a valid python dictionary, then merge all of them to one dictionary, after that, you can convert to dataframe like this:
import urllib2
import requests
import pandas as pd
from bs4 import BeautifulSoup
import pickle
import json
import collections
link=requests.get("https://www.iga.net/en/online_grocery/frozen_grocery")
soup = BeautifulSoup(link.content, 'html.parser')
##print soup.prettify()
bowl=soup.find_all('div',class_='js-product js-equalized js-addtolist-container js-ga')
super_dict = collections.defaultdict(list)
for bowls in bowl:
data=bowls.get('data-product')
data = data.replace("\'","\"") #json.loads accepts only double quotes for json properties, so replace ' with "
dict_data = json.loads(data) #convert to valid python dictionary
for k, v in dict_data.iteritems(): # d.items() in Python 3+
super_dict[k].append(v) #merge all dictionary
df = pd.DataFrame(dict(super_dict))
df
Output will be the dataframe you want:
Update:
if you want to view the dataframe in excel file, you can write to excel file with below code:
writer = pd.ExcelWriter('pandas_simple.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
df.to_excel(writer, sheet_name='Sheet1')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
Then you can open file pandas_simple.xlsx to check the data in excel format.

Categories