Extract a table without class or id - python

I am trying to scrape a table from http://marine-transportation.capitallink.com/indices/baltic_exchange_history.html?ticker=BDI
Although it seemed fairly easy it is not possible for me to identify the table in such a way that i could scrape it, i'm not able to extract the data. Can any one help with the right identification
import urllib3
import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
import requests
import csv
import re
url = 'http://marine-transportation.capitallink.com/indices/baltic_exchange_history.html?ticker=BDI'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
col = row.find_all('td')
column_1 = col[0].string.strip()
#
date = []
closing_rate = []
#Here i need a reference to the correct table
table = soup.find()
for row in table.find_all('tr')[1:]:
col = row.find_all('td')
column_1 = col[0].string.strip()
date.append(column_1)
column_2 = col[1].string.strip()
closing_rate.append(column_2)
columns = {'date': date, 'closing_rate': ClosingRate}
df = pd.DataFrame(columns)
df.to_csv('Baltic_Dry.csv')

You could use unique style attributes to identify the table you need.
For example, on this page here, it looks like the table containing index data is 550px wide. You can use:
soup.findAll('table', width="550")
Please note: I had to use another page on the same website because the one you posted requires a login. Hopefully the page structure will be similar.

Related

How to retrieve and store 2nd and 3rd row elements in a dataframe

I am new to Pandas and Webscraping and BeautifulSoup in Python.
As I was learning to do some basic webscraping in Python by using requests and BeautifulSoup to scrape a webpage, I am confused with the task of assigning the 2nd and 3rd elements of an html table into a pandas dataframe.
Suppose I have this table:
Here is my code so far:
import pandas as pd
from bs4 import BeautifulSoup
import requests
html_data = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_data, 'html.parser')
data = pd.DataFrame(columns=["Name", "Market Cap (US$ Billion)"])
for row in soup.find_all('tbody')[3].find_all('tr'): #This line will make sure to get to the third table which is this "By market capitalization" table on the webpage and finding all the rows of this table
col = row.find_all('td') #This is to target individual column values in a particular row of the table
for j, cell in enumerate(col):
#Further code here
As it can be seen, I want to target all the 2nd and 3rd column values of a row and append to the
empty dataframe, data, so that data contains the Bank names and market cap values. How can I achieve that kind of functionality?
For tables I would suggest pandas:
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_largest_banks'
tables = pd.read_html(url)
df = tables[1]
When you prefer using beautifulsoup, you can try this to accomplish the same:
url = 'https://en.wikipedia.org/wiki/List_of_largest_banks'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser').find_all('table')
table = soup[1]
table_rows = table.find_all('tr')
table_header = [th.text.strip() for th in table_rows[0].find_all('th')]
table_data = []
for row in table_rows[1:]:
table_data.append([td.text.strip() for td in row.find_all('td')])
df = pd.DataFrame(table_data, columns=table_header)
When needed, you can set Rank as index with df.set_index('Rank', inplace=True. Image below is the unmodified dataframe.

How can I get the data of this table from HackerRank and filter it by country of origin and score for then exporting it as a csv file?

I'm learning web scraping on Python and I decided to test my skills in the HackerRank Leaderboard page, so I wrote the code below expecting no errors before adding the country restriction to the tester function for then exporting my csv file successfully.
But then the Python console replied:
AttributeError: 'NoneType' object has no attribute 'find_all'
The error above corresponds to the line 29 from my code (for i in table.find_all({'class':'ellipsis'}):), so I decided to come here in order to ask for assistance, I'm afraid there could be more syntax or logic errors, so it's better to get rid of my doubts by getting a feedback from experts.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
from time import sleep
from random import randint
pd.set_option('display.max_columns', None)
#Declaring a variable for looping over all the pages
pages = np.arange(1, 93, 1)
a = pd.DataFrame()
#loop cycle
for url in pages:
#get html for each new page
url ='https://www.hackerrank.com/leaderboard?page='+str(url)
page = requests.get(url)
sleep(randint(3,10))
soup = BeautifulSoup(page.text, 'lxml')
#get the table
table = soup.find('header', {'class':'table-header flex'})
headers = []
#get the headers of the table and delete the "white space"
for i in table.find_all({'class':'ellipsis'}):
title = i.text.strip()
headers.append(title)
#set the headers to columns in a new dataframe
df = pd.DataFrame(columns=headers)
rows = soup.find('div', {'class':'table-body'})
#get the rows of the table but omit the first row (which are headers)
for row in rows.find_all('table-row-wrapper')[1:]:
data = row.find_all('table-row-column ellipsis')
row_data = [td.text.strip() for td in data]
length = len(df)
df.loc[length] = row_data
#set the data of the Txn Count column to float
Txn = df['SCORE'].values
#combine all the data rows in one single dataframe
a = a.append(pd.DataFrame(df))
def tester(mejora):
mejora = mejora[(mejora['SCORE']>2250.0)]
return mejora.to_csv('new_test_Score_Count.csv')
tester(a)
Do you guys have any ideas or suggestions that could fix the problem?
the error states, that you table element is None. i'm guessing here but you cant get the table from the page with bs4 because it is loaded after with javascript. I would recommend to use selenium for this instead

Python is giving me both columns of a table I a scraping, but I only want it to give me one of the columns

I am using Python to scrape the names of the Alaska Supreme Court justices from Ballotpedia (https://ballotpedia.org/Alaska_Supreme_Court). My current code is giving me both the names of the justices as well as the names of the persons in the "Appointed by" column. Here is my current code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
list = ['https://ballotpedia.org/Alaska_Supreme_Court']
temp_dict = {}
for page in list:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict[page.split('/')[-1]] = [item.text for item in soup.select("table.wikitable.sortable.jquery-tablesorter a")]
df = pd.DataFrame.from_dict(temp_dict,
orient='index').transpose()
df.to_csv('18-TEST.csv')
I've been trying to work with this line:
temp_dict[page.split('/')[-1]] = [item.text for item in soup.select("table.wikitable.sortable.jquery-tablesorter a")]
I'm a little inexperienced using the inspect function on webpages, so I may be trying the wrong thing when I try to put "tr" or "td" (which I am finding under "tbody") after "tablesorter". I'm a bit lost at this point and am having trouble finding resources on this. Would you be able to help me to get python to give me the judge column but not the appointed by column? Thank you!
There are different options to get the result.
Option#1
Slice the list and pick every second element:
soup.select("table.wikitable.sortable.jquery-tablesorter a")][0::2]
Example:
import requests
from bs4 import BeautifulSoup
import pandas as pd
lst = ['https://ballotpedia.org/Alaska_Supreme_Court']
temp_dict = {}
for page in lst:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict[page.split('/')[-1]] = [item.text for item in soup.select("table.wikitable.sortable.jquery-tablesorter a")][0::2]
pd.DataFrame.from_dict(temp_dict, orient='index').transpose().to_csv('18-TEST.csv', index=False)
Option#2
Make your selection more specific and select only the first td in a tr:
soup.select("table.wikitable.sortable.jquery-tablesorter tr > td:nth-of-type(1)")]
Example
import requests
from bs4 import BeautifulSoup
import pandas as pd
list = ['https://ballotpedia.org/Alaska_Supreme_Court']
temp_dict = {}
for page in list:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict[page.split('/')[-1]] = [item.text for item in soup.select("table.wikitable.sortable.jquery-tablesorter tr > td:nth-of-type(1)")]
pd.DataFrame.from_dict(temp_dict, orient='index').transpose().to_csv('18-TEST.csv', index=False)
Option#3
Use pandas functionality read_html()
Example
import pandas as pd
df = pd.read_html('https://ballotpedia.org/Alaska_Supreme_Court')[2]
df.Judge.to_csv('18-TEST.csv', index=False)
Firstly, please note that this is code cannibalised from here.
Now, if you don't know how many rows or columns you have, this gives you a dataframe with all the columns, corresponding to the table on the webpage. Feel free to drop one of the columns if you don't need it.
import requests
from bs4 import BeautifulSoup
import pandas as pd
# I'll do it for the one page example
page = 'https://ballotpedia.org/Alaska_Supreme_Court'
temp_dict = {}
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
# this finds the first table with the class specified
table = soup.find('table', attrs={'class':'wikitable sortable jquery-tablesorter'})
# get all rows of the above table
rows = table.find_all('tr')
data = []
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])
# turn it into a pandas dataframe
df = pd.DataFrame(data)
I would like to share another approach to get you table in desired format :
import pandas as pd
# extracting table and making it dataframe
frame = pd.read_html('https://ballotpedia.org/Alaska_Supreme_Court',attrs={"class":"wikitable sortable jquery-tablesorter"})[0]
# drop unwanted columns
frame.drop("Appointed By", axis=1, inplace=True)
# save dataframe as csv
frame.to_csv("desired/path/output.csv", index=False)
Printing frame would give output as :
|Judge|
|-----|
| Daniel Winfree|
| Joel Harold Bolger|
| Peter Jon Maassen|
| Susan Carney|
| Dario Borghesan|

web scraping table from multiple pages from a search and creating a pandas dataframe

I got this code working for the first page and needed the user agent as it didn't work otherwise.
The problem I get is the search brings the first page, but on the second you have "page=2" and continuing so need to scrape all or as much as needed from the search
"https://www.vesselfinder.com/vessels?page=2&minDW=20000&maxDW=300000&type=4"
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
site= "https://www.vesselfinder.com/vessels?type=4&minDW=20000&maxDW=300000"
hdr = {'User-Agent': 'Chrome/70.0.3538.110'}
req = Request(site,headers=hdr)
page = urlopen(req)
import pandas as pd
import numpy as np
soup = BeautifulSoup(page, 'lxml')
type(soup)
rows = soup.find_all('tr')
print(rows[:10])
for row in rows:
row_td = row.find_all('td')
print(row_td)
type(row_td)
str_cells = str(row_td)
cleantext = BeautifulSoup(str_cells, "lxml").get_text()
print(cleantext)
import re
list_rows = []
for row in rows:
cells = row.find_all('td')
str_cells = str(cells)
clean = re.compile('<.*?>')
clean2 = (re.sub(clean, '',str_cells))
list_rows.append(clean2)
print(clean2)
type(clean2)
df = pd.DataFrame(list_rows)
df.head(10)
df1 = df[0].str.split(',', expand=True)
df1.head(10)
Output is a Pandas DataFrame
need to scrape all pages to output a large dataframe
Okay, so this problem ended up getting stuck in my head, so I worked it out.
import pandas as pd
import requests
hdr={'User-Agent':'Chrome/70.0.3538.110'}
table_dfs={}
for page_number in range(951):
http= "https://www.vesselfinder.com/vessels?page={}&minDW=20000&maxDW=300000&type=4".format(page_number+1)
url= requests.get(http,headers=hdr)
table_dfs[page_number]= pd.read_html(url.text)
it will return the first column (vessel) as a nan value. That's the column for the image, ignore it if you don't need it.
the next column will be called 'built' it has the ships name, and type of ship in it. You'll need to .split() to separate them, and then you can replace column(vessel) with the ships name.
If it works for you I'd love to boost my reputation with a nice green check mark.
rows = soup.find_all('tr')
print(rows[:10])
for row in rows:
row_td = row.find_all('td')
print(row_td)
type(row_td)
^this code above is the same thing as
urls=['some list of urls you want to scrape']
table_dfs= [pd.read_html(url) for url in urls]
you can crawl through the urls you're looking for and apply that, and then if you want to do something with/to the tables you can just go:
for table in table_dfs:
table + 'the thing you want to do'
Note that the in-line for loop of table_dfs is in a list. That means that you might not be able to discern which url it came from if the scrape is big enough. Pieca seemed to have a solution that could be used to iterate the websites urls, and create a dictionary key. Note that this solution may not apply to every website.
url_list = {page_number:"https://www.vesselfinder.com/vessels?page=
{}&minDW=20000&maxDW=300000&type=4".format(page_number) for page_number
in list(range(1, 953))}
table_dfs={}
for url in range(1,len(url_list)):
table_dfs[url]= pd.read_html(url_list[url],header=hdr)

Trouble parsing HTML page with Python

I'm trying to get a hold of the data under the columns having the code "SEVNYXX", where "XX" are the numbers that follow (eg. 01, 02, etc) on the site http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html using Python. I am currently using the following method as prescribed by the site http://docs.python-guide.org/en/latest/scenarios/scrape/ . However, I don't know how to determine the divs for this page and am hence unable to proceed and was hoping to get some help with this.
This is what I have so far:
from lxml import html
import requests
page = requests.get('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html')
tree = html.fromstring(page.text)
Thank You
Have you tried using BeautifulSoup? I'm a pretty big fan. Using that you can easily iterate through all of the info you want, searching by tag.
Here's something I threw together, that prints out the values in each column you are looking at. Not sure what you want to do with the data, but hopefully it helps.
from bs4 import BeautifulSoup
from urllib import request
page = request.urlopen('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html').read()
soup = BeautifulSoup(page)
desired_table = soup.findAll('table')[2]
# Find the columns you want data from
headers = desired_table.findAll('th')
desired_columns = []
for th in headers:
if 'SVENY' in th.string:
desired_columns.append(headers.index(th))
# Iterate through each row grabbing the data from the desired columns
rows = desired_table.findAll('tr')
for row in rows[1:]:
cells= row.findAll('td')
for column in desired_columns:
print(cells[column].text)
In response to your second request:
from bs4 import BeautifulSoup
from urllib import request
page = request.urlopen('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html').read()
soup = BeautifulSoup(page)
desired_table = soup.findAll('table')[2]
data = {}
# Find the columns you want data from
headers = desired_table.findAll('th')
desired_columns = []
column_count = 0
for th in headers:
if 'SVENY' in th.string:
data[th.string] = {'column': headers.index(th), 'data': []}
column_count += 1
# Iterate through each row grabbing the data from the desired columns
rows = desired_table.findAll('tr')
for row in rows[1:]:
date = row.findAll('th')[0].text
cells= row.findAll('td')
for header,info in data.items():
column_number = info['column']
cell_data = [date,cells[column_number].text]
info['data'].append(cell_data)
This returns a dictionary where each key is the header for a column, and each value is another dictionary that has 1) the column it's in on the site, and 2) the actual data you want, in a list of lists.
As an example:
for year_number in data['SVENY01']['data']:
print(year_number)
['2015-06-05', '0.3487']
['2015-06-04', '0.3124']
['2015-06-03', '0.3238']
['2015-06-02', '0.3040']
['2015-06-01', '0.3009']
['2015-05-29', '0.2957']
etc.
You can fiddle around with this to get the info how and where you want it, but hopefully this is helpful.

Categories