Why can't Pandas Webscraping print out certain tables from this website?

Why can't Pandas Webscraping print out certain tables from this website? - python

Is there a way to also print out the second table on this website? (the one that starts with CK Hutchison Holdings Ltd. )
Website link: https://www.hkex.com.hk/Products/Listed-Derivatives/Single-Stock/Stock-Options?sc_lang=en
This was my code:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://www.hkex.com.hk/Products/Listed-Derivatives/Single-Stock/Stock-Options?sc_lang=en'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
spec_table = soup.select('table[class="table migrate"]')[1]
df = pd.read_html(str(spec_table))[0]
print(df[:5].to_markdown())
Whenever I tried to do spec_table = soup.select('table[class="table migrate"]')[0], it would print the first table (the one that starts with Sun Hung Kai Properties Limited) and whenever I tried to do spec_table = soup.select('table[class="table migrate"]')[1], it would skip the table in between (the one that starts with CK Hutchison Holdings Ltd. ) and print out the contract summary table.
Could anyone explain how I can print out the second table?

This is one way to isolate and extract that second table from page:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://www.hkex.com.hk/Products/Listed-Derivatives/Single-Stock/Stock-Options?sc_lang=en'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
second_table = soup.select('table:-soup-contains("CK Hutchison Holdings Ltd.")')[0]
df = pd.read_html(str(second_table))[0]
print(df)
Result:
No. SEHK Code Underlying Stock Name HKATS Code Contract Size (shares) Tier No.* Position Limit ## (Effective from 1 April 2022) Approved by FSC Taiwan+
0 1 1 CK Hutchison Holdings Ltd. CKH 500 1 50000 ✓
1 2 2 CLP Holdings Limited CLP 500 1 50000 NaN
2 3 3 The Hong Kong and China Gas Company Limited HKG 1000 2 150000 NaN
3 4 4 The Wharf (Holdings) Limited WHL 1000 1 50000 NaN
4 5 5 HSBC Holdings Plc. HKB 400 2 150000 ✓
... ... ... ... ... ... ... ... ...
63 64 3323 China National Building Material Company Limited NBM 2000 2 100000 ✓
64 65 3328 Bank of Communications Co., Ltd. BCM 1000 3 150000 ✓
65 66 3968 China Merchants Bank Co., Ltd. CMB 500 1 150000 ✓
66 67 3988 Bank of China Limited XBC 1000 3 150000 ✓
67 68 6862 Haidilao International Holding Ltd. HDO 1000 1 100000 NaN
For documentation, refer to my response to your previous question. Also make sure your bs4 package is up to date, do a pip install -U bs4.

apparently it is due to html error in the webpage: that webpage was clearly written by several people: if you check the html the tables (that look all the same) have each differet coding.
Anyway your problem is due to that table being called class="table migrate;" instead of class="table migrate, as all the other ones (difference being in the ;).
In order to get your tab simply use this "other" class type
spec_table = soup.select('table[class="table migrate;"]')[0]
and you will have the tab you are missing.
As a general tip: if you have problem with webscraping search directly the html of the source (you can look for some string you get from the webpage to find the exact html point you are interested in, or just inspect with browser tools)
Alternatively, if you wanna have a more elegant and general code (for example if you need to iterate over all the tabs) you can do this:
start = soup.find('p', {'class': 'spanHeading'})
spec_table = start.find_all_next('table')
And then do what you wanted to do before:
df = pd.read_html(str(spec_table[1]))[0]
print(df[:5].to_markdown())

Related

Parse SEC EDGAR XML Form Data with child nodes using BeautifulSoup

I am attempting to scrape individual fund holdings from the SEC's N-PORT-P/A form using beautiful soup and xml. A typical submission, outlined below and [linked here][1], looks like:
<edgarSubmission xmlns="http://www.sec.gov/edgar/nport" xmlns:com="http://www.sec.gov/edgar/common" xmlns:ncom="http://www.sec.gov/edgar/nportcommon" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<headerData>
<submissionType>NPORT-P/A</submissionType>
<isConfidential>false</isConfidential>
<accessionNumber>0001145549-23-004025</accessionNumber>
<filerInfo>
<filer>
<issuerCredentials>
<cik>0001618627</cik>
<ccc>XXXXXXXX</ccc>
</issuerCredentials>
</filer>
<seriesClassInfo>
<seriesId>S000048029</seriesId>
<classId>C000151492</classId>
</seriesClassInfo>
</filerInfo>
</headerData>
<formData>
<genInfo>
...
</genInfo>
<fundInfo>
...
</fundInfo>
<invstOrSecs>
<invstOrSec>
<name>ARROW BIDCO LLC</name>
<lei>549300YHZN08M0H3O128</lei>
<title>Arrow Bidco LLC</title>
<cusip>042728AA3</cusip>
<identifiers>
<isin value="US042728AA35"/>
</identifiers>
<balance>115000.000000000000</balance>
<units>PA</units>
<curCd>USD</curCd>
<valUSD>114754.170000000000</valUSD>
<pctVal>0.3967552449</pctVal>
<payoffProfile>Long</payoffProfile>
<assetCat>DBT</assetCat>
<issuerCat>CORP</issuerCat>
<invCountry>US</invCountry>
<isRestrictedSec>N</isRestrictedSec>
<fairValLevel>2</fairValLevel>
<debtSec>
<maturityDt>2024-03-15</maturityDt>
<couponKind>Fixed</couponKind>
<annualizedRt>9.500000000000</annualizedRt>
<isDefault>N</isDefault>
<areIntrstPmntsInArrs>N</areIntrstPmntsInArrs>
<isPaidKind>N</isPaidKind>
</debtSec>
<securityLending>
<isCashCollateral>N</isCashCollateral>
<isNonCashCollateral>N</isNonCashCollateral>
<isLoanByFund>N</isLoanByFund>
</securityLending>
</invstOrSec>
With Arrow Bidco LLC being a bond within the portfolio, with some of its characteristics included within the filing (CUSIP, CIK, balance, maturity date, etc.). I am looking for the best way to iterate through each individual security (investOrSec) and collect the characteristics of each security in a dataframe.
The code I am currently using is:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
header = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36", "X-Requested-With": "XMLHttpRequest"}
n_port_file = requests.get("https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml", headers=header, verify=False)
n_port_file_xml = n_port_file.content
soup = BeautifulSoup(n_port_file_xml,'xml')
names = soup.find_all('name')
lei = soup.find_all('lei')
title = soup.find_all('title')
cusip = soup.find_all('cusip')
....
maturityDt = soup.find_all('maturityDt')
couponKind = soup.find_all('couponKind')
annualizedRt = soup.find_all('annualizedRt')
Then iterating through each list to create a dataframe based on the values in each row.
fixed_income_data = []
for i in range(0,len(names)):
rows = [names[i].get_text(),lei[i].get_text(),
title[i].get_text(),cusip[i].get_text(),
balance[i].get_text(),units[i].get_text(),
pctVal[i].get_text(),payoffProfile[i].get_text(),
assetCat[i].get_text(),issuerCat[i].get_text(),
invCountry[i].get_text(),couponKind[i].get_text()
]
fixed_income_data.append(rows)
fixed_income_df = pd.DataFrame(equity_data,columns = ['name',
'lei',
'title',
'cusip',
'balance',
'units',
'pctVal',
'payoffProfile',
'assetCat',
'issuerCat',
'invCountry'
'maturityDt',
'couponKind',
'annualizedRt'
], dtype = float)
This works fine when all pieces of information are included, but often there is one variable that is not accounted for. A piece of the form might be blank, or an issuer category might not have been filled out incorrectly, leading to an IndexError. This portfolio has 127 securities that I was able to parse, but might be missing an annualized return for a single security, throwing off the ability to neatly create a dataframe.
Additionally, for portfolios that hold both fixed income and equity securities, the equity securities do not return information for the debtSecs child. Is there a way to iterate through this data while simultaneously cleaning it in the easiest way possible? Even adding "NaN" for the debtSec children that equity securities don't reference would be a valid response. Any help would be much appreciated!
[1]: https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml

Here is the best way, in my opinion, to handle the problem. Generally speaking, EDGAR filings are notoriously difficult to parse, so the following may or may not work on other filings, even from the same filer.
To make it easier on yourself, since this is an XML file, you should use an xml parser and xpath. Given that you're looking to create a dataframe, the most appropriate tool would be the pandas read_xml() method.
Because the XML is nested, you will need to create two different dataframes and concatenate them (maybe others will have a better idea on how to approach it). And, finally, although read_xml() can read directly from a url, in this case, EDGAR requires using a user-agent, meaning you also need to use the requests library as well.
So, all together:
#import required libraries
import pandas as pd
import requests
url = 'https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml'
#set headers with a user-agent
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
req = requests.get(url, headers=headers)
#define the columns you want to drop (based on the data in your question)
to_drop = ['identifiers', 'curCd','valUSD','isRestrictedSec','fairValLevel','debtSec','securityLending']
#the filing uses namespaces (too complicated to get into here), so you need to define that as well
namespaces = {"nport": "http://www.sec.gov/edgar/nport"}
#create the first df, for the securities which are debt instruments
invest = pd.read_xml(req.text,xpath="//nport:invstOrSec[.//nport:debtSec]",namespaces=namespaces).drop(to_drop, axis=1)
#crete the 2nd df, for the debt details:
debt = pd.read_xml(req.text,xpath="//nport:debtSec",namespaces=namespaces).iloc[:,0:3]
#finally, concatenate the two into one df:
pd.concat([invest, debt], axis=1)
This should output your 126 debt securities (pardon the formatting):
lei title cusip balance units pctVal payoffProfile assetCat issuerCat invCountry maturityDt couponKind annualizedRt
0 ARROW BIDCO LLC 549300YHZN08M0H3O128 Arrow Bidco LLC 042728AA3 115000.00 PA 0.396755 Long DBT CORP US 2024-03-15 Fixed 9.50000
1 CD&R SMOKEY BUYER INC NaN CD&R Smokey Buyer Inc 12510CAA9 165000.00 PA 0.505585 Long DBT CORP US 2025-07-15 Fixed 6.75000
You can then play with the final df, add or drop columns, etc

Struggling to grab data from baseball reference

I'm trying to grab the tables for all pitchers batting against found on this page.
I believe the problems lies with the data being behind a comment.
For the sake of the example, I'd like to find, say, Sandy Alcantara's home runs allowed.
import requests
from bs4 import BeautifulSoup as bs
url="https://www.baseball-reference.com/leagues/majors/2022-batting-pitching.shtml"
page=requests.get(url)
soup=bs(page.content,"html.parser")
for tr in soup.find_all('tr'):
td=tr.find_all('td')
print(td)
This prints a lot of team data, but doesn't print the pitcher data.
How can I cleverly get it to print the pitcher data? Ideally, I'd have it in a list or something.
object[0]=Rk
object[1]=Name
object[4]=IP
object[13]=HR

The problem related to the extraction of the table content is that the table itself is stored inside a comment string.
After you have fetched your web page, and loaded it into BeautifulSoup, you can solve this web page scraping issue following these steps:
gather the div tagged id = 'all_players_batting_pitching', which contains your table
extract the table from the comments using the decode_contents function, then reloading your text into a soup
extract each record of your table by looking for the tr tag, then each value by looking for the td tag, if that value is included in your indices [1, 4, 13]
load your values into a Pandas.DataFrame, ready to be used
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
# fetching web page
url = "https://www.baseball-reference.com/leagues/majors/2022-batting-pitching.shtml"
page = requests.get(url)
# extracting table from html
soup = bs(page.content,"html.parser")
table = soup.find(id = 'all_players_batting_pitching')
tab_text = table.decode_contents().split('--')[1].strip()
tab_soup = bs(tab_text,"html.parser")
# extracting records from table
records = []
for i, row in enumerate(tab_soup.find_all('tr')):
record = [ele.text.strip() for j, ele in enumerate(row.find_all('td')) if j in [0, 3, 12]]
if record != []:
records.append([row.a['href']] + [i] + record)
Output:
href Rk Name IP HR
0 /players/a/abbotco01.shtml 1 Cory Abbott 48.0 12
1 /players/a/abreual01.shtml 2 Albert Abreu 38.2 5
2 /players/a/abreual01.shtml 3 Albert Abreu 8.2 2
3 /players/a/abreual01.shtml 4 Albert Abreu 4.1 1
4 /players/a/abreual01.shtml 5 Albert Abreu 25.2 2
... ... ... ... ... ...
1063 /players/z/zastrro01.shtml 1106 Rob Zastryzny* 1.0 0
1064 /players/z/zastrro01.shtml 1107 Rob Zastryzny* 3.0 0
1065 /players/z/zerpaan01.shtml 1108 Angel Zerpa* 11.0 2
1066 /players/z/zeuchtj01.shtml 1109 T.J. Zeuch 10.2 5
1067 /players/z/zimmebr02.shtml 1110 Bruce Zimmermann* 73.2 21

How to remove whitespace/tab from an entry when scraping a web table? (python)

I've cobbled together the following code that scrapes a website table using Beautiful Soup.
The script is working as intended except for the first two entries.
Q1: The first entry consists of two empty brackets... how do I omit them?
Q2: The second entry has a hiden tab creating whitespace in the second element that I can't get rid of. How do I remove it?
Code:
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
testlink = "https://www.crutchfield.com/p_13692194/JL-Audio-12TW3-D8.html?tp=64077"
r = requests.get(testlink, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('table', class_='table table-striped')
df = pd.DataFrame(columns=['col1', 'col2'])
rows = []
for i, row in enumerate(table.find_all('tr')):
rows.append([el.text.strip() for el in row.find_all('td')])
for row in rows:
print(row)
Results:
[]
['Size', '12 -inch']
['Impedance (Ohms)', '4, 16']
['Cone Material', 'Mica-Filled IMPP']
['Surround Material', 'Rubber']
['Ideal Sealed Box Volume (cubic feet)', '1']
['Ideal Ported Box Volume (cubic feet)', '1.3']
['Port diameter (inches)', 'N/A']
['Port length (inches)', 'N/A']
['Free-Air', 'No']
['Dual Voice Coil', 'Yes']
['Sensitivity', '84.23 dB at 1 watt']
['Frequency Response', '24 - 200 Hz']
['Max RMS Power Handling', '400']
['Peak Power Handling (Watts)', '800']
['Top Mount Depth (inches)', '3 1/2']
['Bottom Mount Depth (inches)', 'N/A']
['Cutout Diameter or Length (inches)', '11 5/8']
['Vas (liters)', '34.12']
['Fs (Hz)', '32.66']
['Qts', '0.668']
['Xmax (millimeters)', '15.2']
['Parts Warranty', '1 Year']
['Labor Warranty', '1 Year']

Let's simplify, shall we?
import pandas as pd
df = pd.read_html('https://www.crutchfield.com/S-f7IbEJ40aHd/p_13692194/JL-Audio-12TW3-D8.html?tp=64077')[0]
df.columns = ['Property', 'Value', 'Not Needed']
print(df[['Property', 'Value']])
Result in terminal:
Property Value
0 Size 12 -inch
1 Impedance (Ohms) 4, 16
2 Cone Material Mica-Filled IMPP
3 Surround Material Rubber
4 Ideal Sealed Box Volume (cubic feet) 1
5 Ideal Ported Box Volume (cubic feet) 1.3
6 Port diameter (inches) NaN
7 Port length (inches) NaN
8 Free-Air No
9 Dual Voice Coil Yes
10 Sensitivity 84.23 dB at 1 watt
11 Frequency Response 24 - 200 Hz
12 Max RMS Power Handling 400
13 Peak Power Handling (Watts) 800
14 Top Mount Depth (inches) 3 1/2
15 Bottom Mount Depth (inches) NaN
16 Cutout Diameter or Length (inches) 11 5/8
17 Vas (liters) 34.12
18 Fs (Hz) 32.66
19 Qts 0.668
20 Xmax (millimeters) 15.2
21 Parts Warranty 1 Year
22 Labor Warranty 1 Year
Pandas documentation can be found here.

You can clean the results like this if you want.
rows = []
for i, row in enumerate(table.find_all('tr')):
cells = [
el.text.strip().replace("\t", "") ## remove tabs
for el
in row.find_all('td')
]
## don't add a row with no tds
if cells:
rows.append(cells)
I think you can further simplify this with a walrus :=
rows = [
[cell.text.strip().replace("\t", "") for cell in cells]
for row in table.find_all('tr')
if (cells := row.find_all('td'))
]

Let Pandas do it all
No need for anything else
pandas can read tables inside html
url='https://www.crutchfield.com/p_13692194/JL-Audio-12TW3-D8.html?tp=64077'
df=pd.read_html(url,attrs={'class':'table table-striped'})[0]
df.columns=['Features','Specs','Blank']
df.drop('Blank',axis=1,inplace=True) # get rid of the hidden column
Thats it
Seems to me all good no spaces
if you still feel there are spaces left in some column
df['Features']=df['Features'].apply(lambda x:x.strip()) #Not Needed
if you need to pass headers in request..(you can pass requests response to pd.read_html)
ps: it works without headers for the given URL
df=pd.read_html(requests.get(url,headers=headers).content,
attrs={'class':'table table-striped'})[0]

How do I use Python and BeautifulSoup to scrape data from an html table?

if you look at this page https://metals-api.com/currencies there is an html table with 2 columns. I would like to extract all the rows from column1 into a list/array. How do I go about this?
import requests
from bs4 import BeautifulSoup
URL = "https://metals-api.com/currencies"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
with open('outpu2t.txt', 'w', encoding='utf-8') as f:
f.write(soup.text)
To clarify I am not looking to run some fetch price commands against these tickers, I'm trying to compile a list of tickers so I can add them to a dropdown menu for my app

If I understand the question, then you can try the next example
import requests
from bs4 import BeautifulSoup
import pandas as pd
data=[]
URL = "https://metals-api.com/currencies"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
for code in soup.select('.table tbody tr td:nth-child(1)'):
code =code.text
data.append(code)
df=pd.DataFrame(data,columns=['code'])
#df.to_csv('code.csv',index=False)# to store data
print(df)
Output:
code
0 XAU
1 XAG
2 XPT
3 XPD
4 XCU
.. ...
209 LINK
210 XLM
211 ADA
212 BCH
213 LTC
[214 rows x 1 columns]

I sit corrected, I initially just tried pd.read_html("https://metals-api.com/currencies") which normally works, but apparently with a very slight work around it can still work just fine.
import pandas as pd
import requests
URL = "https://metals-api.com/currencies"
page = requests.get(URL)
df = pd.read_html(page.content)[0]
print(df)
Output:
Code Name
0 XAU 1 Ounce of 24K Gold. Use Carat endpoint to dis...
1 XAG Silver
2 XPT Platinum
3 XPD Palladium
4 XCU Copper
.. ... ...
209 LINK Chainlink
210 XLM Stellar
211 ADA Cardano
212 BCH Bitcoin Cash
213 LTC Litecoin
[214 rows x 2 columns]

Excluding some results from a table-Web Scraping with Python

I'm scraping a table from a website using Python. My code looks like this:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
print('Scraping NJ Dept of Banking...')
print()
NJurl = 'https://www.state.nj.us/dobi/bnk_depositories/appwebinfo.htm'
NJr = requests.get(NJurl, headers = headers)
NJsoup = BeautifulSoup(NJr.text, 'html.parser')
NJlist = []
for tr in NJsoup.find_all('td')[12:30]:
NJresult = (tr.text.strip())
NJlist.append(' '.join(NJresult.split()))
print(' '.join(NJresult.split()))
if not NJlist:
NJresult = 'No Transactions'
print(NJresult)
NJlist.append(NJresult)
print()
The code above works well, the results look like this:
The Bank of Princeton
Branch
Application to establish a branch office on or about December 31, 2019 at 1839 Chestnut Street Philadelphia, PA (Philadelphia County County) 7/15/2019 Accepted
7/15/2019
Accepted
Two River Community Bank
Branch
Expedited application to establish a branch office at 442 Highway 35 Middletown, NJ (Monmouth County) 7/15/2019 Opened
7/15/2019
Opened
Two River Community Bank
Main Office Relocation
Application to relocate its main office from 1250 Route 35, Middletown, Middlesex County, 07748 to 766 Shrewsbury Avenue Tinton Falls, NJ (Monmouth County) 7/15/2019 Opened
7/15/2019
Opened
The problem is I would like to exclude certain types of events. For example, I don't need to see the first two events involving branches. I've adjusted a portion of my code to try to account for this:
for tr in NJsoup.find_all('td')[12:30]:
NJresult = (tr.text.strip())
if 'branch' not in NJresult.lower():
NJlist.append(' '.join(NJresult.split()))
print(' '.join(NJresult.split()))
The problem with that adjustment is that it just eliminates specific sentences with the word 'branch' in them, not the entire event. For example the output with the adjusted code looks like this:
The Bank of Princeton
7/15/2019
Accepted
Two River Community Bank
7/15/2019
Opened
Two River Community Bank
Main Office Relocation
Application to relocate its main office from 1250 Route 35, Middletown, Middlesex County, 07748 to 766 Shrewsbury Avenue Tinton Falls, NJ (Monmouth County) 7/15/2019 Opened
7/15/2019
Opened
Is there a way I can go about skipping any output for a branch transaction so the output looks like this?
Two River Community Bank
Main Office Relocation
Application to relocate its main office from 1250 Route 35, Middletown, Middlesex County, 07748 to 766 Shrewsbury Avenue Tinton Falls, NJ (Monmouth County) 7/15/2019 Opened
7/15/2019
Opened

Although you named your iterator tr, you are not iterating through rows, but actually through <td>s. I would iterate through relevant rows and then decide in each row if I want to include it or not. Something like this:
find the rows
rows = NJsoup.find_all('tr')[6:-2:2]
then iterate through them:
for row in rows:
new_record = []
for td in row.find_all('td'):
if td.text.strip() == 'Branch':
break
new_record(td.text.strip())
else:
NJlist.append('\n'.join(new_record)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why can't Pandas Webscraping print out certain tables from this website? - python

Related

Parse SEC EDGAR XML Form Data with child nodes using BeautifulSoup

Struggling to grab data from baseball reference

How to remove whitespace/tab from an entry when scraping a web table? (python)

How do I use Python and BeautifulSoup to scrape data from an html table?

Excluding some results from a table-Web Scraping with Python

Categories

Resources