Parse SEC EDGAR XML Form Data with child nodes using BeautifulSoup

Parse SEC EDGAR XML Form Data with child nodes using BeautifulSoup - python

I am attempting to scrape individual fund holdings from the SEC's N-PORT-P/A form using beautiful soup and xml. A typical submission, outlined below and [linked here][1], looks like:
<edgarSubmission xmlns="http://www.sec.gov/edgar/nport" xmlns:com="http://www.sec.gov/edgar/common" xmlns:ncom="http://www.sec.gov/edgar/nportcommon" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<headerData>
<submissionType>NPORT-P/A</submissionType>
<isConfidential>false</isConfidential>
<accessionNumber>0001145549-23-004025</accessionNumber>
<filerInfo>
<filer>
<issuerCredentials>
<cik>0001618627</cik>
<ccc>XXXXXXXX</ccc>
</issuerCredentials>
</filer>
<seriesClassInfo>
<seriesId>S000048029</seriesId>
<classId>C000151492</classId>
</seriesClassInfo>
</filerInfo>
</headerData>
<formData>
<genInfo>
...
</genInfo>
<fundInfo>
...
</fundInfo>
<invstOrSecs>
<invstOrSec>
<name>ARROW BIDCO LLC</name>
<lei>549300YHZN08M0H3O128</lei>
<title>Arrow Bidco LLC</title>
<cusip>042728AA3</cusip>
<identifiers>
<isin value="US042728AA35"/>
</identifiers>
<balance>115000.000000000000</balance>
<units>PA</units>
<curCd>USD</curCd>
<valUSD>114754.170000000000</valUSD>
<pctVal>0.3967552449</pctVal>
<payoffProfile>Long</payoffProfile>
<assetCat>DBT</assetCat>
<issuerCat>CORP</issuerCat>
<invCountry>US</invCountry>
<isRestrictedSec>N</isRestrictedSec>
<fairValLevel>2</fairValLevel>
<debtSec>
<maturityDt>2024-03-15</maturityDt>
<couponKind>Fixed</couponKind>
<annualizedRt>9.500000000000</annualizedRt>
<isDefault>N</isDefault>
<areIntrstPmntsInArrs>N</areIntrstPmntsInArrs>
<isPaidKind>N</isPaidKind>
</debtSec>
<securityLending>
<isCashCollateral>N</isCashCollateral>
<isNonCashCollateral>N</isNonCashCollateral>
<isLoanByFund>N</isLoanByFund>
</securityLending>
</invstOrSec>
With Arrow Bidco LLC being a bond within the portfolio, with some of its characteristics included within the filing (CUSIP, CIK, balance, maturity date, etc.). I am looking for the best way to iterate through each individual security (investOrSec) and collect the characteristics of each security in a dataframe.
The code I am currently using is:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
header = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36", "X-Requested-With": "XMLHttpRequest"}
n_port_file = requests.get("https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml", headers=header, verify=False)
n_port_file_xml = n_port_file.content
soup = BeautifulSoup(n_port_file_xml,'xml')
names = soup.find_all('name')
lei = soup.find_all('lei')
title = soup.find_all('title')
cusip = soup.find_all('cusip')
....
maturityDt = soup.find_all('maturityDt')
couponKind = soup.find_all('couponKind')
annualizedRt = soup.find_all('annualizedRt')
Then iterating through each list to create a dataframe based on the values in each row.
fixed_income_data = []
for i in range(0,len(names)):
rows = [names[i].get_text(),lei[i].get_text(),
title[i].get_text(),cusip[i].get_text(),
balance[i].get_text(),units[i].get_text(),
pctVal[i].get_text(),payoffProfile[i].get_text(),
assetCat[i].get_text(),issuerCat[i].get_text(),
invCountry[i].get_text(),couponKind[i].get_text()
]
fixed_income_data.append(rows)
fixed_income_df = pd.DataFrame(equity_data,columns = ['name',
'lei',
'title',
'cusip',
'balance',
'units',
'pctVal',
'payoffProfile',
'assetCat',
'issuerCat',
'invCountry'
'maturityDt',
'couponKind',
'annualizedRt'
], dtype = float)
This works fine when all pieces of information are included, but often there is one variable that is not accounted for. A piece of the form might be blank, or an issuer category might not have been filled out incorrectly, leading to an IndexError. This portfolio has 127 securities that I was able to parse, but might be missing an annualized return for a single security, throwing off the ability to neatly create a dataframe.
Additionally, for portfolios that hold both fixed income and equity securities, the equity securities do not return information for the debtSecs child. Is there a way to iterate through this data while simultaneously cleaning it in the easiest way possible? Even adding "NaN" for the debtSec children that equity securities don't reference would be a valid response. Any help would be much appreciated!
[1]: https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml

Here is the best way, in my opinion, to handle the problem. Generally speaking, EDGAR filings are notoriously difficult to parse, so the following may or may not work on other filings, even from the same filer.
To make it easier on yourself, since this is an XML file, you should use an xml parser and xpath. Given that you're looking to create a dataframe, the most appropriate tool would be the pandas read_xml() method.
Because the XML is nested, you will need to create two different dataframes and concatenate them (maybe others will have a better idea on how to approach it). And, finally, although read_xml() can read directly from a url, in this case, EDGAR requires using a user-agent, meaning you also need to use the requests library as well.
So, all together:
#import required libraries
import pandas as pd
import requests
url = 'https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml'
#set headers with a user-agent
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
req = requests.get(url, headers=headers)
#define the columns you want to drop (based on the data in your question)
to_drop = ['identifiers', 'curCd','valUSD','isRestrictedSec','fairValLevel','debtSec','securityLending']
#the filing uses namespaces (too complicated to get into here), so you need to define that as well
namespaces = {"nport": "http://www.sec.gov/edgar/nport"}
#create the first df, for the securities which are debt instruments
invest = pd.read_xml(req.text,xpath="//nport:invstOrSec[.//nport:debtSec]",namespaces=namespaces).drop(to_drop, axis=1)
#crete the 2nd df, for the debt details:
debt = pd.read_xml(req.text,xpath="//nport:debtSec",namespaces=namespaces).iloc[:,0:3]
#finally, concatenate the two into one df:
pd.concat([invest, debt], axis=1)
This should output your 126 debt securities (pardon the formatting):
lei title cusip balance units pctVal payoffProfile assetCat issuerCat invCountry maturityDt couponKind annualizedRt
0 ARROW BIDCO LLC 549300YHZN08M0H3O128 Arrow Bidco LLC 042728AA3 115000.00 PA 0.396755 Long DBT CORP US 2024-03-15 Fixed 9.50000
1 CD&R SMOKEY BUYER INC NaN CD&R Smokey Buyer Inc 12510CAA9 165000.00 PA 0.505585 Long DBT CORP US 2025-07-15 Fixed 6.75000
You can then play with the final df, add or drop columns, etc

Related

Why can't Pandas Webscraping print out certain tables from this website?

Is there a way to also print out the second table on this website? (the one that starts with CK Hutchison Holdings Ltd. )
Website link: https://www.hkex.com.hk/Products/Listed-Derivatives/Single-Stock/Stock-Options?sc_lang=en
This was my code:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://www.hkex.com.hk/Products/Listed-Derivatives/Single-Stock/Stock-Options?sc_lang=en'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
spec_table = soup.select('table[class="table migrate"]')[1]
df = pd.read_html(str(spec_table))[0]
print(df[:5].to_markdown())
Whenever I tried to do spec_table = soup.select('table[class="table migrate"]')[0], it would print the first table (the one that starts with Sun Hung Kai Properties Limited) and whenever I tried to do spec_table = soup.select('table[class="table migrate"]')[1], it would skip the table in between (the one that starts with CK Hutchison Holdings Ltd. ) and print out the contract summary table.
Could anyone explain how I can print out the second table?

This is one way to isolate and extract that second table from page:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://www.hkex.com.hk/Products/Listed-Derivatives/Single-Stock/Stock-Options?sc_lang=en'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
second_table = soup.select('table:-soup-contains("CK Hutchison Holdings Ltd.")')[0]
df = pd.read_html(str(second_table))[0]
print(df)
Result:
No. SEHK Code Underlying Stock Name HKATS Code Contract Size (shares) Tier No.* Position Limit ## (Effective from 1 April 2022) Approved by FSC Taiwan+
0 1 1 CK Hutchison Holdings Ltd. CKH 500 1 50000 ✓
1 2 2 CLP Holdings Limited CLP 500 1 50000 NaN
2 3 3 The Hong Kong and China Gas Company Limited HKG 1000 2 150000 NaN
3 4 4 The Wharf (Holdings) Limited WHL 1000 1 50000 NaN
4 5 5 HSBC Holdings Plc. HKB 400 2 150000 ✓
... ... ... ... ... ... ... ... ...
63 64 3323 China National Building Material Company Limited NBM 2000 2 100000 ✓
64 65 3328 Bank of Communications Co., Ltd. BCM 1000 3 150000 ✓
65 66 3968 China Merchants Bank Co., Ltd. CMB 500 1 150000 ✓
66 67 3988 Bank of China Limited XBC 1000 3 150000 ✓
67 68 6862 Haidilao International Holding Ltd. HDO 1000 1 100000 NaN
For documentation, refer to my response to your previous question. Also make sure your bs4 package is up to date, do a pip install -U bs4.

apparently it is due to html error in the webpage: that webpage was clearly written by several people: if you check the html the tables (that look all the same) have each differet coding.
Anyway your problem is due to that table being called class="table migrate;" instead of class="table migrate, as all the other ones (difference being in the ;).
In order to get your tab simply use this "other" class type
spec_table = soup.select('table[class="table migrate;"]')[0]
and you will have the tab you are missing.
As a general tip: if you have problem with webscraping search directly the html of the source (you can look for some string you get from the webpage to find the exact html point you are interested in, or just inspect with browser tools)
Alternatively, if you wanna have a more elegant and general code (for example if you need to iterate over all the tabs) you can do this:
start = soup.find('p', {'class': 'spanHeading'})
spec_table = start.find_all_next('table')
And then do what you wanted to do before:
df = pd.read_html(str(spec_table[1]))[0]
print(df[:5].to_markdown())

How to web-scrape data that may move indexes in the future

I am trying to web scrape NFL standings data and am interested in the categories "PCT" and "Net Pts" from the table from this url. https://www.nfl.com/standings/league/2021/REG
I have set up BeautifulSoup and printed the all 'td' in this page. The problem is when doing so you get an order of the teams from worst record to the best. Obviously this will cause problems in the future if I have a specific index that I have identified as the Lions' PCT for example, as when their record changes that data will have a different index. In fact the order of the teams on the website will change every week as more games are played.
Is there any way to say anything like if the name of the team is X do something? Like use the table data 4 indexes lower? I haven't seen how to deal with this problem on any youtube tutorial or book so I am wondering what the thought process is. I need a way to identify each team and their PCT and Net points instantaneously as this info will be put into another function.
Here is what I have so far for example:
When you do something like this...
import requests
from bs4 import BeautifulSoup
url = 'https://www.nfl.com/standings/league/2021/REG'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')
data = soup.find_all('td')[0:10]
print(data)
#I am using just the first 10 indexes to keep it short here
...you get the table data info for the Detroit Lions as they are the worst team in the league at the time of posting this question. I have identified that their "PCT" data point would be
win_pct = soup.find_all('td')[4]
print(float(win_pct.text.strip()))
However, if another team becomes the worst team in the league this index would belong to them and not the Lions. How would I work around this? Thanks

You can use dictionary to store data about clubs and then use club name as a key to get the data (independent of club position). For example:
import requests
from bs4 import BeautifulSoup
url = "https://www.nfl.com/standings/league/2021/REG"
res = requests.get(url)
soup = BeautifulSoup(res.text, "lxml")
data = {}
for row in soup.select("tr:has(td)"):
cells = [td.get_text(strip=True) for td in row.select("td")[1:]]
club_name = row.select_one(".d3-o-club-fullname").get_text(strip=True)
data[club_name] = cells
# print PCT/Net Pts of Detroit Lions:
print(data["Detroit Lions"][3], data["Detroit Lions"][6])
Prints:
0.000 -63

any way to download the data with custom queries from url in python?

I want to download the data from USDA site with custom queries. So instead of manually selecting queries in the website, I am thinking about how should I do this handier in python. To do so, I used request, http to access the url and read the content, it is not intuitive for me how should I pass the queries then make a selection and download the data as csv. Does anyone knows of doing this easily in python? Is there any workaround we could download the data from url with specific queries? Any idea?
this is my current attempt
here is the url that I am going to select data with custom queries.
import io
import requests
import pandas as pd
url="https://www.marketnews.usda.gov/mnp/ls-report-retail?&repType=summary&portal=ls&category=Retail&species=BEEF&startIndex=1"
s=requests.get(url).content
df=pd.read_csv(io.StringIO(s.decode('utf-8')))
so before reading the requested json in pandas, I need to pass following queries for correct data selection:
Category = "Retail"
Report Type = "Item"
Species = "Beef"
Region(s) = "National"
Start Dates = "2020-01-01"
End Date = "2021-02-08"
it is not intuitive for me how should I pass the queries with requested json then download the filtered data as csv. Is there any efficient way of doing this in python? Any thoughts? Thanks

A few details
simplest format is text rather that HTML. Got URL from HTML page for text download
requests(params=) is a dict. Built it up and passed, no need to deal with building complete URL string
clearly text is space delimited, found minimum of double space
import io
import requests
import pandas as pd
url="https://www.marketnews.usda.gov/mnp/ls-report-retail"
p = {"repType":"summary","species":"BEEF","portal":"ls","category":"Retail","format":"text"}
r = requests.get(url, params=p)
df = pd.read_csv(io.StringIO(r.text), sep="\s\s+", engine="python")
Date
Region
Feature Rate
Outlets
Special Rate
Activity Index
0
02/05/2021
NATIONAL
69.40%
29,200
20.10%
81,650
1
02/05/2021
NORTHEAST
75.00%
5,500
3.80%
17,520
2
02/05/2021
SOUTHEAST
70.10%
7,400
28.00%
23,980
3
02/05/2021
MIDWEST
75.10%
6,100
19.90%
17,430
4
02/05/2021
SOUTH CENTRAL
57.90%
4,900
26.40%
9,720
5
02/05/2021
NORTHWEST
77.50%
1,300
2.50%
3,150
6
02/05/2021
SOUTHWEST
63.20%
3,800
27.50%
9,360
7
02/05/2021
ALASKA
87.00%
200
.00%
290
8
02/05/2021
HAWAII
46.70%
100
.00%
230

Just format the query data in the url - it's actually a REST API:
To add more query data, as #mullinscr said, you can change the values on the left and press submit, then see the query name in the URL (for example, start date is called repDate).
If you hover on the Download as XML link, you will also discover you can specify the download format using format=<format_name>. Parsing the tabular data in XML using pandas might be easier, so I would append format=xml at the end as well.
category = "Retail"
report_type = "Item"
species = "BEEF"
regions = "NATIONAL"
start_date = "01-01-2019"
end_date = "01-01-2021"
# the website changes "-" to "%2F"
start_date = start_date.replace("-", "%2F")
end_date = end_date.replace("-", "%2F")
url = f"https://www.marketnews.usda.gov/mnp/ls-report-retail?runReport=true&portal=ls&startIndex=1&category={category}&repType={report_type}&species={species}&region={regions}&repDate={start_date}&endDate={end_date}&compareLy=No&format=xml"
# parse with pandas, etc...

How to scrape data off morningstar

So Im new to the world of web scraping and so far I've only really been using beautifulsoup to scrape text and images off websites. I thought Id try and scrape some data points off a graph to test my understanding but I got a bit confused at this graph.
After inspecting the element of the piece of data I wanted to extract, I saw this:
<span id="TSMAIN">: 100.7490637</span>
The problem is, my original idea for scraping the data points would be to have iterated through some sort of id list containing all the different data points (if that makes sense?).
Instead, it seems that all the data points are contained within this same element, and the value depends on where your cursor is on the graph.
My problem is, If I use beautifulsoups find function and type in that specific element with that attribute of id = TSMAIN, I get a none type return, because I am guessing unless I have my cursor on the actual graph nothing will show up there.
Code:
from bs4 import BeautifulSoup
import requests
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36"}
url = "https://www.morningstar.co.uk/uk/funds/snapshot/snapshot.aspx?id=F0GBR050AQ&tab=13"
source=requests.get(url,headers=headers)
soup = BeautifulSoup(source.content,'lxml')
data = soup.find("span",attrs={"id":"TSMAIN"})
print(data)
Output
None
How can I extract all the data points of this graph?

Seems like the data can be pulled form API. Only thing is the values it returns is relative to the start date entered in the payload. It'll set the out put of the start date to 0, then the numbers after are relative to that date.
import requests
import pandas as pd
from datetime import datetime
from dateutil import relativedelta
userInput = input('Choose:\n\t1. 3 Month\n\t2. 6 Month\n\t3. 1 Year\n\t4. 3 Year\n\t5. 5 Year\n\t6. 10 Year\n\n -->: ')
userDict = {'1':3,'2':6,'3':12,'4':36,'5':60,'6':120}
n = datetime.now()
n = n - relativedelta.relativedelta(days=1)
n = n - relativedelta.relativedelta(months=userDict[userInput])
dateStr = n.strftime('%Y-%m-%d')
url = 'https://tools.morningstar.co.uk/api/rest.svc/timeseries_cumulativereturn/t92wz0sj7c'
data = []
idDict = {
'Schroder Managed Balanced Instl Acc':'F0GBR050AQ]2]0]FOGBR$$ALL',
'GBP Moderately Adventurous Allocation':'EUCA000916]8]0]CAALL$$ALL',
'Mixed Investment 40-85% Shares':'LC00000012]8]0]CAALL$$ALL',
'':'F00000ZOR1]7]0]IXALL$$ALL',}
for k, v in idDict.items():
payload = {
'encyId': 'GBP',
'idtype': 'Morningstar',
'frequency': 'daily',
'startDate': dateStr,
'performanceType': '',
'outputType': 'COMPACTJSON',
'id': v,
'decPlaces': '8',
'applyTrackRecordExtension': 'false'}
temp_data = requests.get(url, params=payload).json()
df = pd.DataFrame(temp_data)
df['timestamp'] = pd.to_datetime(df[0], unit='ms')
df['date'] = df['timestamp'].dt.date
df = df[['date',1]]
df.columns = ['date', k]
data.append(df)
final_df = pd.concat(
(iDF.set_index('date') for iDF in data),
axis=1, join='inner'
).reset_index()
final_df.plot(x="date", y=list(idDict.keys()), kind="line")
Output:
print (final_df.head(5).to_string())
date Schroder Managed Balanced Instl Acc GBP Moderately Adventurous Allocation Mixed Investment 40-85% Shares
0 2019-12-22 0.000000 0.000000 0.000000 0.000000
1 2019-12-23 0.357143 0.406784 0.431372 0.694508
2 2019-12-24 0.714286 0.616217 0.632422 0.667586
3 2019-12-25 0.714286 0.616217 0.632422 0.655917
4 2019-12-26 0.714286 0.612474 0.629152 0.664124
....
To get those Ids, it took a little investigating of the requests. Searching through those, I was able to find the corresponding id values and with a little bit of trial and error to work out what values meant what.
Those "alternate" ids used. And where those line graphs get the data from (inthose 4 request, look at the Preview pane, and you'll see the data in there.
Here's the final output/graph:

Trying to Structure BeautifulSoup to Flexibly Scrape Company Annual Reports

I am trying to use the United States Securities and Exchange (SEC) database, to look at company financial reports (known as 10k’s) to pull out a list of the executive committee members for each filing. I am currently using the most recent files for Microsoft (stock ticker: MSFT) and Walmart (stock ticker: WMT). I know I can look up this information elsewhere on finance websites but I am trying to make a flexible database for personal use. My issue:
The table index position is different in each report, on one company report the table I want may
be table 38 and on another it may be table 45 so a static index/position count will not work across
multiple filings.
The specific attributes in each HTML table tag change so I cannot search for a common attribute. In
some cases I find common attributes and sometimes I do not.
I am starting to think I may not be able to automate this due to lack of identifiers that are unique within each file and common across all files. I've banged my head looking at many Python Webscraping tutorials and videos the last few weeks. Any suggestions are appreciated, full automation would be ideal so I can loop through multiple filings, partial helps too I'm here to learn. I might be bumping into trying to automate something that is too diverse.
Microsoft Link:
https://www.sec.gov/Archives/edgar/data/789019/000156459019027952/msft-10k_20190630.htm
Desired Table:
<table border="0" cellspacing="0" cellpadding="0" align="center" style="border-collapse:collapse; width:100%;">
Walmart Link:
https://www.sec.gov/Archives/edgar/data/104169/000010416919000016/wmtform10-kx1312019.htm
Desired Table:
<table cellpadding="0" cellspacing="0" style="font-family:Times New Roman;font-size:10pt;width:100%;border-collapse:collapse;text-align:left;">
Code to Count Number of Tables in Each Page:
from selenium import webdriver
from bs4 import BeautifulSoup
chrome_path = r"C:\webdrivers\chromedriver.exe"
browser = webdriver.Chrome(chrome_path)
#Microsoft
browser.get("https://www.sec.gov/Archives/edgar/data/789019/000156459019027952/msft-10k_20190630.htm")
msft = browser.page_source
page_msft = BeautifulSoup(msft, 'html.parser')
tables_msft = page_msft.find_all("table")
#Walmart
browser.get("https://www.sec.gov/Archives/edgar/data/104169/000010416919000016/wmtform10-kx1312019.htm")
wmt = browser.page_source
page_wmt = BeautifulSoup(wmt, 'html.parser')
tables_wmt = page_wmt.find_all("table")
print("MSFT Result Table Count: " + str(len(tables_msft)))
print("Walmart Result Table Count: " + str(len(tables_wmt)))
Results:
MSFT Result Table Count: 263
Walmart Result Table Count: 258
Process finished with exit code 0

Firstly you don't need Selenium, requests library will be faster and avoid overhead. So I was able to partially figure out a way to extract the required data. But since the number of columns is different they cannot be combined together(for Microsoft and Walmart).
The below code generates two required dataframe one for Microsoft and one for Walmart.
You still need to manipulate the column names. The idea is to get the table with td value as 'Age' since it is a unique table data. Let me know if you need some clarifications:-
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
#Microsoft
page = requests.get("https://www.sec.gov/Archives/edgar/data/789019/000156459019027952/msft-10k_20190630.htm")
soup = BeautifulSoup(page.text, 'html')
resmsft = []
tables_msft = soup.find(text="Age").find_parent("table")
for row in tables_msft.find_all("tr")[1:]:
# print([cell.get_text(strip=True) for cell in row.find_all("td")])
if row:
resmsft.append([cell.get_text(strip=True) for cell in row.find_all("td")])
non_empty = [sublist for sublist in resmsft if any(sublist)]
df_msft = pd.DataFrame.from_records(non_empty)
df_msft[df_msft==''] = np.nan
df_msft=df_msft.dropna(axis=1,how='all')
#Walmart
page = requests.get("https://www.sec.gov/Archives/edgar/data/104169/000010416919000016/wmtform10-kx1312019.htm")
soup = BeautifulSoup(page.text, 'html')
#page_wmt = BeautifulSoup(soup, 'html.parser')
tables_wmt = soup.find(text="Age").find_parent("table")
reswmt = []
for row in tables_wmt.find_all("tr")[1:]:
# print([cell.get_text(strip=True) for cell in row.find_all("td")])
if row:
reswmt.append([cell.get_text(strip=True) for cell in row.find_all("td")])
non_empty_wmt = [sublist for sublist in reswmt if any(sublist)]
df_wmt = pd.DataFrame.from_records(non_empty_wmt)
df_wmt[df_wmt==''] = np.nan
df_wmt=df_wmt.dropna(axis=1,how='all')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse SEC EDGAR XML Form Data with child nodes using BeautifulSoup - python

Related

Why can't Pandas Webscraping print out certain tables from this website?

How to web-scrape data that may move indexes in the future

any way to download the data with custom queries from url in python?

How to scrape data off morningstar

Trying to Structure BeautifulSoup to Flexibly Scrape Company Annual Reports

Categories

Resources