Excluding some results from a table-Web Scraping with Python

Excluding some results from a table-Web Scraping with Python - python

I'm scraping a table from a website using Python. My code looks like this:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
print('Scraping NJ Dept of Banking...')
print()
NJurl = 'https://www.state.nj.us/dobi/bnk_depositories/appwebinfo.htm'
NJr = requests.get(NJurl, headers = headers)
NJsoup = BeautifulSoup(NJr.text, 'html.parser')
NJlist = []
for tr in NJsoup.find_all('td')[12:30]:
NJresult = (tr.text.strip())
NJlist.append(' '.join(NJresult.split()))
print(' '.join(NJresult.split()))
if not NJlist:
NJresult = 'No Transactions'
print(NJresult)
NJlist.append(NJresult)
print()
The code above works well, the results look like this:
The Bank of Princeton
Branch
Application to establish a branch office on or about December 31, 2019 at 1839 Chestnut Street Philadelphia, PA (Philadelphia County County) 7/15/2019 Accepted
7/15/2019
Accepted
Two River Community Bank
Branch
Expedited application to establish a branch office at 442 Highway 35 Middletown, NJ (Monmouth County) 7/15/2019 Opened
7/15/2019
Opened
Two River Community Bank
Main Office Relocation
Application to relocate its main office from 1250 Route 35, Middletown, Middlesex County, 07748 to 766 Shrewsbury Avenue Tinton Falls, NJ (Monmouth County) 7/15/2019 Opened
7/15/2019
Opened
The problem is I would like to exclude certain types of events. For example, I don't need to see the first two events involving branches. I've adjusted a portion of my code to try to account for this:
for tr in NJsoup.find_all('td')[12:30]:
NJresult = (tr.text.strip())
if 'branch' not in NJresult.lower():
NJlist.append(' '.join(NJresult.split()))
print(' '.join(NJresult.split()))
The problem with that adjustment is that it just eliminates specific sentences with the word 'branch' in them, not the entire event. For example the output with the adjusted code looks like this:
The Bank of Princeton
7/15/2019
Accepted
Two River Community Bank
7/15/2019
Opened
Two River Community Bank
Main Office Relocation
Application to relocate its main office from 1250 Route 35, Middletown, Middlesex County, 07748 to 766 Shrewsbury Avenue Tinton Falls, NJ (Monmouth County) 7/15/2019 Opened
7/15/2019
Opened
Is there a way I can go about skipping any output for a branch transaction so the output looks like this?
Two River Community Bank
Main Office Relocation
Application to relocate its main office from 1250 Route 35, Middletown, Middlesex County, 07748 to 766 Shrewsbury Avenue Tinton Falls, NJ (Monmouth County) 7/15/2019 Opened
7/15/2019
Opened

Although you named your iterator tr, you are not iterating through rows, but actually through <td>s. I would iterate through relevant rows and then decide in each row if I want to include it or not. Something like this:
find the rows
rows = NJsoup.find_all('tr')[6:-2:2]
then iterate through them:
for row in rows:
new_record = []
for td in row.find_all('td'):
if td.text.strip() == 'Branch':
break
new_record(td.text.strip())
else:
NJlist.append('\n'.join(new_record)

Related

Parse SEC EDGAR XML Form Data with child nodes using BeautifulSoup

I am attempting to scrape individual fund holdings from the SEC's N-PORT-P/A form using beautiful soup and xml. A typical submission, outlined below and [linked here][1], looks like:
<edgarSubmission xmlns="http://www.sec.gov/edgar/nport" xmlns:com="http://www.sec.gov/edgar/common" xmlns:ncom="http://www.sec.gov/edgar/nportcommon" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<headerData>
<submissionType>NPORT-P/A</submissionType>
<isConfidential>false</isConfidential>
<accessionNumber>0001145549-23-004025</accessionNumber>
<filerInfo>
<filer>
<issuerCredentials>
<cik>0001618627</cik>
<ccc>XXXXXXXX</ccc>
</issuerCredentials>
</filer>
<seriesClassInfo>
<seriesId>S000048029</seriesId>
<classId>C000151492</classId>
</seriesClassInfo>
</filerInfo>
</headerData>
<formData>
<genInfo>
...
</genInfo>
<fundInfo>
...
</fundInfo>
<invstOrSecs>
<invstOrSec>
<name>ARROW BIDCO LLC</name>
<lei>549300YHZN08M0H3O128</lei>
<title>Arrow Bidco LLC</title>
<cusip>042728AA3</cusip>
<identifiers>
<isin value="US042728AA35"/>
</identifiers>
<balance>115000.000000000000</balance>
<units>PA</units>
<curCd>USD</curCd>
<valUSD>114754.170000000000</valUSD>
<pctVal>0.3967552449</pctVal>
<payoffProfile>Long</payoffProfile>
<assetCat>DBT</assetCat>
<issuerCat>CORP</issuerCat>
<invCountry>US</invCountry>
<isRestrictedSec>N</isRestrictedSec>
<fairValLevel>2</fairValLevel>
<debtSec>
<maturityDt>2024-03-15</maturityDt>
<couponKind>Fixed</couponKind>
<annualizedRt>9.500000000000</annualizedRt>
<isDefault>N</isDefault>
<areIntrstPmntsInArrs>N</areIntrstPmntsInArrs>
<isPaidKind>N</isPaidKind>
</debtSec>
<securityLending>
<isCashCollateral>N</isCashCollateral>
<isNonCashCollateral>N</isNonCashCollateral>
<isLoanByFund>N</isLoanByFund>
</securityLending>
</invstOrSec>
With Arrow Bidco LLC being a bond within the portfolio, with some of its characteristics included within the filing (CUSIP, CIK, balance, maturity date, etc.). I am looking for the best way to iterate through each individual security (investOrSec) and collect the characteristics of each security in a dataframe.
The code I am currently using is:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
header = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36", "X-Requested-With": "XMLHttpRequest"}
n_port_file = requests.get("https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml", headers=header, verify=False)
n_port_file_xml = n_port_file.content
soup = BeautifulSoup(n_port_file_xml,'xml')
names = soup.find_all('name')
lei = soup.find_all('lei')
title = soup.find_all('title')
cusip = soup.find_all('cusip')
....
maturityDt = soup.find_all('maturityDt')
couponKind = soup.find_all('couponKind')
annualizedRt = soup.find_all('annualizedRt')
Then iterating through each list to create a dataframe based on the values in each row.
fixed_income_data = []
for i in range(0,len(names)):
rows = [names[i].get_text(),lei[i].get_text(),
title[i].get_text(),cusip[i].get_text(),
balance[i].get_text(),units[i].get_text(),
pctVal[i].get_text(),payoffProfile[i].get_text(),
assetCat[i].get_text(),issuerCat[i].get_text(),
invCountry[i].get_text(),couponKind[i].get_text()
]
fixed_income_data.append(rows)
fixed_income_df = pd.DataFrame(equity_data,columns = ['name',
'lei',
'title',
'cusip',
'balance',
'units',
'pctVal',
'payoffProfile',
'assetCat',
'issuerCat',
'invCountry'
'maturityDt',
'couponKind',
'annualizedRt'
], dtype = float)
This works fine when all pieces of information are included, but often there is one variable that is not accounted for. A piece of the form might be blank, or an issuer category might not have been filled out incorrectly, leading to an IndexError. This portfolio has 127 securities that I was able to parse, but might be missing an annualized return for a single security, throwing off the ability to neatly create a dataframe.
Additionally, for portfolios that hold both fixed income and equity securities, the equity securities do not return information for the debtSecs child. Is there a way to iterate through this data while simultaneously cleaning it in the easiest way possible? Even adding "NaN" for the debtSec children that equity securities don't reference would be a valid response. Any help would be much appreciated!
[1]: https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml

Here is the best way, in my opinion, to handle the problem. Generally speaking, EDGAR filings are notoriously difficult to parse, so the following may or may not work on other filings, even from the same filer.
To make it easier on yourself, since this is an XML file, you should use an xml parser and xpath. Given that you're looking to create a dataframe, the most appropriate tool would be the pandas read_xml() method.
Because the XML is nested, you will need to create two different dataframes and concatenate them (maybe others will have a better idea on how to approach it). And, finally, although read_xml() can read directly from a url, in this case, EDGAR requires using a user-agent, meaning you also need to use the requests library as well.
So, all together:
#import required libraries
import pandas as pd
import requests
url = 'https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml'
#set headers with a user-agent
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
req = requests.get(url, headers=headers)
#define the columns you want to drop (based on the data in your question)
to_drop = ['identifiers', 'curCd','valUSD','isRestrictedSec','fairValLevel','debtSec','securityLending']
#the filing uses namespaces (too complicated to get into here), so you need to define that as well
namespaces = {"nport": "http://www.sec.gov/edgar/nport"}
#create the first df, for the securities which are debt instruments
invest = pd.read_xml(req.text,xpath="//nport:invstOrSec[.//nport:debtSec]",namespaces=namespaces).drop(to_drop, axis=1)
#crete the 2nd df, for the debt details:
debt = pd.read_xml(req.text,xpath="//nport:debtSec",namespaces=namespaces).iloc[:,0:3]
#finally, concatenate the two into one df:
pd.concat([invest, debt], axis=1)
This should output your 126 debt securities (pardon the formatting):
lei title cusip balance units pctVal payoffProfile assetCat issuerCat invCountry maturityDt couponKind annualizedRt
0 ARROW BIDCO LLC 549300YHZN08M0H3O128 Arrow Bidco LLC 042728AA3 115000.00 PA 0.396755 Long DBT CORP US 2024-03-15 Fixed 9.50000
1 CD&R SMOKEY BUYER INC NaN CD&R Smokey Buyer Inc 12510CAA9 165000.00 PA 0.505585 Long DBT CORP US 2025-07-15 Fixed 6.75000
You can then play with the final df, add or drop columns, etc

Why can't Pandas Webscraping print out certain tables from this website?

Is there a way to also print out the second table on this website? (the one that starts with CK Hutchison Holdings Ltd. )
Website link: https://www.hkex.com.hk/Products/Listed-Derivatives/Single-Stock/Stock-Options?sc_lang=en
This was my code:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://www.hkex.com.hk/Products/Listed-Derivatives/Single-Stock/Stock-Options?sc_lang=en'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
spec_table = soup.select('table[class="table migrate"]')[1]
df = pd.read_html(str(spec_table))[0]
print(df[:5].to_markdown())
Whenever I tried to do spec_table = soup.select('table[class="table migrate"]')[0], it would print the first table (the one that starts with Sun Hung Kai Properties Limited) and whenever I tried to do spec_table = soup.select('table[class="table migrate"]')[1], it would skip the table in between (the one that starts with CK Hutchison Holdings Ltd. ) and print out the contract summary table.
Could anyone explain how I can print out the second table?

This is one way to isolate and extract that second table from page:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://www.hkex.com.hk/Products/Listed-Derivatives/Single-Stock/Stock-Options?sc_lang=en'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
second_table = soup.select('table:-soup-contains("CK Hutchison Holdings Ltd.")')[0]
df = pd.read_html(str(second_table))[0]
print(df)
Result:
No. SEHK Code Underlying Stock Name HKATS Code Contract Size (shares) Tier No.* Position Limit ## (Effective from 1 April 2022) Approved by FSC Taiwan+
0 1 1 CK Hutchison Holdings Ltd. CKH 500 1 50000 ✓
1 2 2 CLP Holdings Limited CLP 500 1 50000 NaN
2 3 3 The Hong Kong and China Gas Company Limited HKG 1000 2 150000 NaN
3 4 4 The Wharf (Holdings) Limited WHL 1000 1 50000 NaN
4 5 5 HSBC Holdings Plc. HKB 400 2 150000 ✓
... ... ... ... ... ... ... ... ...
63 64 3323 China National Building Material Company Limited NBM 2000 2 100000 ✓
64 65 3328 Bank of Communications Co., Ltd. BCM 1000 3 150000 ✓
65 66 3968 China Merchants Bank Co., Ltd. CMB 500 1 150000 ✓
66 67 3988 Bank of China Limited XBC 1000 3 150000 ✓
67 68 6862 Haidilao International Holding Ltd. HDO 1000 1 100000 NaN
For documentation, refer to my response to your previous question. Also make sure your bs4 package is up to date, do a pip install -U bs4.

apparently it is due to html error in the webpage: that webpage was clearly written by several people: if you check the html the tables (that look all the same) have each differet coding.
Anyway your problem is due to that table being called class="table migrate;" instead of class="table migrate, as all the other ones (difference being in the ;).
In order to get your tab simply use this "other" class type
spec_table = soup.select('table[class="table migrate;"]')[0]
and you will have the tab you are missing.
As a general tip: if you have problem with webscraping search directly the html of the source (you can look for some string you get from the webpage to find the exact html point you are interested in, or just inspect with browser tools)
Alternatively, if you wanna have a more elegant and general code (for example if you need to iterate over all the tabs) you can do this:
start = soup.find('p', {'class': 'spanHeading'})
spec_table = start.find_all_next('table')
And then do what you wanted to do before:
df = pd.read_html(str(spec_table[1]))[0]
print(df[:5].to_markdown())

Web Scrape Attempt With Selenium Yields Duplicate Entries

I am attempting to scrape some data off of this website.
I would greatly appreciate any assistance with this.
There are 30 entries per page and I'm currently trying to scrape information from within each of the links on each page. Here is my code:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Chrome()
print(driver.title)
driver.get("https://www.businesslist.com.ng/category/farming")
time.sleep(1)
select = driver.find_element_by_id("listings")
page_entries = [i.find_element_by_tag_name("a").get_attribute("href")
for i in select.find_elements_by_tag_name("h4")]
columns = {"ESTABLISHMENT YEAR":[], "EMPLOYEES":[], "COMPANY MANAGER":[],
"VAT REGISTRATION":[], "REGISTRATION CODE":[]}
for i in page_entries:
print(i)
driver.get(i)
listify_subentries = [i.text.strip().replace("\n","") for i in
driver.find_elements_by_class_name("info")][:11]
Everything runs fine up to here.The problem is likely in the section below.
for i in listify_subentries:
for q in columns.keys():
if q in i:
item = i.replace(q,"")
print(item)
columns[q].append(item)
else:
columns[q].append("None given")
print("None given")
Here's a picture of the layout for one entry. Sorry I can't yet embed images.
I'm trying to scrape some of the information under the "Working Hours" box (i.e. establishment year, company manager etc) from every business's page. You can find the exact information under the columns variable.
Because the not all pages have the same amount of information under the "Working Hours" box (here is one with more details underneath it), I tried using dictionaries + text manipulation to look up the available sub-entries and obtain the relevant information to their right. That is to say, obtain the name of the company manager, the year of establishment and so on; and if a page did not have this, then it would simply be tagged as "None given" under the relevant sub-entry.
The idea is to collate all this information and export it to a dataframe later on. Inputting "None given" when a page is lacking a particular sub-entry allows me to preserve the integrity of the data structure so that the entries are sure to align.
However, when I run the code the output I receive is completely off.
Here is the outer view of the columns dictionary once the code has run.
And if I click on the 'COMPANY MANAGER' section, you can see that there are multiple instances of it saying "None given" before it gives the name of company manager on the page. This is repeated for every other sub-entry as you'll see if you run the code and scroll down. I'm not sure what went wrong, but it seems that the size of the list has been inflated by a factor of 10, with extra "None given"s littered here and there. The size of each list should be 30, but now its 330.
I would greatly appreciate any assistance with this. Thank you.

You can use next example how to iterate all enterprises on that page and save the various info into a dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.businesslist.com.ng/category/farming"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for a in soup.select("h4 > a"):
u = "https://www.businesslist.com.ng" + a["href"]
print(u)
data = {"URL": u}
s = BeautifulSoup(requests.get(u).content, "html.parser")
for info in s.select("div.info:has(.label)"):
label = info.select_one(".label")
label.extract()
value = info.get_text(strip=True, separator=" ")
data[label.get_text(strip=True)] = value
all_data.append(data)
df = pd.DataFrame(all_data)
print(df)
df.to_csv("data.csv", index=None)
Prints:
URL Company name Address Phone Number Mobile phone Working hours Establishment year Employees Company manager Share this listing Location map Description Products & Services Listed in categories Keywords Website Registration code VAT registration E-mail Fax
0 https://www.businesslist.com.ng/company/198846/macmed-integrated-farms Macmed Integrated Farms 1, Gani Street, Ijegun Imore. Satellite Town. Lagos , Lagos , Lagos , Nigeria View Map 08033316905 09092245349 Monday: 8am-5pm Tuesday: 8am-5pm Wednesday: 8am-5pm Thursday: 8am-5pm Friday: 8am-5pm Saturday: 8am-5pm Sunday: 10am-4pm 2013 1-5 Engr. Marcus Awoh Show Map Expand Map 1, Gani Street, Ijegun Imore. Satellite Town. Lagos , Lagos , Lagos , Nigeria Get Directions Macmed Integrated Farms is into Poultry, Fish Farming (eggs, meat,day old chicks,fingerlings and grow-out) and animal Husbandry and sales of Farmlands land and facilities We also provide feasibility studies and business planning for all kind of businesses. Day old chicks WE are receiving large quantity of Day old Pullets, Broilers and cockerel in December 2016.\nInterested buyers are invited. PRICE: N 100 - N 350 Investors/ Partners We Macmed Integrated Farms a subsidiary of Macmed Cafe Limited RC (621444) are into poultry farming situated at Iponsinyi, behind (Nigerian National Petroleum Marketing Company)NNPMC at Mosimi, along... Commercial Hatchery Macmed Integrated Farms is setting up a Hatchery for chicken and other birds. We have 2 nos of fully automatic incubator imported from China with combined capacity of 1,500 eggs per setting.\nPlease book in advance.\nMarcus Awoh.\nfarm Operations Manager. PRICE: N100 - N250 Business Services Business Services / Consultants Business Services / Small Business Business Services / Small Business / Business Plans Business Services / Animal Shelters Manufacturing & Industry Manufacturing & Industry / Farming Manufacturing & Industry / Farming / Poultry Housing Suppliers Catfish Day old chicks Farming FINGERLINGS Fishery grow out and aquaculture Meat Poultry eggs spent Pol MORE +4 NaN NaN NaN NaN NaN
...
And saves data.csv (screenshot from LibreOffice):

How do I get scraped elements into a single list BeautifulSoup?

I would like the result to be a single list with individual strings, not the current output. Basically it would be the last list with all the strings in one list together.
Any help would be appreciated
headers = dict()
headers[
"User-Agent"
] = "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Mobile Safari/537.36"
headlines =[]
pages = np.arange(1, 3)
for page in pages:
url = 'https://www.marketwatch.com/investing/stock/aapl/moreheadlines?channel=MarketWatch&pageNumber=' + str(page)
results = requests.get(url, headers=headers)
soup = bs(results.text, "html.parser")
contents = soup.find_all("div", class_='article__content')
for i in contents:
headline = i.find("h3", class_='article__headline').text.strip()
headlines.append(headline)
print(headlines)
Then the output is this:
['The Dow Fell 179 Points Because Fauci Warned About Covid Mutations']
['The Dow Fell 179 Points Because Fauci Warned About Covid Mutations', 'Apple Inc. stock outperforms competitors on strong trading day']
['The Dow Fell 179 Points Because Fauci Warned About Covid Mutations', 'Apple Inc. stock outperforms competitors on strong trading day', 'Apple Stock Nears Another Record. Analysts’ iPhone Outlooks Keep Going Up.']
['The Dow Fell 179 Points Because Fauci Warned About Covid Mutations', 'Apple Inc. stock outperforms competitors on strong trading day', 'Apple Stock Nears Another Record. Analysts’ iPhone Outlooks Keep Going Up.', 'Charting a bull-trend pullback: S&P 500 retests the breakout point']
['The Dow Fell 179 Points Because Fauci Warned About Covid Mutations', 'Apple Inc. stock outperforms competitors on strong trading day', 'Apple Stock Nears Another Record. Analysts’ iPhone Outlooks Keep Going Up.', 'Charting a bull-trend pullback: S&P 500 retests the breakout point', 'Facebook and Amazon set records in annual spending on Washington lobbying']
['The Dow Fell 179 Points Because Fauci Warned About Covid Mutations', 'Apple Inc. stock outperforms competitors on strong trading day', 'Apple Stock Nears Another Record. Analysts’ iPhone Outlooks Keep Going Up.', 'Charting a bull-trend pullback: S&P 500 retests the breakout point', 'Facebook and Amazon set records in annual spending on Washington lobbying', 'The Dow Fell 12 Points Because Intel and Apple Stock Softened the Blow']
['The Dow Fell 179 Points Because Fauci Warned About Covid Mutations', 'Apple Inc. stock outperforms competitors on strong trading day', 'Apple Stock Nears Another Record. Analysts’ iPhone Outlooks Keep Going Up.', 'Charting a bull-trend pullback: S&P 500 retests the breakout point', 'Facebook and Amazon set records in annual spending on Washington lobbying', 'The Dow Fell 12 Points Because Intel and Apple Stock Softened the Blow', 'Apple Inc. stock outperforms market on strong trading day']

What happens?
The headlines are all in that list, issue is the indent of your print, it should be outside the loop and print the list only ones.
for page in pages:
url = 'https://www.marketwatch.com/investing/stock/aapl/moreheadlines?channel=MarketWatch&pageNumber=' + str(page)
results = requests.get(url, headers=headers)
soup = bs(results.text, "html.parser")
contents = soup.find_all("div", class_='article__content')
for i in contents:
headline = i.find("h3", class_='article__headline').text.strip()
headlines.append(headline)
print(headlines)
Btw you can improve your selection like this:
soup = BeautifulSoup(results.text, "html.parser")
for headline in soup.select('div.article__content h3.article__headline'):
headlines.append(headline.get_text(strip=True))

Pandas read_html producing empty df with tuple column names

I want to retrieve the tables on the following website and store them in a pandas dataframe: https://www.acf.hhs.gov/orr/resource/ffy-2012-13-state-of-colorado-orr-funded-programs
However, the third table on the page returns an empty dataframe with all the table's data stored in tuples as the column headers:
Empty DataFrame
Columns: [(Service Providers, State of Colorado), (Cuban - Haitian Program, $0), (Refugee Preventive Health Program, $150,000.00), (Refugee School Impact, $450,000), (Services to Older Refugees Program, $0), (Targeted Assistance - Discretionary, $0), (Total FY, $600,000)]
Index: []
Is there a way to "flatten" the tuple headers into header + values, then append this to a dataframe made up of all four tables? My code is below -- it has worked on other similar pages but keeps breaking because of this table's formatting. Thanks!
funds_df = pd.DataFrame()
url = 'https://www.acf.hhs.gov/programs/orr/resource/ffy-2011-12-state-of-colorado-orr-funded-programs'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
year = url.split('ffy-')[1].split('-orr')[0]
tables = page.content
df_list = pd.read_html(tables)
for df in df_list:
df['URL'] = url
df['YEAR'] = year
funds_df = funds_df.append(df)

For this site, there's no need for beautifulsoup or requests
pandas.read_html creates a list of DataFrames for each <table> at the URL.
import pandas as pd
url = 'https://www.acf.hhs.gov/orr/resource/ffy-2012-13-state-of-colorado-orr-funded-programs'
# read the url
dfl = pd.read_html(url)
# see each dataframe in the list; there are 4 in this case
for i, d in enumerate(dfl):
print(i)
display(d) # display worker in Jupyter, otherwise use print
print('\n')
dfl[0]
Service Providers Cash and Medical Assistance* Refugee Social Services Program Targeted Assistance Program TOTAL
0 State of Colorado $7,140,000 $1,896,854 $503,424 $9,540,278
dfl[1]
WF-CMA 2 RSS TAG-F CMA Mandatory 3 TOTAL
0 $3,309,953 $1,896,854 $503,424 $7,140,000 $9,540,278
dfl[2]
Service Providers Refugee School Impact Targeted Assistance - Discretionary Services to Older Refugees Program Refugee Preventive Health Program Cuban - Haitian Program Total
0 State of Colorado $430,000 $0 $100,000 $150,000 $0 $680,000
dfl[3]
Volag Affiliate Name Projected ORR MG Funding Director
0 CWS Ecumenical Refugee & Immigration Services $127,600 Ferdi Mevlani 1600 Downing St., Suite 400 Denver, CO 80218 303-860-0128
1 ECDC ECDC African Community Center $308,000 Jennifer Guddiche 5250 Leetsdale Drive Denver, CO 80246 303-399-4500
2 EMM Ecumenical Refugee Services $191,400 Ferdi Mevlani 1600 Downing St., Suite 400 Denver, CO 80218 303-860-0128
3 LIRS Lutheran Family Services Rocky Mountains $121,000 Floyd Preston 132 E Las Animas Colorado Springs, CO 80903 719-314-0223
4 LIRS Lutheran Family Services Rocky Mountains $365,200 James Horan 1600 Downing Street, Suite 600 Denver, CO 80218 303-980-5400

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Excluding some results from a table-Web Scraping with Python - python

Related

Parse SEC EDGAR XML Form Data with child nodes using BeautifulSoup

Why can't Pandas Webscraping print out certain tables from this website?

Web Scrape Attempt With Selenium Yields Duplicate Entries

How do I get scraped elements into a single list BeautifulSoup?

Pandas read_html producing empty df with tuple column names

Categories

Resources