I'm looking to do something with bills that have been delivered to the governor - collecting dates for when they were delivered and the date of the last legislative action before they were sent.
I'm doing this for a whole series of similar URLs. Problem is, my code (below) works for some URLs and not others. I'm writing this to a pandas dataframe and then to csv file. When the code fails, it writes the else block when either if of elif should've been triggered.
Here's a fail URL: https://www.nysenate.gov/legislation/bills/2011/s663
And a succeed URL: https://www.nysenate.gov/legislation/bills/2011/s333
Take the first URL for example. Underneath the "view actions" dropdown, it says it was delivered to the governor on Jul 29, 2011. Prior to that, it was returned to assembly on Jun 20, 2011.
Using "delivered to governor" location as td in the table, I'd like to collect both dates using Bs4.
Here's what I have in my code:
check_list = [item.text.strip() for item in tablebody.select("td")]
dtg = "delivered to governor"
dtg_regex = re.compile(
'/.*(\S\S\S\S\S\S\S\S\S\s\S\S\s\S\S\S\S\S\S\S\S).*'
)
if dtg in check_list:
i = check_list.index(dtg)
transfer_list.append(check_list[i+1]) ## last legislative action date (not counting dtg)
transfer_list.append(check_list[i-1]) ## dtg date
elif any(dtg_regex.match(dtg_check_list) for dtg_check_list in check_list):
transfer_list.append(check_list[4])
transfer_list.append(check_list[2])
else:
transfer_list.append("no floor vote")
transfer_list.append("not delivered to governor")
You could use :has and :contains to target the right first row and find_next to move to next row. You can use last-of-type to get last action in first row select_one to get first in second row. You can use the class of each "column" to move between first and second columns.
Your mileage may vary with other pages.
import requests
from bs4 import BeautifulSoup as bs
links = ['https://www.nysenate.gov/legislation/bills/2011/s663', 'https://www.nysenate.gov/legislation/bills/2011/s333']
transfer_list = []
with requests.Session() as s:
for link in links:
r = s.get(link)
soup = bs(r.content, 'lxml')
target = soup.select_one('.cbill--actions-table--row:has(td:contains("delivered"))')
if target:
print(target.select_one('.c-bill--actions-table-col1').text)
# transfer_list.append(target.select_one('.c-bill--actions-table-col1').text)
print(target.select_one('.c-bill--action-line-assembly:last-of-type, .c-bill--action-line-senate:last-of-type').text)
# transfer_list.append(target.select_one('.c-bill--action-line-assembly:last-of-type, .c-bill--action-line-senate:last-of-type').text)
print(target.find_next('tr').select_one('.c-bill--actions-table-col1').text)
# append again
print(target.find_next('tr').select_one('.c-bill--actions-table-col2 span').text)
# append again
else:
transfer_list.append("no floor vote")
transfer_list.append("not delivered to governor")
Make full use of XPath:
Get date of "delivered to governor"
//text()[contains(lower-case(.), 'delivered to governor')]/ancestor::tr/td[1]/text()
S663A - http://xpather.com/bUQ6Gva8
S333 - http://xpather.com/oTNfuH75
Get date of "returned to assembly/senate"
//text()[contains(lower-case(.), 'delivered to governor')]/ancestor::tr/following-sibling::tr/td//text()[contains(lower-case(.), 'returned to')]/ancestor::tr/td[1]/text()
S663A - http://xpather.com/Rnufm2TB
S333 - http://xpather.com/4x9UHn4L
Get date of action which precedes "delivered to governor" row regardless of what the action is
//text()[contains(lower-case(.), 'delivered to governor')]/ancestor::tr/following-sibling::tr[1]/td/text()
S663A - http://xpather.com/AUpWCFIz
S333 - http://xpather.com/u8LOCb0x
Related
I am trying to web scrape NFL standings data and am interested in the categories "PCT" and "Net Pts" from the table from this url. https://www.nfl.com/standings/league/2021/REG
I have set up BeautifulSoup and printed the all 'td' in this page. The problem is when doing so you get an order of the teams from worst record to the best. Obviously this will cause problems in the future if I have a specific index that I have identified as the Lions' PCT for example, as when their record changes that data will have a different index. In fact the order of the teams on the website will change every week as more games are played.
Is there any way to say anything like if the name of the team is X do something? Like use the table data 4 indexes lower? I haven't seen how to deal with this problem on any youtube tutorial or book so I am wondering what the thought process is. I need a way to identify each team and their PCT and Net points instantaneously as this info will be put into another function.
Here is what I have so far for example:
When you do something like this...
import requests
from bs4 import BeautifulSoup
url = 'https://www.nfl.com/standings/league/2021/REG'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')
data = soup.find_all('td')[0:10]
print(data)
#I am using just the first 10 indexes to keep it short here
...you get the table data info for the Detroit Lions as they are the worst team in the league at the time of posting this question. I have identified that their "PCT" data point would be
win_pct = soup.find_all('td')[4]
print(float(win_pct.text.strip()))
However, if another team becomes the worst team in the league this index would belong to them and not the Lions. How would I work around this? Thanks
You can use dictionary to store data about clubs and then use club name as a key to get the data (independent of club position). For example:
import requests
from bs4 import BeautifulSoup
url = "https://www.nfl.com/standings/league/2021/REG"
res = requests.get(url)
soup = BeautifulSoup(res.text, "lxml")
data = {}
for row in soup.select("tr:has(td)"):
cells = [td.get_text(strip=True) for td in row.select("td")[1:]]
club_name = row.select_one(".d3-o-club-fullname").get_text(strip=True)
data[club_name] = cells
# print PCT/Net Pts of Detroit Lions:
print(data["Detroit Lions"][3], data["Detroit Lions"][6])
Prints:
0.000 -63
I'm want to import data from the Tesla Quarterly Revenues into a pandas data frame. I keep extracting the Annual Revenues table instead (both tables are sides by sides on the webpage). How do I need to modify my code to extract the Quarterly Revenues? Thanks in advance.
```html_data = requests.get('https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue').text
soup = BeautifulSoup(html_data, 'html5lib')
tesla_revenue = pd.DataFrame(columns=['Date', 'Revenue'])
for row in soup.find("tbody").find_all("tr"):
col = row.find_all("td")
if (col != []):
date =col[0].text
revenue =col[1].text
tesla_revenue = tesla_revenue.append({"Date":date, "Revenue":revenue}, ignore_index=True)
tesla_revenue```
To stick with your method you can use the following css selector. Note that I skip the first row which are the headers
Py requests:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue')
soup = bs(r.content, 'html5lib')
tesla_revenue = pd.DataFrame(
[{"Date": r.select_one('td:nth-of-type(1)').text,
"Revenue":r.select_one('td:nth-of-type(2)').text}
for r in soup.select('#style-1 div + div .historical_data_table tr')[1:]]
, columns = ['Date', 'Revenue'])
print(tesla_revenue)
This #style-1 div + div .historical_data_table tr select for a parent that has id style-1 that has a child div , via descendant combinator ( ), then moves to the adjacent div with an adjacent sibling combinator (+), then the table element which has class (.) historical_data_table, and finally, via descendant combinator ( ), selects all the rows, tr, within.
You can test the css within the browser's elements tab by pressing F12 and then Ctrl + F and enter the css #style-1 div + div .historical_data_table tr and hit enter. You can then cycle through the matches.
You can read about css selectors here: https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors
Py pandas read_html:
There is a lot of unnecessary work going on there though.
You could more easily just use pandas read_html and index in for the right table:
import pandas as pd
table = pd.read_html('https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue')[1]
print(table)
You can find the right index by searching in the elements tab, as described above, but using the css type selector table - you will see the second actual table element match is the one you want.
If you examine the page source (right click view source) you will find the table, meaning that content is static and read_html can read it (unless request headers) are required.
Scraping:
As per the helpful comment by #MichaelLongstreth
Please read the https://www.macrotrends.net/robots.txt for whether scraping is permitted. If it is not permitted, look for a public API serving the same data or another website that does permit scraping.
I am trying to use the United States Securities and Exchange (SEC) database, to look at company financial reports (known as 10k’s) to pull out a list of the executive committee members for each filing. I am currently using the most recent files for Microsoft (stock ticker: MSFT) and Walmart (stock ticker: WMT). I know I can look up this information elsewhere on finance websites but I am trying to make a flexible database for personal use. My issue:
The table index position is different in each report, on one company report the table I want may
be table 38 and on another it may be table 45 so a static index/position count will not work across
multiple filings.
The specific attributes in each HTML table tag change so I cannot search for a common attribute. In
some cases I find common attributes and sometimes I do not.
I am starting to think I may not be able to automate this due to lack of identifiers that are unique within each file and common across all files. I've banged my head looking at many Python Webscraping tutorials and videos the last few weeks. Any suggestions are appreciated, full automation would be ideal so I can loop through multiple filings, partial helps too I'm here to learn. I might be bumping into trying to automate something that is too diverse.
Microsoft Link:
https://www.sec.gov/Archives/edgar/data/789019/000156459019027952/msft-10k_20190630.htm
Desired Table:
<table border="0" cellspacing="0" cellpadding="0" align="center" style="border-collapse:collapse; width:100%;">
Walmart Link:
https://www.sec.gov/Archives/edgar/data/104169/000010416919000016/wmtform10-kx1312019.htm
Desired Table:
<table cellpadding="0" cellspacing="0" style="font-family:Times New Roman;font-size:10pt;width:100%;border-collapse:collapse;text-align:left;">
Code to Count Number of Tables in Each Page:
from selenium import webdriver
from bs4 import BeautifulSoup
chrome_path = r"C:\webdrivers\chromedriver.exe"
browser = webdriver.Chrome(chrome_path)
#Microsoft
browser.get("https://www.sec.gov/Archives/edgar/data/789019/000156459019027952/msft-10k_20190630.htm")
msft = browser.page_source
page_msft = BeautifulSoup(msft, 'html.parser')
tables_msft = page_msft.find_all("table")
#Walmart
browser.get("https://www.sec.gov/Archives/edgar/data/104169/000010416919000016/wmtform10-kx1312019.htm")
wmt = browser.page_source
page_wmt = BeautifulSoup(wmt, 'html.parser')
tables_wmt = page_wmt.find_all("table")
print("MSFT Result Table Count: " + str(len(tables_msft)))
print("Walmart Result Table Count: " + str(len(tables_wmt)))
Results:
MSFT Result Table Count: 263
Walmart Result Table Count: 258
Process finished with exit code 0
Firstly you don't need Selenium, requests library will be faster and avoid overhead. So I was able to partially figure out a way to extract the required data. But since the number of columns is different they cannot be combined together(for Microsoft and Walmart).
The below code generates two required dataframe one for Microsoft and one for Walmart.
You still need to manipulate the column names. The idea is to get the table with td value as 'Age' since it is a unique table data. Let me know if you need some clarifications:-
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
#Microsoft
page = requests.get("https://www.sec.gov/Archives/edgar/data/789019/000156459019027952/msft-10k_20190630.htm")
soup = BeautifulSoup(page.text, 'html')
resmsft = []
tables_msft = soup.find(text="Age").find_parent("table")
for row in tables_msft.find_all("tr")[1:]:
# print([cell.get_text(strip=True) for cell in row.find_all("td")])
if row:
resmsft.append([cell.get_text(strip=True) for cell in row.find_all("td")])
non_empty = [sublist for sublist in resmsft if any(sublist)]
df_msft = pd.DataFrame.from_records(non_empty)
df_msft[df_msft==''] = np.nan
df_msft=df_msft.dropna(axis=1,how='all')
#Walmart
page = requests.get("https://www.sec.gov/Archives/edgar/data/104169/000010416919000016/wmtform10-kx1312019.htm")
soup = BeautifulSoup(page.text, 'html')
#page_wmt = BeautifulSoup(soup, 'html.parser')
tables_wmt = soup.find(text="Age").find_parent("table")
reswmt = []
for row in tables_wmt.find_all("tr")[1:]:
# print([cell.get_text(strip=True) for cell in row.find_all("td")])
if row:
reswmt.append([cell.get_text(strip=True) for cell in row.find_all("td")])
non_empty_wmt = [sublist for sublist in reswmt if any(sublist)]
df_wmt = pd.DataFrame.from_records(non_empty_wmt)
df_wmt[df_wmt==''] = np.nan
df_wmt=df_wmt.dropna(axis=1,how='all')
I'm looking for help with two main things: (1) scraping a web page and (2) turning the scraped data into a pandas dataframe (mostly so I can output as .csv, but just creating a pandas df is enough for now). Here is what I have done so far for both:
(1) Scraping the web site:
I am trying to scrape this page: https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015&id=1284178.015&id=1283809.015&id=1283549.015&id=1282631.015. My end goal is to create a dataframe that would ideally contain only the information I am looking for (i.e. I'd be able to select only the parts of the site that I am interested in for my df); it's OK if I have to pull in all the data for now.
As you can see from the URL as well as the ID hyperlinks underneath "Quick Link Reference" at the top of the page, there are five distinct records on this page. I would like each of these IDs/records to be treated as an individual row in my pandas df.
EDIT: Thanks to a helpful comment, I'm including an example of what I would ultimately want in the table below. The first row represents column headers/names and the second row represents the first inspection.
inspection_id open_date inspection_type close_conference close_case violations_serious_initial
1285328.015 12/28/2017 referral 12/28/2017 06/21/2018 2
Mostly relying on BeautifulSoup4, I've tried a few different options to get at the page elements I'm interested in:
# This is meant to give you the first instance of Case Status, which in the case of this page is "CLOSED".
case_status_template = html_soup.head.find('div', {"id" : "maincontain"},
class_ = "container").div.find('table', class_ = "table-bordered").find('strong').text
# I wasn't able to get the remaining Case Statuses with find_next_sibling or find_all, so I used a different method:
for table in html_soup.find_all('table', class_= "table-bordered"):
print(table.text)
# This gave me the output I needed (i.e. the Case Status for all five records on the page),
# but didn't give me the structure I wanted and didn't really allow me to connect to the other data on the page.
# I was also able to get to the same place with another page element, Inspection Details.
# This is the information reflected on the page after "Inspection: ", directly below Case Status.
insp_details_template = html_soup.head.find('div', {"id" : "maincontain"},
class_ = "container").div.find('table', class_ = "table-unbordered")
for div in html_soup.find_all('table', class_ = "table-unbordered"):
print(div.text)
# Unfortunately, although I could get these two pieces of information to print,
# I realized I would have a hard time getting the rest of the information for each record.
# I also knew that it would be hard to connect/roll all of these up at the record level.
So, I tried a slightly different approach. By focusing instead on a version of that page with a single inspection record, I thought maybe I could just hack it by using this bit of code:
url = 'https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
first_table = html_soup.find('table', class_ = "table-borderedu")
first_table_rows = first_table.find_all('tr')
for tr in first_table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)
# Then, actually using pandas to get the data into a df and out as a .csv.
dfs_osha = pd.read_html('https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015',header=1)
for df in dfs_osha:
print(df)
path = r'~\foo'
dfs_osha = pd.read_html('https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015',header=1)
for df[1,3] in dfs_osha:
df.to_csv(os.path.join(path,r'osha_output_table1_012320.csv'))
# This worked better, but didn't actually give me all of the data on the page,
# and wouldn't be replicable for the other four inspection records I'm interested in.
So, finally, I found a pretty handy example here: https://levelup.gitconnected.com/quick-web-scraping-with-python-beautiful-soup-4dde18468f1f. I was trying to work through it, and had gotten as far as coming up with this code:
for elem in all_content_raw_lxml:
wrappers = elem.find_all('div', class_ = "row-fluid")
for x in wrappers:
case_status = x.find('div', class_ = "text-center")
print(case_status)
insp_details = x.find('div', class_ = "table-responsive")
for tr in insp_details:
td = tr.find_all('td')
td_row = [i.text for i in td]
print(td_row)
violation_items = insp_details.find_next_sibling('div', class_ = "table-responsive")
for tr in violation_items:
tr = tr.find_all('tr')
tr_row = [i.text for i in tr]
print(tr_row)
print('---------------')
Unfortunately, I ran into too many bugs with this to be able to use it so I was forced to abandon the project until I got some further guidance. Hopefully the code I've shared so far at least shows the effort I've put in, even if it doesn't do much to get to the final output! Thanks.
For this type of page you don't really need beautifulsoup; pandas is enough.
url = 'your url above'
import pandas as pd
#use pandas to read the tables on the page; there are lots of them...
tables = pd.read_html(url)
#Select from this list of tables only those tables you need:
incident = [] #initialize a list of inspections
for i, table in enumerate(tables): #we need to find the index position of this table in the list; more below
if table.shape[1]==5: #all relevant tables have this shape
case = [] #initialize a list of inspection items you are interested in
case.append(table.iat[1,0]) #this is the location in the table of this particular item
case.append(table.iat[1,2].split(' ')[2]) #the string in the cell needs to be cleaned up a bit...
case.append(table.iat[9,1])
case.append(table.iat[12,3])
case.append(table.iat[13,3])
case.append(tables[i+2].iat[0,1]) #this particular item is in a table which 2 positions down from the current one; this is where the index position of the current table comes handy
incident.append(case)
columns = ["inspection_id", "open_date", "inspection_type", "close_conference", "close_case", "violations_serious_initial"]
df2 = pd.DataFrame(incident,columns=columns)
df2
Output (pardon the formatting):
inspection_id open_date inspection_type close_conference close_case violations_serious_initial
0 Nr: 1285328.015 12/28/2017 Referral 12/28/2017 06/21/2018 2
1 Nr: 1283809.015 12/18/2017 Complaint 12/18/2017 05/24/2018 5
2 Nr: 1284178.015 12/18/2017 Accident 05/17/2018 09/17/2018 1
3 Nr: 1283549.015 12/13/2017 Referral 12/13/2017 05/22/2018 3
4 Nr: 1282631.015 12/12/2017 Fat/Cat 12/12/2017 11/16/2018 1
I am trying to figure out how to print all tr elements from a table, but I can't quite get it working right.
Here is the link I am working with.
https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_Senate
Here is my code.
import requests
from bs4 import BeautifulSoup
link = "https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_Senate"
html = requests.get(link).text
# If you do not want to use requests then you can use the following code below
# with urllib (the snippet above). It should not cause any issue."""
soup = BeautifulSoup(html, "lxml")
res = soup.findAll("span", {"class": "fn"})
for r in res:
print("Name: " + r.find('a').text)
table_body=soup.find('senators')
rows = table_body.find_all('tr')
for row in rows:
cols=row.find_all('td')
cols=[x.text.strip() for x in cols]
print(cols)
I am trying to print all tr elements from the table named 'senators'. Also, I am wondering if there is a way to click on links of senators, like 'Richard Shelby' which takes me to this:
https://en.wikipedia.org/wiki/Richard_Shelby
From each link, I want to grab the data under 'Assumed office'. In this case the value is: 'January 3, 2018'. So, ultimately, I want to end up with this:
Richard Shelby May 6, 1934 (age 84) Lawyer U.S. House
Alabama Senate January 3, 1987 2022
Assumed office: January 3, 2018
All I can get now is the name of each senator printed out.
In order to locate the "Senators" table, you can first find the corresponding "Senators" label and then get the first following table element:
soup.find(id='Senators').find_next("table")
Now, in order to get the data row by row, you would have to account for the cells with a "rowspan" which stretch across multiple rows. You can either follow the approaches suggested at What should I do when <tr> has rowspan, or the implementation I provide below (not ideal but works in your case).
import copy
import requests
from bs4 import BeautifulSoup
link = "https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_Senate"
with requests.Session() as session:
html = session.get(link).text
soup = BeautifulSoup(html, "lxml")
senators_table = soup.find(id='Senators').find_next("table")
headers = [td.get_text(strip=True) for td in senators_table.tr('th')]
rows = senators_table.find_all('tr')
# pre-process table to account for rowspan, TODO: extract into a function
for row_index, tr in enumerate(rows):
for cell_index, td in enumerate(tr('td')):
if 'rowspan' in td.attrs:
rowspan = int(td['rowspan'])
del td.attrs['rowspan']
# insert same td into subsequent rows
for index in range(row_index + 1, row_index + rowspan):
try:
rows[index]('td')[cell_index].insert_after(copy.copy(td))
except IndexError:
continue
# extracting the desired data
rows = senators_table.find_all('tr')[1:]
for row in rows:
cells = [td.get_text(strip=True) for td in row('td')]
print(dict(zip(headers, cells)))
If you want to, then, follow the links to senator "profile" pages, you would first need to extract the link out of the appropriate cell in a row and then use session.get() to "navigate" to it, something along these lines:
senator_link = row.find_all('td')[3].a['href']
senator_link = urljoin(link, senator_link)
response = session.get(senator_link)
soup = BeautifulSoup(response.content, "lxml")
# TODO: parse
where urljoin is imported as:
from urllib.parse import urljoin
Also, FYI, one of the reasons to use requests.Session() here is to optimize making requests to the same host:
The Session object allows you to persist certain parameters across requests. It also persists cookies across all requests made from the Session instance, and will use urllib3’s connection pooling. So if you’re making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase
There is also an another way to get the tabular data parsed - .read_html() from pandas. You could do:
import pandas as pd
df = pd.read_html(str(senators_table))[0]
print(df.head())
to get the desired table as a dataframe.