I am completely new to web scraping and would like to parse a specific table that occurs in the SEC filing DEF 14A of companies. I was able to get the right URL and pass it to panda.
Note: Even though the desired table should occur in every DEF 14A, it's layout may differ from company to company. Right now I am struggling with formatting the dataframe.
How do I manage to get the right header and join it into a single index(column)?
This is my code so far:
url_to_use: "https://www.sec.gov/Archives/edgar/data/1000229/000095012907000818/h43371ddef14a.htm"
resp = requests.get(url_to_use)
soup = bs.BeautifulSoup(resp.text, "html.parser")
dfs = pd.read_html(resp.text, match="Salary")
pd.options.display.max_columns = None
df = dfs[0]
df.dropna(how="all", inplace = True)
df.dropna(axis = 1, how="all", inplace = True)
display(df)
Right now the output of my code looks like this:
Dataframe output
Whereas the correct layout looks like this:
Original format
Is there some way to identify those rows that belong to the header and combine them as the header?
The table html is rather messed up. The empty cells are actually in the source code. It would be easiest to do some post processing:
import pandas as pd
import requests
r = requests.get("https://www.sec.gov/Archives/edgar/data/1000229/000095012907000818/h43371ddef14a.htm", headers={'User-agent': 'Mozilla/5.0'}).text
df = pd.read_html(r) #load with user agent to avoid 401 error
df = df[40] #get the right table from the list of dataframes
df = df[8:].rename(columns={i: ' '.join(df[i][:8].dropna()) for i in df.columns}) #generate column headers from the first 8 rows
df.dropna(how='all', axis=1, inplace=True) #remove empty columns and rows
df.dropna(how='all', axis=0, inplace=True)
df.reset_index(drop=True, inplace=True)
def sjoin(x): return ''.join(x[x.notnull()].astype(str))
df = df.groupby(level=0, axis=1).apply(lambda x: x.apply(sjoin, axis=1)) #concatenate columns with the same headers, taken from https://stackoverflow.com/a/24391268/11380795
Result
All Other Compensation ($)(4)
Change in Pension Value and Nonqualified Deferred Compensation Earnings ($)
Name and Principal Position
Non-Equity Incentive Plan Compensation ($)
Salary ($)
Stock Awards ($)(1)
Total ($)
Year
0
8953
(3)
David M. Demshur President and Chief Executive Officer
766200(2)
504569
1088559
2368281
2006
1
8944
(3)
Richard L. Bergmark Executive Vice President, Chief Financial Officer and Treasurer
330800(2)
324569
799096
1463409
2006
2
8940
(3)
Monty L. Davis Chief Operating Officer and Senior Vice President
320800(2)
314569
559097
1203406
2006
3
8933
(3)
John D. Denson Vice President, General Counsel and Secretary
176250(2)
264569
363581
813333
2006
I am a self-teaching data science student, currently doing my first big Python portfolio project in several steps, the first of which is using pandas to work with IMDb [Internet Movie Database]'s rather oddly structured .tsv files in an effort to create a fully searchable big data repository of all IMDb data (the officially supported searches and even APIs like OMDB (Open Movie Database) don't allow for the kinds of detailed queries I need to do for the larger project).
The structure of IMDb's public files is that they include all data on movies, TV shows, episodes, actors, directors, crew, the whole business, scattered rather haphazardly across seven massive tsv files. I've confirmed that pandas can, in fact, read in all of this data and that my computer's memory can handle it, but what I want to do is merge the seven tsv files into a single DataFrame object which can then be exported to (preferably) a SQL database or even a huge spreadsheet/another TSV file but larger.
Each thing in the database (movie, actor, individual TV episode) has a tconst row, which, in one file is identified as "titleId", a string. In every other file, this is identified as "tconst", also a string. I'm going to need to change titleId when I read that file into tconst; this is one of several challenges I haven't got to yet.
#set pandas formatting parameters
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 25)
#read in the data tables provided
showbiz_core = pd.read_table("name.basics.tsv",sep='\t')
#temporary hack - print the entire dataframe as test
print(showbiz_core)
This works, but I'm not sure exactly how to proceed next. I want to import each of the other tsv files to attempt to locally reconstruct the imdb database. This means that I don't want to have duplicate tconst strings, but rather to end up with new information about a tconst entry (like a film) appended to it as new columns.
Should I be looking to do a "for i in [new file]" type loop somehow? How would you go about this?
The IMdB files are actually highly structured. Looping is always a bad structure for merging data
structure data sourcing - I used wget rather than manually sourcing
the files are large so work with a subset for modelling purpose. I just have used popular movies and actors as driver
the CSV columns in the tsv files are actually sub-tables. Treat them as such. I build a reference entity dmi to do this
there are other associative relationships there as well primaryProfession, genres
finally join (merge) everything together from OMDB and IMdB. Taken first rows where many items associate to a title
I have left data currently as tsv clearly it would be very simple to put into a database using to_sql() method. Main point is sourcing and transformation. aka ETL which has become an unfashionable term. This can be further supplemented with web scraping. I looked at Box Office Mojo however this would require selenium to scrape it as it's dynamic HTML
IMdB sourcing
import requests, json, re, urllib.parse
from bs4 import BeautifulSoup
import pandas as pd
import wget,gzip
from pathlib import Path
import numpy as np
# find what IMdB has to give ...
resp = requests.get("https://datasets.imdbws.com")
soup = BeautifulSoup(resp.content.decode(), "html.parser")
files = {}
for f in soup.find_all("a", href=True):
if f["href"].endswith('gz'):
u = urllib.parse.urlparse(f["href"])
fn = Path().cwd().joinpath(u.path.strip("/"))
files[Path(fn.stem).stem] = fn.name
if not fn.is_file():
wget.download(f["href"])
IMdB transform
set alldata=True on first run to prep data. Second run false and you have a manageable subset
alldata = False
subsetdata = True
dfs={}
# work with a subset of data to speed up modelling and iterations. Take a few major actors and titles
# as criteria to build a manageable representative set of data
l = ["Tom Hanks","Will Smith","Clint Eastwood","Leonardo DiCaprio","Johnny Depp","Meryl Streep","Bruce Willis"]
tm = {'tconst': ['tt0111161','tt0468569','tt1375666','tt0137523','tt0110912','tt0109830','tt0944947','tt0133093','tt0120737','tt0167260','tt0068646'],
'averageRating': [9.3, 9.0, 8.8, 8.8, 8.9, 8.8, 9.3, 8.7, 8.8, 8.9, 9.2],
'numVotes': [2275837,2237966,1997918,1805137,1777920,1752954,1699318,1630083,1618100,1602417,1570167]}
# work with subset for modelling purpose
k = "name.basics"
if alldata:
dfs[k] = pd.read_csv(gzip.open(files[k]), sep="\t").replace({"\\N":np.nan})
if subsetdata:
# manage down size of nmi
dfs[k] = dfs[k].loc[(dfs[k]["primaryName"].isin(l)
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][0])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][1])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][2])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][3])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][4])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][5])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][6])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][7])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][8])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][9])
)
&dfs[k]["knownForTitles"].str.contains("tt")]
dfs[k].to_csv(f"{files[k]}_subset.tsv", sep="\t", index=False)
else:
dfs[k] = pd.read_csv(f"{files[k]}_subset.tsv", sep="\t")
dfs[k] = dfs[k].astype({c:"Int64" for c in dfs[k].columns}, errors="ignore")
# birth year is a lot but getting data issues...
# dfs[k] = dfs[k].dropna(subset=["primaryProfession","birthYear"])
# comma separated - not good for joins and merges. rename for consistency
dfs["nmi"] = (dfs["name.basics"].loc[:,["nconst","knownForTitles"]]
.assign(knownForTitles=lambda x: x["knownForTitles"].str.split(","))
.explode("knownForTitles")
).rename(columns={"knownForTitles":"tconst"}).drop_duplicates()
# already extracted known titles so can drop and de-dup - e.g. Tom Hanks
dfs[k] = dfs[k].drop(columns=["knownForTitles"]).drop_duplicates()
for k in [k for k in files.keys() if k not in ["name.basics","omdb.titles"]]:
if alldata:
dfs[k] = pd.read_csv(gzip.open(files[k]), sep="\t").replace({"\\N":np.nan})
if k=="title.akas": dfs[k]=dfs[k].rename(columns={"titleId":"tconst"})
# subset titles to those we have names
if subsetdata:
c = "tconst" if k!= "title.episode" else "parentTconst"
try:
(dfs[k].loc[dfs[k][c].isin(dfs["nmi"]["tconst"])]
.to_csv(f"{files[k]}_subset.tsv", sep="\t", index=False))
except KeyError as e:
print(k, dfs[k].columns, e)
else:
dfs[k] = pd.read_csv(f"{files[k]}_subset.tsv", sep="\t")
dfs[k] = dfs[k].astype({c:"Int64" for c in dfs[k].columns}, errors="ignore")
dfs["name.and.titles"] = dfs["nmi"].merge(dfs["name.basics"], on="nconst").merge(dfs["title.basics"], on="tconst")
OMDB sourcing
omdbcols = ['Title', 'Year', 'Rated', 'Released', 'Runtime', 'Genre', 'Director', 'Writer', 'Actors', 'Plot', 'Language', 'Country', 'Awards', 'Poster', 'Ratings', 'Metascore', 'imdbRating', 'imdbVotes', 'imdbID', 'Type', 'DVD', 'BoxOffice', 'Production', 'Website', 'Response']
omdbk = "omdb.titles"
files[omdbk] = f"{omdbk}.tsz"
if not Path().cwd().joinpath(files[omdbk]).is_file():
dfs[omdbk] = pd.DataFrame(columns=omdbcols)
else:
dfs[omdbk] = pd.read_csv(files[omdbk], sep="\t", thousands=",")
dfs[omdbk] = dfs[omdbk].astype({c:"Int64" for c in dfs[omdbk].columns}, errors="ignore")
k = "title.basics"
# limited to 1000 API calls a day, so only fetch if have not done already
for tconst in dfs[k].loc[~(dfs[k]["tconst"].isin(dfs[omdbk]["imdbID"]))]["tconst"].values:
# tt0109830 movie Forrest Gump
# http://www.omdbapi.com/?i=tt3896198&apikey=xxx
params={"apikey":apikey,"i":tconst,"plot":"full"}
res = requests.get("http://www.omdbapi.com/", params=params)
if res.status_code!=200:
print("breached API limit")
break
else:
dfs[omdbk] = pd.concat([dfs[omdbk], pd.json_normalize(res.json())])
dfs[omdbk].to_csv(files[omdbk], index=False, sep="\t")
sample analysis
# The Dark Knight tt0468569
# Game of Throne tt0944947
# for demo purpose - just pick first association when there are many
mask = dfs[omdbk]["imdbID"].isin(["tt0468569","tt0944947"])
demo = (dfs[omdbk].loc[mask]
.rename(columns={c:f"OMDB{c}" for c in dfs[omdbk].columns})
.rename(columns={"OMDBimdbID":"tconst"})
.merge(dfs["title.basics"], on="tconst")
.merge(dfs["title.ratings"], on="tconst")
.merge(dfs["title.akas"].groupby("tconst", as_index=False).first(), on="tconst")
.merge(dfs["title.crew"].groupby("tconst", as_index=False).first(), on="tconst")
.merge(dfs["title.principals"].groupby("tconst", as_index=False).first(), on="tconst")
.merge(dfs["title.episode"].groupby("parentTconst", as_index=False).first(),
left_on="tconst", right_on="parentTconst", how="left", suffixes=("","_ep"))
.merge(dfs["nmi"]
.merge(dfs["name.basics"], on="nconst")
.groupby(["tconst"], as_index=False).first(), on="tconst", suffixes=("","_name"))
).T
output
0 1
OMDBTitle The Dark Knight Game of Thrones
OMDBYear 2008 2011–2019
OMDBRated PG-13 TV-MA
OMDBReleased 18 Jul 2008 17 Apr 2011
OMDBRuntime 152 min 57 min
OMDBGenre Action, Crime, Drama, Thriller Action, Adventure, Drama, Fantasy, Romance
OMDBDirector Christopher Nolan NaN
OMDBWriter Jonathan Nolan (screenplay), Christopher Nolan (screenplay), Christopher Nolan (story), David S. Goyer (story), Bob Kane (characters) David Benioff, D.B. Weiss
OMDBActors Christian Bale, Heath Ledger, Aaron Eckhart, Michael Caine Peter Dinklage, Lena Headey, Emilia Clarke, Kit Harington
OMDBLanguage English, Mandarin English
OMDBCountry USA, UK USA, UK
OMDBAwards Won 2 Oscars. Another 153 wins & 159 nominations. Won 1 Golden Globe. Another 374 wins & 602 nominations.
OMDBPoster https://m.media-amazon.com/images/M/MV5BMTMxNTMwODM0NF5BMl5BanBnXkFtZTcwODAyMTk2Mw##._V1_SX300.jpg https://m.media-amazon.com/images/M/MV5BYTRiNDQwYzAtMzVlZS00NTI5LWJjYjUtMzkwNTUzMWMxZTllXkEyXkFqcGdeQXVyNDIzMzcwNjc#._V1_SX300.jpg
OMDBRatings [{'Source': 'Internet Movie Database', 'Value': '9.0/10'}, {'Source': 'Rotten Tomatoes', 'Value': '94%'}, {'Source': 'Metacritic', 'Value': '84/100'}] [{'Source': 'Internet Movie Database', 'Value': '9.3/10'}]
OMDBMetascore 84 <NA>
OMDBimdbRating 9 9.3
OMDBimdbVotes 2234169 1679892
tconst tt0468569 tt0944947
OMDBType movie series
OMDBDVD 09 Dec 2008 NaN
OMDBBoxOffice $533,316,061 NaN
OMDBProduction Warner Bros. Pictures/Legendary NaN
OMDBWebsite <NA> <NA>
OMDBResponse 1 1
OMDBtotalSeasons <NA> 8
titleType movie tvSeries
primaryTitle The Dark Knight Game of Thrones
originalTitle The Dark Knight Game of Thrones
isAdult 0 0
startYear 2008 2011
endYear <NA> 2019
runtimeMinutes 152 57
genres Action,Crime,Drama Action,Adventure,Drama
averageRating 9 9.3
numVotes 2237966 1699318
ordering_x 10 10
title The Dark Knight Taht Oyunları
region GB TR
language en tr
types imdbDisplay imdbDisplay
attributes fake working title literal title
isOriginalTitle 0 0
directors nm0634240 nm0851930,nm0551076,nm0533713,nm0336241,nm1888967,nm1047532,nm0764601,nm0007008,nm0617042,nm0787687,nm0687964,nm0070474,nm1125275,nm0638354,nm0002399,nm0806252,nm0755261,nm0887700,nm0590889
writers nm0634300,nm0634240,nm0333060,nm0004170 nm1125275,nm0552333,nm1888967,nm4984276,nm2643685,nm7260047,nm2977599,nm0961827,nm0260870
ordering_y 10 10
nconst nm0746273 nm0322513
category producer actor
job producer creator
characters ["Bruce Wayne"] ["Jorah Mormont"]
parentTconst NaN tt0944947
tconst_ep NaN tt1480055
seasonNumber <NA> 1
episodeNumber <NA> 1
nconst_name nm0000198 nm0000293
primaryName Gary Oldman Sean Bean
birthYear 1958 1959
deathYear 1998 2020
primaryProfession actor,soundtrack,producer actor,producer,animation_department
I am learning Python and trying to create a DataFrame with Pandas. I want to take the data from the table from this website https://www.chilli-diy.com/chilikeimtabelle/ so I can later on visualize it with altair. I am having troubles with the column "Schärfe" because there are multiple values I guess (1-10, 1 bis 4, 10+,...)
So all I get is the "Brazilian Ghost", because it has no value?
thanks in advance
permalink = ('https://www.chilli-diy.com/chilikeimtabelle/')
chilis = requests.get(permalink).text
soup = bs4.BeautifulSoup(chilis, "html.parser")
tables = soup.find_all('table')
names = []
peps = []
for table in tables:
rows = table.find_all('tr')
for row in rows:
cells = row.find_all('td')
if(peps is None):
continue
peps.append(cells[2].text)
names.append(cells[0].text)
df = pd.DataFrame({"Chilisorte" : names, "Schärfe" : peps })
df
Out:
Chilisorte Schärfe
0 Brazilian Ghost
output
Your dataframe is empty because names and peps are appended outside of the loop. So all they have in is the last element. Also, I guess, you you wanted to check if the cell is None, not peps
names = []
peps = []
for table in tables:
rows = table.find_all('tr')
for row in rows:
cells = row.find_all('td')
if cells:
peps.append(cells[2].text)
names.append(cells[0].text)
If you examine the DOM of the page in your browser, you'll see that all Chilisorte column cells (<td>) have classes column-1, and Schärfe column table cells have classes column-3.
The easiest way would be to extract all cells with relevant classes into 2 list, extract their .text value, and add into a dataframe.
permalink = ('https://www.chilli-diy.com/chilikeimtabelle/')
chilis = requests.get(permalink).text
soup = bs4.BeautifulSoup(chilis, "html.parser")
col1_tds = tables[0].findAll('td', {"class": "column-1"})
col1 = [ td.text for td in col1_tds ]
col3_tds = tables[0].findAll('td', {"class": "column-3"})
col3 = [ td.text for td in col3_tds ]
pd.DataFrame({
"Chilisorte": col1,
"Schärfe": col3
})
Output:
Chilisorte Schärfe
0 Anaheim 1 bis 4
1 Habanero Chocolate 10
2 Habanero White 10
3 Bird Pepper Wild 9
4 Bhut(Bih) Jolokia Yellow 10+
... ... ...
272 Naglah 10++
273 Dorset Naga 10+++
274 Jigsaw 10+++
275 Black Naga 10+
276 Brazilian Ghost
277 rows × 2 columns
I am scraping data from the espn.com for the upcoming NFL schedule. However, I am only able to get the first line of table and the not the rest of the tables. I believe it is because of structure of the html and the each date has a different 'td'. I can get Thursday's game data but, not the rest
****Thursday, September 5****
MATCHUP TIME (ET) NAT TV TICKETS LOCATION
Green Bay
Chicago
8:20 PM NBC Tickets as low as $290 Soldier Field, Chicago
Sunday, September 8
MATCHUP TIME (ET) NAT TV TICKETS LOCATION
Tennessee
Cleveland
1:00 PM CBS Tickets as low as $121 FirstEnergy Stadium, Cleveland
Cincinnati
Seattle
4:05 PM CBS Tickets as low as $147 CenturyLink Field, Seattle
New York
Dallas
4:25 PM FOX Tickets as low as $50 AT&T Stadium, Arlington
Foxboro
Monday, September 9
MATCHUP TIME (ET) NAT TV TICKETS LOCATION
Houston
New Orleans
7:10 PM ESPN Tickets as low as $112 Mercedes-Benz Superdome, New Orleans
Denver
Oakland
10:20 PM ESPN Tickets as low as $72 Oakland Coliseum, Oakland
I have use beautifulsoup and was easily about to get the data, but parsing the data has been a challenged.
I have tried to just continuing using a for loop, but I can a stopiteration traceback. After reading the previous article about the traceback I realize that I need to try a different solution to the problem.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
import pandas as pd
main_url = ['http://www.espn.com/nfl/schedule'][1]
response = requests.get(main_url)
soup = BeautifulSoup(response.text, 'lxml')
table = soup.find('table')
rows = table.find_all('tr')
rows = iter(rows)
df = [td.text for td in next(rows).find_all('td') if td.text]
df2 = [td.text for td in next(rows).find_all('td') if td.text]
[1]: https://www.espn.com/nfl/schedule
I believe that the problem lies in this line :
table = soup.find('table')
The thing is, the above mentioned page consists of 3 table elements that have the class = "schedule" attribute. However in your code, you used the find() function only, instead of find_all(). That's the major reason that you ended up with only the contents of the first table. So, I believe that if just handle that part correctly then you'll be good to go. Now, I'm not much familiar with the set notation used to fill up the lists, hence the code contains the good old for loop style.
#List to store the rows
df = []
#Collect all the tables
tables = soup.find_all('table', class_ = "schedule")
for table in tables:
rows = soup.find_all('tr')
#rows = iter(rows)
row_item = []
for row in rows:
#Collect all 'td' elements from the 'row' & append them to a list 'row_item'
data_items = row.find_all('td')
for data_item in data_items:
row_item.append(data_item.text)
#Append the list to the 'df'
df.append(row_item)
row_item = []
print(df)
If you're trying to pull <table> tags, you can use Pandas .read_html() to do that. It'll return a list of dataframes. In this case, you can append them all together into 1 table:
import pandas as pd
url = 'http://www.espn.com/nfl/schedule'
tables = pd.read_html(url)
df = pd.DataFrame()
for table in tables:
df = df.append(table)
I'm using Pandas as a way to write data from Selenium.
Two example results from a search box ac_results on a webpage:
#Search for product_id = "01"
ac_results = "Orange (10)"
#Search for product_id = "02"
ac_result = ["Banana (10)", "Banana (20)", "Banana (30)"]
Orange returns only one price ($10) while Banana returns a variable number of prices from different vendors, in this example three prices ($10), ($20), ($30).
The code uses regex via re.findall to grab each price and put them into a list. The code works fine as long as re.findall finds only one list item, as for Oranges.
Problem is when there are a variable amount of prices, as when searching for Bananas. I would like to create a new row for each stated price, and the rows should also include product_id and item_name.
Current output:
product_id prices item_name
01 10 Orange
02 [u'10', u'20', u'30'] Banana
Desired output:
product_id prices item_name
01 10 Orange
02 10 Banana
02 20 Banana
02 30 Banana
Current code:
df = pd.read_csv("product_id.csv")
def crawl(product_id):
#Enter search input here, omitted
#Getting results:
search_result = driver.find_element_by_class_name("ac_results")
item_name = re.match("^.*(?=(\())", search_result.text).group().encode("utf-8")
prices = re.findall("((?<=\()[0-9]*)", search_reply.text)
return pd.Series([prices, item_name])
df[["prices", "item_name"]] = df["product_id"].apply(crawl)
df.to_csv("write.csv", index=False)
FYI: Workable solution with csv module, but I want to use Pandas.
with open("write.csv", "a") as data_write:
wr_data = csv.writer(data_write, delimiter = ",")
for price in prices: #<-- This is the important part!
wr_insref.writerow([product_id, price, item_name])
# initializing here for reproducibility
pids = ['01','02']
prices = [10, [u'10', u'20', u'30']]
names = ['Orange','Banana']
df = pd.DataFrame({"product_id": pids, "prices": prices, "item_name": names})
The following snippet should work after your apply(crawl).
# convert all of the prices to lists (even if they only have one element)
df.prices = df.prices.apply(lambda x: x if isinstance(x, list) else [x])
# Create a new dataframe which splits the lists into separate columns.
# Then flatten using stack. The explicit MultiIndex allows us to keep
# the item_name and product_id associated with each price.
idx = pd.MultiIndex.from_tuples(zip(*[df['item_name'],df['product_id']]),
names = ['item_name', 'product_id'])
df2 = pd.DataFrame(df.prices.tolist(), index=idx).stack()
# drop the hierarchical index and select columns of interest
df2 = df2.reset_index()[['product_id', 0, 'item_name']]
# rename back to prices
df2.columns = ['product_id', 'prices', 'item_name']
I was not able to run your code (probably missing inputs) but you can probably transform your prices list in a list of dict and then build a DataFrame from there:
d = [{"price":10, "product_id":2, "item_name":"banana"},
{"price":20, "product_id":2, "item_name":"banana"},
{"price":10, "product_id":1, "item_name":"orange"}]
df = pd.DataFrame(d)
Then df is:
item_name price product_id
0 banana 10 2
1 banana 20 2
2 orange 10 1