Scraping tables using Pandas read_html and identifying headers

Scraping tables using Pandas read_html and identifying headers - python

I am completely new to web scraping and would like to parse a specific table that occurs in the SEC filing DEF 14A of companies. I was able to get the right URL and pass it to panda.
Note: Even though the desired table should occur in every DEF 14A, it's layout may differ from company to company. Right now I am struggling with formatting the dataframe.
How do I manage to get the right header and join it into a single index(column)?
This is my code so far:
url_to_use: "https://www.sec.gov/Archives/edgar/data/1000229/000095012907000818/h43371ddef14a.htm"
resp = requests.get(url_to_use)
soup = bs.BeautifulSoup(resp.text, "html.parser")
dfs = pd.read_html(resp.text, match="Salary")
pd.options.display.max_columns = None
df = dfs[0]
df.dropna(how="all", inplace = True)
df.dropna(axis = 1, how="all", inplace = True)
display(df)
Right now the output of my code looks like this:
Dataframe output
Whereas the correct layout looks like this:
Original format
Is there some way to identify those rows that belong to the header and combine them as the header?

The table html is rather messed up. The empty cells are actually in the source code. It would be easiest to do some post processing:
import pandas as pd
import requests
r = requests.get("https://www.sec.gov/Archives/edgar/data/1000229/000095012907000818/h43371ddef14a.htm", headers={'User-agent': 'Mozilla/5.0'}).text
df = pd.read_html(r) #load with user agent to avoid 401 error
df = df[40] #get the right table from the list of dataframes
df = df[8:].rename(columns={i: ' '.join(df[i][:8].dropna()) for i in df.columns}) #generate column headers from the first 8 rows
df.dropna(how='all', axis=1, inplace=True) #remove empty columns and rows
df.dropna(how='all', axis=0, inplace=True)
df.reset_index(drop=True, inplace=True)
def sjoin(x): return ''.join(x[x.notnull()].astype(str))
df = df.groupby(level=0, axis=1).apply(lambda x: x.apply(sjoin, axis=1)) #concatenate columns with the same headers, taken from https://stackoverflow.com/a/24391268/11380795
Result
All Other Compensation ($)(4)
Change in Pension Value and Nonqualified Deferred Compensation Earnings ($)
Name and Principal Position
Non-Equity Incentive Plan Compensation ($)
Salary ($)
Stock Awards ($)(1)
Total ($)
Year
0
8953
(3)
David M. Demshur President and Chief Executive Officer
766200(2)
504569
1088559
2368281
2006
1
8944
(3)
Richard L. Bergmark Executive Vice President, Chief Financial Officer and Treasurer
330800(2)
324569
799096
1463409
2006
2
8940
(3)
Monty L. Davis Chief Operating Officer and Senior Vice President
320800(2)
314569
559097
1203406
2006
3
8933
(3)
John D. Denson Vice President, General Counsel and Secretary
176250(2)
264569
363581
813333
2006

Related

Pandas map two dataframes using regex

I've two dataframes, one with text information and another with regex and patterns, what I need to do is to map a column from the second dataframe using regex
edit: What I need to do is to apply each regex on all df['text'] rows, and if there is a match, add the Pattern into a new column
Sample data
text_dict = {'text':['customer and increased repair and remodel activity as well as from other sales',
'sales for the overseas customers',
'marketing approach is driving strong play from top tier customers',
'employees in India have been the continuance of remote work will impact productivity',
'sales due to higher customer']}
regex_dict = {'Pattern':['Sales + customer', 'Marketing + customer', 'Employee * Productivity'],
'regex': ['(?:sales\\w*)(?:[^,.?])*(?:customer\\w*)|(?:customer\\w*)(?:[^,.?])*(?:sales\\w*)',
'(?:marketing\\w*)(?:[^,.?])*(?:customer\\w*)|(?:customer\\w*)(?:[^,.?])*(?:marketing\\w*)',
'(?:employee\\w*)(?:[^\n])*(?:productivity\\w*)|(?:productivity\\w*)(?:[^\n])*(?:employee\\w*)']}
df
text
0 customer and increased repair and remodel acti...
1 sales for the overseas customers
2 marketing approach is driving strong play from...
3 employees in India have been the continuance o...
4 sales due to higher customer
regex
Pattern regex
0 Sales + customer (?:sales\w*)(?:[^,.?])*(?:customer\w*)|(?:cust...
1 Marketing + customer (?:marketing\w*)(?:[^,.?])*(?:customer\w*)|(?:...
2 Employee * Productivity (?:employee\w*)(?:[^\n])*(?:productivity\w*)|(...
Desired output
text Pattern
0 customer and increased repair and remodel acti... Sales + customer
1 sales for the overseas customers Sales + customer
2 marketing approach is driving strong play from... Marketing + customer
3 employees in India have been the continuance o... Employee * Productivity
4 sales due to higher customer Sales + customer
tried the following, created a function that returns the Pattern in case there is a match, then I iterate over all the columns in the regex dataframe
def finding_keywords(regex, match, keyword):
if re.search(regex, match):
return keyword
else:
pass
for index, row in regex.iterrows():
df['Pattern'] = df['text'].apply(lambda x: finding_keywords(regex['Regex'][index], x, regex['Pattern'][index]))
the problem with this is that in every iteration, it erases the previous mappings, as you can see below. As I'm foo foo was the last iteration, is the only one remaining with a pattern
text Pattern
0 foo None
1 bar None
2 foo foo I'm foo foo
3 foo bar None
4 bar bar None
One solution could be to run the iteration over regex dataframe, and then iterate over df, this way I avoid loosing information, but I'm looking for a fastest solution

You can loop through the unique values of the regex dataframe and apply to the text of the df frame and return the pattern in a new regex column. Then, merge in the Pattern column and drop the regex column.
The key to my approach was to first create the column as NaN and then fillna with each iteration so the columns didn't get overwritten.
import re
import numpy as np
srs = regex['regex'].unique()
df['regex'] = np.nan
for reg in srs:
df['regex'] = df['regex'].fillna(df['text'].apply(lambda x: reg
if re.search(reg, x) else np.NaN))
df = pd.merge(df, regex, how='left', on='regex').drop('regex', axis=1)
df
Out[1]:
text Pattern
0 customer and increased repair and remodel acti... Sales + customer
1 sales for the overseas customers Sales + customer
2 marketing approach is driving strong play from... Marketing + customer
3 employees in India have been the continuance o... Employee * Productivity
4 sales due to higher customer Sales + customer

Python : Pivot up values in data frame using lamda

I am reading a text file and using pandas and storing the details in Data Frame. Below is the Input file on which data frame is created :
SourceID|OrganizationName|AddressLine1|AddressLine2|City
1|Manor Drug Medical And Pharma|5795 N 1st St||Uta
1|Manor Drug Medical And Pharma|23230 Red River|Dr Ste 104|Evanston
On this data frame i am trying to pivot up the address information grouping by SourceID. Below is the expected output:
SourceID|OrganizationName|AddressLine1|AddressLine2|City
1|Manor Drug Medical And Pharma|5795 N 1st St^23230 Red River|^Dr Ste 104|Evanston^Uta
Below is the code used for same:
import pandas as pd
df = pd.read_csv('PivotingValues.txt', sep="|")
a=df[df.groupby("SourceID")['AddressLine1'].apply(lambda tags: '^'.join(tags)), df.groupby("SourceID")['City'].apply(lambda tags: '^'.join(tags))]
print(a)
Can you please help is achieving the same. Is there any other method other then lamda that can be used to achieve the same

Use -
df.groupby(['SourceID', 'OrganizationName'], as_index = False).agg('^'.join)
OR
df.groupby(['SourceID', 'OrganizationName'], as_index = False).agg({'AddressLine1': '^'.join, 'AddressLine2': '^'.join, 'City': '^'.join})
Output
SourceID OrganizationName AddressLine1 AddressLine2 City
1 Manor Drug Medical And Pharma 5795 N 1st St^23230 Red River ^Dr Ste 104 Uta^Evanston

How do I combine multiple rows of a CSV that share data into one row using Pandas?

I have downloaded the ASCAP database, giving me a CSV that is too large for Excel to handle. I'm able to chunk the CSV to open parts of it, the problem is that the data isn't super helpful in its default format. Each song title has 3+ rows associated with it:
The first row include the % share that ASCAP has in that song.
The rows after that include a character code (ROLE_TYPE) that indicates if that row contains the writer or performer of that song.
The first column of each row contains a song title.
This structure makes the data confusing because on the rows that list the % share there are blank cells in the NAME column because that row does not have a Writer/Performer associated with it.
What I would like to do is transform this data from having 3+ rows per song to having 1 row per song with all relevant data.
So instead of:
TITLE, ROLE_TYPE, NAME, SHARES, NOTE
I would like to change the data to:
TITLE, WRITER, PERFORMER, SHARES, NOTE
Here is a sample of the data:
TITLE,ROLE_TYPE,NAME,SHARES,NOTE
SCORE MORE,ASCAP,Total Current ASCAP Share,100,
SCORE MORE,W,SMITH ANTONIO RENARD,,
SCORE MORE,P,SMITH SHOW PUBLISHING,,
PEOPLE KNO,ASCAP,Total Current ASCAP Share,100,
PEOPLE KNO,W,SMITH ANTONIO RENARD,,
PEOPLE KNO,P,SMITH SHOW PUBLISHING,,
FEEDBACK,ASCAP,Total Current ASCAP Share,100,
FEEDBACK,W,SMITH ANTONIO RENARD,,
I would like the data to look like:
TITLE, WRITER, PERFORMER, SHARES, NOTE
SCORE MORE, SMITH ANTONIO RENARD, SMITH SHOW PUBLISHING, 100,
PEOPLE KNO, SMITH ANTONIO RENARD, SMITH SHOW PUBLISHING, 100,
FEEDBACK, SMITH ANONIO RENARD, SMITH SHOW PUBLISHING, 100,
I'm using python/pandas to try and work with the data. I am able to use groupby('TITLE') to group rows with matching titles.
import pandas as pd
data = pd.read_csv("COMMA_ASCAP_TEXT.txt", low_memory=False)
title_grouped = data.groupby('TITLE')
for TITLE,group in title_grouped:
print(TITLE)
print(group)
I was able to groupby('TITLE') of each song, and the output I get seems close to what I want:
SCORE MORE
TITLE ROLE_TYPE NAME SHARES NOTE
0 SCORE MORE ASCAP Total Current ASCAP Share 100.0 NaN
1 SCORE MORE W SMITH ANTONIO RENARD NaN NaN
2 SCORE MORE P SMITH SHOW PUBLISHING NaN NaN
What do I need to do to take this group and produce a single row in a CSV file with all the data related to each song?

I would recommend:
Decompose the data by the ROLE_TYPE
Prepare the data for merge (rename columns and drop unnecessary columns)
Merge everything back into one DataFrame
Merge will be automatically performed over the column which has the same name in the DataFrames being merged (TITLE in this case).
Seems to work nicely :)
data = pd.read_csv("data2.csv", sep=",")
# Create 3 individual DataFrames for different roles
data_ascap = data[data["ROLE_TYPE"] == "ASCAP"].copy()
data_writer = data[data["ROLE_TYPE"] == "W"].copy()
data_performer = data[data["ROLE_TYPE"] == "P"].copy()
# Remove unnecessary columns for ASCAP role
data_ascap.drop(["ROLE_TYPE", "NAME"], axis=1, inplace=True)
# Rename columns and remove unnecesary columns for WRITER role
data_writer.rename(index=str, columns={"NAME": "WRITER"}, inplace=True)
data_writer.drop(["ROLE_TYPE", "SHARES", "NOTE"], axis=1, inplace=True)
# Rename columns and remove unnecesary columns for PERFORMER role
data_performer.rename(index=str, columns={"NAME": "PERFORMER"}, inplace=True)
data_performer.drop(["ROLE_TYPE", "SHARES", "NOTE"], axis=1, inplace=True)
# Merge all together
result = data_ascap.merge(data_writer, how="left")
result = result.merge(data_performer, how="left")
# Print result
print(result)

strip data frame cell then create columns

i'm trying to take the info from dataframe and break it out into columns with the following header names. the info is all crammed into 1 cell.
new to python, so be gentle.
thanks for the help
my code:
r=requests.get('https://nclbgc.org/search/licenseDetails?licenseNumber=80479')
page_data = soup(r.text, 'html.parser')
company_info = [' '.join(' '.join(info.get_text(", ", strip=True).split()) for info in page_data.find_all('tr'))]
df = pd.DataFrame(company_info, columns = ['ic_number, status, renewal_date, company_name, address, county, telephon, limitation, residential_qualifiers'])
print(df)
the result i get:
['License Number, 80479 Status, Valid Renewal Date, n/a Name, DLR Construction, LLC Address, 3217 Vagabond Dr Monroe, N
C 28110 County, Union Telephone, (980) 245-0867 Limitation, Limited Classifications, Residential Qualifiers, Arteaga, Vi
cky Rodriguez']

You can use read_html with some post processing:
url = 'https://nclbgc.org/search/licenseDetails?licenseNumber=80479'
#select first table form list of tables, remove only NaNs rows
df = pd.read_html(url)[0].dropna(how='all')
#forward fill NaNs in first column
df[0] = df[0].ffill()
#merge values in second column
df = df.groupby(0)[1].apply(lambda x: ' '.join(x.dropna())).to_frame().rename_axis(None).T
print (df)
Address Classifications County License Number \
1 3217 Vagabond Dr Monroe, NC 28110 Residential Union 80479
Limitation Name Qualifiers Renewal Date \
1 Limited DLR Construction, LLC Arteaga, Vicky Rodriguez
Status Telephone
1 Valid (980) 245-0867

Replace the df line like below:
df = pd.DataFrame(company_info, columns = ['ic_number', 'status', 'renewal_date', 'company_name', 'address', 'county', 'telephon', 'limitation', 'residential_qualifiers'])
Each column mentioned under columns should be within quotes. Else it is considered as one single column.

Rearrange CSV data

I have 2 csv files with different sequence of columns. For e.g. the first file starts with 10 digits mobile numbers while that column is at number 4 in the second file.
I need to merge all the customer data into a single csv file. The order of the columns should be as follows:
mobile pincode model Name Address Location pincode date
mobile Name Address Model Location pincode Date
9845299999 Raj Shah nagar No 22 Rivi Building 7Th Main I Crz Mumbai 17/02/2011
9880877777 Managing Partner M/S Aitas # 1010, 124Th Main, Bk Stage. - Bmw 320 D Hyderabad 560070 30-Dec-11
Name Address Location mobile pincode Date Model
Asvi Developers pvt Ltd fantry Road Nariman Point, 1St Floor, No. 150 Chennai 9844066666 13/11/2011 Crz
L R Shiva Gaikwad & Sudha Gaikwad # 42, Suvarna Mansion, 1St Cross, 17Th Main, Banjara Hill, B S K Stage,- Bangalore 9844233333 560085 40859 Mercedes_E 350 Cdi
Second task and that may be slightly difficult is that the new files expected may have a totally different column sequence. In that case I need to extract 10 digits mobile number and 6 digits pincode column. I need to write the code that will guess the city column if it matches with any of the given city list. The new files are expected to have relevant column headings but the column heading may be slightly different. for e.g. "customer address" instead of "address". How do I handle such data?
sed 's/.*\([0-9]\{10\}\).*/\1,&/' input
I have been suggested to use sed to rearrange the 10 digits column at the beginning. But I do also need to rearrange the text columns. For e.g. if a column matches the entries in the following list then it is undoubtedly model column.
['Crz', 'Bmw 320 D', 'Benz', 'Mercedes_E 350 Cdi', 'Toyota_Corolla He 1.8']
If any column matches 10% of the entries with the above list then it is a "model" column and should be at number 3 followed by mobile and pincode.

For your first question, I suggest using pandas to load both files and then concat. After that you can rearrange your columns.
import pandas as pd
dataframe1 = pd.read_csv('file1.csv')
dataframe2 = pd.read_csv('file2.csv')
combined = pd.concat([dataframe1, dataframe2]) #the columns will be ordered alphabetically
To get desired order,
result_df = combined[['mobile', 'pincode', 'model', 'Name', 'Address', 'Location', 'pincode', 'date']]
and then result_df.to_csv('oupput.csv', index=False) to export to csv file.
For the second one, you can do something like this (assuming you have loaded a csv file into df like above)
match_model = lambda m: m in ['Crz', 'Bmw 320 D', 'Benz', 'Mercedes_E 350 Cdi', 'Toyota_Corolla He 1.8']
for c in df:
if df[c].map(match_model).sum()/len(df) > 0.1:
print "Column %s is 'Model'"% c
df.rename(columns={c:'Model'}, inplace=True)
You can modify the matching function match_model to use regex instead if you want.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.