How to include attributes of HTML table as a multiindex using Pandas? - python

I'm trying to read HTML from the following URL into a pandas dataframe:
https://antismash-db.secondarymetabolites.org/output/GCF_006385935.1/
The rendered HTML tables look like the following where there are N tables I'm interested in and 1 (the last one) that I'm not (i.e., I'm interested in the ones that don't start with "No secondary metabolite"):
When I read HTML via pandas I get 3 tables. Note, the last table from pd.read_html isn't the "No secondary metabolite" table but a concatenated table of the ones I'm interested in prefixed with "NZ_" in the header.
My question is if there is a way to include the headers of the rendered table as a multiindex?
For instance, I'm looking for a resulting table that looks like this:
# Read HTML Tables
dataframes = pd.read_html("https://antismash-db.secondarymetabolites.org/output/GCF_006385935.1/")
# Set Region as the index
dataframes = list(map(lambda df: df.set_index("Region"), dataframes))
# Manual prepending of title and table headers, respectively
dataframes[0].index = dataframes[0].index.map(lambda x: ("GCF_006385935.1", "NZ_CP041066.1", x))
dataframes[1].index = dataframes[1].index.map(lambda x: ("GCF_006385935.1", "NZ_CP041065.1", x))
# Concatenate tables
df_concat = pd.concat(dataframes[:-1], axis=0)
# Replace &nbsp characters with _
df_concat.index = df_concat.index.map(lambda x: (x[0], x[1], x[2].replace("&nbsp","_")))
# Multiindex labels
df_concat.index.names = ["level_0", "level_1", "level_2"]
df_concat

Try beautifulsoup to parse the HTML and construct the final dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
id_ = "GCF_006385935.1"
url = f"https://antismash-db.secondarymetabolites.org/output/{id_}/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
dfs = []
for table in soup.select(".record-overview-details table"):
header = table.find_previous(class_="record-overview-header").text.split()[
0
]
df = pd.read_html(str(table))[0].assign(level_1=header, level_0=id_)
dfs.append(df)
final_df = pd.concat(dfs)
final_df = final_df.set_index(["level_0", "level_1", "Region"])
print(final_df)
Prints:
Type From To Most similar known cluster Most similar known cluster.1 Similarity
level_0 level_1 Region
GCF_006385935.1 NZ_CP041066.1 Region&nbsp1.1 terpene 1123901 1143342 carotenoid Terpene 50%
Region&nbsp1.2 phosphonate 1252463 1293980 NaN NaN NaN
Region&nbsp1.3 T3PKS 1944360 1985445 NaN NaN NaN
Region&nbsp1.4 terpene 2690187 2709232 NaN NaN NaN
Region&nbsp1.5 terpene 4260236 4281054 surfactin NRP:Lipopeptide 13%
Region&nbsp1.6 siderophore 4446861 4463436 NaN NaN NaN
NZ_CP041065.1 Region&nbsp3.1 lanthipeptide 98352 124802 NaN NaN NaN

Related

remove nan from table data python?

I'm using BS4 to pull a table from an HTML webpage and trying to add it to a pandas data frame but it's very sloppy when I pull it and I can't seem to get it to print properly if anyone can help?
There is only 1 table available on the webpage and this is the code I'm using. and what it's pulling.
soup = BeautifulSoup(driver.page_source,'html.parser')
df = pd.read_html(str(soup))
print (df)
results:
[ Unnamed: 0 Student Number Student Name Placement Date
0 NaN 20808456 Sandy Gurlow 01/13/2023
1 NaN NaN NaN NaN]
But I've tried to use:
df.dropna(inplace=True)
And I get the error code:
AttributeError: 'list' object has no attribute 'dropna'
pandas.read_html returns a list of dataframes, with as many dataframes as it found tables in the input.
You need to use:
df = pd.read_html(driver.page_source)[0]
Or, to avoid IndexError in case of no table:
l = pd.read_html(driver.page_source)
if l:
df = l[0]
else:
print('no table found')

How to pull specific columns from a Wikipedia table using python/Beautiful Soup

I've really been stumped for a while on this.
Link to table = https://en.wikipedia.org/wiki/List_of_Manchester_United_F.C._seasons
I want to pull the data in the columns highlighed in red below
And put it in a pandas dataframe like this
Here is my code
import urllib.request
url = "https://en.wikipedia.org/wiki/List_of_Manchester_United_F.C._seasons"
page = urllib.request.urlopen(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, "lxml")
# print(soup.prettify())
my_table = soup.find('table', {'class':'wikitable sortable'})
season = []
data = []
for row in my_table.find_all('tr'):
s = row.find('th')
season.append(s)
d = row.find('td')
data.append(d)
import pandas as pd
c = {'Season': season, 'Data': data}
df = pd.DataFrame(c)
df
Heres's my output. I'm completely lost on how to get to the simple 5 column table above. Thanks
You are almost there, though you don't really need beautifulsoup for that; just pandas.
Try this:
url = "https://en.wikipedia.org/wiki/List_of_Manchester_United_F.C._seasons"
resp = requests.get(url)
tables = pd.read_html(resp.text)
target = tables[2].iloc[:,[0,2,3,4,5]]
target
Output:
Season P W D L
Season League League League League
0 1886–87 NaN NaN NaN NaN
1 1888–89[9] 12 8 2 2
2 1889–90 22 9 2 11
etc. And you can take it from there.

Join multiple CSV files by using python pandas

I am trying to create a CSV file from multiple csv files by using python pandas.
accreditation.csv :-
"pid","accreditation_body","score"
"25799","TAAC","4.5"
"25796","TAAC","5.6"
"25798","DAAC","5.7"
ref_university :-
"id","pid","survery_year","end_year"
"1","25799","2018","2018"
"2","25797","2016","2018"
I want to create a new table by reading the instruction from table_structure.csv. I want to join two tables and rewrite the accreditation.csv . REFERENCES ref_university(id, survey_year) is connecting with ref_university.csv and inserting id and survery_year columns value by matching the pid column value.
table_structure.csv :-
table_name,attribute_name,attribute_type,Description
,,,
accreditation,accreditation_body,varchar,
,grading,varchar,
,pid,int4, "REFERENCES ref_university(id, survey_year)"
,score,float8,
Modified CSV file should look like,
New accreditation.csv :-
"accreditation_body","grading","pid","id","survery_year","score"
"TAAC","","25799","1","2018","2018","4.5"
"TAAC","","25797","2","2016","2018","5.6"
"DAAC","","25798","","","","5.7"
I can read the csv in panda
df = pd.read_csv("accreditation.csv")
But, what is the recommended way to read the REFERENCES instruction and pick the columns value. If there is no value then column should be blank.
We can not hardcore pid in panda function. We have to read table_structure.csv and match if there is a Reference then call the mentioned columns. It should not be merged, just the specific columns should be added.
Dynamic solution is possible, but not so easy:
df = pd.read_csv("table_structure.csv")
#remove only NaNs rows
df = df.dropna(how='all')
#repalce NaNs by forward filling
df['table_name'] = df['table_name'].ffill()
#create for each table_name one row
df = (df.dropna(subset=['Description'])
.join(df.groupby('table_name')['attribute_name'].apply(list)
.rename('cols'), 'table_name'))
#get name of DataFrame and new columns names
df['df1'] = df['Description'].str.extract('REFERENCES\s*(.*)\s*\(')
df['new_cols'] = df['Description'].str.extract('\(\s*(.*)\s*\)')
df['new_cols'] = df['new_cols'].str.split(', ')
#remove unnecessary columns
df = df.drop(['attribute_type','Description'], axis=1).set_index('table_name')
print (df)
table_name
accreditation pid [accreditation_body, grading, pid, score]
df1 new_cols
table_name
accreditation ref_university [id, survey_year]
#for select by named create dictioanry of DataFrames
data = {'accreditation' : pd.read_csv("accreditation.csv"),
'ref_university': pd.read_csv("ref_university.csv")}
#seelct by index
v = df.loc['accreditation']
print (v)
attribute_name pid
cols [accreditation_body, grading, pid, score]
df1 ref_university
new_cols [id, survey_year]
Name: accreditation, dtype: object
Select by dictionary and by Series v
df = pd.merge(data[v.name],
data[v['df1']][v['new_cols'] + [v['attribute_name']]],
on=v['attribute_name'],
how='left')
is converted to:
df = pd.merge(data['accreditation'],
data['ref_university'][['id', 'survey_year'] + ['pid']],
on='pid',
how='left')
and return:
print (df)
pid accreditation_body score id survey_year
0 25799 TAAC 4.5 1.0 2018.0
1 25796 TAAC 5.6 NaN NaN
2 25798 DAAC 5.7 NaN NaN
Last add new columns by union and reindex:
df = df.reindex(columns=df.columns.union(v['cols']))
print (df)
accreditation_body grading id pid score survey_year
0 TAAC NaN 1.0 25799 4.5 2018.0
1 TAAC NaN NaN 25796 5.6 NaN
2 DAAC NaN NaN 25798 5.7 NaN
Here is the working code. Try it! When files are huge set low_memory=False in pd.read_csv()
import pandas as pd
import glob
# gets path to the folder datafolder
path = r"C:\Users\data_folder"
# reads all files with.csv ext
filenames = glob.glob(path + "\*.csv")
print('File names:', filenames)
df=pd.DataFrame()
# for loop to iterate and concat csv files
for file in filenames:
temp=pd.read_csv(file,low_memory=False)
df= pd.concat([df, temp], axis=1) #set axis =0 if you want to join rows
df.to_csv('output.csv')

How to reindex malformed columns retrived from pandas read_html?

I am retrieving some content from a website which has several tables with the same number of columns, with pandas read_html. When I read a single link that actually has several tables with the same number of columns, pandas effectively read all the tables as one (something like a flat/normalized table). However, I am interested in do the same for a list of links from a website (i.e. a single flat table for several links), so I tried the following:
In:
import multiprocessing
def process(url):
df_url = pd.read_html(url)
df = pd.concat(df_url, ignore_index=False)
return df_url
links = ['link1.com','link2.com','link3.com',...,'linkN.com']
pool = multiprocessing.Pool(processes=6)
df = pool.map(process, links)
df
Nevertheless, I guess I am not specifiying corecctly to read_html() which are the columns, so I am getting this malformed list of lists:
Out:
[[ Form Disponibility \
0 290090 01780-500-01) Unavailable - no product available for release.
Relation \
Relation drawbacks
0 NaN Removed
1 NaN Removed ],
[ Form \
Relation \
0 American Regent is currently releasing the 0.4...
1 American Regent is currently releasing the 1mg...
drawbacks
0 Demand increase for the drug
1 Removed ,
Form \
0 0.1 mg/mL; 10 mL Luer-Jet Prefilled Syringe (N...
Disponibility Relation \
0 Product available NaN
2 Removed
3 Removed ]]
So my question which parameter should I move in order to get a flat pandas dataframe from the above nested list?. I tried to header=0, index_col=0, match='"columns"', none of them worked or do I need to do the flatting when I create the pandas dataframe with pd.Dataframe()?. My main objective is to have a pandas dataframe like with this columns:
form, Disponibility, Relation, drawbacks
1
2
...
n
IIUC you can do it this way:
first you want to return concatenated DF, instead of list of DFs (as read_html returns a list of DFs):
def process(url):
return pd.concat(pd.read_html(url), ignore_index=False)
and then concatenate them for all URLs:
df = pd.concat(pool.map(process, links), ignore_index=True)

How to convert a html table into pandas dataframe

pandas provides an useful to_html() to convert the DataFrame into the html table. Is there any useful function to read it back to the DataFrame?
The read_html utility released in pandas 0.12
In the general case it is not possible but if you approximately know the structure of your table you could something like this:
# Create a test df:
>>> df = DataFrame(np.random.rand(4,5), columns = list('abcde'))
>>> df
a b c d e
0 0.675006 0.230464 0.386991 0.422778 0.657711
1 0.250519 0.184570 0.470301 0.811388 0.762004
2 0.363777 0.715686 0.272506 0.124069 0.045023
3 0.657702 0.783069 0.473232 0.592722 0.855030
Now parse the html and reconstruct:
from pyquery import PyQuery as pq
d = pq(df.to_html())
columns = d('thead tr').eq(0).text().split()
n_rows = len(d('tbody tr'))
values = np.array(d('tbody tr td').text().split(), dtype=float).reshape(n_rows, len(columns))
>>> DataFrame(values, columns=columns)
a b c d e
0 0.675006 0.230464 0.386991 0.422778 0.657711
1 0.250519 0.184570 0.470301 0.811388 0.762004
2 0.363777 0.715686 0.272506 0.124069 0.045023
3 0.657702 0.783069 0.473232 0.592722 0.855030
You could extend it for Multiindex dfs or automatic type detection using eval() if needed.

Categories