I'm using BS4 to pull a table from an HTML webpage and trying to add it to a pandas data frame but it's very sloppy when I pull it and I can't seem to get it to print properly if anyone can help?
There is only 1 table available on the webpage and this is the code I'm using. and what it's pulling.
soup = BeautifulSoup(driver.page_source,'html.parser')
df = pd.read_html(str(soup))
print (df)
results:
[ Unnamed: 0 Student Number Student Name Placement Date
0 NaN 20808456 Sandy Gurlow 01/13/2023
1 NaN NaN NaN NaN]
But I've tried to use:
df.dropna(inplace=True)
And I get the error code:
AttributeError: 'list' object has no attribute 'dropna'
pandas.read_html returns a list of dataframes, with as many dataframes as it found tables in the input.
You need to use:
df = pd.read_html(driver.page_source)[0]
Or, to avoid IndexError in case of no table:
l = pd.read_html(driver.page_source)
if l:
df = l[0]
else:
print('no table found')
I've really been stumped for a while on this.
Link to table = https://en.wikipedia.org/wiki/List_of_Manchester_United_F.C._seasons
I want to pull the data in the columns highlighed in red below
And put it in a pandas dataframe like this
Here is my code
import urllib.request
url = "https://en.wikipedia.org/wiki/List_of_Manchester_United_F.C._seasons"
page = urllib.request.urlopen(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, "lxml")
# print(soup.prettify())
my_table = soup.find('table', {'class':'wikitable sortable'})
season = []
data = []
for row in my_table.find_all('tr'):
s = row.find('th')
season.append(s)
d = row.find('td')
data.append(d)
import pandas as pd
c = {'Season': season, 'Data': data}
df = pd.DataFrame(c)
df
Heres's my output. I'm completely lost on how to get to the simple 5 column table above. Thanks
You are almost there, though you don't really need beautifulsoup for that; just pandas.
Try this:
url = "https://en.wikipedia.org/wiki/List_of_Manchester_United_F.C._seasons"
resp = requests.get(url)
tables = pd.read_html(resp.text)
target = tables[2].iloc[:,[0,2,3,4,5]]
target
Output:
Season P W D L
Season League League League League
0 1886–87 NaN NaN NaN NaN
1 1888–89[9] 12 8 2 2
2 1889–90 22 9 2 11
etc. And you can take it from there.
I am trying to create a CSV file from multiple csv files by using python pandas.
accreditation.csv :-
"pid","accreditation_body","score"
"25799","TAAC","4.5"
"25796","TAAC","5.6"
"25798","DAAC","5.7"
ref_university :-
"id","pid","survery_year","end_year"
"1","25799","2018","2018"
"2","25797","2016","2018"
I want to create a new table by reading the instruction from table_structure.csv. I want to join two tables and rewrite the accreditation.csv . REFERENCES ref_university(id, survey_year) is connecting with ref_university.csv and inserting id and survery_year columns value by matching the pid column value.
table_structure.csv :-
table_name,attribute_name,attribute_type,Description
,,,
accreditation,accreditation_body,varchar,
,grading,varchar,
,pid,int4, "REFERENCES ref_university(id, survey_year)"
,score,float8,
Modified CSV file should look like,
New accreditation.csv :-
"accreditation_body","grading","pid","id","survery_year","score"
"TAAC","","25799","1","2018","2018","4.5"
"TAAC","","25797","2","2016","2018","5.6"
"DAAC","","25798","","","","5.7"
I can read the csv in panda
df = pd.read_csv("accreditation.csv")
But, what is the recommended way to read the REFERENCES instruction and pick the columns value. If there is no value then column should be blank.
We can not hardcore pid in panda function. We have to read table_structure.csv and match if there is a Reference then call the mentioned columns. It should not be merged, just the specific columns should be added.
Dynamic solution is possible, but not so easy:
df = pd.read_csv("table_structure.csv")
#remove only NaNs rows
df = df.dropna(how='all')
#repalce NaNs by forward filling
df['table_name'] = df['table_name'].ffill()
#create for each table_name one row
df = (df.dropna(subset=['Description'])
.join(df.groupby('table_name')['attribute_name'].apply(list)
.rename('cols'), 'table_name'))
#get name of DataFrame and new columns names
df['df1'] = df['Description'].str.extract('REFERENCES\s*(.*)\s*\(')
df['new_cols'] = df['Description'].str.extract('\(\s*(.*)\s*\)')
df['new_cols'] = df['new_cols'].str.split(', ')
#remove unnecessary columns
df = df.drop(['attribute_type','Description'], axis=1).set_index('table_name')
print (df)
table_name
accreditation pid [accreditation_body, grading, pid, score]
df1 new_cols
table_name
accreditation ref_university [id, survey_year]
#for select by named create dictioanry of DataFrames
data = {'accreditation' : pd.read_csv("accreditation.csv"),
'ref_university': pd.read_csv("ref_university.csv")}
#seelct by index
v = df.loc['accreditation']
print (v)
attribute_name pid
cols [accreditation_body, grading, pid, score]
df1 ref_university
new_cols [id, survey_year]
Name: accreditation, dtype: object
Select by dictionary and by Series v
df = pd.merge(data[v.name],
data[v['df1']][v['new_cols'] + [v['attribute_name']]],
on=v['attribute_name'],
how='left')
is converted to:
df = pd.merge(data['accreditation'],
data['ref_university'][['id', 'survey_year'] + ['pid']],
on='pid',
how='left')
and return:
print (df)
pid accreditation_body score id survey_year
0 25799 TAAC 4.5 1.0 2018.0
1 25796 TAAC 5.6 NaN NaN
2 25798 DAAC 5.7 NaN NaN
Last add new columns by union and reindex:
df = df.reindex(columns=df.columns.union(v['cols']))
print (df)
accreditation_body grading id pid score survey_year
0 TAAC NaN 1.0 25799 4.5 2018.0
1 TAAC NaN NaN 25796 5.6 NaN
2 DAAC NaN NaN 25798 5.7 NaN
Here is the working code. Try it! When files are huge set low_memory=False in pd.read_csv()
import pandas as pd
import glob
# gets path to the folder datafolder
path = r"C:\Users\data_folder"
# reads all files with.csv ext
filenames = glob.glob(path + "\*.csv")
print('File names:', filenames)
df=pd.DataFrame()
# for loop to iterate and concat csv files
for file in filenames:
temp=pd.read_csv(file,low_memory=False)
df= pd.concat([df, temp], axis=1) #set axis =0 if you want to join rows
df.to_csv('output.csv')
I am retrieving some content from a website which has several tables with the same number of columns, with pandas read_html. When I read a single link that actually has several tables with the same number of columns, pandas effectively read all the tables as one (something like a flat/normalized table). However, I am interested in do the same for a list of links from a website (i.e. a single flat table for several links), so I tried the following:
In:
import multiprocessing
def process(url):
df_url = pd.read_html(url)
df = pd.concat(df_url, ignore_index=False)
return df_url
links = ['link1.com','link2.com','link3.com',...,'linkN.com']
pool = multiprocessing.Pool(processes=6)
df = pool.map(process, links)
df
Nevertheless, I guess I am not specifiying corecctly to read_html() which are the columns, so I am getting this malformed list of lists:
Out:
[[ Form Disponibility \
0 290090 01780-500-01) Unavailable - no product available for release.
Relation \
Relation drawbacks
0 NaN Removed
1 NaN Removed ],
[ Form \
Relation \
0 American Regent is currently releasing the 0.4...
1 American Regent is currently releasing the 1mg...
drawbacks
0 Demand increase for the drug
1 Removed ,
Form \
0 0.1 mg/mL; 10 mL Luer-Jet Prefilled Syringe (N...
Disponibility Relation \
0 Product available NaN
2 Removed
3 Removed ]]
So my question which parameter should I move in order to get a flat pandas dataframe from the above nested list?. I tried to header=0, index_col=0, match='"columns"', none of them worked or do I need to do the flatting when I create the pandas dataframe with pd.Dataframe()?. My main objective is to have a pandas dataframe like with this columns:
form, Disponibility, Relation, drawbacks
1
2
...
n
IIUC you can do it this way:
first you want to return concatenated DF, instead of list of DFs (as read_html returns a list of DFs):
def process(url):
return pd.concat(pd.read_html(url), ignore_index=False)
and then concatenate them for all URLs:
df = pd.concat(pool.map(process, links), ignore_index=True)
pandas provides an useful to_html() to convert the DataFrame into the html table. Is there any useful function to read it back to the DataFrame?
The read_html utility released in pandas 0.12
In the general case it is not possible but if you approximately know the structure of your table you could something like this:
# Create a test df:
>>> df = DataFrame(np.random.rand(4,5), columns = list('abcde'))
>>> df
a b c d e
0 0.675006 0.230464 0.386991 0.422778 0.657711
1 0.250519 0.184570 0.470301 0.811388 0.762004
2 0.363777 0.715686 0.272506 0.124069 0.045023
3 0.657702 0.783069 0.473232 0.592722 0.855030
Now parse the html and reconstruct:
from pyquery import PyQuery as pq
d = pq(df.to_html())
columns = d('thead tr').eq(0).text().split()
n_rows = len(d('tbody tr'))
values = np.array(d('tbody tr td').text().split(), dtype=float).reshape(n_rows, len(columns))
>>> DataFrame(values, columns=columns)
a b c d e
0 0.675006 0.230464 0.386991 0.422778 0.657711
1 0.250519 0.184570 0.470301 0.811388 0.762004
2 0.363777 0.715686 0.272506 0.124069 0.045023
3 0.657702 0.783069 0.473232 0.592722 0.855030
You could extend it for Multiindex dfs or automatic type detection using eval() if needed.