remove nan from table data python? - python

I'm using BS4 to pull a table from an HTML webpage and trying to add it to a pandas data frame but it's very sloppy when I pull it and I can't seem to get it to print properly if anyone can help?
There is only 1 table available on the webpage and this is the code I'm using. and what it's pulling.
soup = BeautifulSoup(driver.page_source,'html.parser')
df = pd.read_html(str(soup))
print (df)
results:
[ Unnamed: 0 Student Number Student Name Placement Date
0 NaN 20808456 Sandy Gurlow 01/13/2023
1 NaN NaN NaN NaN]
But I've tried to use:
df.dropna(inplace=True)
And I get the error code:
AttributeError: 'list' object has no attribute 'dropna'

pandas.read_html returns a list of dataframes, with as many dataframes as it found tables in the input.
You need to use:
df = pd.read_html(driver.page_source)[0]
Or, to avoid IndexError in case of no table:
l = pd.read_html(driver.page_source)
if l:
df = l[0]
else:
print('no table found')

Related

How can I update my Dataframe with new columns and rows while webscraping?

I'm trying to create a webscraping tool that will update a Dataframe with data from multiple tables.
The page I'm working on has a base table in which every row has a link that directs you to a new URL that has a secondary table with the data I'm looking for.
My objective is to create a unique Dataframe comprehensive of all the data present on all secondary tables of the site.
Problem is, every secondary table can have different sets of columns from the previous one, depending on whether that secondary table has a value for that specific column or not, and I cannot know all the possibile column types
I tried multiple solutions. What I'm working on at the moment is to create a for loop that constantly create a new Dataframe out of the new tables and merge them to the previous one.
But I'm stuck on trying to merge the two Dataframes on all the columns they have in common.
Please forgive me if I made amateur mistakes, I've been using python only for a week.
#create the main DataFrame
link1= links[0]
url_linked = url_l + link1
page_linked = requests.get (url_linked)
soup_linked = BeautifulSoup(page_linked.text,'lxml')
table_linked= soup_linked.find('table', class_="XXXXX")
headers_link=[]
headers_unique=[]
for i in table_linked.find_all('th'):
title_link=i.text
title_link=map(str,title_link)
headers_link.append(title_link)
headers_unique=headers_link
mydata_link = pd.DataFrame(columns=headers_link)
count = 1
for link in links:
url_linked = url_l + link
page_linked = requests.get (url_linked)
soup_linked = BeautifulSoup(page_linked.text,'lxml')
table_linked= soup_linked.find('table', class_="table table-directory-responsive")
row2=[]
n_columns =len(table_linked.find_all('th'))
#populating the main dataframe
if count == 1:
for j in table_linked.find_all('tr'):
row_data=j.find_all('td')
row=[i.text for i in row_data]
row2.append(row)
lenght_link= len(mydata_link)
row2.remove(['']) #To get rid of empty rows that have no th
mydata_link.loc[lenght_link]=row2
print(mydata_link)
print('Completato link '+ str(count))
count= count+1
#creating the secondary Dataframe
else:
headers_test=[]
for i in table_linked.find_all('th'):
title_test=i.text
title_test=map(str,title_test)
headers_test.append(title_test)
mydata_temp=pd.DataFrame(columns=headers_test)
for j in table_linked.find_all('tr'):
row_data=j.find_all('td')
row=[i.text for i in row_data]
row2.append(row)
lenght_link= len(mydata_link)
row2.remove(['']) #To get rid of empty rows that have no th
mydata_temp.loc[lenght_link]=row2
print(mydata_temp)
#merge the two DataFrames based on the unique set of columns they both have
headers_unique= set(headers_unique).intersection(headers_test)
mydata_link=mydata_link.merge(mydata_temp, on=[headers_unique], how='outer')
print(mydata_link)
print('Completed link '+ str(count))
count= count+1
What I need is basically a function that, given these sample dataFrames:
A
B
C
1
2
3
C
A
D
E
4
5
6
7
Will return the following dataframe:
A
B
C
D
E
1
2
3
Nan
Nan
5
Nan
4
6
7
Just concatenating all the secondary tables should do - build a list of all the secondary DataFrames, and then pd.concat(dfList).
Btw, have you considered just using .read_html instead of looping through the cells?
#create the main DataFrame
link1 = links[0]
url_linked = url_l + link1
page_linked = requests.get (url_linked)
soup_linked = BeautifulSoup(page_linked.text, 'lxml')
table_linked = soup_linked.find('table', class_="XXXXX")
if table_linked:
primaryDf = pd.read_html(table_linked.prettify())[0]
headers_link = [h.get_text(' ').strip() for h in table_linked.find_all('th')]
dfList = [pd.DataFrame(columns=headers_link if headers_link else primaryDf.columns)]
else: primaryDf, dfList = None, []
count = 0
for link in links:
count += 1
url_linked = url_l + link
page_linked = requests.get (url_linked)
soup_linked = BeautifulSoup(page_linked.text, 'lxml')
table_linked = soup_linked.find('table', class_="table table-directory-responsive")
if not table_linked:
## to see if any response errors or redirects
print(f'[{page_linked.status_code} {page_linked.reason} from {page_linked.url}]')
## print error message and move to next link
print(f'Found no tables with required class at link#{count}', url_linked)
continue
tempDf = pd.read_html(table_linked.prettify())[0] ## read table as df [if found]
## get rid of empty rows and empty columns
tempDf = tempDf.dropna(axis='rows', 'how'='all').dropna(axis='columns', 'how'='all')
dfList.append(tempDf.loc[:]) ## .loc[:] to append a copy, not original (just in case)
print(f'Completed link#{count} with {len(tempDf)} rows from {url_linked}')
combinedDF = pd.concat(dfList)

How to include attributes of HTML table as a multiindex using Pandas?

I'm trying to read HTML from the following URL into a pandas dataframe:
https://antismash-db.secondarymetabolites.org/output/GCF_006385935.1/
The rendered HTML tables look like the following where there are N tables I'm interested in and 1 (the last one) that I'm not (i.e., I'm interested in the ones that don't start with "No secondary metabolite"):
When I read HTML via pandas I get 3 tables. Note, the last table from pd.read_html isn't the "No secondary metabolite" table but a concatenated table of the ones I'm interested in prefixed with "NZ_" in the header.
My question is if there is a way to include the headers of the rendered table as a multiindex?
For instance, I'm looking for a resulting table that looks like this:
# Read HTML Tables
dataframes = pd.read_html("https://antismash-db.secondarymetabolites.org/output/GCF_006385935.1/")
# Set Region as the index
dataframes = list(map(lambda df: df.set_index("Region"), dataframes))
# Manual prepending of title and table headers, respectively
dataframes[0].index = dataframes[0].index.map(lambda x: ("GCF_006385935.1", "NZ_CP041066.1", x))
dataframes[1].index = dataframes[1].index.map(lambda x: ("GCF_006385935.1", "NZ_CP041065.1", x))
# Concatenate tables
df_concat = pd.concat(dataframes[:-1], axis=0)
# Replace &nbsp characters with _
df_concat.index = df_concat.index.map(lambda x: (x[0], x[1], x[2].replace("&nbsp","_")))
# Multiindex labels
df_concat.index.names = ["level_0", "level_1", "level_2"]
df_concat
Try beautifulsoup to parse the HTML and construct the final dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
id_ = "GCF_006385935.1"
url = f"https://antismash-db.secondarymetabolites.org/output/{id_}/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
dfs = []
for table in soup.select(".record-overview-details table"):
header = table.find_previous(class_="record-overview-header").text.split()[
0
]
df = pd.read_html(str(table))[0].assign(level_1=header, level_0=id_)
dfs.append(df)
final_df = pd.concat(dfs)
final_df = final_df.set_index(["level_0", "level_1", "Region"])
print(final_df)
Prints:
Type From To Most similar known cluster Most similar known cluster.1 Similarity
level_0 level_1 Region
GCF_006385935.1 NZ_CP041066.1 Region&nbsp1.1 terpene 1123901 1143342 carotenoid Terpene 50%
Region&nbsp1.2 phosphonate 1252463 1293980 NaN NaN NaN
Region&nbsp1.3 T3PKS 1944360 1985445 NaN NaN NaN
Region&nbsp1.4 terpene 2690187 2709232 NaN NaN NaN
Region&nbsp1.5 terpene 4260236 4281054 surfactin NRP:Lipopeptide 13%
Region&nbsp1.6 siderophore 4446861 4463436 NaN NaN NaN
NZ_CP041065.1 Region&nbsp3.1 lanthipeptide 98352 124802 NaN NaN NaN

How to edit a dataframe row by row while itterating?

So I am using a script to read a CSV and create a data frame from which it then scrapes price data using tickers from said data frame. The original data frame has the following columns, note NO 'Price'.
df.columns = ['Ticker TV', 'Ticker YF', 'TV Name', 'Sector', 'Industry', 'URLTV']
I've printed the below first couple of outputs from my "updated" data frame
Ticker TV Ticker YF ... URLTV Price
1 100D 100D.L ... URL NaN
2 1GIS 1GIS.L ... URL NaN
3 1MCS 1MCS.L ... URL NaN
... ... ... ... ... ...
2442 ZYT ZYT.L ...URL NaN
100D.L NaN NaN .. NaN 9272.50
1GIS.L NaN NaN ...NaN 8838.50
1MCS.L NaN NaN ...NaN 5364.00
As you can see it's not working as intended I would like to create a new column with the name of Price and attach each price with the correct ticker so 100D.L should be 9272.50 then when the script iterates to the next ticker it adds the next price value to 1GIS and so forth.
tickerList = df['Ticker YF']
for tick in tickerList:
summarySoup = getSummary(tick)
currentPriceData = priceData(summarySoup)
print('The Price of '+tick+ ' is '+str(currentPriceData))
df.at[tick,'Price'] = currentPriceData
Assign price using apply method:
df['Price'] = df['Ticker YF'].apply(lambda x: str(priceData(getSummary(x))))
tick is just the value from your 'Ticker YF' column. so you can use enumerate to get also the index. And if you want to access the former price to add them up you can then just use idx-1
tickerList = df['Ticker YF']
for idx, tick in enumerate(tickerList):
summarySoup = getSummary(tick)
currentPriceData = priceData(summarySoup)
print('The Price of '+tick+ ' is '+str(currentPriceData))
if idx!=0:
df.at[idx+1,'Price'] = float(currentPriceData)+float(df.at[idx,'Price'])
else:
df.at[idx+1,'Price'] = float(currentPriceData)
A more "elegant" idea could be something like:
df["Single_Price"]=df["Ticker YF"].apply(lambda x: priceData(getSummary(x)))
to get the value of the single prices. And then create the next column with the added prices:
df["Price"]=df["Ticker"].apply(lambda x: df["Single_Price"][df["Ticker"]<x["Ticker"]].sum())
this will add up every Single_Price (df["Single_Price"]) from every row that is before your current row Ticker x (df["Ticker"] < x["Ticker"]) and creates a new column Price in your dataframe.
after that cou can simply delete the single prices if you don't need them with:
del df["Single_Price"]

How to reindex malformed columns retrived from pandas read_html?

I am retrieving some content from a website which has several tables with the same number of columns, with pandas read_html. When I read a single link that actually has several tables with the same number of columns, pandas effectively read all the tables as one (something like a flat/normalized table). However, I am interested in do the same for a list of links from a website (i.e. a single flat table for several links), so I tried the following:
In:
import multiprocessing
def process(url):
df_url = pd.read_html(url)
df = pd.concat(df_url, ignore_index=False)
return df_url
links = ['link1.com','link2.com','link3.com',...,'linkN.com']
pool = multiprocessing.Pool(processes=6)
df = pool.map(process, links)
df
Nevertheless, I guess I am not specifiying corecctly to read_html() which are the columns, so I am getting this malformed list of lists:
Out:
[[ Form Disponibility \
0 290090 01780-500-01) Unavailable - no product available for release.
Relation \
Relation drawbacks
0 NaN Removed
1 NaN Removed ],
[ Form \
Relation \
0 American Regent is currently releasing the 0.4...
1 American Regent is currently releasing the 1mg...
drawbacks
0 Demand increase for the drug
1 Removed ,
Form \
0 0.1 mg/mL; 10 mL Luer-Jet Prefilled Syringe (N...
Disponibility Relation \
0 Product available NaN
2 Removed
3 Removed ]]
So my question which parameter should I move in order to get a flat pandas dataframe from the above nested list?. I tried to header=0, index_col=0, match='"columns"', none of them worked or do I need to do the flatting when I create the pandas dataframe with pd.Dataframe()?. My main objective is to have a pandas dataframe like with this columns:
form, Disponibility, Relation, drawbacks
1
2
...
n
IIUC you can do it this way:
first you want to return concatenated DF, instead of list of DFs (as read_html returns a list of DFs):
def process(url):
return pd.concat(pd.read_html(url), ignore_index=False)
and then concatenate them for all URLs:
df = pd.concat(pool.map(process, links), ignore_index=True)

Extracting data from a column in a data frame in Python

I want to extract out the "A"(s) from this column. After doing that I want to be able to print the other data from other columns associated with "A" in the same row.
However, my code printed this instead:
outputs:
UniqueCarrier NaN
CancellationCode NaN
Name: CancellationCode, dtype: object
None
The column CancellationCode looks like this:
CancellationCode:
NaN
A
NaN
B
NaN
I want to get it to print in a data frame format with the filtered rows and columns.
Here is my code below:
cancellation_reason = (flight_data_finalcopy["CancellationCode"] == "A")
cancellation_reasons_filtered = cancellation_reason[["UniqueCarrier", "AirlineID", "Origin"]]
print(display(cancellation_reasons_filtered))
try this
cancellation_reason=flight_data_finalcopy[flight_data_finalcopy["CancellationCode"] == "A"]
cancellation_reasons_filtered = cancellation_reason[["UniqueCarrier", "AirlineID", "Origin"]]
print(display(cancellation_reasons_filtered))

Categories