How to reindex malformed columns retrived from pandas read_html?

How to reindex malformed columns retrived from pandas read_html? - python

I am retrieving some content from a website which has several tables with the same number of columns, with pandas read_html. When I read a single link that actually has several tables with the same number of columns, pandas effectively read all the tables as one (something like a flat/normalized table). However, I am interested in do the same for a list of links from a website (i.e. a single flat table for several links), so I tried the following:
In:
import multiprocessing
def process(url):
df_url = pd.read_html(url)
df = pd.concat(df_url, ignore_index=False)
return df_url
links = ['link1.com','link2.com','link3.com',...,'linkN.com']
pool = multiprocessing.Pool(processes=6)
df = pool.map(process, links)
df
Nevertheless, I guess I am not specifiying corecctly to read_html() which are the columns, so I am getting this malformed list of lists:
Out:
[[ Form Disponibility \
0 290090 01780-500-01) Unavailable - no product available for release.
Relation \
Relation drawbacks
0 NaN Removed
1 NaN Removed ],
[ Form \
Relation \
0 American Regent is currently releasing the 0.4...
1 American Regent is currently releasing the 1mg...
drawbacks
0 Demand increase for the drug
1 Removed ,
Form \
0 0.1 mg/mL; 10 mL Luer-Jet Prefilled Syringe (N...
Disponibility Relation \
0 Product available NaN
2 Removed
3 Removed ]]
So my question which parameter should I move in order to get a flat pandas dataframe from the above nested list?. I tried to header=0, index_col=0, match='"columns"', none of them worked or do I need to do the flatting when I create the pandas dataframe with pd.Dataframe()?. My main objective is to have a pandas dataframe like with this columns:
form, Disponibility, Relation, drawbacks
1
2
...
n

IIUC you can do it this way:
first you want to return concatenated DF, instead of list of DFs (as read_html returns a list of DFs):
def process(url):
return pd.concat(pd.read_html(url), ignore_index=False)
and then concatenate them for all URLs:
df = pd.concat(pool.map(process, links), ignore_index=True)

Related

How to include attributes of HTML table as a multiindex using Pandas?

I'm trying to read HTML from the following URL into a pandas dataframe:
https://antismash-db.secondarymetabolites.org/output/GCF_006385935.1/
The rendered HTML tables look like the following where there are N tables I'm interested in and 1 (the last one) that I'm not (i.e., I'm interested in the ones that don't start with "No secondary metabolite"):
When I read HTML via pandas I get 3 tables. Note, the last table from pd.read_html isn't the "No secondary metabolite" table but a concatenated table of the ones I'm interested in prefixed with "NZ_" in the header.
My question is if there is a way to include the headers of the rendered table as a multiindex?
For instance, I'm looking for a resulting table that looks like this:
# Read HTML Tables
dataframes = pd.read_html("https://antismash-db.secondarymetabolites.org/output/GCF_006385935.1/")
# Set Region as the index
dataframes = list(map(lambda df: df.set_index("Region"), dataframes))
# Manual prepending of title and table headers, respectively
dataframes[0].index = dataframes[0].index.map(lambda x: ("GCF_006385935.1", "NZ_CP041066.1", x))
dataframes[1].index = dataframes[1].index.map(lambda x: ("GCF_006385935.1", "NZ_CP041065.1", x))
# Concatenate tables
df_concat = pd.concat(dataframes[:-1], axis=0)
# Replace &nbsp characters with _
df_concat.index = df_concat.index.map(lambda x: (x[0], x[1], x[2].replace("&nbsp","_")))
# Multiindex labels
df_concat.index.names = ["level_0", "level_1", "level_2"]
df_concat

Try beautifulsoup to parse the HTML and construct the final dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
id_ = "GCF_006385935.1"
url = f"https://antismash-db.secondarymetabolites.org/output/{id_}/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
dfs = []
for table in soup.select(".record-overview-details table"):
header = table.find_previous(class_="record-overview-header").text.split()[
0
]
df = pd.read_html(str(table))[0].assign(level_1=header, level_0=id_)
dfs.append(df)
final_df = pd.concat(dfs)
final_df = final_df.set_index(["level_0", "level_1", "Region"])
print(final_df)
Prints:
Type From To Most similar known cluster Most similar known cluster.1 Similarity
level_0 level_1 Region
GCF_006385935.1 NZ_CP041066.1 Region&nbsp1.1 terpene 1123901 1143342 carotenoid Terpene 50%
Region&nbsp1.2 phosphonate 1252463 1293980 NaN NaN NaN
Region&nbsp1.3 T3PKS 1944360 1985445 NaN NaN NaN
Region&nbsp1.4 terpene 2690187 2709232 NaN NaN NaN
Region&nbsp1.5 terpene 4260236 4281054 surfactin NRP:Lipopeptide 13%
Region&nbsp1.6 siderophore 4446861 4463436 NaN NaN NaN
NZ_CP041065.1 Region&nbsp3.1 lanthipeptide 98352 124802 NaN NaN NaN

Create column in Pandas based on dataframe query condition

I've got a pandas dataframe like this:
Title Pgid Views30 Title_to
===============================================================================
30 Хо_Ен_Чон 9048639 284950 Чон_Хо_Ён
98 Mail.ru_Group 9018641 153082 VK_(компания)
105 Паша_Техник 9070663 143053 Техник,_Паша
303 Freeware 6166716 79998 Бесплатное_программное_обеспечение
399 СССР 1007 69349 Союз_Советских_Социалистических_Республик
The data contains over 1.2 million entries from Wikipedia page data:
Title = page title
Pgid = page ID
Views30 = monthly page views
Title_to = title of the page that this page redirects to (or NaN if no redirect)
Now I want to make a new column Pgid_to with the page IDs of the redirect target pages for all pages with Title_to != NaN. That is, collect Pgid from Title = Title_to for all entries.
My current solution is straightforward:
def cond(title_to):
try:
# get Pgid of page whose title == title_to
return df.loc[df['Title'] == title_to, 'Pgid']
except:
# return NaN on failure to locate
return np.NaN
# make new column by applying search element-wise
df['Pgid_to'] = df['Title_to'].apply(cond)
However, this algorithm is likely to take polynomial time (N^2), which for 1.2 MM entries means 1.4 trillion operations! Is is possible to optimize? Possibly, is there a vectorized solution?

np.where() is vectorized and hopefully will save the day. Kindly try:
df['Pgid_to'] = np.where(df[Title'] == df['Title_co'], 'Pgid',np.nan)
If you want to compare it against nan:
df['Pgid_to'] = np.where(df[Title'].isna(),np.nan, 'Pgid')

OK, found a solution by simply merging the dataframe with itself!
df_merged = df.merge(df, 'left', left_on='Title',
right_on='Title_to', suffixes=(None, '_from'))
This produces a new dataframe with the 'Pgid_from' column (among others), which can then be used to group the data.

group dataframe based on columns

I am new to datascience your help is appreciated. my question is regarding grouping dataframe based on columns so that bar chart will be plotted based on each subject status
my csv file is something like this
Name,Maths,Science,English,sports
S1,Pass,Fail,Pass,Pass
S2,Pass,Pass,NA,Pass
S3,Pass,Fail,Pass,Pass
S4,Pass,Pass,Pass,NA
S5,Pass,Fail,Pass,NA
expected o/p:
Subject,Status,Count
Maths,Pass,5
Science,Pass,2
Science,Fail,3
English,Pass,4
English,NA,1
Sports,Pass,3
Sports,NA,2

You can do this with pandas, not exactly in the same output format in the question, but definitely having the same information:
import pandas as pd
# reading csv
df = pd.read_csv("input.csv")
# turning columns into rows
melt_df = pd.melt(df, id_vars=['Name'], value_vars=['Maths', 'Science', "English", "sports"], var_name="Subject", value_name="Status")
# filling NaN values, otherwise the below groupby will ignore them.
melt_df = melt_df.fillna("Unknown")
# counting per group of subject and status.
result_df = melt_df.groupby(["Subject", "Status"]).size().reset_index(name="Count")
Then you get the following result:
Subject Status Count
0 English Pass 4
1 English Unknown 1
2 Maths Pass 5
3 Science Fail 3
4 Science Pass 2
5 sports Pass 3
6 sports Unknown 2

PS: Going forward, always paste code on what you've tried so far
To match exactly your output, this is what you could do:
import pandas as pd
df = pd.read_csv('c:/temp/data.csv') # Or where ever your csv file is
subjects = ['Maths', 'Science' , 'English' , 'sports'] # Or you could get that as df.columns and drop 'Name'
grouped_rows = []
for eachsub in subjects:
rows = df.groupby(eachsub)['Name'].count()
idx = list(rows.index)
if 'Pass' in idx:
grouped_rows.append([eachsub, 'Pass', rows['Pass']])
if 'Fail' in idx:
grouped_rows.append([eachsub, 'Fail', rows['Fail']])
new_df = pd.DataFrame(grouped_rows, columns=['Subject', 'Grade', 'Count'])
print(new_df)
I must suggest though that I would avoid getting into the for loop. My approach would be just these two lines:
subjects = ['Maths', 'Science' , 'English' , 'sports']
grouped_rows = df.groupby(eachsub)['Name'].count()
Depending on your application, you already have the data available in grouped_rows

How do I combine multiple rows of a CSV that share data into one row using Pandas?

I have downloaded the ASCAP database, giving me a CSV that is too large for Excel to handle. I'm able to chunk the CSV to open parts of it, the problem is that the data isn't super helpful in its default format. Each song title has 3+ rows associated with it:
The first row include the % share that ASCAP has in that song.
The rows after that include a character code (ROLE_TYPE) that indicates if that row contains the writer or performer of that song.
The first column of each row contains a song title.
This structure makes the data confusing because on the rows that list the % share there are blank cells in the NAME column because that row does not have a Writer/Performer associated with it.
What I would like to do is transform this data from having 3+ rows per song to having 1 row per song with all relevant data.
So instead of:
TITLE, ROLE_TYPE, NAME, SHARES, NOTE
I would like to change the data to:
TITLE, WRITER, PERFORMER, SHARES, NOTE
Here is a sample of the data:
TITLE,ROLE_TYPE,NAME,SHARES,NOTE
SCORE MORE,ASCAP,Total Current ASCAP Share,100,
SCORE MORE,W,SMITH ANTONIO RENARD,,
SCORE MORE,P,SMITH SHOW PUBLISHING,,
PEOPLE KNO,ASCAP,Total Current ASCAP Share,100,
PEOPLE KNO,W,SMITH ANTONIO RENARD,,
PEOPLE KNO,P,SMITH SHOW PUBLISHING,,
FEEDBACK,ASCAP,Total Current ASCAP Share,100,
FEEDBACK,W,SMITH ANTONIO RENARD,,
I would like the data to look like:
TITLE, WRITER, PERFORMER, SHARES, NOTE
SCORE MORE, SMITH ANTONIO RENARD, SMITH SHOW PUBLISHING, 100,
PEOPLE KNO, SMITH ANTONIO RENARD, SMITH SHOW PUBLISHING, 100,
FEEDBACK, SMITH ANONIO RENARD, SMITH SHOW PUBLISHING, 100,
I'm using python/pandas to try and work with the data. I am able to use groupby('TITLE') to group rows with matching titles.
import pandas as pd
data = pd.read_csv("COMMA_ASCAP_TEXT.txt", low_memory=False)
title_grouped = data.groupby('TITLE')
for TITLE,group in title_grouped:
print(TITLE)
print(group)
I was able to groupby('TITLE') of each song, and the output I get seems close to what I want:
SCORE MORE
TITLE ROLE_TYPE NAME SHARES NOTE
0 SCORE MORE ASCAP Total Current ASCAP Share 100.0 NaN
1 SCORE MORE W SMITH ANTONIO RENARD NaN NaN
2 SCORE MORE P SMITH SHOW PUBLISHING NaN NaN
What do I need to do to take this group and produce a single row in a CSV file with all the data related to each song?

I would recommend:
Decompose the data by the ROLE_TYPE
Prepare the data for merge (rename columns and drop unnecessary columns)
Merge everything back into one DataFrame
Merge will be automatically performed over the column which has the same name in the DataFrames being merged (TITLE in this case).
Seems to work nicely :)
data = pd.read_csv("data2.csv", sep=",")
# Create 3 individual DataFrames for different roles
data_ascap = data[data["ROLE_TYPE"] == "ASCAP"].copy()
data_writer = data[data["ROLE_TYPE"] == "W"].copy()
data_performer = data[data["ROLE_TYPE"] == "P"].copy()
# Remove unnecessary columns for ASCAP role
data_ascap.drop(["ROLE_TYPE", "NAME"], axis=1, inplace=True)
# Rename columns and remove unnecesary columns for WRITER role
data_writer.rename(index=str, columns={"NAME": "WRITER"}, inplace=True)
data_writer.drop(["ROLE_TYPE", "SHARES", "NOTE"], axis=1, inplace=True)
# Rename columns and remove unnecesary columns for PERFORMER role
data_performer.rename(index=str, columns={"NAME": "PERFORMER"}, inplace=True)
data_performer.drop(["ROLE_TYPE", "SHARES", "NOTE"], axis=1, inplace=True)
# Merge all together
result = data_ascap.merge(data_writer, how="left")
result = result.merge(data_performer, how="left")
# Print result
print(result)

Match one to many columns in Pandas dataframe

I have 2 datasets in CSV file, using pandas each file is converted into 2 different dataframes.
I want to find similar companies based on their url. I'm able to find similar companies based on 1 field (Rule1), but I want to compare more efficiently as following:
Dataset 1
uuid, company_name, website
YAHOO,Yahoo,yahoo.com
CSCO,Cisco,cisco.com
APPL,Apple,
Dataset 2
company_name, company_website, support_website, privacy_website
Yahoo,,yahoo.com,yahoo.com
Google,google.com,,
Cisco,,,cisco.com
Result Dataset
company_name, company_website, support_website, privacy_website, uuid
Yahoo,,yahoo.com,yahoo.com,YAHOO
Google,google.com,,
Cisco,,,cisco.com,CSCO
Dataset1 contains ~50K records.
Dataset2 contains ~4M records.
Rules
If field website in dataset 1 is the same as field company_website in dataset 2, extract identifier.
If not match, check if field website in dataset 1 is the same as field support_website in dataset 2, extract identifier.
If not match, check if field website in dataset 1 is the same as field privacy_website in dataset 2, extract identifier.
If not match, check if field company_name in dataset 1 is the same as field company_name in dataset 2, extract identifier.
If not matches return record and identifier field (UUID) will be empty.
Here is my current function:
def MatchCompanies(
companies: pandas.Dataframe,
competitor_companies: pandas.Dataframe) -> Optional[Sequence[str]]:
"""Find Competitor companies in companies dataframe and generate a new list.
Args:
companies: A dataframe with company information from CSV file.
competitor_companies: A dataframe with Competitor information from CSV file.
Returns:
A sequence of matched companies and their UUID.
Raises:
ValueError: No companies found.
"""
if _IsEmpty(companies):
raise ValueError('No companies found')
# Clean up empty fields. Use extra space to avoid matching on empty TLD.
companies.fillna({'website': ' '}, inplace=True)
competitor_companies = competitor_companies.fillna('')
logging.info('Found: %d records.', len(competitor_companies))
# Rename column to TLD to compare matching companies.
companies.rename(columns={'website': 'tld'}, inplace=True)
logging.info('Cleaning up company name.')
companies.company_name = companies.company_name.apply(_NormalizeText)
competitor_companies.company_name = competitor_companies.company_name.apply(
_NormalizeText)
# Rename column to TLD since Competitor already contains TLD in company_website.
competitor_companies.rename(columns={'company_website': 'tld'}, inplace=True)
logging.info('Extracting UUID')
merge_tld = competitor_companies.merge(
companies[['tld', 'uuid']], on='tld', how='left')
# Extracts UUID for company name matches.
competitor_companies = competitor_companies.merge(
companies[['company_name', 'uuid']], on='company_name', how='left')
# Combines dataframes.
competitor_companies['uuid'] = competitor_companies['uuid'].combine_first(
merge_tld['uuid'])
match_companies = len(
competitor_companies[competitor_companies['uuid'].notnull()])
total_companies = len(competitor_companies)
logging.info('Results found: %d out of %d', match_companies, total_companies)
competitor_companies.rename(columns={'tld': 'company_website'}, inplace=True)
return competitor_companies
Looking for advise in which function to use?

Use map by Series with combine_first, but one requrement is necessary - always unique values in df1['website'] and df1['company_name']:
df1 = df1.dropna()
s1 = df1.set_index('website')['uuid']
s2 = df1.set_index('company_name')['uuid']
w1 = df2['company_website'].map(s1)
w2 = df2['support_website'].map(s1)
w3 = df2['privacy_website'].map(s1)
c = df2['company_name'].map(s2)
df2['uuid'] = w1.combine_first(w2).combine_first(w3).combine_first(c)
print (df2)
company_name company_website support_website privacy_website uuid
0 Yahoo NaN yahoo.com yahoo.com YAHOO
1 Google google.com NaN NaN NaN
2 Cisco NaN NaN cisco.com CSCO

Take a look at dataframe.merge. Rename third column in A to company_website and do something like
A.merge(B, on='company_website', indicator=True)
should at least take care of the first rule.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to reindex malformed columns retrived from pandas read_html? - python

Related

How to include attributes of HTML table as a multiindex using Pandas?

Create column in Pandas based on dataframe query condition

group dataframe based on columns

How do I combine multiple rows of a CSV that share data into one row using Pandas?

Match one to many columns in Pandas dataframe

Categories

Resources