How to convert a html table into pandas dataframe - python

pandas provides an useful to_html() to convert the DataFrame into the html table. Is there any useful function to read it back to the DataFrame?

The read_html utility released in pandas 0.12

In the general case it is not possible but if you approximately know the structure of your table you could something like this:
# Create a test df:
>>> df = DataFrame(np.random.rand(4,5), columns = list('abcde'))
>>> df
a b c d e
0 0.675006 0.230464 0.386991 0.422778 0.657711
1 0.250519 0.184570 0.470301 0.811388 0.762004
2 0.363777 0.715686 0.272506 0.124069 0.045023
3 0.657702 0.783069 0.473232 0.592722 0.855030
Now parse the html and reconstruct:
from pyquery import PyQuery as pq
d = pq(df.to_html())
columns = d('thead tr').eq(0).text().split()
n_rows = len(d('tbody tr'))
values = np.array(d('tbody tr td').text().split(), dtype=float).reshape(n_rows, len(columns))
>>> DataFrame(values, columns=columns)
a b c d e
0 0.675006 0.230464 0.386991 0.422778 0.657711
1 0.250519 0.184570 0.470301 0.811388 0.762004
2 0.363777 0.715686 0.272506 0.124069 0.045023
3 0.657702 0.783069 0.473232 0.592722 0.855030
You could extend it for Multiindex dfs or automatic type detection using eval() if needed.

Related

.csv file data checking using python

I have a .csv file with the following data
Roll,Subject,Marks,Pass_Fail
1,A,50,P
1,B,50,P
1,C,30,F
1,D,50,P
2,A,40,P
2,B,30,F
2,C,30,F
2,D,50,P
3,A,50,P
3,B,30,F
3,C,40,P
3,D,20,F
4,A,50,P
4,B,50,P
4,C,50,P
4,D,50,P
Now, I would like to check if any person has failed in both B & C or D
Output -
2,B,30,F
2,C,30,F
3,B,30,F
3,D,20,F
I am new to Python. I have used Pandas. But only able to get the unique Roll value.
My code is as bellow
import pandas as pd
dataFrame = pd.read_csv(".\students.csv")
Unique_Users=dataFrame['roll'].unique()
for each roll in dataFrame:
if dataFrame.loc['pass_fail'] == 'fail':
print (dataFrame)
"to check if any person has failed in both B & C or D"
Use dataframe filtering on specific conditions:
failed = df[df['Subject_Code'].isin(['B','C','D']) & df['Pass_Fail'].eq('F')]
print(failed)
Roll Subject_Code Marks Pass_Fail
6 2 C 20 F
7 2 D 25 F
import pandas as pd
from pandas import DataFrame
df = pd.read_csv(".\students.csv")
failed = df[(df['subject'].eq('B')) & df['pass_fail'].eq('fail')]
df1 = DataFrame(failed, columns=
(['roll','name','subject','marks','pass_fail']))
for name in failed:
failed_1 = df[df['subject'].isin('C','D') & df['pass_fail'].eq('fail')]
df2 = DataFrame(failed_1)

How to include attributes of HTML table as a multiindex using Pandas?

I'm trying to read HTML from the following URL into a pandas dataframe:
https://antismash-db.secondarymetabolites.org/output/GCF_006385935.1/
The rendered HTML tables look like the following where there are N tables I'm interested in and 1 (the last one) that I'm not (i.e., I'm interested in the ones that don't start with "No secondary metabolite"):
When I read HTML via pandas I get 3 tables. Note, the last table from pd.read_html isn't the "No secondary metabolite" table but a concatenated table of the ones I'm interested in prefixed with "NZ_" in the header.
My question is if there is a way to include the headers of the rendered table as a multiindex?
For instance, I'm looking for a resulting table that looks like this:
# Read HTML Tables
dataframes = pd.read_html("https://antismash-db.secondarymetabolites.org/output/GCF_006385935.1/")
# Set Region as the index
dataframes = list(map(lambda df: df.set_index("Region"), dataframes))
# Manual prepending of title and table headers, respectively
dataframes[0].index = dataframes[0].index.map(lambda x: ("GCF_006385935.1", "NZ_CP041066.1", x))
dataframes[1].index = dataframes[1].index.map(lambda x: ("GCF_006385935.1", "NZ_CP041065.1", x))
# Concatenate tables
df_concat = pd.concat(dataframes[:-1], axis=0)
# Replace &nbsp characters with _
df_concat.index = df_concat.index.map(lambda x: (x[0], x[1], x[2].replace("&nbsp","_")))
# Multiindex labels
df_concat.index.names = ["level_0", "level_1", "level_2"]
df_concat
Try beautifulsoup to parse the HTML and construct the final dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
id_ = "GCF_006385935.1"
url = f"https://antismash-db.secondarymetabolites.org/output/{id_}/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
dfs = []
for table in soup.select(".record-overview-details table"):
header = table.find_previous(class_="record-overview-header").text.split()[
0
]
df = pd.read_html(str(table))[0].assign(level_1=header, level_0=id_)
dfs.append(df)
final_df = pd.concat(dfs)
final_df = final_df.set_index(["level_0", "level_1", "Region"])
print(final_df)
Prints:
Type From To Most similar known cluster Most similar known cluster.1 Similarity
level_0 level_1 Region
GCF_006385935.1 NZ_CP041066.1 Region&nbsp1.1 terpene 1123901 1143342 carotenoid Terpene 50%
Region&nbsp1.2 phosphonate 1252463 1293980 NaN NaN NaN
Region&nbsp1.3 T3PKS 1944360 1985445 NaN NaN NaN
Region&nbsp1.4 terpene 2690187 2709232 NaN NaN NaN
Region&nbsp1.5 terpene 4260236 4281054 surfactin NRP:Lipopeptide 13%
Region&nbsp1.6 siderophore 4446861 4463436 NaN NaN NaN
NZ_CP041065.1 Region&nbsp3.1 lanthipeptide 98352 124802 NaN NaN NaN

Python Parse a Text file and convert it to a dataframe

I need help parsing a specific string from this text file and then converting it to a dataframe.
I am trying to parse this portion of the text file:
Graph Stats for Max-Clique:
|V|: 566834
|E|: 659570
d_max: 8
d_avg: 2
p: 4.10563e-06
|T|: 31315
T_avg: 0
T_max: 5
cc_avg: 0.0179651
cc_global: 0.0281446
After parsing the text file, I need to make it into a dataframe where the columns are |V|,|E|, |T|, T_avg, T_max, cc_avg, and cc_global. Please advice! Thanks :)
You can read directly to a Pandas dataframe via pd.read_csv. Just remember to use an appropriate sep parameter. You can set your index column as the first and transpose:
import pandas as pd
from io import StringIO
x = StringIO("""|V|: 566834
|E|: 659570
d_max: 8
d_avg: 2
p: 4.10563e-06
|T|: 31315
T_avg: 0
T_max: 5
cc_avg: 0.0179651
cc_global: 0.0281446""")
# replace x with 'file.txt'
df = pd.read_csv(x, sep=': ', header=None, index_col=[0]).T
Result
print(df)
0 |V| |E| d_max d_avg p |T| T_avg T_max \
1 566834.0 659570.0 8.0 2.0 0.000004 31315.0 0.0 5.0
0 cc_avg cc_global
1 0.017965 0.028145

Make NetworkX node attributes into Pandas Dataframe columns

I have a Networkx graph called G created below:
import networkx as nx
G = nx.Graph()
G.add_node(1,job= 'teacher', boss = 'dee')
G.add_node(2,job= 'teacher', boss = 'foo')
G.add_node(3,job= 'admin', boss = 'dee')
G.add_node(4,job= 'admin', boss = 'lopez')
I would like to store the node number along with attributes, job and boss in separate columns of a pandas dataframe.
I have attempted to do this with the below code but it produces a dataframe with 2 columns, 1 with node number and one with all of the attributes:
graph = G.nodes(data = True)
import pandas as pd
df = pd.DataFrame(graph)
df
Out[19]:
0 1
0 1 {u'job': u'teacher', u'boss': u'dee'}
1 2 {u'job': u'teacher', u'boss': u'foo'}
2 3 {u'job': u'admin', u'boss': u'dee'}
3 4 {u'job': u'admin', u'boss': u'lopez'}
Note: I acknowledge that NetworkX has a to_pandas_dataframe function but it does not provide a dataframe with the output I am looking for.
Here's a one-liner.
pd.DataFrame.from_dict(dict(graph.nodes(data=True)), orient='index')
I think this is even simpler:
pandas.DataFrame.from_dict(graph.nodes, orient='index')
Without having to convert to another dict.
I don't know how representative your data is but it should be straightforward to modify my code to work on your real network:
In [32]:
data={}
data['node']=[x[0] for x in graph]
data['boss'] = [x[1]['boss'] for x in graph]
data['job'] = [x[1]['job'] for x in graph]
df1 = pd.DataFrame(data)
df1
Out[32]:
boss job node
0 dee teacher 1
1 foo teacher 2
2 dee admin 3
3 lopez admin 4
So here all I'm doing is constructing a dict from the graph data, pandas accepts dicts as data where the keys are the column names and the data has to be array-like, in this case lists of values
A more dynamic method:
In [42]:
def func(graph):
data={}
data['node']=[x[0] for x in graph]
other_cols = graph[0][1].keys()
for key in other_cols:
data[key] = [x[1][key] for x in graph]
return data
pd.DataFrame(func(graph))
Out[42]:
boss job node
0 dee teacher 1
1 foo teacher 2
2 dee admin 3
3 lopez admin 4
I updated this solution to work with my updated version of NetworkX (2.0) and thought I would share. I also had the function return a Pandas DataFrame.
def nodes_to_df(graph):
import pandas as pd
data={}
data['node']=[x[0] for x in graph.nodes(data=True)]
other_cols = graph.nodes[0].keys()
for key in other_cols:
data[key] = [x[1][key] for x in graph.nodes(data=True)]
return pd.DataFrame(data)
I have solved this with a dictionary comprehension.
d = {n:dag.nodes[n] for n in dag.nodes}
df = pd.DataFrame.from_dict(d, orient='index')
Your dictionary d maps the nodes n to dag.nodes[n].
Each value of that dictionary dag.nodes[n] is a dictionary itself and contains all attributes: {attribute_name:attribute_value}
So your dictionary d has the form:
{node_id : {attribute_name : attribute_value} }
The advantage I see is that you do not need to know the names of your attributes.
If you wanted to have the node-IDs not as index but in a column, you could add as the last command:
df.reset_index(drop=False, inplace=True)

How to store formulas, instead of values, in pandas DataFrame

Is it possible to work with pandas DataFrame as with an Excel spreadsheet: say, by entering a formula in a column so that when variables in other columns change, the values in this column change automatically? Something like:
a b c
2 3 =a+b
And so when I update 2 or 3, the column c also updates automatically.
PS: It's clearly possible to write a function to return a+b, but is there any built-in functionality in pandas or in other Python libraries to work with matrices this way?
This will work in 0.13 (still in development)
In [19]: df = DataFrame(randn(10,2),columns=list('ab'))
In [20]: df
Out[20]:
a b
0 0.958465 0.679193
1 -0.769077 0.497436
2 0.598059 0.457555
3 0.290926 -1.617927
4 -0.248910 -0.947835
5 -1.352096 -0.568631
6 0.009125 0.711511
7 -0.993082 -1.440405
8 -0.593704 0.352468
9 0.523332 -1.544849
This will be possible as 'a + b' (soon)
In [21]: formulas = { 'c' : 'df.a + df.b' }
In [22]: def update(df,formulas):
for k, v in formulas.items():
df[k] = pd.eval(v)
In [23]: update(df,formulas)
In [24]: df
Out[24]:
a b c
0 0.958465 0.679193 1.637658
1 -0.769077 0.497436 -0.271642
2 0.598059 0.457555 1.055614
3 0.290926 -1.617927 -1.327001
4 -0.248910 -0.947835 -1.196745
5 -1.352096 -0.568631 -1.920726
6 0.009125 0.711511 0.720636
7 -0.993082 -1.440405 -2.433487
8 -0.593704 0.352468 -0.241236
9 0.523332 -1.544849 -1.021517
You could implement a hook into setitem on the data frame to have this type of function called automatically. But pretty tricky. You didn't specify how the frame is updated in the first place. Would probably be easiest to simply call the update function after you change the values
I don't know it it is what you want, but I accidentally discovered that you can store xlwt.Formula objects in the DataFrame cells, and then, using DataFrame.to_excel method, export the DataFrame to excel and have your formulas in it:
import pandas
import xlwt
formulae=[]
formulae.append(xlwt.Formula('SUM(F1:F5)'))
formulae.append(xlwt.Formula('SUM(G1:G5)'))
formulae.append(xlwt.Formula('SUM(H1:I5)'))
formulae.append(xlwt.Formula('SUM(I1:I5)'))
df=pandas.DataFrame(formula)
df.to_excel('FormulaTest.xls')
Try it...
There's currently no way to do this exactly in the way that you describe.
In pandas 0.13 there will be a new DataFrame.eval method that will allow you to evaluate an expression in the "context" of a DataFrame. For example, you'll be able to df['c'] = df.eval('a + b').

Categories