I have a Networkx graph called G created below:
import networkx as nx
G = nx.Graph()
G.add_node(1,job= 'teacher', boss = 'dee')
G.add_node(2,job= 'teacher', boss = 'foo')
G.add_node(3,job= 'admin', boss = 'dee')
G.add_node(4,job= 'admin', boss = 'lopez')
I would like to store the node number along with attributes, job and boss in separate columns of a pandas dataframe.
I have attempted to do this with the below code but it produces a dataframe with 2 columns, 1 with node number and one with all of the attributes:
graph = G.nodes(data = True)
import pandas as pd
df = pd.DataFrame(graph)
df
Out[19]:
0 1
0 1 {u'job': u'teacher', u'boss': u'dee'}
1 2 {u'job': u'teacher', u'boss': u'foo'}
2 3 {u'job': u'admin', u'boss': u'dee'}
3 4 {u'job': u'admin', u'boss': u'lopez'}
Note: I acknowledge that NetworkX has a to_pandas_dataframe function but it does not provide a dataframe with the output I am looking for.
Here's a one-liner.
pd.DataFrame.from_dict(dict(graph.nodes(data=True)), orient='index')
I think this is even simpler:
pandas.DataFrame.from_dict(graph.nodes, orient='index')
Without having to convert to another dict.
I don't know how representative your data is but it should be straightforward to modify my code to work on your real network:
In [32]:
data={}
data['node']=[x[0] for x in graph]
data['boss'] = [x[1]['boss'] for x in graph]
data['job'] = [x[1]['job'] for x in graph]
df1 = pd.DataFrame(data)
df1
Out[32]:
boss job node
0 dee teacher 1
1 foo teacher 2
2 dee admin 3
3 lopez admin 4
So here all I'm doing is constructing a dict from the graph data, pandas accepts dicts as data where the keys are the column names and the data has to be array-like, in this case lists of values
A more dynamic method:
In [42]:
def func(graph):
data={}
data['node']=[x[0] for x in graph]
other_cols = graph[0][1].keys()
for key in other_cols:
data[key] = [x[1][key] for x in graph]
return data
pd.DataFrame(func(graph))
Out[42]:
boss job node
0 dee teacher 1
1 foo teacher 2
2 dee admin 3
3 lopez admin 4
I updated this solution to work with my updated version of NetworkX (2.0) and thought I would share. I also had the function return a Pandas DataFrame.
def nodes_to_df(graph):
import pandas as pd
data={}
data['node']=[x[0] for x in graph.nodes(data=True)]
other_cols = graph.nodes[0].keys()
for key in other_cols:
data[key] = [x[1][key] for x in graph.nodes(data=True)]
return pd.DataFrame(data)
I have solved this with a dictionary comprehension.
d = {n:dag.nodes[n] for n in dag.nodes}
df = pd.DataFrame.from_dict(d, orient='index')
Your dictionary d maps the nodes n to dag.nodes[n].
Each value of that dictionary dag.nodes[n] is a dictionary itself and contains all attributes: {attribute_name:attribute_value}
So your dictionary d has the form:
{node_id : {attribute_name : attribute_value} }
The advantage I see is that you do not need to know the names of your attributes.
If you wanted to have the node-IDs not as index but in a column, you could add as the last command:
df.reset_index(drop=False, inplace=True)
Related
I have been trying to update the Column values in data frame based on another Column values which contains strings.
import pandas as pd
import numpy as np
1. df=pd.read_excel('C:\\Users\\bahlrajesh23\\datascience\\Invoice.xlsx')
2. df1 =( df[df['Vendor'].str.contains('holding')] )
3. df['cat'] = pd.np.where(df['Vendor'].str.contains('holding'),"Yes",'' )
4. print(df[0:5])
The code up to line 4 above is working well but now I want to add more conditions in line 3 and I amended the line 3 above like this.
df['cat'] = pd.np.where((df['Vendor'].str.contains('holding'),"Yes",''),
(df['Vendor'].str.contains('tech'),"tech",''))
I am getting following error
ValueError: either both or neither of x and y should be given
How can I achieve this?
Because you're wanting to return different answers for each condition, using np.where() won't work. map() would also be difficult.
You can use apply() and make the function as complex as you need.
df = pd.DataFrame({'Vendor':['techi', 'tech', 'a', 'hold', 'holding', 'holdingon', 'techno', 'b']})
df
def add_cat(x):
if 'tech' in x:
return 'tech'
if'holding' in x:
return 'Yes'
else:
return ''
df['cat'] = df['Vendor'].apply(add_cat)
Vendor cat
0 techi tech
1 tech tech
2 a
3 hold
4 holding Yes
5 holdingon Yes
6 techno tech
7 b
I am new to datascience your help is appreciated. my question is regarding grouping dataframe based on columns so that bar chart will be plotted based on each subject status
my csv file is something like this
Name,Maths,Science,English,sports
S1,Pass,Fail,Pass,Pass
S2,Pass,Pass,NA,Pass
S3,Pass,Fail,Pass,Pass
S4,Pass,Pass,Pass,NA
S5,Pass,Fail,Pass,NA
expected o/p:
Subject,Status,Count
Maths,Pass,5
Science,Pass,2
Science,Fail,3
English,Pass,4
English,NA,1
Sports,Pass,3
Sports,NA,2
You can do this with pandas, not exactly in the same output format in the question, but definitely having the same information:
import pandas as pd
# reading csv
df = pd.read_csv("input.csv")
# turning columns into rows
melt_df = pd.melt(df, id_vars=['Name'], value_vars=['Maths', 'Science', "English", "sports"], var_name="Subject", value_name="Status")
# filling NaN values, otherwise the below groupby will ignore them.
melt_df = melt_df.fillna("Unknown")
# counting per group of subject and status.
result_df = melt_df.groupby(["Subject", "Status"]).size().reset_index(name="Count")
Then you get the following result:
Subject Status Count
0 English Pass 4
1 English Unknown 1
2 Maths Pass 5
3 Science Fail 3
4 Science Pass 2
5 sports Pass 3
6 sports Unknown 2
PS: Going forward, always paste code on what you've tried so far
To match exactly your output, this is what you could do:
import pandas as pd
df = pd.read_csv('c:/temp/data.csv') # Or where ever your csv file is
subjects = ['Maths', 'Science' , 'English' , 'sports'] # Or you could get that as df.columns and drop 'Name'
grouped_rows = []
for eachsub in subjects:
rows = df.groupby(eachsub)['Name'].count()
idx = list(rows.index)
if 'Pass' in idx:
grouped_rows.append([eachsub, 'Pass', rows['Pass']])
if 'Fail' in idx:
grouped_rows.append([eachsub, 'Fail', rows['Fail']])
new_df = pd.DataFrame(grouped_rows, columns=['Subject', 'Grade', 'Count'])
print(new_df)
I must suggest though that I would avoid getting into the for loop. My approach would be just these two lines:
subjects = ['Maths', 'Science' , 'English' , 'sports']
grouped_rows = df.groupby(eachsub)['Name'].count()
Depending on your application, you already have the data available in grouped_rows
I was using python and pandas to do some statistical analysis on data and at some point I needed to add some new columns with assign function
df_res = (
df
.assign(col1 = lambda x: np.where(x['event'].str.contains('regex1'),1,0))
.assign(col2 = lambda x: np.where(x['event'].str.contains('regex2'),1,0))
.assign(mycol = lambda x: np.where(x['event'].str.contains('regex3'),1,0))
.assign(newcol = lambda x: np.where(x['event'].str.contains('regex4'),1,0))
)
I wanted to know if there is any way to add columns names and my regex to a dictionary and use a for loop or another lambda expression to assign these columns automatically:
Dic = {'col1':'regex1','col2':'regex2','mycol':'regex3','newcol':'regex4'}
df_res = (
df
.assign(...using Dic here...)
)
I need to add more columns later and I think it will make it easier to add new columns later.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html
Assigning multiple columns within the same assign is possible. For Python 3.6 and above, later items in ‘**kwargs’ may refer to newly created or modified columns in ‘df’; items are computed and assigned into ‘df’ in order. For Python 3.5 and below, the order of keyword arguments is not specified, you cannot refer to newly created or modified columns. All items are computed first, and then assigned in alphabetical order.
Changed in version 0.23.0: Keyword argument order is maintained for Python 3.6 and later.
If you map all your regex so that each dictionary value holds a lambda instead of just the regex, you can simply unpack the dic into assign:
lambda_dict = {
col:
lambda x, regex=regex: (
x['event'].
str.contains(regex)
.astype(int)
)
for col, regex in Dic.items()
}
res = df.assign(**lambda_dict)
EDIT
Here's an example:
import pandas as pd
import random
random.seed(0)
events = ['apple_one', 'chicken_one', 'chicken_two', 'apple_two']
data = [random.choice(events) for __ in range(10)]
df = pd.DataFrame(data, columns=['event'])
regex_dict = {
'apples': 'apple',
'chickens': 'chicken',
'ones': 'one',
'twos': 'two',
}
lambda_dict = {
col:
lambda x, regex=regex: (
x['event']
.str.contains(regex)
.astype(int)
)
for col, regex in regex_dict.items()
}
res = df.assign(**lambda_dict)
print(res)
# Output
event apples chickens ones twos
0 apple_two 1 0 0 1
1 apple_two 1 0 0 1
2 apple_one 1 0 1 0
3 chicken_two 0 1 0 1
4 apple_two 1 0 0 1
5 apple_two 1 0 0 1
6 chicken_two 0 1 0 1
7 apple_two 1 0 0 1
8 chicken_two 0 1 0 1
9 chicken_one 0 1 1 0
The problem with the prior code was that the regex was only evaluated during the last loop. Adding it as a default argument fixes this.
This can do what you want to do
pd.concat([df,pd.DataFrame({a:list(df["event"].str.contains(b)) for a,b in Dic.items()})],axis=1)
Actually using a for loop will do the same
If I understand you question correctly, you're trying to rename the columns, in which case I think you could just use Pandas rename function. This would look like
df_res = df_res.rename(mapper=Dic)
-Ben
I've been trying to create a dictionary of data frames so I can store data coming from different files. I have one dataframe created in the following loop, and I would like to aggregate them to have each dataframe to the dictionary. I will have to join them later by the date.
d = {}
for num in range(3,14):
nodeName = "rgs" + str(num).zfill(2) #The key should be the nodeName
# Bunch of stuff to get the data ...
# Fill dataframe
data = {'date':date_list, 'users':users_list}
df = pd.DataFrame(data)
df = df.convert_objects(convert_numeric=True)
df = df.dropna(subset=['users'])
df['users'] = df['users'].astype(int)
d = {nodeName:df}
print d
The problem that I have is, if I print the dictionary out of the loop I only have one item, the last one.
{'rgs13': date users
0 2016-01-18 1
1 2016-01-19 1
2 2016-01-20 1
3 2016-01-21 1
4 2016-01-22 1
5 2016-01-23 1
6 2016-01-24 0
But I can clearly see that I can generate all the dataframes without problems inside the loop. How can I make the dictionary to keep all the df's? What am I doing wrong?
Thanks for the help.
It's because in the end you are re-defining d.
What you want is this:
d = {}
for num in range(3,14):
nodeName = "rgs" + str(num).zfill(2) #The key should be the nodeName
# Bunch of stuff to get the data ...
# Fill dataframe
data = {'date':date_list, 'users':users_list}
df = pd.DataFrame(data)
df = df.convert_objects(convert_numeric=True)
df = df.dropna(subset=['users'])
df['users'] = df['users'].astype(int)
d[nodeName] = df
print d
Instead of d = {nodeName:df} use
d[nodeName] = df
Since this adds a key/value pair to d whereas d = {nodeName:df} reassigns d to a new dict (with only the one key/value pair). Doing that in a loop spells death to all the previous key/value pairs.
You may find Ned Batchelder's Facts and myths about Python names and values a useful read. It will give you the right mental model for thinking about the relationship between variable names and values, and help you see what statements modify values (e.g. d[nodeName] = df) versus reassign variable names (e.g. d = {nodeName:df}).
pandas provides an useful to_html() to convert the DataFrame into the html table. Is there any useful function to read it back to the DataFrame?
The read_html utility released in pandas 0.12
In the general case it is not possible but if you approximately know the structure of your table you could something like this:
# Create a test df:
>>> df = DataFrame(np.random.rand(4,5), columns = list('abcde'))
>>> df
a b c d e
0 0.675006 0.230464 0.386991 0.422778 0.657711
1 0.250519 0.184570 0.470301 0.811388 0.762004
2 0.363777 0.715686 0.272506 0.124069 0.045023
3 0.657702 0.783069 0.473232 0.592722 0.855030
Now parse the html and reconstruct:
from pyquery import PyQuery as pq
d = pq(df.to_html())
columns = d('thead tr').eq(0).text().split()
n_rows = len(d('tbody tr'))
values = np.array(d('tbody tr td').text().split(), dtype=float).reshape(n_rows, len(columns))
>>> DataFrame(values, columns=columns)
a b c d e
0 0.675006 0.230464 0.386991 0.422778 0.657711
1 0.250519 0.184570 0.470301 0.811388 0.762004
2 0.363777 0.715686 0.272506 0.124069 0.045023
3 0.657702 0.783069 0.473232 0.592722 0.855030
You could extend it for Multiindex dfs or automatic type detection using eval() if needed.