How to convert a co-occurrence matrix to networkx graph - python

I am using the following code to convert my list of lists to a co-occurrence matrix.
lst = [
['a', 'b'],
['b', 'c', 'd', 'e'],
['a', 'd'],
['b', 'e']
]
u = (pd.get_dummies(pd.DataFrame(lst), prefix='', prefix_sep='')
.groupby(level=0, axis=1)
.sum())
v = u.T.dot(u)
v.values[(np.r_[:len(v)], ) * 2] = 0
print(v)
My output is as follows:
a b c d e
a 0 1 0 1 0
b 1 0 1 1 2
c 0 1 0 1 1
d 1 1 1 0 1
e 0 2 1 1 0
I would like to transform my co-occurrence matrix to a weighted undirected networkx graph where weights represent the co-occurrence count in the matrix.
Currently, I have tried to do it as follows. However, I am not sure how I can insert weights to the graph.
print("get x and y pairs")
#get (x,y) pairs from the cooccurrence matrix
arr = np.where(v>=1)
corrs = [(v.index[x], v.columns[y]) for x, y in zip(*arr)]
#get the unique pairs
final_arr = []
for x, y in corrs:
if (y,x) not in final_arr:
final_arr.append((x,y))
#construct the graph
G = nx.Graph()
nodes_vocabulary_list = ['a', 'b', 'c', 'd', 'e']
G.add_nodes_from(nodes_vocabulary_list)
G.add_edges_from(final_arr)
I am wondering if there is a more easy way to do it networkx?
I am happy to provide more details if needed.

I believe you can use:
lst = [
['a', 'b'],
['b', 'c', 'd', 'e'],
['a', 'd'],
['b', 'e']
]
u = pd.get_dummies(pd.DataFrame(lst), prefix='', prefix_sep='').sum(level=0, axis=1)
v = u.T.dot(u)
#set 0 to lower triangular matrix
v.values[np.tril(np.ones(v.shape)).astype(np.bool)] = 0
print(v)
a b c d e
a 0 1 0 1 0
b 0 0 1 1 2
c 0 0 0 1 1
d 0 0 0 0 1
e 0 0 0 0 0
#reshape and filter only count > 0
a = v.stack()
a = a[a >= 1].rename_axis(('source', 'target')).reset_index(name='weight')
print(a)
source target weight
0 a b 1
1 a d 1
2 b c 1
3 b d 1
4 b e 2
5 c d 1
6 c e 1
7 d e 1
Create graph by from_pandas_edgelist
import networkx as nx
G = nx.from_pandas_edgelist(a, edge_attr=True)
print (nx.to_dict_of_dicts(G))
{'a': {'b': {'weight': 1}, 'd': {'weight': 1}},
'b': {'a': {'weight': 1}, 'c': {'weight': 1}, 'd': {'weight': 1}, 'e': {'weight': 2}},
'd': {'a': {'weight': 1}, 'b': {'weight': 1}, 'c': {'weight': 1}, 'e': {'weight': 1}},
'c': {'b': {'weight': 1}, 'd': {'weight': 1}, 'e': {'weight': 1}},
'e': {'b': {'weight': 2}, 'c': {'weight': 1}, 'd': {'weight': 1}}}

Related

Cut Pandas dataframe based on unique values per column

I want to cut pandas data frame with duplicated values in a column into separate data frames.
So from this:
df = pd.DataFrame({'block': ['A', 'B', 'B', 'C'],
'd': [{'A': 1}, {'B': 'A'}, {'B': 3}, {'C': 'Z'}]})
block d
0 A {'A': 1 }
1 B {'B': 'A'}
2 B {'B': 3 }
3 C {'C': 'Z'}
I would like to achieve two separate data frames:
df1:
block d
0 A {'A': 1 }
1 B {'B': 'A'}
2 C {'C': 'Z'}
df2:
block d
0 A {'A': 1 }
1 B {'B': 3 }
2 C {'C': 'Z'}
Another example:
df = pd.DataFrame({'block': ['A', 'B', 'B', 'C', 'C'],
'd': [{'A': 1}, {'B': 'A'}, {'B': 3}, {'C': 'Z'}, {'C': 10}]})
Result:
block d
0 A {'A': 1 }
1 B {'B': 'A'}
2 C {'C': 'Z'}
block d
0 A {'A': 1 }
1 B {'B': 'A'}
2 C {'C': 10 }
block d
0 A {'A': 1 }
1 B {'B': 3 }
2 C {'C': 'Z'}
block d
0 A {'A': 1 }
1 B {'B': 3 }
2 C {'C': 10 }
I should add that I want to preserve the order of the column 'block'.
I tried pandas explode and itertools package but without good results. If someone knows how to solve this - please help.
One way using pandas.DataFrame.groupby, iterrows and itertools.product:
from itertools import product
prods = []
for _, d in df.groupby("block"):
prods.append([s for _, s in d.iterrows()])
dfs = [pd.concat(ss, axis=1).T for ss in product(*prods)]
print(dfs)
Output:
[ block d
0 A {'A': 1}
1 B {'B': 'A'}
3 C {'C': 'Z'},
block d
0 A {'A': 1}
2 B {'B': 3}
3 C {'C': 'Z'}]
Output for second sample df:
[ block d
0 A {'A': 1}
1 B {'B': 'A'}
3 C {'C': 'Z'},
block d
0 A {'A': 1}
1 B {'B': 'A'}
4 C {'C': 10},
block d
0 A {'A': 1}
2 B {'B': 3}
3 C {'C': 'Z'},
block d
0 A {'A': 1}
2 B {'B': 3}
4 C {'C': 10}]

Is it possible to use two (non-nested) for loops inside a dicitonary?

I have these two lists:
a = ['A', 'B', 'C']
b = [ 1 , 2 , 3 ]
And I want to merge them into a dictionary like this:
{'A': 1, 'B': 2, 'C': 3}
I already tried doing stuff like:
{i: j for i in a for j in b}
dict(*a: *b)
Which outputs
{'A': 3, 'B': 3, 'C': 3}
SyntaxError: invalid syntax
a = ['A', 'B', 'C']
b = [ 1 , 2 , 3 ]
print (dict(zip(a,b)))
Output:
{'A': 1, 'B': 2, 'C': 3}
You should better use zip for this:
a = ['A', 'B', 'C']
b = [ 1 , 2 , 3 ]
{i:k for i,k in zip(a,b)}
#{'A': 1, 'B': 2, 'C': 3}
You can also use enumerate
d = {elem: b[i] for i, elem in enumerate(a)}
d
{'A': 1, 'B': 2, 'C': 3}

Convert multiple columns into dict after groupby in pandas

I have a dataframe
df = pd.DataFrame({"a":[1,1,1,2,2,2,3,3], "b":["a","a","a","b","b","b","c","c"], "c":[0,0,1,0,1,1,0,1], "d":["x","y","z","x","y","y","z","x"]})
a b c d
0 1 a 0 x
1 1 a 0 y
2 1 a 1 z
3 2 b 0 x
4 2 b 1 y
5 2 b 1 y
6 3 c 0 z
7 3 c 1 x
I want to groupby on column a and column b to get following output:
a b e
0 1 a [{'c': 0, 'd': 'x'}, {'c': 0, 'd': 'y'}, {'c': 1, 'd': 'z'}]
1 2 b [{'c': 0, 'd': 'x'}, {'c': 1, 'd': 'y'}, {'c': 1, 'd': 'y'}]
2 3 c [{'c': 0, 'd': 'z'}, {'c': 1, 'd': 'x'}]
My solution:
new_df = df.groupby(["a","b"])["c","d"].apply(lambda x: x.to_dict(orient="records")).reset_index(name="e")
But issue is it is behaving inconsistently, sometimes I get below error:
reset_index() got an unexpected keyword argument "name"
It would be helpful if someone point out issue in above solution or provide an alternate way of doing.
You can do
new=ddf.groupby(['a','b'])[['c','d']].apply(lambda x : x.to_dict('r')).to_frame('e').reset_index()
Out[13]:
a b e
0 1 a [{'c': 0, 'd': 'x'}, {'c': 0, 'd': 'y'}, {'c':...
1 2 b [{'c': 0, 'd': 'x'}, {'c': 1, 'd': 'y'}, {'c':...
2 3 c [{'c': 0, 'd': 'z'}, {'c': 1, 'd': 'x'}]
Alternatively we can do:
df['e'] = df[['c', 'd']].agg(lambda s: dict(zip(s.index, s.values)), axis=1)
df1 = df.groupby(['a', 'b'])['e'].agg(list).reset_index()
# print(df1)
a b e
0 1 a [{'c': 0, 'd': 'x'}, {'c': 0, 'd': 'y'}, {'c':...
1 2 b [{'c': 0, 'd': 'x'}, {'c': 1, 'd': 'y'}, {'c':...
2 3 c [{'c': 0, 'd': 'z'}, {'c': 1, 'd': 'x'}]

Keeping additional column when normalizing list of dicts

I have a dataframe containing id and list of dicts:
df = pd.DataFrame({
'list_of_dicts': [[{'a': 1, 'b': 2}, {'a': 11, 'b': 22}],
[{'a': 3, 'b': 4}, {'a': 33, 'b': 44}]],
'id': [100, 200]
})
and I want to normalize it like this:
id a b
0 100 1 2
0 100 3 4
1 200 11 22
1 200 33 44
This gets most of the way:
pd.concat([
pd.DataFrame.from_dict(item)
for item in df.list_of_dicts
])
but is missing the id column.
I'm most interested in readability.
How about something like this:
d = {
'list_of_dicts': [[{'a': 1, 'b': 2}, {'a': 11, 'b': 22}],
[{'a': 3, 'b': 4}, {'a': 33, 'b': 44}]],
'id': [100, 200]
}
df = pd.DataFrame([pd.Series(x) for ld in d['list_of_dicts'] for x in ld])
id = [[x]*len(l) for l,x in zip(d['list_of_dicts'],d['id'])]
df['id'] = pd.Series([x for l in id for x in l])
EDIT - Here's a simpler version
t = [[('id', i)]+list(l.items()) for i in d['id'] for ll in d['list_of_dicts'] for l in ll]
df = pd.DataFrame([dict(x) for x in t])
And, if you really want the id column first, you can change dict to OrderedDict from the collections module.
This is what I call an incomprehension
pd.DataFrame(
*list(map(list, zip(
*[(d, i) for i, l in zip(df.id, df.list_of_dicts) for d in l]
)))
).rename_axis('id').reset_index()
id a b
0 100 1 2
1 100 11 22
2 200 3 4
3 200 33 44

pandas how to fill NaN/None values based on the other columns?

Given the following, how can I set the NaN/None value of the B row based on the other rows? Should I use apply?
d = [
{'A': 2, 'B': Decimal('628.00'), 'C': 1, 'D': 'blue'},
{'A': 1, 'B': None, 'C': 3, 'D': 'orange'},
{'A': 3, 'B': None, 'C': 1, 'D': 'orange'},
{'A': 2, 'B': Decimal('575.00'), 'C': 2, 'D': 'blue'},
{'A': 4, 'B': None, 'C': 1, 'D': 'blue'},
]
df = pd.DataFrame(d)
# Make sure types are correct
df['B'] = df['B'].astype('float')
df['C'] = df['C'].astype('int')
In : df
Out:
A B C D
0 2 628 1 blue
1 1 NaN 3 orange
2 3 NaN 1 orange
3 2 575 2 blue
4 4 NaN 1 blue
In : df.dtypes
Out:
A int64
B float64
C int64
D object
dtype: object
Here is an example of the "rules" to set B when the value is None:
def make_B(c, d):
"""When B is None, the value of B depends on C and D."""
if d == 'blue':
return Decimal('1400.89') * 1 * c
elif d == 'orange':
return Decimal('2300.57') * 2 * c
raise
Here is the way I solve it:
I define make_B as below:
def make_B(x):
if np.isnan(x['B']):
"""When B is None, the value of B depends on C and D."""
if x['D'] == 'blue':
return Decimal('1400.89') * 1 * x['C']
elif x['D'] == 'orange':
return Decimal('2300.57') * 2 * x['C']
else:
return x['B']
Then I use apply:
df.apply(make_B,axis=1)

Categories