Construct NetworkX graph from Pandas DataFrame - python

I'd like to create some NetworkX graphs from a simple Pandas DataFrame:
Loc 1 Loc 2 Loc 3 Loc 4 Loc 5 Loc 6 Loc 7
Foo 0 0 1 1 0 0 0
Bar 0 0 1 1 0 1 1
Baz 0 0 1 0 0 0 0
Bat 0 0 1 0 0 1 0
Quux 1 0 0 0 0 0 0
Where Foo… is the index, and Loc 1 to Loc 7 are the columns. But converting to Numpy matrices or recarrays doesn't seem to work for generating input for nx.Graph(). Is there a standard strategy for achieving this? I'm not averse the reformatting the data in Pandas --> dumping to CSV --> importing to NetworkX, but it seems as if I should be able to generate the edges from the index and the nodes from the values.

NetworkX expects a square matrix (of nodes and edges), perhaps* you want to pass it:
In [11]: df2 = pd.concat([df, df.T]).fillna(0)
Note: It's important that the index and columns are in the same order!
In [12]: df2 = df2.reindex(df2.columns)
In [13]: df2
Out[13]:
Bar Bat Baz Foo Loc 1 Loc 2 Loc 3 Loc 4 Loc 5 Loc 6 Loc 7 Quux
Bar 0 0 0 0 0 0 1 1 0 1 1 0
Bat 0 0 0 0 0 0 1 0 0 1 0 0
Baz 0 0 0 0 0 0 1 0 0 0 0 0
Foo 0 0 0 0 0 0 1 1 0 0 0 0
Loc 1 0 0 0 0 0 0 0 0 0 0 0 1
Loc 2 0 0 0 0 0 0 0 0 0 0 0 0
Loc 3 1 1 1 1 0 0 0 0 0 0 0 0
Loc 4 1 0 0 1 0 0 0 0 0 0 0 0
Loc 5 0 0 0 0 0 0 0 0 0 0 0 0
Loc 6 1 1 0 0 0 0 0 0 0 0 0 0
Loc 7 1 0 0 0 0 0 0 0 0 0 0 0
Quux 0 0 0 0 1 0 0 0 0 0 0 0
In[14]: graph = nx.from_numpy_matrix(df2.values)
This doesn't pass the column/index names to the graph, if you wanted to do that you could use relabel_nodes (you may have to be wary of duplicates, which are allowed in pandas' DataFrames):
In [15]: graph = nx.relabel_nodes(graph, dict(enumerate(df2.columns))) # is there nicer way than dict . enumerate ?
*It's unclear exactly what the columns and index represent for the desired graph.

A little late answer, but now networkx can read data from pandas dataframes, in that case ideally the format is the following for a simple directed graph:
+----------+---------+---------+
| Source | Target | Weight |
+==========+=========+=========+
| Node_1 | Node_2 | 0.2 |
+----------+---------+---------+
| Node_2 | Node_1 | 0.6 |
+----------+---------+---------+
If you are using adjacency matrixes then Andy Hayden is right, you should take care of the correct format. Since in your question you used 0 and 1, I guess you would like to see an undirected graph. It may seem counterintuitive first since you said Index represents e.g. a person, and columns represent groups to which a given person belongs, but it's correct also in the other way a group (membership) belongs to a person. Following this logic, you should actually put the groups in indexes and the persons in columns too.
Just a side note: You can also define this problem in the sense of a directed graph, for example you would like to visualize an association network of hierarchical categories. There, the association e.g. from Samwise Gamgee to Hobbits is stronger than in the other direction usually (since Frodo Baggins is more likely the Hobbit prototype)

You can also use scipy to create the square matrix like this:
import scipy.sparse as sp
cols = df.columns
X = sp.csr_matrix(df.astype(int).values)
Xc = X.T * X # multiply sparse matrix
Xc.setdiag(0) # reset diagonal
# create dataframe from co-occurence matrix in dense format
df = pd.DataFrame(Xc.todense(), index=cols, columns=cols)
Later on you can create an edge list from the dataframe and import it into Networkx:
df = df.stack().reset_index()
df.columns = ['source', 'target', 'weight']
df = df[df['weight'] != 0] # remove non-connected nodes
g = nx.from_pandas_edgelist(df, 'source', 'target', ['weight'])

Related

How to create by default two columns for every features (One Hot Encoding)?

My feature engineering runs for different documents. For some documents some features do not exist and followingly the sublist consists only of the same values such as the third sublist [0,0,0,0,0]. One hot encoding of this sublist leads to only one column, while the feature lists of other documents are transformed to two columns. Is there any possibility to tell ohe also to create two columns if it consits only of one and the same value and insert the column in the right spot? The main problem is that my feature dataframe of different documents consists in the end of a different number of columns, which make them not comparable.
import pandas as pd
feature = [[0,0,1,0,0], [1,1,1,0,1], [0,0,0,0,0], [1,0,1,1,1], [1,1,0,1,1], [1,0,1,1,1], [0,1,0,0,0]]
df = pd.DataFrame(feature[0])
df_features_final = pd.get_dummies(df[0])
for feature in feature[1:]:
df = pd.DataFrame(feature)
df_enc = pd.get_dummies(df[0])
print(df_enc)
df_features_final = pd.concat([df_features_final, df_enc], axis = 1, join ='inner')
print(df_features_final)
The result is the following dataframe. As you can see in the changing columntitles, after column 5 does not follow a 1:
0 1 0 1 0 0 1 0 1 0 1 0 1
0 1 0 0 1 1 0 1 0 1 0 1 1 0
1 1 0 0 1 1 1 0 0 1 1 0 0 1
2 0 1 0 1 1 0 1 1 0 0 1 1 0
3 1 0 1 0 1 0 1 0 1 0 1 1 0
4 1 0 0 1 1 0 1 0 1 0 1 1 0
I don't notice the functionality you want in pandas atleast. But, in TensorFlow, we do have
tf.one_hot(
indices, depth, on_value=None, off_value=None, axis=None, dtype=None, name=None
)
Set depth to 2.

I have a two-dimensional list in which I change one sub-item, but this change is happening to every sibling as well [duplicate]

This question already has answers here:
List of lists changes reflected across sublists unexpectedly
(17 answers)
Closed 3 years ago.
I am attempting to make an Abelian Sandpile Model (see: Rosetta Code) in Python 3.8, and I have not looked at the provided solution because this is a practice exercise for me. I am getting started with a possible solution, but I am not sure why the following issue is occurring.
I have a 5x5 grid, composed of lists.
# a list composed of 5 child-lists, each containing five 0's
area = [[0]*5]*5
My first step is to place the sandpile (ie. change one (x, y) pair to a higher number), and I have written this function to accomplish that:
# changes one location to a greater height, takes a one-based (x, y) pair
def make_sandpile(area, loc, height):
# accommodate for zero-based counting
loc = list(n-1 for n in loc)
# translate to x and y
x, y = loc
# change the corresponding point in the area
area[y][x] = height
calling make_sandpile(area, (3, 3), 4) should have the following outcome:
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 -> 0 0 4 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
However, the outcome is as follows:
0 0 0 0 0 0 0 4 0 0
0 0 0 0 0 0 0 4 0 0
0 0 0 0 0 -> 0 0 4 0 0
0 0 0 0 0 0 0 4 0 0
0 0 0 0 0 0 0 4 0 0
What can I do to correct this?
Instead of [[0]*5]*5, you should use something like [[0] * 5 for _ in range(5)]. Otherwise, you are just copying the reference to the same list.

How to encode dummy variables in Python for sequential data such that the same order is maintained always?

A simple issue really, I have a dataset that is too large to hold in to memory and thus must load it then perform machine learning on it sequentially. One of my features is categorical and I would like to do convert it to a dummy variable, but I have two issues:
1) Not all of the categories are present during a splice. So I would like to add the extra categories even if they are not presented in the current slice
2) The columns would have to maintain the same order as they were before.
This is an example of the problem:
In[1]: import pandas as pd
splice1 = pd.Series(list('bdcccb'))
Out[1]: 0 b
1 d
2 c
3 c
4 c
5 b
dtype: object
In[2]: splice2 = pd.Series(list('accd'))
Out[2]: 0 a
1 c
2 c
3 d
dtype: object
In[3]: splice1_dummy = pd.get_dummies(splice1)
Out[3]: b c d
0 1 0 0
1 0 0 1
2 0 1 0
3 0 1 0
4 0 1 0
5 1 0 0
In[4]: splice2_dummy = pd.get_dummies(splice2)
Out[4]: a c d
0 1 0 0
1 0 1 0
2 0 1 0
3 0 0 1
Edit: How to deal with the N-1 rule. A dummy variable has to be dropped, but which one to drop? Every new splice would hold different categorical variables.
So if you pass the categories in the exact order that you want, get_dummies will maintain it regardless. The code shows how its done.
In[1]: from pandas.api.types import CategoricalDtype
splice1 = pd.Series(list('bdcccb'))
splice1 = splice1.astype(CategoricalDtype(categories=['a','c','b','d']))
splice2 = pd.Series(list('accd'))
splice2 = splice2.astype(CategoricalDtype(categories=['a','c','b','d']))
In[2]: splice1_dummy = pd.get_dummies(splice1)
Out[2]: a c b d
0 0 0 1 0
1 0 0 0 1
2 0 1 0 0
3 0 1 0 0
4 0 1 0 0
5 0 0 1 0
In[3]: splice2_dummy = pd.get_dummies(splice2)
Out[3]: a c b d
0 1 0 0 0
1 0 1 0 0
2 0 1 0 0
3 0 0 0 1
Although, I still haven't solved the issue of which variable to drop.

Counting instances in dataframe that match to another instance

So, I am working with over 100 attributes. Clearly cannot be using this
df['column_name'] >= 1 & df['column_name'] <= 1
Say my dataframe looks like this-
A B C D E F G H I
1 1 1 1 1 0 1 1 0
0 0 1 1 0 0 0 0 1
0 0 1 0 0 0 1 1 1
0 1 1 1 1 0 0 0 0
I wish to find #instances with value 1 for labels C and I . Answer here is two( 2nd and 3rd row). I am working with a lot of attributes certainly cannot hardcode them. How can I be finding the frequency?
Consider I have access to the list of class labels I wish to work with i.e. [C,I]
I think you want DataFrame.all:
df[['C','I']].eq(1).all(axis=1).sum()
#2
We can also use:
df[['C','I']].astype(bool).all(axis=1).sum()

Convert Dictionary to Pandas in Python

I have a dict as follows:
data_dict = {'1.160.139.117': ['712907','742068'],
'1.161.135.205': ['667386','742068'],
'1.162.51.21': ['326136', '663056', '742068']}
I want to convert the dict into a dataframe:
df= pd.DataFrame.from_dict(data_dict, orient='index')
How can I create a dataframe that has columns representing the values of the dictionary and rows representing the keys of the dictionary?, as below:
The best option is #4
pd.get_dummies(df.stack()).sum(level=0)
Option 1:
One way you could do it:
df.stack().reset_index(level=1)\
.set_index(0,append=True)['level_1']\
.unstack().notnull().mul(1)
Output:
326136 663056 667386 712907 742068
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 2
Or with a litte reshaping and pd.crosstab:
df2 = df.stack().reset_index(name='Values')
pd.crosstab(df2.level_0,df2.Values)
Output:
Values 326136 663056 667386 712907 742068
level_0
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 3
df.stack().reset_index(name="Values")\
.pivot(index='level_0',columns='Values')['level_1']\
.notnull().astype(int)
Output:
Values 326136 663056 667386 712907 742068
level_0
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 4 (#Wen pointed out a short solution and fastest so far)
pd.get_dummies(df.stack()).sum(level=0)
Output:
326136 663056 667386 712907 742068
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1

Categories