Create a weighted graph based on Dataframe - python

consider a data frame like this:
id
source
Target
Weight
1
A
B
1
2
A
C
2
3
A
D
3
4
A
E
4
I want to depict a graph with networkX which shows us two things:
1-Node with more connections has a larger size, respectively.
2-Edge with more weight has a thicker line in between.

We can set the edge_attr to Weight when we create the Graph from_pandas_edgelist then when we draw the graph we can get_edge_attributes and pass that as the width of whatever drawing operation.
For node_size we can use nx.degree to get the Degree from the Graph:
nx.degree(G)
[('A', 4), ('B', 1), ('C', 1), ('D', 1), ('E', 1)]
We can then scale up the degree by some factor since these values are going to be quite small. I've chosen a factor of 200 here, but this can be adjusted:
[d[1] * 200 for d in nx.degree(G)]
[800, 200, 200, 200, 200]
All together it can look like:
G = nx.from_pandas_edgelist(
df,
source='source',
target='Target',
edge_attr='Weight' # Set Edge Attribute to Weight Column
)
# Get Degree values and scale
scaled_degree = [d[1] * 200 for d in nx.degree(G)]
nx.draw(G,
# Weights Based on Column
width=list(nx.get_edge_attributes(G, 'Weight').values()),
# Node size based on degree
node_size=scaled_degree,
# Colour Based on Degree
node_color=scaled_degree,
# Set color map to determine colours
cmap='rainbow',
with_labels=True)
plt.show()
Setup Used:
import networkx as nx
import pandas as pd
from matplotlib import pyplot as plt
df = pd.DataFrame({
'id': [1, 2, 3, 4],
'source': ['A', 'A', 'A', 'A'],
'Target': ['B', 'C', 'D', 'E'],
'Weight': [1, 2, 3, 4]
})

Related

How to create a Crosstab Plot?

I would like to create a 'Crosstab' plot like the below using matplotlib or seaborn:
Using the following dataframe:
import pandas as pd
data = [['A', 'C', 2], ['A', 'D', 8], ['B', 'C', 25], ['B', 'D', 30]]
df = pd.DataFrame(data = data, columns = ['col', 'row', 'val'])
col row val
0 A C 2
1 A D 8
2 B C 25
3 B D 30
An option in matplotlib could be by adding Rectangles to the origin via plt.gca and add_patch. The problem is that I did here all manually like this:
from matplotlib.patches import Rectangle
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_axes([0, 0, 1, 1])
plt.xlim(-10, 40)
plt.ylim(-40, 40)
plt.rcParams['figure.figsize'] = (10,16)
someX, someY = 0, 0
currentAxis = plt.gca()
currentAxis.add_patch(Rectangle((someX, someY), 30, 30, facecolor="purple"))
ax.text(15, 15, '30')
currentAxis.add_patch(Rectangle((someX, someY), 25, -25, facecolor="blue"))
ax.text(12.5, -12.5, '25')
currentAxis.add_patch(Rectangle((someX, someY), -2, -2, facecolor="red"))
ax.text(-1, -1, '2')
currentAxis.add_patch(Rectangle((someX, someY), -8, 8, facecolor="green"))
ax.text(-4, 4, '8')
Output:
As you can see, the plot doesn't look that nice. So I was wondering if it is possible to somehow automatically create 'Crosstab' plots using matplotlib or seaborn?
I am not sure whether matplotlib or seaborn have dedicated functions for this type of plot or not, but using plt.bar and plt.bar_label instead of Rectangle and plt.Text might help automatize things a little (label placement etc.).
See code below:
import matplotlib.pyplot as plt
data = [['A', 'C', 2], ['A', 'D', 8], ['B', 'C', 25], ['B', 'D', 30]]
pos={'A':-1,'B':0,'C':-1,'D':1}
fig,ax=plt.subplots(figsize=(10,10))
p=[ax.bar(pos[d[0]]*d[2],pos[d[1]]*d[2],width=d[2],align='edge') for d in data]
[ax.bar_label(p[i],labels=[data[i][2]], label_type='center',fontsize=18) for i in range(len(data))]
ax.set_aspect('equal')

How to change python pyplot legend with 4 legend instead of 2

Thanks for taking time on my question.
I have 2 DataFrame composed of several columns:
df=pd.DataFrame([['A',10, 22], ['A',12, 15], ['A',0, 2], ['A', 20, 25], ['A', 5, 5], ['A',12, 11], ['B', 0 ,0], ['B', 9 ,0], ['B', 8 ,50], ['B', 0 ,0], ['B', 18 ,5], ['B', 7 ,6],['C', 10 ,11], ['C', 9 ,10], ['C', 8 ,2], ['C', 6 ,2], ['C', 8 ,5], ['C', 6 ,8]],
columns=['Name', 'Value_01','Value_02'])
df_agreement=pd.DataFrame([['A', '<66%', '>80'],['B', '>80%', '>66% & <80%'], ['C', '<66%', '<66%']], columns=['Name', 'Agreement_01', 'Agreement_02'])
my goal is to create boxplot for this DataFrame, with ['Value_01', 'Value_02'] as values and 'Name' as x-values. To do so, I perform a sns boxplot with the following code:
fig = plt.figure()
# Change seaborn plot size
fig.set_size_inches(60, 40)
plt.xticks(rotation=70)
plt.yticks(fontsize=40)
df_02=pd.melt(df, id_vars=['Name'],value_vars=['Value_01', 'Value_02'])
bp=sns.boxplot(x='Name',y='value',hue="variable",showfliers=True, data=df_02,showmeans=True,meanprops={"marker": "+",
"markeredgecolor": "black",
"markersize": "20"})
bp.set_xlabel("Name", fontsize=45)
bp.set_ylabel('Value', fontsize=45)
bp.legend(handles=bp.legend_.legendHandles, labels=['V_01', 'V_02'])
Okay this part works, I do have 6 boxplots, two for each name.
What is becoming tricky is that I want to use the df_agreement to change the color of my boxplots, regarding it is <66% or not. So I added this in my code:
list_color_1=[]
list_color_2=[]
for i in range(0, len(df_agreement)):
name=df_agreement.loc[i,'Name']
if df_agreement.loc[i,'Agreement_01']=="<66%":
list_color_1.append(i*2)
if df_agreement.loc[i,'Agreement_02']=="<66%":
list_color_2.append(i*2+1)
for k in list_color_1:
mybox = bp.artists[k]
# Change the appearance of that box
mybox.set_facecolor("#D1DBE6") #facecolor is the inside color of the boxplot
mybox.set_edgecolor('black') #edgecolor is the line color of the box
mybox.set_linewidth(2)
for k in list_color_2:
mybox = bp.artists[k]
# Change the appearance of that box
mybox.set_facecolor("#EFDBD1") #facecolor is the inside color of the boxplot
mybox.set_edgecolor('black') #edgecolor is the line color of the box
mybox.set_linewidth(2)
it works well, I have my boxplots that have changed regarding the value on df_agreement.
But, unfortunatelly, I would like also to change the legend with ["V_01", "V_02", "V_01 with less 66% agreement", "V_02 with less 66% agreement"], and obviously with the corresponding color in the legend.
Would you have an idea to perform that ?
Thank you very much ! :)
You could add custom legend elements, extending the list of handles. Here is an example.
handles, labels = bp.get_legend_handles_labels()
new_handles = handles + [plt.Rectangle((0, 0), 0, 0, facecolor="#D1DBE6", edgecolor='black', linewidth=2),
plt.Rectangle((0, 0), 0, 0, facecolor="#EFDBD1", edgecolor='black', linewidth=2)]
bp.legend(handles=new_handles,
labels=['V_01', 'V_02', "V_01 with less\n than 66% agreement", "V_02 with less\n than 66% agreement"])

plotting pandas groupby with x-axis in columns

I have a dataframe of time series where the columns are the time values (in order) and each row is a separate series. I also have extra columns that gives the category of each row, which in turn determines the linestyle and the color.
Here's the dataframe:
>>> df
cat (frac_norm, 2, 1) cluster
month_rel -5 -4 -3 -2 -1 0 1 2 3 4 5
user1 user2
3414845 4232621 -1b 0.760675 0.789854 0.95941 0.867755 0.790102 1 0.588729 0.719073 0.695572 0.647696 0.656323 4
4369232 3370279 -1b 0.580436 0.546761 0.71343 0.742033 0.802198 0.389957 0.861451 0.651786 0.798265 0.476305 0.896072 0
22771 3795428 -1b 0.946188 0.499531 0.834885 0.825772 0.754018 0.67823 0.430692 0.353989 0.333761 0.284759 0.260501 2
2660226 3126314 -1b 0.826701 0.81203 0.765182 0.680162 0.763475 0.802632 1 0.780186 0.844019 0.868698 0.722672 4
4154510 4348009 -1b 1 0.955656 0.677647 0.911556 0.76613 0.743759 0.61798 0.606536 0.715528 0.614902 0.482267 3
2860801 164553 -1b 0.870056 0.371981 0.640212 0.835185 0.673108 0.536585 1 0.850242 0.551198 0.873016 0.635556 4
120577 3480468 -1b 0.8197 0.879873 0.961178 1 0.855465 0.827824 0.827139 0.304011 0.574978 0.473996 0.358934 3
6692132 5095003 -1b 1 0.995859 0.738418 0.991217 0.854336 0.936518 0.910347 0.883205 0.987796 0.699433 0.815072 4
2515737 4263756 -1b 0.949047 0.990238 0.899524 1 0.961066 0.83703 0.835114 0.759142 0.749727 0.886913 0.936961 4
707596 2856619 -1b 0.780538 0.702179 0.568627 1 0.601382 0.789116 0 0.0714286 0 0.111969 0.0739796 2
I can make the following plot, where I the x-axis are the ordered values of ('frac_norm',2,1), the colors depend on the value of cluster, and the linestyle depends on the value of cat. However, it's row-by-row. Is there a way to vectorize this, say, by using groupby?
My code for generating image
import pandas as pd
import numpy as np
colors = ['r','g','b','c','y','k']
lnst = ['-','--']
cats = np.sort(df['cat'].unique())
clusters = np.sort(df['cluster'].unique())
colordict = dict(zip(clusters, colors))
lnstdict = dict(zip(cats,lnst))
fig, ax = plt.subplots()
# I first do it by `cluster` value
for clus_val in clusters:
clr = colordict[clus_val]
subset = df[df['cluster'] == clus_val]
# and then plot each row individually, setting the color and linestyle
for row in subset.iterrows():
ax.plot(row[1][('frac_norm', 2, 1)], color=clr,
linestyle=lnstdict[row[1]['cat'][0]]
)
Code for generating df
import pandas as pd
import numpy as np
vals = np.array([['-1b', 0.7606747496046389, 0.7898535589129476, 0.959409594095941,
0.8677546569280126, 0.7901020186672455, 1.0, 0.5887286145588728,
0.7190726452719073, 0.6955719557195572, 0.6476962793343348,
0.6563233814156323, 4],
['-1b', 0.5804363905325444, 0.5467611336032389,
0.7134300126103406, 0.7420329670329671, 0.8021978021978022,
0.389957264957265, 0.861451048951049, 0.6517857142857143,
0.798265460030166, 0.4763049450549451, 0.8960720130932898, 0],
['-1b', 0.9461875843454791, 0.49953095684803, 0.8348848603625673,
0.8257715338553662, 0.7540183696900115, 0.6782302664655606,
0.43069179143004643, 0.35398860398860393, 0.33376068376068374,
0.28475935828877, 0.260501012145749, 2],
['-1b', 0.8267008985879333, 0.8120300751879698,
0.7651821862348178, 0.680161943319838, 0.7634749524413443,
0.8026315789473684, 1.0, 0.7801857585139319, 0.8440191387559809,
0.8686980609418281, 0.7226720647773278, 4],
['-1b', 1.0, 0.955656108597285, 0.6776470588235294,
0.9115556882651537, 0.766129636568003, 0.7437589670014347,
0.6179800221975582, 0.6065359477124183, 0.715527950310559,
0.6149019607843138, 0.4822670674109059, 3],
['-1b', 0.8700564971751412, 0.3719806763285024,
0.6402116402116402, 0.8351851851851851, 0.6731078904991948,
0.5365853658536585, 1.0, 0.8502415458937197, 0.55119825708061,
0.873015873015873, 0.6355555555555555, 4],
['-1b', 0.8196997807387418, 0.879872907246731, 0.961178456344944,
1.0, 0.8554654738607772, 0.8278240873814314, 0.8271388025408839,
0.3040112596762843, 0.5749778172138421, 0.47399605003291634,
0.35893441346004046, 3],
['-1b', 1.0, 0.9958592132505176, 0.7384176764076977,
0.9912165129556433, 0.8543355440923606, 0.9365176566646254,
0.9103471520053926, 0.8832054560954816, 0.9877955758962623,
0.6994328922495274, 0.8150724637681159, 4],
['-1b', 0.9490474080638015, 0.9902376128200405,
0.8995240613432046, 1.0, 0.9610655737704917, 0.837029893924783,
0.8351136964569011, 0.759142496847415, 0.7497267759562841,
0.8869130313976105, 0.9369612979550449, 4],
['-1b', 0.7805383022774327, 0.7021791767554478,
0.5686274509803921, 1.0, 0.6013824884792627, 0.7891156462585033,
0.0, 0.07142857142857142, 0.0, 0.11196911196911197,
0.07397959183673469, 2]], dtype=object)
cols = pd.MultiIndex.from_tuples([( 'cat', ''),
(('frac_norm', 2, 1), -5),
(('frac_norm', 2, 1), -4),
(('frac_norm', 2, 1), -3),
(('frac_norm', 2, 1), -2),
(('frac_norm', 2, 1), -1),
(('frac_norm', 2, 1), 0),
(('frac_norm', 2, 1), 1),
(('frac_norm', 2, 1), 2),
(('frac_norm', 2, 1), 3),
(('frac_norm', 2, 1), 4),
(('frac_norm', 2, 1), 5),
( 'cluster', '')],
names=[None, 'month_rel'])
idx = pd.MultiIndex.from_tuples([(3414845, 4232621),
(4369232, 3370279),
( 22771, 3795428),
(2660226, 3126314),
(4154510, 4348009),
(2860801, 164553),
( 120577, 3480468),
(6692132, 5095003),
(2515737, 4263756),
( 707596, 2856619)],
names=['user1', 'user2'])
df = pd.DataFrame(vals, columns=cols, index=idx)
You should be able to use pandas to plot and avoid the loops:
from matplotlib.colors import LinearSegmentedColormap
colors = ['r','g','b','c','y','k']
lnst = ['-','--']
cats = np.sort(df['cat'].unique())
clusters = np.sort(df['cluster'].unique())
colordict = dict(zip(clusters, colors))
lnstdict = dict(zip(cats,lnst))
# transpose data frame
df1 = df.T
# map colors from colordict to cluster
cmap = df['cluster'].map(colordict).values.tolist()
# create a custom color map and line style
lscm = LinearSegmentedColormap.from_list('color', cmap)
lstyle = df['cat'].map(lnstdict).values.tolist()
# plot with pandas
df1.iloc[1:12].reset_index(level=0, drop=True).plot(figsize=(20,10),
colormap=lscm,
style=lstyle)
Update (assuming you want both on the same graph)
from matplotlib.colors import LinearSegmentedColormap
import matplotlib.pyplot as plt
%matplotlib inline
colors = ['r','g','b','c','y','k']
lnst = ['-','--']
cats = np.sort(df['cat'].unique())
clusters = np.sort(df['cluster'].unique())
colordict = dict(zip(clusters, colors))
lnstdict = dict(zip(cats,lnst))
# transpose data frame
df1 = df.T
# map colors from colordict to cluster
cmap = df['cluster'].map(colordict).values.tolist()
# create a custom color map and line style
lscm = LinearSegmentedColormap.from_list('color', cmap)
lstyle = df['cat'].map(lnstdict).values.tolist()
c = df.columns
# not needed for your actually dataframe
# i am just converting your sample data to numeric
for i in range(len(df.columns[1:])-1):
df[c[i+1]] = pd.to_numeric(df[c[i+1]])
# groupby and get mean of cluster
df2 = df[c[1:]].groupby('cluster').mean()
# create sublots object from matplotlib
fig, ax = plt.subplots()
# add a twin y-axis
ax2 = ax.twiny()
# plot dataframe 1
df1.iloc[1:12].reset_index(level=0, drop=True).plot(ax=ax, figsize=(20,10),
colormap=lscm,
style=lstyle)
# create legend for ax
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles, labels, loc='center left', borderaxespad=-20)
# subplot df2
df2.plot(ax=ax2, colormap='copper')
# create legend for ax2
handles, labels = ax2.get_legend_handles_labels()
ax2.legend(handles, labels, loc='center right', borderaxespad=-20)

Is there a way to create a stacked bar graph from pandas?

I have an sqlite database setup with some data. I have imported it through sql statements via pandas:
df1 = pd.read_sql_query("Select avg(Duration),keyword,filename from keywords group by keyword,filename order by filename", con)
The data looks as follows:
Based on this I want to construct a stacked bar graph that looks like this:
I've tried various different solutions including matplotlib, pandas.plot but im unable to successfully construct this graph.
Thanks in advance.
This snippet should work:
import pandas as pd
import matplotlib.pyplot as plt
data = [[2, 'A', 'output.xml'], [5, 'B', 'output.xml'],
[3, 'A', 'output.xml'], [2, 'B', 'output.xml'],
[5, 'C', 'output2.xml'], [1, 'B', 'output2.xml'],
[6, 'C', 'output.xml'], [3, 'C', 'output2.xml'],
[3, 'A', 'output2.xml'], [3, 'B', 'output.xml'],
[2, 'C', 'output.xml'], [1, 'C', 'output2.xml']
]
df = pd.DataFrame(data, columns = ['duration', 'Keyword', 'Filename'])
df2 = df.groupby(['Filename', 'Keyword'])['duration'].sum().unstack('Keyword').fillna(0)
df2[['A','B', 'C']].plot(kind='bar', stacked=True)
It is similar to this question with the difference that I sum the values of the the concerned field instead of counting.
1.You just have to use:
ax=df.pivot_table(index='fillname',columns='keyword',values='avg(duration)').plot(kind='bar',stacked=True,figsize=(15,15),fontsize=25)
ax.legend(fontsize=25)
2. Example
df=pd.DataFrame()
df['avg(duration)']=[7,4,5,9,3,2]
df['keywoard']=['a','b','c','a','b','c']
df['fillname']=['out1','out1','out1','out2','out2','out2']
df
2.1 Output df example:
avg(duration) keywoard fillname
0 7 a out1
1 4 b out1
2 5 c out1
3 9 a out2
4 3 b out2
5 2 c out2
2.2 Drawing
ax=df.pivot_table(index='fillname',columns='keywoard',values='avg(duration)').plot(kind='bar',stacked=True,figsize=(15,15),fontsize=25)
ax.legend(fontsize=25)
2.3 Output image example:
3. In addiccion using:
#set ylim
plt.ylim(-1, 20)
plt.xlim(-1,4)
#grid on
plt.grid()
# set y=0
ax.axhline(0, color='black', lw=1)
#change size of legend
ax.legend(fontsize=25,loc=(0.9,0.4))
#hiding upper and right axis layout
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
#changing the thickness
ax.spines['bottom'].set_linewidth(3)
ax.spines['left'].set_linewidth(3)
#setlabels
ax.set_xlabel('fillname',fontsize=20,color='r')
ax.set_ylabel('avg(duration)',fontsize=20,color='r')
#rotation
plt.xticks(rotation=0)

Matplotlib histogram where each bin is colored by frequency of additional parameter

If I have
data = [(1, 'a'), (1, 'b'), (1, 'a'), (2, 'a'), (2, 'b'), (3, 'c'), (3, 'c'), (3, 'c'), (3, 'c')]
such that there are two properties for each datapoint:
x, y = zip(*data)
I can display x in a histogram, ala:
x = [1, 1, 1, 2, 2, 3, 3, 3, 3]
bins = [1, 2, 3]; f = [3, 2, 4]`.
Then, using the second property,
y = ['a', 'b', 'a', 'a', 'b', 'c', 'c', 'c', 'c']
each bin from the original histogram has frequency information for the secondary parameter:
bins[0] = {'a': 2, 'b': 1}
bins[1] = {'a': 1, 'b': 1}
bins[2] = {'b': 1, 'c': 3}
Using matplotlib, I can create the basic histogram of x:
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
n, bins, patches = ax.hist(x, 3)
Is there a clever way to iterate over the patches, perhaps, to break them up into appropriately sized rectangles that reflect the additional information, y?
In the example, if I wanted 'a' to be red, 'b' to be green and 'c' to be blue, then first bin (x = 1) would be two-thirds red and one-third green, the second bin (x = 2) would be half red and half green and the final bin (x = 3) would be one-fourth green and three-fourths blue.
Example illustration
I realize this is not quite a complete answer, but if you were to reformat your data, then you could use some built-in functionality of hist to avoid having to code everything up by hand.
For example, you might make list which contains all of the x values with y value equal to 'a', another where y = 'b', and finally, one where y = 'c'. You can then stack those lists into another list, and call hist on that data, with stacked = True.
See http://matplotlib.org/1.3.1/examples/pylab_examples/histogram_demo_extended.html (5th panel down) for an illustration.

Categories