Is there a way to create a stacked bar graph from pandas? - python

I have an sqlite database setup with some data. I have imported it through sql statements via pandas:
df1 = pd.read_sql_query("Select avg(Duration),keyword,filename from keywords group by keyword,filename order by filename", con)
The data looks as follows:
Based on this I want to construct a stacked bar graph that looks like this:
I've tried various different solutions including matplotlib, pandas.plot but im unable to successfully construct this graph.
Thanks in advance.

This snippet should work:
import pandas as pd
import matplotlib.pyplot as plt
data = [[2, 'A', 'output.xml'], [5, 'B', 'output.xml'],
[3, 'A', 'output.xml'], [2, 'B', 'output.xml'],
[5, 'C', 'output2.xml'], [1, 'B', 'output2.xml'],
[6, 'C', 'output.xml'], [3, 'C', 'output2.xml'],
[3, 'A', 'output2.xml'], [3, 'B', 'output.xml'],
[2, 'C', 'output.xml'], [1, 'C', 'output2.xml']
]
df = pd.DataFrame(data, columns = ['duration', 'Keyword', 'Filename'])
df2 = df.groupby(['Filename', 'Keyword'])['duration'].sum().unstack('Keyword').fillna(0)
df2[['A','B', 'C']].plot(kind='bar', stacked=True)
It is similar to this question with the difference that I sum the values of the the concerned field instead of counting.

1.You just have to use:
ax=df.pivot_table(index='fillname',columns='keyword',values='avg(duration)').plot(kind='bar',stacked=True,figsize=(15,15),fontsize=25)
ax.legend(fontsize=25)
2. Example
df=pd.DataFrame()
df['avg(duration)']=[7,4,5,9,3,2]
df['keywoard']=['a','b','c','a','b','c']
df['fillname']=['out1','out1','out1','out2','out2','out2']
df
2.1 Output df example:
avg(duration) keywoard fillname
0 7 a out1
1 4 b out1
2 5 c out1
3 9 a out2
4 3 b out2
5 2 c out2
2.2 Drawing
ax=df.pivot_table(index='fillname',columns='keywoard',values='avg(duration)').plot(kind='bar',stacked=True,figsize=(15,15),fontsize=25)
ax.legend(fontsize=25)
2.3 Output image example:
3. In addiccion using:
#set ylim
plt.ylim(-1, 20)
plt.xlim(-1,4)
#grid on
plt.grid()
# set y=0
ax.axhline(0, color='black', lw=1)
#change size of legend
ax.legend(fontsize=25,loc=(0.9,0.4))
#hiding upper and right axis layout
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
#changing the thickness
ax.spines['bottom'].set_linewidth(3)
ax.spines['left'].set_linewidth(3)
#setlabels
ax.set_xlabel('fillname',fontsize=20,color='r')
ax.set_ylabel('avg(duration)',fontsize=20,color='r')
#rotation
plt.xticks(rotation=0)

Related

How to create a Crosstab Plot?

I would like to create a 'Crosstab' plot like the below using matplotlib or seaborn:
Using the following dataframe:
import pandas as pd
data = [['A', 'C', 2], ['A', 'D', 8], ['B', 'C', 25], ['B', 'D', 30]]
df = pd.DataFrame(data = data, columns = ['col', 'row', 'val'])
col row val
0 A C 2
1 A D 8
2 B C 25
3 B D 30
An option in matplotlib could be by adding Rectangles to the origin via plt.gca and add_patch. The problem is that I did here all manually like this:
from matplotlib.patches import Rectangle
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_axes([0, 0, 1, 1])
plt.xlim(-10, 40)
plt.ylim(-40, 40)
plt.rcParams['figure.figsize'] = (10,16)
someX, someY = 0, 0
currentAxis = plt.gca()
currentAxis.add_patch(Rectangle((someX, someY), 30, 30, facecolor="purple"))
ax.text(15, 15, '30')
currentAxis.add_patch(Rectangle((someX, someY), 25, -25, facecolor="blue"))
ax.text(12.5, -12.5, '25')
currentAxis.add_patch(Rectangle((someX, someY), -2, -2, facecolor="red"))
ax.text(-1, -1, '2')
currentAxis.add_patch(Rectangle((someX, someY), -8, 8, facecolor="green"))
ax.text(-4, 4, '8')
Output:
As you can see, the plot doesn't look that nice. So I was wondering if it is possible to somehow automatically create 'Crosstab' plots using matplotlib or seaborn?
I am not sure whether matplotlib or seaborn have dedicated functions for this type of plot or not, but using plt.bar and plt.bar_label instead of Rectangle and plt.Text might help automatize things a little (label placement etc.).
See code below:
import matplotlib.pyplot as plt
data = [['A', 'C', 2], ['A', 'D', 8], ['B', 'C', 25], ['B', 'D', 30]]
pos={'A':-1,'B':0,'C':-1,'D':1}
fig,ax=plt.subplots(figsize=(10,10))
p=[ax.bar(pos[d[0]]*d[2],pos[d[1]]*d[2],width=d[2],align='edge') for d in data]
[ax.bar_label(p[i],labels=[data[i][2]], label_type='center',fontsize=18) for i in range(len(data))]
ax.set_aspect('equal')

How to split a grouped plot in Seaborn Python?

I have a data frame like this:
df:
Type Col-1 Col-2
A 3 8
A 4 7
A 5 9
A 6 6
A 7 7
B 4 8
B 2 7
B 6 6
B 4 9
B 5 7
I have 2 violin plots for Col-1 & Col-2. Now, I want to create a single violin plot with 2 violin images for Type A & B. In the violin plot, I want to split every violin such that the left half of the violin denotes Col-1 & right half of the violin denotes Col-2. I created two separate violin plots for col-1 and col-2 but now I want to make it a single plot and represent 2 columns at a time by splitting. How can I do it?
This is my code for separate plots:
def violin(data):
for col in data.columns:
x = data[col].to_frame().reset_index()
ax = sns.violinplot(data=x, x='type',y=col,inner='quart',split=True)
plt.show()
violin(df)
This is what my current violin plots look like. I want to make them in single plot:
Can anyone help me with this?
Seaborn works easiest with data in "long form", combining the value columns.
Here is how the code could look like:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.DataFrame({'Type': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'],
'Col-1': [4, 3, 5, 6, 7, 4, 2, 6, 4, 5],
'Col-2': [7, 8, 9, 6, 7, 8, 7, 6, 9, 7]})
df_long = df.melt(id_vars=['Type'], value_vars=['Col-1', 'Col-2'], var_name='Col', value_name='Value')
plt.figure(figsize=(12, 5))
sns.set()
sns.violinplot(data=df_long, x='Type', y='Value', hue='Col', split=True, palette='spring')
plt.tight_layout()
plt.show()

How to change python pyplot legend with 4 legend instead of 2

Thanks for taking time on my question.
I have 2 DataFrame composed of several columns:
df=pd.DataFrame([['A',10, 22], ['A',12, 15], ['A',0, 2], ['A', 20, 25], ['A', 5, 5], ['A',12, 11], ['B', 0 ,0], ['B', 9 ,0], ['B', 8 ,50], ['B', 0 ,0], ['B', 18 ,5], ['B', 7 ,6],['C', 10 ,11], ['C', 9 ,10], ['C', 8 ,2], ['C', 6 ,2], ['C', 8 ,5], ['C', 6 ,8]],
columns=['Name', 'Value_01','Value_02'])
df_agreement=pd.DataFrame([['A', '<66%', '>80'],['B', '>80%', '>66% & <80%'], ['C', '<66%', '<66%']], columns=['Name', 'Agreement_01', 'Agreement_02'])
my goal is to create boxplot for this DataFrame, with ['Value_01', 'Value_02'] as values and 'Name' as x-values. To do so, I perform a sns boxplot with the following code:
fig = plt.figure()
# Change seaborn plot size
fig.set_size_inches(60, 40)
plt.xticks(rotation=70)
plt.yticks(fontsize=40)
df_02=pd.melt(df, id_vars=['Name'],value_vars=['Value_01', 'Value_02'])
bp=sns.boxplot(x='Name',y='value',hue="variable",showfliers=True, data=df_02,showmeans=True,meanprops={"marker": "+",
"markeredgecolor": "black",
"markersize": "20"})
bp.set_xlabel("Name", fontsize=45)
bp.set_ylabel('Value', fontsize=45)
bp.legend(handles=bp.legend_.legendHandles, labels=['V_01', 'V_02'])
Okay this part works, I do have 6 boxplots, two for each name.
What is becoming tricky is that I want to use the df_agreement to change the color of my boxplots, regarding it is <66% or not. So I added this in my code:
list_color_1=[]
list_color_2=[]
for i in range(0, len(df_agreement)):
name=df_agreement.loc[i,'Name']
if df_agreement.loc[i,'Agreement_01']=="<66%":
list_color_1.append(i*2)
if df_agreement.loc[i,'Agreement_02']=="<66%":
list_color_2.append(i*2+1)
for k in list_color_1:
mybox = bp.artists[k]
# Change the appearance of that box
mybox.set_facecolor("#D1DBE6") #facecolor is the inside color of the boxplot
mybox.set_edgecolor('black') #edgecolor is the line color of the box
mybox.set_linewidth(2)
for k in list_color_2:
mybox = bp.artists[k]
# Change the appearance of that box
mybox.set_facecolor("#EFDBD1") #facecolor is the inside color of the boxplot
mybox.set_edgecolor('black') #edgecolor is the line color of the box
mybox.set_linewidth(2)
it works well, I have my boxplots that have changed regarding the value on df_agreement.
But, unfortunatelly, I would like also to change the legend with ["V_01", "V_02", "V_01 with less 66% agreement", "V_02 with less 66% agreement"], and obviously with the corresponding color in the legend.
Would you have an idea to perform that ?
Thank you very much ! :)
You could add custom legend elements, extending the list of handles. Here is an example.
handles, labels = bp.get_legend_handles_labels()
new_handles = handles + [plt.Rectangle((0, 0), 0, 0, facecolor="#D1DBE6", edgecolor='black', linewidth=2),
plt.Rectangle((0, 0), 0, 0, facecolor="#EFDBD1", edgecolor='black', linewidth=2)]
bp.legend(handles=new_handles,
labels=['V_01', 'V_02', "V_01 with less\n than 66% agreement", "V_02 with less\n than 66% agreement"])

pandas data frame / numpy array - roll without aggregate function

rolling in python aggregates data:
x = pd.DataFrame([[1,'a'],[2,'b'],[3,'c'],[4,'d']], columns=['a','b'])
y = x.rolling(2).mean()
print(y)
gives:
a b
0 NaN a
1 1.5 b
2 2.5 c
3 3.5 d
what I need is 3 dimension dataframes (or numpy arrays) shifting 3 samples by 1 step (in this example):
[
[[1,'a'],[2,'b'],[3,'c']],
[[2,'b'],[3,'c'],[4,'d']]
]
Whats the right way to do it for 900 samples shifting by 1 each step?
Using np.concantenate
np.concatenate([x.values[:-1],
x.values[1:]], axis=1)\
.reshape([x.shape[0] - 1, x.shape[1], -1])
You can try of concatenating window length associated dataframes based on the window length chosen (as selected 2)
length = df.dropna().shape[0]-1
cols = len(df.columns)
pd.concat([df.shift(1),df],axis=1).dropna().astype(int,errors='ignore').values.reshape((length,cols,2))
Out:
array([[[1, 'a'],
[2, 'b']],
[[2, 'b'],
[3, 'c']],
[[3, 'c'],
[4, 'd']]], dtype=object)
Let me know whether this solution suits your question.
p = x[['a','b']].values.tolist() # create a list of list ,as [i.a,i.b] for every i row in x
#### Output ####
[[1, 'a'], [2, 'b'], [3, 'c'], [4, 'd']]
#iterate through list except last two and for every i, fetch p[i],p[i+1],p[i+2] into a list
list_of_3 = [[p[i],p[i+1],p[i+2]] for i in range(len(p)-2)]
#### Output ####
[
[[1, 'a'], [2, 'b'], [3, 'c']],
[[2, 'b'], [3, 'c'], [4, 'd']]
]
# This is used if in case the list you require is numpy ndarray
from numpy import array
a = array(list_of_3)
#### Output ####
[[['1' 'a']
['2' 'b']
['3' 'c']]
[['2' 'b']
['3' 'c']
['4' 'd']]
]
Since pandas 1.1 you can iterate over rolling objects:
[window.values.tolist() for window in x.rolling(3) if window.shape[0] == 3]
The if makes sure we only get full windows. This solution has the advantage that you can use any parameter of the handy rolling function of pandas.

Plotting multiple columns of different sizes with Pandas

I'm fairly new to Pandas, but typically what I do with data (when all columns are of equal sizes), I build np.zeros(count) matrices, then use a for loop to populate the data from a text file (np.genfromtxt()) to do my graphing and analysis in matplotlib.
However, I am now trying to implement similar analysis with columns of different sizes on the same plot from a CSV file.
For instance:
data.csv:
A B C D E F
1 2 3 4 5 6
2 3 4 5 6 7
3 4 5 6
4 5
df = pandas.read_csv('data.csv')
ax = df.plot(x = 'A', y = 'B')
df.plot(x = 'C', y = 'D', ax = ax)
df.plot(x = 'E', y = 'F', ax = ax)
This code plots the first two on the same graph, but the rest of the information is lost (and there are a lot more columns of mismatched sizes, but the x/y columns I am plotting are the all the same size).
Is there an easier way to do all of this? Thanks!
Here is how you could generalize your solution :
I edited my answer to add an error handling. If you have a lonely last column, it'll still work.
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
data = {
'A' : [1, 2, 3, 4],
'B' : [2, 3, 4, 5],
'C' : [3, 4, 5, np.nan],
'D' : [4, 5, 6, np.nan],
'E' : [5, 6, np.nan, np.nan],
'F' : [6, 7, np.nan, np.nan]
}
df = pd.DataFrame(data)
def Chris(df):
ax = df.plot(x='A', y='B')
df.plot(x='C', y='D', ax=ax)
df.plot(x='E', y='F', ax=ax)
plt.show()
def IMCoins(df):
fig, ax = plt.subplots()
try:
for idx in range(0, df.shape[1], 2):
df.plot(x = df.columns[idx],
y = df.columns[idx + 1],
ax= ax)
except IndexError:
print('Index Error: Log the error.')
plt.show()
Chris(df)
IMCoins(df)

Categories