I am having trouble with a specific demand on my graphs.
For now, I had to do the following instructions:
Read two dataframes
Create boxplots for the first dataframe and color the boxplots depending on the values of the second dataframe (the code is below, and more information are in my previous StackQuestion)
The code below works and my problems come after:
df=pd.DataFrame([['A',10, 22], ['A',12, 15], ['A',0, 2], ['A', 20, 25], ['A', 5, 5], ['A',12, 11], ['B', 0 ,0], ['B', 9 ,0], ['B', 8 ,50], ['B', 0 ,0], ['B', 18 ,5], ['B', 7 ,6],['C', 10 ,11], ['C', 9 ,10], ['C', 8 ,2], ['C', 6 ,2], ['C', 8 ,5], ['C', 6 ,8]],
columns=['Name', 'Value_01','Value_02'])
df_agreement=pd.DataFrame([['A', '<66%', '>80'],['B', '>80%', '>66% & <80%'], ['C', '<66%', '<66%']], columns=['Name', 'Agreement_01', 'Agreement_02'])
fig = plt.figure()
# Change seaborn plot size
fig.set_size_inches(60, 40)
plt.xticks(rotation=70)
plt.yticks(fontsize=40)
df_02=pd.melt(df, id_vars=['Name'],value_vars=['Value_01', 'Value_02'])
bp=sns.boxplot(x='Name',y='value',hue="variable",showfliers=True, data=df_02,showmeans=True,meanprops={"marker": "+",
"markeredgecolor": "black",
"markersize": "20"})
bp.set_xlabel("Name", fontsize=45)
bp.set_ylabel('Value', fontsize=45)
handles, labels = bp.get_legend_handles_labels()
new_handles = handles + [plt.Rectangle((0, 0), 0, 0, facecolor="#D1DBE6", edgecolor='black', linewidth=2),
plt.Rectangle((0, 0), 0, 0, facecolor="#EFDBD1", edgecolor='black', linewidth=2)]
bp.legend(handles=new_handles,
labels=['V_01', 'V_02', "V_01 with less\n than 66% agreement", "V_02 with less\n than 66% agreement"])
list_color_1=[]
list_color_2=[]
for i in range(0, len(df_agreement)):
name=df_agreement.loc[i,'Name']
if df_agreement.loc[i,'Agreement_01']=="<66%":
list_color_1.append(i*2)
if df_agreement.loc[i,'Agreement_02']=="<66%":
list_color_2.append(i*2+1)
for k in list_color_1:
mybox = bp.artists[k]
# Change the appearance of that box
mybox.set_facecolor("#D1DBE6") #facecolor is the inside color of the boxplot
mybox.set_edgecolor('black') #edgecolor is the line color of the box
mybox.set_linewidth(2)
for k in list_color_2:
mybox = bp.artists[k]
# Change the appearance of that box
mybox.set_facecolor("#EFDBD1") #facecolor is the inside color of the boxplot
mybox.set_edgecolor('black') #edgecolor is the line color of the box
mybox.set_linewidth(2)
Now I have a new dataFrame, equivalent to the first one (df), but with different values:
df_02=pd.DataFrame([['A',5, 20], ['A',15, 2], ['A',3, 5], ['A', 21, 24], ['A', 6, 6], ['A',10, 10], ['B', 0 ,0], ['B', 9 ,0], ['B', 9 ,5], ['B', -4 ,-2], ['B', 8 ,7], ['B', 8 ,9],['C', 10 ,15], ['C', 9 ,10], ['C', 8 ,2], ['C', 6 ,2], ['C', 8 ,5], ['C', 6 ,8]],
columns=['Name', 'Value_01','Value_02'])
What I would like to do is that on the boxplots, I would add a bar (only on each boxplot) corresponding to the value of my second dataframe (df_02).
Is there anyone who would have a guess for that one ?
Related
I have a code with multiple columns and I would like to add two more, one for the highest number on the row, and another one for the second highest. However, instead of the number, I would like to show the column name where they are found.
Assume the following data frame:
import pandas as pd
df = pd.DataFrame({'A': [1, 5, 10], 'B': [2, 6, 11], 'C': [3, 7, 12], 'D': [4, 8, 13], 'E': [5, 9, 14]})
To extract the highest number on every row, I can just apply max(axis=1) like this:
df['max1'] = df[['A', 'B', 'C', 'D', 'E']].max(axis = 1)
This gets me the max number, but not the column name itself.
How can this be applied to the second max number as well?
You can sorting values and assign top2 values:
cols = ['A', 'B', 'C', 'D', 'E']
df[['max2','max1']] = np.sort(df[cols].to_numpy(), axis=1)[:, -2:]
print (df)
A B C D E max2 max1
0 1 2 3 4 5 4 5
1 5 6 7 8 9 8 9
2 10 11 12 13 14 13 14
df[['max1','max2']] = np.sort(df[cols].to_numpy(), axis=1)[:, -2:][:, ::-1]
EDIT: For get top2 columns names and top2 values use:
df = pd.DataFrame({'A': [1, 50, 10], 'B': [2, 6, 11],
'C': [3, 7, 12], 'D': [40, 8, 13], 'E': [5, 9, 14]})
cols = ['A', 'B', 'C', 'D', 'E']
#values in numpy array
vals = df[cols].to_numpy()
#columns names in array
cols = np.array(cols)
#get indices that would sort an array in descending order
arr = np.argsort(-vals, axis=1)
#top 2 columns names
df[['top1','top2']] = cols[arr[:, :2]]
#top 2 values
df[['max2','max1']] = vals[np.arange(arr.shape[0])[:, None], arr[:, :2]]
print (df)
A B C D E top1 top2 max2 max1
0 1 2 3 40 5 D E 40 5
1 50 6 7 8 9 A E 50 9
2 10 11 12 13 14 E D 14 13
Another approaches to you can get first max then remove it and get max again to get the second max
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 15, 10], 'B': [2, 89, 11], 'C': [80, 7, 12], 'D': [4, 8, 13], 'E': [5, 9, 14]})
max1=df.max(axis=1)
maxcolum1=df.idxmax(axis=1)
max2 = df.replace(np.array(df.max(axis=1)),0).max(axis=1)
maxcolum2=df.replace(np.array(df.max(axis=1)),0).idxmax(axis=1)
df2 =pd.DataFrame({ 'max1': max1, 'max2': max2 ,'maxcol1':maxcolum1,'maxcol2':maxcolum2 })
df.join(df2)
I would like to create a 'Crosstab' plot like the below using matplotlib or seaborn:
Using the following dataframe:
import pandas as pd
data = [['A', 'C', 2], ['A', 'D', 8], ['B', 'C', 25], ['B', 'D', 30]]
df = pd.DataFrame(data = data, columns = ['col', 'row', 'val'])
col row val
0 A C 2
1 A D 8
2 B C 25
3 B D 30
An option in matplotlib could be by adding Rectangles to the origin via plt.gca and add_patch. The problem is that I did here all manually like this:
from matplotlib.patches import Rectangle
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_axes([0, 0, 1, 1])
plt.xlim(-10, 40)
plt.ylim(-40, 40)
plt.rcParams['figure.figsize'] = (10,16)
someX, someY = 0, 0
currentAxis = plt.gca()
currentAxis.add_patch(Rectangle((someX, someY), 30, 30, facecolor="purple"))
ax.text(15, 15, '30')
currentAxis.add_patch(Rectangle((someX, someY), 25, -25, facecolor="blue"))
ax.text(12.5, -12.5, '25')
currentAxis.add_patch(Rectangle((someX, someY), -2, -2, facecolor="red"))
ax.text(-1, -1, '2')
currentAxis.add_patch(Rectangle((someX, someY), -8, 8, facecolor="green"))
ax.text(-4, 4, '8')
Output:
As you can see, the plot doesn't look that nice. So I was wondering if it is possible to somehow automatically create 'Crosstab' plots using matplotlib or seaborn?
I am not sure whether matplotlib or seaborn have dedicated functions for this type of plot or not, but using plt.bar and plt.bar_label instead of Rectangle and plt.Text might help automatize things a little (label placement etc.).
See code below:
import matplotlib.pyplot as plt
data = [['A', 'C', 2], ['A', 'D', 8], ['B', 'C', 25], ['B', 'D', 30]]
pos={'A':-1,'B':0,'C':-1,'D':1}
fig,ax=plt.subplots(figsize=(10,10))
p=[ax.bar(pos[d[0]]*d[2],pos[d[1]]*d[2],width=d[2],align='edge') for d in data]
[ax.bar_label(p[i],labels=[data[i][2]], label_type='center',fontsize=18) for i in range(len(data))]
ax.set_aspect('equal')
Thanks for taking time on my question.
I have 2 DataFrame composed of several columns:
df=pd.DataFrame([['A',10, 22], ['A',12, 15], ['A',0, 2], ['A', 20, 25], ['A', 5, 5], ['A',12, 11], ['B', 0 ,0], ['B', 9 ,0], ['B', 8 ,50], ['B', 0 ,0], ['B', 18 ,5], ['B', 7 ,6],['C', 10 ,11], ['C', 9 ,10], ['C', 8 ,2], ['C', 6 ,2], ['C', 8 ,5], ['C', 6 ,8]],
columns=['Name', 'Value_01','Value_02'])
df_agreement=pd.DataFrame([['A', '<66%', '>80'],['B', '>80%', '>66% & <80%'], ['C', '<66%', '<66%']], columns=['Name', 'Agreement_01', 'Agreement_02'])
my goal is to create boxplot for this DataFrame, with ['Value_01', 'Value_02'] as values and 'Name' as x-values. To do so, I perform a sns boxplot with the following code:
fig = plt.figure()
# Change seaborn plot size
fig.set_size_inches(60, 40)
plt.xticks(rotation=70)
plt.yticks(fontsize=40)
df_02=pd.melt(df, id_vars=['Name'],value_vars=['Value_01', 'Value_02'])
bp=sns.boxplot(x='Name',y='value',hue="variable",showfliers=True, data=df_02,showmeans=True,meanprops={"marker": "+",
"markeredgecolor": "black",
"markersize": "20"})
bp.set_xlabel("Name", fontsize=45)
bp.set_ylabel('Value', fontsize=45)
bp.legend(handles=bp.legend_.legendHandles, labels=['V_01', 'V_02'])
Okay this part works, I do have 6 boxplots, two for each name.
What is becoming tricky is that I want to use the df_agreement to change the color of my boxplots, regarding it is <66% or not. So I added this in my code:
list_color_1=[]
list_color_2=[]
for i in range(0, len(df_agreement)):
name=df_agreement.loc[i,'Name']
if df_agreement.loc[i,'Agreement_01']=="<66%":
list_color_1.append(i*2)
if df_agreement.loc[i,'Agreement_02']=="<66%":
list_color_2.append(i*2+1)
for k in list_color_1:
mybox = bp.artists[k]
# Change the appearance of that box
mybox.set_facecolor("#D1DBE6") #facecolor is the inside color of the boxplot
mybox.set_edgecolor('black') #edgecolor is the line color of the box
mybox.set_linewidth(2)
for k in list_color_2:
mybox = bp.artists[k]
# Change the appearance of that box
mybox.set_facecolor("#EFDBD1") #facecolor is the inside color of the boxplot
mybox.set_edgecolor('black') #edgecolor is the line color of the box
mybox.set_linewidth(2)
it works well, I have my boxplots that have changed regarding the value on df_agreement.
But, unfortunatelly, I would like also to change the legend with ["V_01", "V_02", "V_01 with less 66% agreement", "V_02 with less 66% agreement"], and obviously with the corresponding color in the legend.
Would you have an idea to perform that ?
Thank you very much ! :)
You could add custom legend elements, extending the list of handles. Here is an example.
handles, labels = bp.get_legend_handles_labels()
new_handles = handles + [plt.Rectangle((0, 0), 0, 0, facecolor="#D1DBE6", edgecolor='black', linewidth=2),
plt.Rectangle((0, 0), 0, 0, facecolor="#EFDBD1", edgecolor='black', linewidth=2)]
bp.legend(handles=new_handles,
labels=['V_01', 'V_02', "V_01 with less\n than 66% agreement", "V_02 with less\n than 66% agreement"])
I have a dataframe from a stata file and I would like to add a new column to it which has a numeric list as an entry for each row. How can one accomplish this? I have been trying assignment but its complaining about index size.
I tried initiating a new column of strings (also tried integers) and tried something like this but it didnt work.
testdf['new_col'] = '0'
testdf['new_col'] = testdf['new_col'].map(lambda x : list(range(100)))
Here is a toy example resembling what I have:
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd'], 'start_val': [1,7,9,10], 'end_val' : [3,11, 12,15]}
testdf = pd.DataFrame.from_dict(data)
This is what I would like to have:
data2 = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd'], 'start_val': [1,7,9,10], 'end_val' : [3,11, 12,15], 'list' : [[1,2,3],[7,8,9,10,11],[9,10,11,12],[10,11,12,13,14,15]]}
testdf2 = pd.DataFrame.from_dict(data2)
My final goal is to use explode on that "list" column to duplicate the rows appropriately.
Try this bit of code:
testdf['list'] = pd.Series(np.arange(i, j) for i, j in zip(testdf['start_val'],
testdf['end_val']+1))
testdf
Output:
col_1 col_2 start_val end_val list
0 3 a 1 3 [1, 2, 3]
1 2 b 7 11 [7, 8, 9, 10, 11]
2 1 c 9 12 [9, 10, 11, 12]
3 0 d 10 15 [10, 11, 12, 13, 14, 15]
Let's use comprehension and zip with a pd.Series constructor and np.arange to create the lists.
If you'd stick to using the apply function:
import pandas as pd
import numpy as np
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd'], 'start_val': [1,7,9,10], 'end_val' : [3,11, 12,15]}
df = pd.DataFrame.from_dict(data)
df['range'] = df.apply(lambda row: np.arange(row['start_val'], row['end_val']+1), axis=1)
print(df)
Output:
col_1 col_2 start_val end_val range
0 3 a 1 3 [1, 2, 3]
1 2 b 7 11 [7, 8, 9, 10, 11]
2 1 c 9 12 [9, 10, 11, 12]
3 0 d 10 15 [10, 11, 12, 13, 14, 15]
I have an sqlite database setup with some data. I have imported it through sql statements via pandas:
df1 = pd.read_sql_query("Select avg(Duration),keyword,filename from keywords group by keyword,filename order by filename", con)
The data looks as follows:
Based on this I want to construct a stacked bar graph that looks like this:
I've tried various different solutions including matplotlib, pandas.plot but im unable to successfully construct this graph.
Thanks in advance.
This snippet should work:
import pandas as pd
import matplotlib.pyplot as plt
data = [[2, 'A', 'output.xml'], [5, 'B', 'output.xml'],
[3, 'A', 'output.xml'], [2, 'B', 'output.xml'],
[5, 'C', 'output2.xml'], [1, 'B', 'output2.xml'],
[6, 'C', 'output.xml'], [3, 'C', 'output2.xml'],
[3, 'A', 'output2.xml'], [3, 'B', 'output.xml'],
[2, 'C', 'output.xml'], [1, 'C', 'output2.xml']
]
df = pd.DataFrame(data, columns = ['duration', 'Keyword', 'Filename'])
df2 = df.groupby(['Filename', 'Keyword'])['duration'].sum().unstack('Keyword').fillna(0)
df2[['A','B', 'C']].plot(kind='bar', stacked=True)
It is similar to this question with the difference that I sum the values of the the concerned field instead of counting.
1.You just have to use:
ax=df.pivot_table(index='fillname',columns='keyword',values='avg(duration)').plot(kind='bar',stacked=True,figsize=(15,15),fontsize=25)
ax.legend(fontsize=25)
2. Example
df=pd.DataFrame()
df['avg(duration)']=[7,4,5,9,3,2]
df['keywoard']=['a','b','c','a','b','c']
df['fillname']=['out1','out1','out1','out2','out2','out2']
df
2.1 Output df example:
avg(duration) keywoard fillname
0 7 a out1
1 4 b out1
2 5 c out1
3 9 a out2
4 3 b out2
5 2 c out2
2.2 Drawing
ax=df.pivot_table(index='fillname',columns='keywoard',values='avg(duration)').plot(kind='bar',stacked=True,figsize=(15,15),fontsize=25)
ax.legend(fontsize=25)
2.3 Output image example:
3. In addiccion using:
#set ylim
plt.ylim(-1, 20)
plt.xlim(-1,4)
#grid on
plt.grid()
# set y=0
ax.axhline(0, color='black', lw=1)
#change size of legend
ax.legend(fontsize=25,loc=(0.9,0.4))
#hiding upper and right axis layout
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
#changing the thickness
ax.spines['bottom'].set_linewidth(3)
ax.spines['left'].set_linewidth(3)
#setlabels
ax.set_xlabel('fillname',fontsize=20,color='r')
ax.set_ylabel('avg(duration)',fontsize=20,color='r')
#rotation
plt.xticks(rotation=0)