customizing the legend in a plot derived from a pandas dataframe

customizing the legend in a plot derived from a pandas dataframe - python

I'm working on a python implementation of an agent-based model using the 'mesa' framework (available in Github). In the model, each "agent" on a grid plays a Prisoner's Dilemma game against its neighbors. Each agent has a strategy that determines its move vs. other moves. Strategies with higher payoffs replace strategies with lower payoffs. In addition, strategies evolve through mutations, so new and longer strategies emerge as the model runs. The app produces a pandas dataframe that gets updated after each step. For example, after 106 steps, the df might look like this:
step strategy count score
0 0 CC 34 2.08
1 0 DD 1143 2.18
2 0 CD 1261 2.24
3 0 DC 62 2.07
4 1 CC 6 1.88
.. ... ... ... ...
485 106 DDCC 56 0.99
486 106 DD 765 1.00
487 106 DC 1665 1.31
488 106 DCDC 23 1.60
489 106 DDDD 47 0.98
Pandas/matplotlib creates a pretty good plot of this data, calling this simple plot function:
def plot_counts(df):
df1 = df.set_index('step')
df1.groupby('strategy')['count'].plot()
plt.ylabel('count')
plt.xlabel('step')
plt.title('Count of all strategies by step')
plt.legend(loc='best')
plt.show()
I get this plot:
Not bad, but here's what I can't figure out. The automatic legend quickly gets way too long and the low-frequency strategies are of little interest, so I want the legend to (1) include only the top 4 strategies listed in the above legend and (2) list those strategies in the order they appear in the last step of the model, based on their counts. Looking at the strategies in step 106 in the df, for example, I want the legend to show the top 4 strategies in order DC,DD,DDCC, and DDDD, but not include DCDC (or any other lower-count strategies that might be active).
I have searched through tons of pandas and matplotlib plotting examples but haven't been able to find a solution to this specific problem. It's clear that these plots are extremely customizable, so I suspect there is a way to do this. Any help would be greatly appreciated.

This post is somewhat similar to what you have asked, I guess you should check the answer on this page: Show only certain items in legend Python Matplotlib. Hope this helps!

Here is an approach. I don't have the complete dataframe, so the test is only with the ones displayed in the question.
The pandas part of the question can be solved by assigning the last step to a variable, then querying for the strategies of that step and then getting the highest counts.
To find the handles, we ask matplotlib for all the handles and labels it generated. Then we search each of the strategies in the list of labels, taking its index to get the corresponding handle.
Please note that 'count' is an annoying name for a column. It also is the name of a pandas function, which prevents its use in the dot notation.
import pandas as pd
from matplotlib import pyplot as plt
df = pd.DataFrame(columns=['step', 'strategy', 'count', 'score'],
data=[[0, 'CC', 34, 2.08],
[0, 'DD', 1143, 2.18],
[0, 'CD', 1261, 2.24],
[0, 'DC', 62, 2.07],
[1, 'CC', 6, 1.88],
[106, 'DDCC', 56, 0.99],
[106, 'DD', 765, 1.00],
[106, 'DC', 1665, 1.31],
[106, 'DCDC', 23, 1.60],
[106, 'DDDD', 47, 0.98]])
last_step = df.step.max()
strategies_last_step = df.strategy[df['count'][df.step == last_step].nlargest(4).index]
df1 = df.set_index('step')
df1.groupby('strategy')['count'].plot()
plt.ylabel('count')
plt.xlabel('step')
plt.title('Count of all strategies by step')
handles, labels = plt.gca().get_legend_handles_labels()
selected_handles = [handles[labels.index(strategy)] for strategy in strategies_last_step]
legend = plt.legend(handles=selected_handles, loc='best')
plt.show()

Thank you, JohanC, you really helped me see what was going on under the hood with this problem. (Also, good point about count as a col name. I changed it to ncount.)
I found your statement:
strategies_last_step = df.strategy[df['count'][df.step == last_step].nlargest(4).index]
wasn't working for me (nlargest got confused about dtypes) so I formulated a slightly different approach. I got a list of correctly ordered strategy names this way:
def plot_counts(df):
# to customize plot legend, first get the last step in the df
last_step = df.step.max()
# next, make new df_last_step, reverse sorted by 'count' & limited to 4 items
df_last_step = df[df['step'] == last_step].sort_values(by='ncount', ascending=False)[0:4]
# put selected and reordered strategies in a list
top_strategies = list(df_last_step.strategy)
Then, after indexing and grouping my original df and adding my other plot parameters ...
dfi = df.set_index('step')
dfi.groupby('strategy')['ncount'].plot()
plt.ylabel('ncount')
plt.xlabel('step')
plt.title('Count of all strategies by step')
I was able to pick out the right handles from the default handles list and reorder them this way:
handles, labels = plt.gca().get_legend_handles_labels()
# get handles for top_strategies, in order, and replace default handles
selected_handles = []
for i in range(len(top_strategies)):
# get the index of the labels object that matches this strategy
ix = labels.index(top_strategies[i])
# get matching handle w the same index, append it to a new handles list in right order
selected_handles.append(handles[ix])
Then plot with the new selected_handles:
plt.legend(handles=selected_handles, loc='best')
plt.show()
Result is exactly as intended. Here is a plot after 300+ steps. Legend is in the right order and limited to top 4 strategies:

Related

Show how when values rise in one column, so does the values in another one

I'm working with a covid dataset for some python exercises I am working through to try learn. I've got it by doing the normal:
import pandas as pd
import numpy as np
df = pd.read_csv("C:/Users/Desktop/Python Short Course/diagnosis.csv")
In this dataset there are 2 columns called BodyTemp and SpO2, what I am looking to try do is show how the results of the columns are similar. So like when the values rise in the BodyTemp column, so does the values in the SpO2 column, that sort of idea. I had thought of maybe doing a bar chart like:
plt.xlabel("BodyTemp") , plt.ylabel("SpO2")
plt.bar(x = df["BodyTemp"], height = df["SpO2"])
plt.show()
but all the bars are very close together and it just doesn't look great, so what would be a better way to do this? Or would there be a better approach to show the visualisation of the distribution of values?
Edit: to show screenshot of graph
Edit to show data:
BodyTemp
SpO2
37.6
85
38.9
93
38.5
92
37
80
I've added a table showing the first few, there are a whole lot more though but it gives an idea of the data

you need to change the scale of y-axis. try this.
plt.ylim((df['SpO2'].min()-.5, df['SpO2'].max()+.5))
If this didn't work, it's probably because there are very small values in the column SpO2. These gaps between the bars may be small values that are distorting the data. Try to remove them from the dataframe.

Python data visualization: too small value to be visible - how to solve?

Here is dataset, i have:
Source
All Leads
Not Junks
Warms
Hots
Deals
Weighted Sum
web
281316
269490
10252
2508
1602
4376.5
telesales
30458
29732
431
138
85
316.2
networking
4249
4195
763
547
476
539.1
promos
1356
1308
30
1
0
10.8
I visualized it:
df.plot.bar()
And got this output:
Some columns got too small values, so that they are not visible, how can tackle this problem?
Setting bigger figure size isn't useful, it makes chart bigger, but columns ratio is still the same, so nothing changes
Any ideas how to make it look more sophisticated? Or maybe i should try different type of chart? Thank you!

Could try df.plot.bar(logy=true), but it's going to make useful interpretation of it messy. A Sankey diagram would probably be a better fit for showing how the data breaks down in each category.

Seaborn comes out a little nicer, but takes some transformation to produce the same type of output:
import seaborn as sns
df2 = df.melt('Source').rename(columns={'variable': 'Category', 'value': 'Values'})
sns.barplot(x='Source', y='Values', data=df2, hue='Category')
plt.show()
Output:
Or with log=True

Plotting different lines for different states on the same chart

I am trying to create a distribution for the number of ___ across a few states.
I want to get all of the states on the same graph, represented by different lines.
Here is an example what my data looks like: you have the state ('which I want to filter lines by), the number of reviews (x axis), and the frequency of restaurants that have that many reviews (y axis)
State | num_of_reviews | Count_id
alaska 1 400
alaska 2 388
alaska 3 344
...
Wyoming 57 13
Whenever I try doing a simple line plot in seaborn or matplotlib, it just returns a messy graph.
Does anyone know a string of code where I easily can filter df['State']?

Assuming that you have 50+ states, I wouldn't plot the distribution for each on the same plot as it would get really messy and hard to read. Instead, I would suggest to use a FacetGrid (read more about it here).
Something like this should do.
import seaborn as sns
import matplotlib.pyplot as plt
g = sns.FacetGrid(df, col="State", col_wrap=5, height=1.5)
g = g.map(plt.hist, "num_of_reviews")
You can find other possible solutions and ideas on how to visualize your data here.
If none of these work for you then it might be helpful if you explain a bit better your problem and provide a desired output and a minimal, complete, and verifiable example.

Plotting boxplots for a groupby object

I would like to plot boxplots for several datasets based on a criterion.
Imagine a dataframe similar to the example below:
df = pd.DataFrame({'Group':[1,1,1,2,3,2,2,3,1,3],'M':np.random.rand(10),'F':np.random.rand(10)})
df = df[['Group','M','F']]
Group M F
0 1 0.465636 0.537723
1 1 0.560537 0.727238
2 1 0.268154 0.648927
3 2 0.722644 0.115550
4 3 0.586346 0.042896
5 2 0.562881 0.369686
6 2 0.395236 0.672477
7 3 0.577949 0.358801
8 1 0.764069 0.642724
9 3 0.731076 0.302369
In this case, I have three groups, so I would like to make a boxplot for each group and for M and F separately having the groups on Y axis and the columns of M and F colour-coded.
This answer is very close to what I want to achieve, but I would prefer something more robust, applicable for larger dataframes with greater number of groups. I feel that groupby is the way to go, but I am not familiar with groupby objects and I am failing to even slice them.
. The desirable output would look something like this:
Looks like years ago, someone had the same problem, but got no answers :( Having a boxplot as a graphical representation of the describe function of groupby
My questions are:
How to implement groupby to feed the desired data into the boxplot
What is the correct syntax for the box plot if I want to control what is displayed and not just use default settings (which I don't even know what they are, I am finding the documentation rather vague. To be specific,can I have the box covering the mean +/- standard deviation, and keep the vertical line at median value?)

I think you should use Seaborn library that offers to create these type of customize plots.In your case i had first melted your dataframe to convert it into proper format and then created the boxplot of your choice.
import pandas as pd
import matplotlib.pyplot as plt
Import seaborn as sns
dd=pd.melt(df,id_vars=['Group'],value_vars=['M','F'],var_name='sex')
sns.boxplot(y='Group',x='value',data=dd,orient="h",hue='sex')
The plot looks similar to your required plot.

Finally, I found a solution by slightly modifying this answer. It does not use groupby object, so it is more tedious to prepare the data, but so far it looks like the best solution to me. Here it is:
# here I prepare the data (group them manually and then store in lists)
Groups=[1,2,3]
Columns=df.columns.tolist()[1:]
print Columns
Mgroups=[]
Fgroups=[]
for g in Groups:
dfgc = df[df['Group']==g]
m=dfgc['M'].dropna()
f=dfgc['F'].dropna()
Mgroups.append(m.tolist())
Fgroups.append(f.tolist())
fig=plt.figure()
ax = plt.axes()
def setBoxColors(bp,cl):
plt.setp(bp['boxes'], color=cl, linewidth=2.)
plt.setp(bp['whiskers'], color=cl, linewidth=2.5)
plt.setp(bp['caps'], color=cl,linewidth=2)
plt.setp(bp['medians'], color=cl, linewidth=3.5)
bpl = plt.boxplot(Mgroups, positions=np.array(xrange(len(Mgroups)))*3.0-0.4,vert=False,whis='range', sym='', widths=0.6)
bpr = plt.boxplot(Fgroups, positions=np.array(xrange(len(Fgroups)))*3.0+0.4,vert=False,whis='range', sym='', widths=0.6)
setBoxColors(bpr, '#D7191C') # colors are from http://colorbrewer2.org/
setBoxColors(bpl, '#2C7BB6')
# draw temporary red and blue lines and use them to create a legend
plt.plot([], c='#D7191C', label='F')
plt.plot([], c='#2C7BB6', label='M')
plt.legend()
plt.yticks(xrange(0, len(Groups) * 3, 3), Groups)
plt.ylim(-3, len(Groups)*3)
#plt.xlim(0, 8)
plt.show()
The result looks mostly like what I wanted (as far as I have been able to find, the box always ranges from first to third quartile, so it is not possible to set it to +/- standard deviation). So I am a bit disappointed there is no one-line solution, but I am glad it is possible. However, for hundreds of groups this would not be good enough...

Plot several densities on one plot

I have a data frame with a MultiIndex (expenditure, groupid):
coef stderr N
expenditure groupid
TOTEXPCQ 176 3745.124 858.1998 81
358 -1926.703 1036.636 75
109 239.3678 639.373 280
769 6406.512 1823.979 96
775 2364.655 1392.187 220
I can get the density using df['coef'].plot(kind='density'). I would like to group these densities by the outer level of the MultiIndex (expenditure), and draw the different densities for different levels of expenditure into the same plot.
How would I achieve this? Bonus: label the different expenditure graphs with the 'expenditure' value
Answer
My initial approach was to merge the different kdes by generating one ax object and passing that along, but the accepted answer inspired me to rather generate one df with the group identifiers as columns:
n = 25
df = pd.DataFrame({'expenditure' : np.random.choice(['foo','bar'], n),
'groupid' : np.random.choice(['one','two'], n),
'coef' : np.random.randn(n)})
df2 = df[['expenditure', 'coef']].pivot_table(index=df.index, columns='expenditure', values='coef')
df2.plot(kind='kde')

Wow, that ended up being much harder than I expected. Seemed easy in concept, but (yet again) concept and practice really differed.
Set up some toy data:
n = 25
df = pd.DataFrame({'expenditure' : np.random.choice(['foo','bar'], n),
'groupid' : np.random.choice(['one','two'], n),
'coef' : randn(n)})
Then group by expenditure, iterate through each expenditure, pivot the data, and plot the kde:
gExp = df.groupby('expenditure')
for exp in gExp:
print exp[0]
gGroupid = exp[1].groupby('groupid')
g = exp[1][['groupid','coef']].reset_index(drop=True)
gpt = g.pivot_table(index = g.index, columns='groupid', values='coef')
gpt.plot(kind='kde').set_title(exp[0])
show()
Results in:
It took some trial and error to figure out the data had to be pivoted before plotting.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

customizing the legend in a plot derived from a pandas dataframe - python

This post is somewhat similar to what you have asked, I guess you should check the answer on this page: Show only certain items in legend Python Matplotlib. Hope this helps!

Related

Show how when values rise in one column, so does the values in another one

Python data visualization: too small value to be visible - how to solve?

Plotting different lines for different states on the same chart

Plotting boxplots for a groupby object

Plot several densities on one plot

Categories

Resources