I have two DataFrames that I am plotting as a stripplot. I am able to plot them pretty much as I wish, but I would like to know if it is possible to add the category labels for the "hue".
The plot currently looks like this:
However, I would like to add the labels of the categories (there are only two of them) to each "column" for each letter. So that it looks something like this:
The DataFrames look like this (although these are just edited snippets):
Case Letter Size Weight
0 upper A 20 bold
1 upper A 23 bold
2 lower A 61 bold
3 lower A 62 bold
4 upper A 78 bold
5 upper A 95 bold
6 upper B 23 bold
7 upper B 40 bold
8 lower B 47 bold
9 upper B 59 bold
10 upper B 61 bold
11 upper B 99 bold
12 lower C 23 bold
13 upper D 23 bold
14 upper D 66 bold
15 lower D 99 bold
16 upper E 5 bold
17 upper E 20 bold
18 upper E 21 bold
19 upper E 22 bold
...and...
Case Letter Size Weight
0 upper A 4 normal
1 upper A 6 normal
2 upper A 7 normal
3 upper A 8 normal
4 upper A 9 normal
5 upper A 12 normal
6 upper A 25 normal
7 upper A 26 normal
8 upper A 38 normal
9 upper A 42 normal
10 lower A 43 normal
11 lower A 57 normal
12 lower A 90 normal
13 upper B 4 normal
14 lower B 6 normal
15 upper B 8 normal
16 upper B 9 normal
17 upper B 12 normal
18 upper B 21 normal
19 lower B 25 normal
The relevant code I have is:
fig, ax = plt.subplots(figsize=(10, 7.5))
plt.tight_layout()
sns.stripplot(x=new_df_normal['Letter'], y=new_df_normal['Size'],
hue=new_df_normal['Case'], jitter=False, dodge=True,
size=8, ax=ax, marker='D',
palette={'upper': 'red', 'lower': 'red'})
plt.setp(ax.get_legend().get_texts(), fontsize='16') # for legend text
plt.setp(ax.get_legend().get_title(), fontsize='18') # for legend title
ax.set_xlabel("Letter", fontsize=20)
ax.set_ylabel("Size", fontsize=20)
ax.set_ylim(0, 105)
ax.tick_params(labelsize=20)
ax2 = ax.twinx()
sns.stripplot(x=new_df_bold['Letter'], y=new_df_bold['Size'],
hue=new_df_bold['Case'], jitter=False, dodge=True,
size=8, ax=ax2, marker='D',
palette={'upper': 'green', 'lower': 'green'})
ax.legend_.remove()
ax2.legend_.remove()
ax2.set_xlabel("", fontsize=20)
ax2.set_ylabel("", fontsize=20)
ax2.set_ylim(0, 105)
ax2.tick_params(labelsize=20)
Is it possible to add those category labels ("bold" and "normal") for each column?
Using seaborn’s scatter plot you could access to the style (or even size) parameter. But you might not end up with your intended layout in the end. scatterplot documentation.
Or you could use the catplot and play with rows and columns. seaborn doc for catplot
Unfortunately Seaborn does not natively provide what you are looking for : another level of nesting beyond the hue parameter in stripplot (see stripplot documentation. Some seaborn tickets are opened that might be related, eg this ticket. But I’ve come accros some similar feature requests in seaborn that were refused, see this ticket
One last possibility is to dive into the matplotlib primitives to manipulate your seaborn diagram (since seaborn is just on top of matplotlib). Needless to say it would require a lot of effort, and might end-up nullifying seaborn in the first place ;)
Set dodge=True enables this:
import seaborn as sns
tips = sns.load_dataset("tips")
sns.violinplot(x="day", y="total_bill", hue="smoker",
data=tips, palette="muted")
sns.stripplot(x="day", y="total_bill", hue="smoker",
data=tips, palette="muted", dodge=True)
EDIT:
And with the df provided by the OP:
df = pd.read_csv('./ongenz.tsv', sep='\t')
sns.stripplot(x=df['Letter'], y=df['Size'], data=df, hue=df['Case'], dodge=True)
Related
I try to plot two dataframes with seaborn into one figure.
given these test data:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df['Name'] = 'Adam'
df.iloc[::5, 4] = 'Berta'
df.head(10)
A B C D Name
0 40 75 45 6 Berta
1 52 98 55 44 Adam
2 57 61 70 17 Adam
3 52 5 20 28 Adam
4 63 53 74 49 Adam
5 53 28 97 26 Berta
6 64 38 73 56 Adam
7 25 65 34 64 Adam
8 95 91 92 60 Adam
9 6 54 5 58 Adam
and
df1 = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df1['Location'] = 'New York'
df1.iloc[::5, 4] = 'Tokyo'
df1.head(10)
A B C D Location
0 89 16 23 15 Tokyo
1 7 35 26 21 New York
2 64 94 51 61 New York
3 84 16 15 36 New York
4 55 62 0 2 New York
5 73 93 4 1 Tokyo
6 93 11 27 69 New York
7 14 52 50 45 New York
8 26 77 86 32 New York
9 21 10 68 11 New York
A)The first plot I would like to plot a relplot or scatterplot where both dataframes have the same x and y axes, but a different "hue". If I try:
sb.relplot(data=df, x='Name', y='C', hue="Name", height=8.27, aspect=11.7/8.27)
sb.relplot(data=df1, x='Location', y='C', hue="Location", height=8.27, aspect=11.7/8.27)
plt.show()
The latter plot will overwrite the first or creates a new one. Any ideas?
B) Now we have the same y-axes (let's say "amount"), but with different x-axes (strings).
I found this here: How to overlay two seaborn relplots? and it looks pretty good, but if I try:
fig, ax = plt.subplots()
sb.scatterplot(x="Name", y='A', data=df, hue="Name", ax=ax)
ax2 = ax.twinx()
sb.scatterplot(data=df1, x='Location', y='A', hue="Location", ax =ax2)
plt.show()
then the second scatterplot plots the values over the values of the first one overwriting the names for x. But I would like to add the second scatterplot on the right. Is this possible?
In my opinion it doesn't make sense to concatenate the two dataframes.
Thanks very much!
Having gathered all questions you asked I assume you either want to plot two subplots in one row for two DataFrames or plot two sets of data on one figure.
As for the 'A' plot:
fig, ax = plt.subplots(1, 2, figsize=(8, 4), sharey=True)
sb.scatterplot(data=df, x='Name', y='A', hue='Name',
ax=ax[0])
sb.scatterplot(data=df1, x='Location', y='A', hue='Location',
ax=ax[1])
plt.show()
Here I created both fig and ax using plt.subplots() so then I could locate each scatter plot on a separate subplot, indicating number of rows (1) and columns (2) and a shared Y-axis. Here's what I got (sorry for not bothering for legend location and other decorations):
As for the 'B' plot, if you would want everything on one plot, then you may try:
fig, ax = plt.subplots(1, 1, figsize=(8, 4))
sb.scatterplot(data=df, x='Name', y='A', hue='Name', palette=['blue', 'orange'],
ax=ax)
sb.scatterplot(data=df1, x='Location', y='A', hue='Location', palette=['red', 'green'],
ax=ax)
ax.set_xlabel('Name/Location')
plt.show()
Here I made a single subplot and assigned both scatter plots to it. Might require color mapping and renaming X-axis:
I am using a bar chart to plot query frequencies, but I consistently see uneven spacing between the bars. These look like they should be related to to the ticks, but they're in different positions
This shows up in larger plots
And smaller ones
def TestPlotByFrequency (df, f_field, freq, description):
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.bar(df[f_field][0:freq].index,\
df[f_field][0:freq].values)
plt.show()
This is not related to data either, none at the top have the same frequency count
count
0 8266
1 6603
2 5829
3 4559
4 4295
5 4244
6 3889
7 3827
8 3769
9 3673
10 3606
11 3479
12 3086
13 2995
14 2945
15 2880
16 2847
17 2825
18 2719
19 2631
20 2620
21 2612
22 2590
23 2583
24 2569
25 2503
26 2430
27 2287
28 2280
29 2234
30 2138
Is there any way to make these consistent?
The problem has to do with aliasing as the bars are too thin to really be separated. Depending on the subpixel value where a bar starts, the white space will be visible or not. The dpi of the plot can either be set for the displayed figure or when saving the image. However, if you have too many bars increasing the dpi will only help a little.
As suggested in this post, you can also save the image as svg to get a vector format. Depending where you want to use it, it can be perfectly rendered.
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
matplotlib.rcParams['figure.dpi'] = 300
t = np.linspace(0.0, 2.0, 50)
s = 1 + np.sin(2 * np.pi * t)
df = pd.DataFrame({'time': t, 'voltage': s})
fig, ax = plt.subplots()
ax.bar(df['time'], df['voltage'], width = t[1]*.95)
plt.savefig("test.png", dpi=300)
plt.show()
Image with 100 dpi:
Image with 300 dpi:
I have a data frame that looks like -
id age_bucket state gender duration category1 is_active
1 (40, 70] Jammu and Kashmir m 123 ABB 1
2 (17, 24] West Bengal m 72 ABB 0
3 (40, 70] Bihar f 109 CA 0
4 (17, 24] Bihar f 52 CA 1
5 (24, 30] MP m 23 ACC 1
6 (24, 30] AP m 103 ACC 1
7 (30, 40] West Bengal f 182 GF 0
I want to create a bar plot with how many people are active for each age_bucket and state (top 10). For for gender and category1 I want to create a pie chart with the proportion of active people. The top of the bar should display the total count for active and inactive members and similarly % should be display on pie chart based on is_active.
How to do it in python using seaborn or matplotlib?
I have done so far -
import seaborn as sns
%matplotlib inline
sns.barplot(x='age_bucket',y='is_active',data=df)
sns.barplot(x='category1',y='is_active',data=df)
It sounds like you want to count the observations rather than plotting a value from a column along the yaxis. In seaborn, the function for this is countplot():
sns.countplot('age_bucket', hue='is_active', data=df)
Since the returned object is a matplotlib axis, you could assign it to a variable (e.g. ax) and then use ax.annotate to place text in the the figure manually:
ax = sns.countplot('age_bucket', hue='is_active', data=df)
ax.annotate('1 1', (0, 1), ha='center', va='bottom', fontsize=12)
Seaborn has no way of creating pie charts, so you would need to use matplotlib directly. However, it is often easier to tell counts and proportions from bar charts so I would generally recommend that you stick to those unless you have a specific constraint that forces you to use a pie chart.
I need to compare different sets of daily data between 4 shifts(categorical / groupby), using bar graphs and line graphs. I have looked everywhere and have not found a working solution for this that doesn't include generating new pivots and such.
I've used both, matplotlib and seaborn, and while I can do one or the other(different colored bars/lines for each shift), once I incorporate the other, either one disappears, or other anomalies happen like only one plot point shows. I have looked all over and there are solutions for representing a single series of data on both chart types, but none that goes into multi category or grouped for both.
Data Example:
report_date wh_id shift Head_Count UTL_R
3/17/19 55 A 72 25%
3/18/19 55 A 71 10%
3/19/19 55 A 76 20%
3/20/19 55 A 59 33%
3/21/19 55 A 65 10%
3/22/19 55 A 54 20%
3/23/19 55 A 66 14%
3/17/19 55 1 11 10%
3/17/19 55 2 27 13%
3/17/19 55 3 18 25%
3/18/19 55 1 23 100%
3/18/19 55 2 16 25%
3/18/19 55 3 12 50%
3/19/19 55 1 28 10%
3/19/19 55 2 23 50%
3/19/19 55 3 14 33%
3/20/19 55 1 29 25%
3/20/19 55 2 29 25%
3/20/19 55 3 10 50%
3/21/19 55 1 17 20%
3/21/19 55 2 29 14%
3/21/19 55 3 30 17%
3/22/19 55 1 12 14%
3/22/19 55 2 10 100%
3/22/19 55 3 17 14%
3/23/19 55 1 16 10%
3/23/19 55 2 11 100%
3/23/19 55 3 13 10%
tm_daily_df = pd.read_csv('fg_TM_Daily.csv')
tm_daily_df = tm_daily_df.set_index('report_date')
fig2, ax2 = plt.subplots(figsize=(12,8))
ax3 = ax2.twinx()
group_obj = tm_daily_df.groupby('shift')
g = group_obj['Head_Count'].plot(kind='bar', x='report_date', y='Head_Count',ax=ax2,stacked=False,alpha = .2)
g = group_obj['UTL_R'].plot(kind='line',x='report_date', y='UTL_R', ax=ax3,marker='d', markersize=12)
plt.legend(tm_daily_df['shift'].unique())
This code has gotten me the closest I've been able to get. Notice that even with stacked = False, they are still stacked. I changed the setting to True, and nothing changes.
All i need is for the bars to be next to each other with the same color scheme representative of the shift
The graph:
Here are two solutions (stacked and unstacked). Based on your questions we will:
plot Head_Count in the left y axis and UTL_R in the right y axis.
report_date will be our x axis
shift will represent the hue of our graph.
The stacked version uses pandas default plotting feature, and the unstacked version uses seaborn.
EDIT
From your request, I added a 100% stacked graph. While it is not quite exactly what you asked in the comment, the graph type you asked may create some confusion when reading (are the values based on the upper line of the stack or the width of the stack). An alternative solution may be using a 100% stacked graph.
Stacked
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dfg = df.set_index(['report_date', 'shift']).sort_index(level=[0,1])
fig, ax = plt.subplots(figsize=(12,6))
ax2 = ax.twinx()
dfg['Head_Count'].unstack().plot.bar(stacked=True, ax=ax, alpha=0.6)
dfg['UTL_R'].unstack().plot(kind='line', ax=ax2, marker='o', legend=None)
ax.set_title('My Graph')
plt.show()
Stacked 100%
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dfg = df.set_index(['report_date', 'shift']).sort_index(level=[0,1])
# Create `Head_Count_Pct` column
for date in dfg.index.get_level_values('report_date').unique():
for shift in dfg.loc[date, :].index.get_level_values('shift').unique():
dfg.loc[(date, shift), 'Head_Count_Pct'] = dfg.loc[(date, shift), 'Head_Count'].sum() / dfg.loc[(date, 'A'), 'Head_Count'].sum()
fig, ax = plt.subplots(figsize=(12,6))
ax2 = ax.twinx()
pal = sns.color_palette("Set1")
dfg[dfg.index.get_level_values('shift').isin(['1','2','3'])]['Head_Count_Pct'].unstack().plot.bar(stacked=True, ax=ax, alpha=0.5, color=pal)
dfg['UTL_R'].unstack().plot(kind='line', ax=ax2, marker='o', legend=None, color=pal)
ax.set_title('My Graph')
plt.show()
Unstacked
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dfg = df.set_index(['report_date', 'shift']).sort_index(level=[0,1])
fig, ax = plt.subplots(figsize=(15,6))
ax2 = ax.twinx()
sns.barplot(x=dfg.index.get_level_values('report_date'),
y=dfg.Head_Count,
hue=dfg.index.get_level_values('shift'), ax=ax, alpha=0.7)
sns.lineplot(x=dfg.index.get_level_values('report_date'),
y=dfg.UTL_R,
hue=dfg.index.get_level_values('shift'), ax=ax2, marker='o', legend=None)
ax.set_title('My Graph')
plt.show()
EDIT #2
Here is the graph as you requested in a second time (stacked, but stack n+1 does not start where stack n ends).
It is slightly more involving as we have to do multiple things:
- we need to manually assign our color to our shift in our df
- once we have our colors assign, we will iterate through each date range and 1) sort or Head_Count values descending (so that our largest sack is in the back when we plot the graph), and 2) plot the data and assign the color to each stacj
- Then we can create our second y axis and plot our UTL_R values
- Then we need to assign the correct color to our legend labels
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
def assignColor(shift):
if shift == 'A':
return 'R'
if shift == '1':
return 'B'
if shift == '2':
return 'G'
if shift == '3':
return 'Y'
# map a color to a shift
df['color'] = df['shift'].apply(assignColor)
fig, ax = plt.subplots(figsize=(15,6))
# plot our Head_Count values
for date in df.report_date.unique():
d = df[df.report_date == date].sort_values(by='Head_Count', ascending=False)
y = d.Head_Count.values
x = date
color = d.color
b = plt.bar(x,y, color=color)
# Plot our UTL_R values
ax2 = ax.twinx()
sns.lineplot(x=df.report_date, y=df.UTL_R, hue=df['shift'], marker='o', legend=None)
# Assign the color label color to our legend
leg = ax.legend(labels=df['shift'].unique(), loc=1)
legend_maping = dict()
for shift in df['shift'].unique():
legend_maping[shift] = df[df['shift'] == shift].color.unique()[0]
i = 0
for leg_lab in leg.texts:
leg.legendHandles[i].set_color(legend_maping[leg_lab.get_text()])
i += 1
How about this?
tm_daily_df['UTL_R'] = tm_daily_df['UTL_R'].str.replace('%', '').astype('float') / 100
pivoted = tm_daily_df.pivot_table(values=['Head_Count', 'UTL_R'],
index='report_date',
columns='shift')
pivoted
# Head_Count UTL_R
# shift 1 2 3 A 1 2 3 A
# report_date
# 3/17/19 11 27 18 72 0.10 0.13 0.25 0.25
# 3/18/19 23 16 12 71 1.00 0.25 0.50 0.10
# 3/19/19 28 23 14 76 0.10 0.50 0.33 0.20
# 3/20/19 29 29 10 59 0.25 0.25 0.50 0.33
# 3/21/19 17 29 30 65 0.20 0.14 0.17 0.10
# 3/22/19 12 10 17 54 0.14 1.00 0.14 0.20
# 3/23/19 16 11 13 66 0.10 1.00 0.10 0.14
fig, ax = plt.subplots()
pivoted['Head_Count'].plot.bar(ax=ax)
pivoted['UTL_R'].plot.line(ax=ax, legend=False, secondary_y=True, marker='D')
ax.legend(loc='upper left', title='shift')
I'm trying to plot the following data with matplotlib.
Month A B C
0 2014/06 41 17 3
1 2014/07 48 11 7
2 2014/08 58 20 4
3 2014/09 43 16 6
4 2014/10 73 13 7
5 2014/11 69 22 16
6 2014/12 65 34 9
7 2015/01 69 27 12
I'm having the following code:
x = np.arange(len(df["Month"].values))
y1=df["A"].values.astype(int)
y2=df["B"].values.astype(int)
y3=df["C"].values.astype(int)
my_xticks = df["Month"].values
plt.xticks(x, my_xticks)
plt.plot(x,y1)
plt.plot(x,y2)
plt.plot(x,y3)
plt.show()
The problem is the months are overlapping each other on x-axis. Can I make this automatically adjusted by Python. Not only I need to rotate, but also automatically ignore some months. Otherwise, it's too crowded.
Matplotlib has a function which can automatically format your x axis when they are dates - autofmt_xdate. This automatically rotates the labels, and positions the ticks. They can be changed from the defaults by passing arguments to this function. They can also, of course, be changed manually, but this requires (slightly) more effort.
You can easily reduce the number of dates shown be sampling every 2nd element of the list, using the slice notation [::2]
# Code here that creates a list of dates called list_of_dates...
print (list_of_dates)
# ['2016-08', '2016-09', '2016-10', '2016-11', '2016-12', '2017-01',
# '2017-02', '2017-03', '2017-04', '2017-05', '2017-06', '2017-07',
# '2017-08', '2017-09', '2017-10', '2017-11', '2017-12', '2018-01']
x = np.arange(0, len(list_of_dates), 1)
plt.xticks(x[::2], list_of_dates[::2])
plt.plot(x, np.random.randn(len(list_of_dates)))
# plt.gcf() means "get current figure"
plt.gcf().autofmt_xdate(ha="center")
plt.show()
Which gives: