Related
I'm trying to overlay a seaborn lineplot over a seaborn boxplot
The result is someway "shocking" :)
It seems like the two graphs are put in the same figure but separate
The box plot is compressed on the left side, the line plot is compressed on the right side
Notice that if I run the two graph separatly they work fine
I cannot fugure out how to make it work
Thank you in advance for any help
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
mydata = pd.DataFrame({
'a':[2012, 2012, 2012, 2012, 2013, 2013, 2013, 2013, 2014, 2014, 2014, 2014, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2018, 2018, 2018, 2018, 2019, 2019, 2019, 2019, 2020, 2020, 2020, 2020],
'v':[383.00, 519.00, 366.00, 436.00, 1348.00, 211.00, 139.00, 614.00, 365.00, 365.00, 383.00, 602.00, 994.00, 719.00, 589.00, 365.00, 990.00, 1142.00, 262.00, 1263.00, 507.00, 222.00, 363.00, 274.00, 195.00, 730.00, 730.00, 592.00, 479.00, 607.00, 292.00, 657.00, 453.00, 691.00, 673.00, 705]
})
means =mydata.groupby('a').v.mean().reset_index()
fig, ax = plt.subplots(figsize=(15,8))
sns.boxplot(data=mydata, x='a', y='v', ax=ax, showfliers=False)
sns.lineplot(data=means, x='a', y='v', ax=ax)
plt.show()
Surprisingly, I did not find a duplicate for this question with a good answer, so I elevate my comment to one. Arise, Sir Comment:
Instead of lineplot, you should use pointplot
...
sns.boxplot(data=mydata, x='a', y='v', ax=ax, showfliers=False)
sns.pointplot(data=means, x='a', y='v', ax=ax)
plt.show()
Sample output:
Pointplot is the equivalent to lineplot for categorical variables that are used for boxplot. Please read here more about relational and categorical plotting.
The question came up why there is no problem with lineplot for the following data:
mydata = pd.DataFrame({'a':["m1", "m1", "m1", "m2", "m2", "m2", "m2", "m3", "m3", "m3", "m3", "m4", "m4", "m4", "m4"], 'v':[11.37, 11.31, 10.93, 9.43, 9.62, 6.61, 9.31, 11.27, 8.47, 11.86, 8.77, 8.8, 9.58, 12.26, 10] })
means =mydata.groupby('a').v.mean().reset_index()
print(means)
fig, ax = plt.subplots(figsize=(15,8))
sns.boxplot(data=mydata, x='a', y='v', ax=ax, showfliers=False)
sns.lineplot(data=means, x='a', y='v', ax=ax)
plt.show()
Output:
The difference is that this example does not have any ambiguity for lineplot. Seaborn lineplot can use both - categorical and numerical data. Seemingly, the code tries first to plot it as numerical data, and if this is not possible uses them as categorical variables (I don't know the source code). This is probably a good software decision by seaborn because the other case (not accepting categorical data) would cause way more problems than the rare case that people try to plot both categorical and numerical data into the same figure. A warning by seaborn would be a good thing, though.
I am looking to add a shaded box to my plot below. I want the box to go from Aug 25-Aug 30 and to run the length of the Y axis.
The following is my code for the two plots I have made...
df = pd.read_excel('salinity_temp.xlsx')
dates = df['Date']
sal = df['Salinity']
temp = df['Temperature']
fig, axes = plt.subplots(2, 1, figsize=(8,8), sharex=True)
axes[0].plot(dates, sal, lw=5, color="red")
axes[0].set_ylabel('Salinity (PSU)')
axes[0].set_title('Salinity', fontsize=14)
axes[1].set_title('Temperature', fontsize=14)
axes[1].plot(dates, temp, lw=5, color="blue")
axes[1].set_ylabel('Temperature (C)')
axes[1].set_xlabel('Dates, 2017', fontsize=12)
axes[1].xaxis.set_major_formatter(matplotlib.dates.DateFormatter('%b %d'))
axes[0].xaxis.set_major_formatter(matplotlib.dates.DateFormatter('%b %d'))
axes[1].xaxis_date()
axes[0].xaxis_date()
I want the shaded box to highlight when Hurricane Harvey hit Houston, Texas (Aug 25- Aug 30). My data looks like:
Date Salinity Temperature
20-Aug 15.88144647 31.64707184
21-Aug 18.83088846 31.43848419
22-Aug 19.51015264 31.47655487
23-Aug 23.41655369 31.198349
24-Aug 25.16410124 30.63014984
25-Aug 25.2273574 28.8677597
26-Aug 28.35557667 27.49458313
27-Aug 18.52829235 25.92834473
28-Aug 7.423231661 24.06635284
29-Aug 0.520394177 23.47881317
30-Aug 0.238508327 23.90857697
31-Aug 0.143210364 24.30892944
1-Sep 0.206473387 25.20442963
2-Sep 0.241343182 26.32663727
3-Sep 0.58000503 26.93431854
4-Sep 1.182055098 27.8212738
5-Sep 3.632014919 28.23947906
6-Sep 4.672006985 27.29686737
7-Sep 5.938766377 26.8693161
8-Sep 9.107671159 26.48963928
9-Sep 8.180587303 26.05213165
10-Sep 6.200532091 25.73104858
11-Sep 5.144526191 25.60035706
12-Sep 5.106032451 25.73139191
13-Sep 4.279492562 26.06132507
14-Sep 5.255868992 26.74919128
15-Sep 8.026764063 27.23724365
I have tried using the rectangle function in this link (https://discuss.analyticsvidhya.com/t/how-to-add-a-patch-in-a-plot-in-python/5518) however can't seem to get it to work properly.
Independent of your specific data, it sounds like you need axvspan. Try running this after your plotting code:
for ax in axes:
ax.axvspan('2017-08-25', '2017-08-30', color='black', alpha=0.5)
This will work if dates = df['Date'] is stored as type datetime64. It might not work with other datetime data types, and it won't work if dates contains date strings.
I want to create a scatter plot with only one trendline. Plotly express creates a different trendline for each color in the points list.
import plotly.express as px
value = [15, 20, 35, 40, 48]
years = [2010, 2011, 2012, 2013, 2014]
colors = ['red', 'red', 'blue', 'blue', 'blue']
fig = px.scatter(
x=years,
y=value,
trendline='ols',
color=colors
)
fig.show()
Is there a way to create just one trendline for all the points?
Plot:
Thanks in advance!
With the release of Plotly 5.2.1 (2021-08-13)using px.scatter() lets you specify:
trendline_scope = 'overall'
Plot 1 - trendline_scope = 'overall'
If the greenish color of the trendline is not to your liking, you can change that through:
trendline_color_override = 'black'
Plot 2 - trendline_color_override = 'black'
The other option for trendline_scopeis trace which produces:
Plot 3 - trendline_scope = 'trace'
Complete code:
import plotly.express as px
df = px.data.tips()
fig = px.scatter(df, x="total_bill", y="tip",
color="sex",
trendline="ols",
trendline_scope = 'overall',
# trendline_scope = 'trace'
trendline_color_override = 'black'
)
fig.show()
Previous answer for older versions:
Since you're not specifically asking for a built-in plotly express feature, you can easily build on px.Scatter() and obtain what you want using statsmodels.OLS together with add_traces(go.Scatter()):
Plot:
Code:
import plotly.express as px
import plotly.graph_objs as go
import statsmodels.api as sm
value = [15, 20, 35, 40, 48]
years = [2010, 2011, 2012, 2013, 2014]
colors = ['red', 'red', 'blue', 'blue', 'blue']
# your original setup
fig = px.scatter(
x=years,
y=value,
color=colors
)
# linear regression
regline = sm.OLS(value,sm.add_constant(years)).fit().fittedvalues
# add linear regression line for whole sample
fig.add_traces(go.Scatter(x=years, y=regline,
mode = 'lines',
marker_color='black',
name='trend all')
)
fig
And you can have it both ways:
Plot:
Change in code: Just add trendline='ols'
fig = px.scatter(
x=years,
y=value,
trendline='ols',
color=colors
)
There's no built-in feature for this at the moment, no, unfortunately! But it's a good idea and I've created an issue to suggest it as an addition: https://github.com/plotly/plotly.py/issues/1846
I have a dataframe having two columns- VOL, INVOL and for a particular year, the value are the same. Hence, while plotting in seaborn, I am not able to see the value of the other column when they converge.
For example:
My dataframe is
When I use seaborn, using the below code
f5_test = df5_test.melt('FY', var_name='cols', value_name='vals')
g = sns.catplot(x="FY", y="vals", hue='cols', data=df5_test, kind='point')
the chart is not showing the same point of 0.06.
I tried using pandas plotting, having the same result.
Please advise what I should do. Thanks in advance.
You plot looks legitimate. Two lines perfectly overlap since the data from 2016 to 2018 is exactly the same. I think maybe you can try to plot the two lines separately and add or subtract some small value to one of them to move the line a little bit. For example:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame({'FY': [2012, 2013, 2014, 2015, 2016, 2017, 2018],
'VOL_PCT': [0, 0.08, 0.07, 0.06, 0, 0, 0.06],
'INVOL_PC': [0, 0, 0, 0, 0, 0, 0.06]})
# plot
fig, ax = plt.subplots()
sns.lineplot(df.FY, df.VOL_PCT)
sns.lineplot(df.FY+.01, df.INVOL_PC-.001)
In addition, given the type of your data, you could also consider using stack plots. For example:
fig, ax = plt.subplots()
labels = ['VOL_PCT', 'INVOL_PC']
ax.stackplot(df.FY, df.VOL_PCT, df.INVOL_PC, labels=labels)
ax.legend(loc='upper left');
Ref. Stackplot
I have a Pandas series with values for which I'd like to plot counts. This creates roughly what I want:
dy = sns.countplot(rated.year, color="#53A2BE")
axes = dy.axes
dy.set(xlabel='Release Year', ylabel = "Count")
dy.spines['top'].set_color('none')
dy.spines['right'].set_color('none')
plt.show()
The problem comes with missing data. There are 31 years with ratings, but over a timespan of 42 years. That means there should be some empty bins, which are not being displayed. Is there a way to configure this in Seaborn/Matplotlib? Should I use another type of graph, or is there another fix for this?
I thought about looking into whether it is possible to configure it as a time series, but I have the same problem with rating scales. So, on a 1-10 scale the count for e.g. 4 might be zero, and therefore '4' is not in the Pandas data series, which means it also does not show up in the graph.
The result I'd like is the full scale on the x-axis, with counts (for steps of one) on the y-axis, and showing zero/empty bins for missing instances of the scale, instead of simply showing the next bin for which data is available.
EDIT:
The data (rated.year) looks something like this:
import pandas as pd
rated = pd.DataFrame(data = [2016, 2004, 2007, 2010, 2015, 2016, 2016, 2015,
2011, 2010, 2016, 1975, 2011, 2016, 2015, 2016,
1993, 2011, 2013, 2011], columns = ["year"])
It has more values, but the format is the same. As you can see in..
rated.year.value_counts()
..there are quite a few x values for which count would have to be zero in the graph. Currently plot looks like:
I solved the problem by using the solution suggested by #mwaskom in the comments to my question. I.e. to add an 'order' to the countplot with all valid values for year, including those with count equals zero. This is the code that produces the graph:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
rated = pd.DataFrame(data = [2016, 2004, 2007, 2010, 2015, 2016, 2016, 2015,
2011, 2010, 2016, 1975, 2011, 2016, 2015, 2016,
1993, 2011, 2013, 2011], columns = ["year"])
dy = sns.countplot(rated.year, color="#53A2BE", order = list(range(rated.year.min(),rated.year.max()+1)))
axes = dy.axes
dy.set(xlabel='Release Year', ylabel = "Count")
dy.spines['top'].set_color('none')
dy.spines['right'].set_color('none')
plt.show()
Consider a seaborn barplot by creating a reindexed series casted to a dataframe:
# REINDEXED DATAFRAME
rated_ser = pd.DataFrame(rated['year'].value_counts().\
reindex(range(rated.year.min(),rated.year.max()+1), fill_value=0))\
.reset_index()
# SNS BAR PLOT
dy = sns.barplot(x='index', y='year', data=rated_ser, color="#53A2BE")
dy.set_xticklabels(dy.get_xticklabels(), rotation=90) # ROTATE LABELS, 90 DEG.
axes = dy.axes
dy.set(xlabel='Release Year', ylabel = "Count")
dy.spines['top'].set_color('none')
dy.spines['right'].set_color('none')