Highlight specific sample in stripplot from pandas dataframe

Highlight specific sample in stripplot from pandas dataframe - python

I have a pandas dataframe as the following (although with more rows and columns):
Index
LOC1
LOC2
LOC 3
A
0.054
1.2
0.00
B
0.38
3.89
0.027
C
3.07
2.67
1.635
D
7.36
6.2
0.23
I was wondering if it's possible to highlight stripplot dots that belong to a specific sample. In my dataframe samples are index names ('A', 'B'...). So, for example, I would like to use a different color for values in the 'C' row.
As I pass my dataset in a wide-form https://seaborn.pydata.org/generated/seaborn.stripplot.html , I guess I can't use hue, but I wasn't able to figure out any other way.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame(
{
"Index": list("ABCD"),
"LOC1": [0.054, 0.38, 3.07, 7.36],
"LOC2": [1.2, 3.89, 2.67, 6.2],
"LOC3": [0.0, 0.027, 1.635, 0.23]
}
)
fig = plt.figure()
ax=sns.boxplot(data=df, showfliers=False, medianprops=dict(color='red', linewidth=3))
ax=sns.stripplot(data=df,jitter=True, size=12, color=".3")
plt.show()

You could reshape your dataframe then use 'hue', assuming 'Index' is in the dataframe index, then you need to reset_index before melt:
import seaborn as sns
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(10,10))
title = 'TEST'
fig.suptitle(title,y=0.92,fontsize=36)
ax=sns.boxplot(data=df, showfliers=False, medianprops=dict(color='red', linewidth=3))
dfm = df.reset_index().melt('Index')
ax=sns.stripplot(data=dfm, x='variable', y='value', hue='Index', jitter=True, size=12, linewidth=1)
Output:
import seaborn as sns
import matplotlib.pyplot as plt
df = df.replace({'A':'Other', 'C':'Other','D':'Other'})
fig = plt.figure(figsize=(10,10))
title = 'TEST'
fig.suptitle(title,y=0.92,fontsize=36)
ax=sns.boxplot(data=df, showfliers=False, medianprops=dict(color='red', linewidth=3))
dfm = df.reset_index().melt('Index')
ax=sns.stripplot(data=dfm, x='variable', y='value', hue='Index', jitter=True, size=12, linewidth=1)
Output:

Related

how to set x_axis label(not xtick label) for all subplots in relplot?

I tried drawing subplot through relplot method of seaborn. Now the question is, due to the original dataset is varying, sometimes I don't know how much final subplots will be.
I set col_wrap to limit it, but sometimes the results looks not so good. For example, I set col_wrap = 3, while there are 5 subplots as below:
As the figure shows, the x_axis only occurs in the C D E, which seems strange. I want x axis label is shown in all subplots(from A to E).
Now I already know that facet_kws={'sharex': 'col'} allows plots to have independent axis scales(according to set axis limits on individual facets of seaborn facetgrid).
But I want set labels for x axis of all subplots.I haven't found any solution for it.
Any keyword like set_xlabels in object FacetGrid seems to be useless, because official document announces they only control "on the bottom row of the grid".
FacetGrid.set_xlabels(label=None, clear_inner=True, **kwargs)
Label the x axis on the bottom row of the grid.
The following are my example data and my code:
city date value
0 A 1 9
1 B 1 20
2 C 1 4
3 D 1 33
4 E 1 2
5 A 2 22
6 B 2 32
7 C 2 27
8 D 2 32
9 E 2 18
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_excel("data/example_data.xlsx")
# print(df)
g = sns.relplot(data=df, x="date", y="value", kind="line", col="city", col_wrap=3,
errorbar=None, facet_kws={'sharex': 'col'})
(g.set_axis_labels("x_axis", "y_axis", )
.set_titles("{col_name}")
.tight_layout()
.add_legend()
)
plt.subplots_adjust(top=0.94, wspace=None, hspace=0.4)
plt.show()
Thanks in advance.

In order to reduce superfluous information, Seaborn makes these inner labels invisible. You can make them visible again:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
df = pd.DataFrame({'date': np.repeat([1, 2], 5),
'value': np.random.randint(1, 20, 10),
'city': np.tile([*'abcde'], 2)})
# print(df)
g = sns.relplot(data=df, x="date", y="value", kind="line", col="city", col_wrap=3,
errorbar=None, facet_kws={'sharex': 'col'})
g.set_titles("{col_name}")
g.add_legend()
for ax in g.axes.flat:
ax.set_xlabel('x axis', visible=True)
ax.set_ylabel('y axis', visible=True)
plt.subplots_adjust(top=0.94, wspace=None, hspace=0.4)
plt.show()

How to merge two plots in Pandas?

I want to merge two plots, that is my dataframe:
df_inc.head()
id date real_exe_time mean mean+30% mean-30%
0 Jan 31 33.14 43.0 23.0
1 Jan 30 33.14 43.0 23.0
2 Jan 33 33.14 43.0 23.0
3 Jan 38 33.14 43.0 23.0
4 Jan 36 33.14 43.0 23.0
My first plot:
df_inc.plot.scatter(x = 'date', y = 'real_exe_time')
Then
My second plot:
df_inc.plot(x='date', y=['mean','mean+30%','mean-30%'])
When I try to merge with:
fig=plt.figure()
ax = df_inc.plot(x='date', y=['mean','mean+30%','mean-30%']);
df_inc.plot.scatter(x = 'date', y = 'real_exe_time', ax=ax)
plt.show()
I got the following:
How I can merge the right way?

You should not repeat your mean values as an extra column. df.plot() for categorical data will be plotted against the index - hence you will see the original scatter plot (also plotted against the index) squeezed into the left corner.
You could create instead an additional aggregation dataframe that you can plot then into the same graph:
import matplotlib.pyplot as plt
import pandas as pd
#test data generation
import numpy as np
n=30
np.random.seed(123)
df = pd.DataFrame({"date": np.random.choice(list("ABCDEF"), n), "real_exe_time": np.random.randint(1, 100, n)})
df = df.sort_values(by="date").reindex()
#aggregate data for plotting
df_agg = df.groupby("date")["real_exe_time"].agg(mean="mean").reset_index()
df_agg["mean+30%"] = df_agg["mean"] * 1.3
df_agg["mean-30%"] = df_agg["mean"] * 0.7
#plot both into the same subplot
ax = df.plot.scatter(x = 'date', y = 'real_exe_time')
df_agg.plot(x='date', y=['mean','mean+30%','mean-30%'], ax=ax)
plt.show()
Sample output:
You could also consider using seaborn that has, for instance, pointplots for categorical data aggregation.

I'm Guessing that you haven't transform the Date to a datetime object so the first thing you should do is this
#Transform the date to datetime object
df_inc['date']=pd.to_datetime(df_inc['date'],format='%b')
fig=plt.figure()
ax = df_inc.plot(x='date', y=['mean','mean+30%','mean-30%']);
df_inc.plot.scatter(x = 'date', y = 'real_exe_time', ax=ax)
plt.show()

how plot multiples dataframe csv in same plot

I have 4 dataframes in 4 csv. I need to plot timeseries ( Date , mean ) in the same plot.
This is my script :
cc = Series.from_csv('D:/python/means2000_2001.csv' , header=0)
fig = plt.figure()
plt.plot(cc , color='red')
fig.suptitle('test title', fontsize=20)
plt.xlabel('Date', fontsize=15)
plt.ylabel('MEANS ', fontsize=15)
plt.xticks(rotation=90)
The 4 dataframes are like this ( x=Date and y=mean )
Out[307]:
Date
07-28 0.17
08-13 0.18
08-29 0.17
09-14 0.19
09-30 0.19
10-16 0.20
11-01 0.18
11-17 0.22
12-03 0.21
12-19 0.82
01-02 0.59
01-18 0.52
02-03 0.54
02-19 0.53
03-07 0.33
03-23 0.32
04-08 0.31
04-24 0.39
05-10 0.40
05-26 0.40
06-11 0.37
06-27 0.33
07-13 0.29
Name: mean, dtype: float64
when I plot the timeseries i have this graph :
how can i plot all dataframes in the same plot with different colors?
I need something like this :

You can do both:
plot all curves with one singel command, see: plt.plot()
adress each singel curve to plot, see for-loop with plt.fill_between()
if you have 2 DataFrames, say df1 and df2, then use plt.plot() twice:
plt.plot(t,df1); plt.plot(t,df2); plt.show()
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
#--- generate data and DataFrame --
nt = 100
t= np.linspace(0,1,nt)*3*np.pi
y1 = np.sin(t); y2 = np.cos(t); y3 = y1*y2
df = pd.DataFrame({'y1':y1,'y2':y2,'y3':y3 })
#--- graphics ---
plt.style.use('fast')
fig, ax0 = plt.subplots(figsize=(20,4))
plt.plot(t,df, lw=4, alpha=0.6); # plot all curves with 1 command
for j in range(len(df.columns)): # add on: fill_between for each curve
plt.fill_between(t,df.values[:,j],label=df.columns[j],alpha=0.2)
plt.legend(prop={'size':15});plt.grid(axis='y');plt.show()

The answer
You can plot multiple dataframes on a single graph by capturing the Axes object that df.plot returns and then reusing it. Here's an example with two dataframes, df1 and df2:
ax = df1.plot(x='dates', y='vals', label='val 1')
df2.plot(x='dates', y='vals', label='val 2', ax=ax)
plt.show()
Output:
Details
Here's the code I used to generate random example values for df1 and df2:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
def random_dates(start, end, n=10):
if isinstance(start, str): start = pd.to_datetime(start)
if isinstance(end, str): end = pd.to_datetime(end)
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
# generate two random dfs
df1 = pd.DataFrame({'dates': random_dates('2016-01-01', '2016-12-31'), 'vals': np.random.rand(10)})
df2 = pd.DataFrame({'dates': random_dates('2016-01-01', '2016-12-31'), 'vals': np.random.rand(10)})

Plotting Date fails on x-axis and extension to subplots

I am trying to plot some data but somehow the data showed on the x-axis is not the proper format. Instead having 2018-01-03 etc I am receiving 0028-02-23. When loading the data the proper format is already loaded when getting the data from the csv file.
In addition I would like to have the data plotted in diverse subplots means valuegroup in subplot 1, valuegroub B in subplot 2 etc.
The figure looks like
The code like:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter
csv_loader = pd.read_csv('C:/Users/micha/Desktop/Test.csv', encoding='cp1252', parse_dates=['Date'], sep=';', index_col=0).dropna()
fig, ax = plt.subplots()
csv_loader.groupby('Valuegroup').plot(x='Date', y='Value', ax=ax, legend=False, kind='line')
plt.grid(True)
myFmt = DateFormatter("%Y-%m-%d")
ax.xaxis.set_minor_formatter(myFmt)
plt.show()
The data looks like:
Calcgroup;Valuegroup;id;Date;Value
Group1;A;1;20080103;0.1
Group1;A;1;20080104;0.3
Group1;A;1;20080107;0.5
Group1;A;1;20080108;0.9
Group1;B;1;20080103;0.5
Group1;B;1;20080104;1.3
Group1;B;1;20080107;2.0
Group1;B;1;20080108;0.15
Group1;C;1;20080103;1.9
Group1;C;1;20080104;2.1
Group1;C;1;20080107;2.9
Group1;C;1;20080108;0.45
and after importing I have this dataframe:
csv_loader
Valuegroup id Date Value
Calcgroup
Group1 A 1 2008-01-03 0.10
Group1 A 1 2008-01-04 0.30
Group1 A 1 2008-01-07 0.50
Group1 A 1 2008-01-08 0.90
Group1 B 1 2008-01-03 0.50
Group1 B 1 2008-01-04 1.30
Group1 B 1 2008-01-07 2.00
Group1 B 1 2008-01-08 0.15
Group1 C 1 2008-01-03 1.90
Group1 C 1 2008-01-04 2.10
Group1 C 1 2008-01-07 2.90
Group1 C 1 2008-01-08 0.45

Try out this solution
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter
import matplotlib.dates as mdates
csv_loader = pd.read_csv('C:/Users/micha/Desktop/Test.csv', encoding='cp1252', parse_dates=['Date'], sep=';', index_col=0)
fig, ax = plt.subplots()
csv_loader.groupby('Valuegroup').plot(x='Date', y='Value', ax=ax, legend=False, kind='line')
plt.grid(True)
myFmt = DateFormatter('%Y-%m-%d')
ax.xaxis.set_major_locator(mdates.YearLocator())
ax.xaxis.set_major_formatter(myFmt)
fig.autofmt_xdate()
plt.show()

To be honest, I was not able to find what's going wrong with this date format in your code, but however, when at least testing my approch for plotting in separate subplots, which you also asked for, I saw that the formating problem was gone and the automatic format was already the one you want to have:
fig, axs = plt.subplots(3, sharex=True, sharey=True)
for i, (name, grp) in enumerate(csv_loader.groupby('Valuegroup')):
axs[i].plot(grp.Date, grp.Value)
axs[i].set_title(name)
plt.tight_layout()
see yourself:

Wrong Dates in Dataframe and Subplots

I am trying to plot my data in the csv file. Currently my dates are not shown properly in the plot also if i am converting it. How can I change it to show the proper dat format as defined Y-m-d? The second question is that I am currently plotting all the dat in one plot but want to have for every Valuegroup one subplot.
My code looks like the following:
import pandas as pd
import matplotlib.pyplot as plt
csv_loader = pd.read_csv('C:/Test.csv', encoding='cp1252', sep=';', index_col=0).dropna()
csv_loader['Date'] = pd.to_datetime(csv_loader['Date'], format="%Y-%m-%d")
print(csv_loader)
fig, ax = plt.subplots()
csv_loader.groupby('Valuegroup').plot(x='Date', y='Value', ax=ax, legend=False, kind='line')
plt.grid(True)
The csv file looks like the following:
Calcgroup;Valuegroup;id;Date;Value
Group1;A;1;20080103;0.1
Group1;A;1;20080104;0.3
Group1;A;1;20080107;0.5
Group1;A;1;20080108;0.9
Group1;B;1;20080103;0.5
Group1;B;1;20080104;1.3
Group1;B;1;20080107;2.0
Group1;B;1;20080108;0.15
Group1;C;1;20080103;1.9
Group1;C;1;20080104;2.1
Group1;C;1;20080107;2.9
Group1;C;1;20080108;0.45

You can just tell pandas to parse that column as a datetime and it will just work:
In[151]:
import matplotlib.pyplot as plt
t="""Calcgroup;Valuegroup;id;Date;Value
Group1;A;1;20080103;0.1
Group1;A;1;20080104;0.3
Group1;A;1;20080107;0.5
Group1;A;1;20080108;0.9
Group1;B;1;20080103;0.5
Group1;B;1;20080104;1.3
Group1;B;1;20080107;2.0
Group1;B;1;20080108;0.15
Group1;C;1;20080103;1.9
Group1;C;1;20080104;2.1
Group1;C;1;20080107;2.9
Group1;C;1;20080108;0.45"""
df = pd.read_csv(io.StringIO(t), parse_dates=['Date'], sep=';', index_col=0)
df
Out[151]:
Valuegroup id Date Value
Calcgroup
Group1 A 1 2008-01-03 0.10
Group1 A 1 2008-01-04 0.30
Group1 A 1 2008-01-07 0.50
Group1 A 1 2008-01-08 0.90
Group1 B 1 2008-01-03 0.50
Group1 B 1 2008-01-04 1.30
Group1 B 1 2008-01-07 2.00
Group1 B 1 2008-01-08 0.15
Group1 C 1 2008-01-03 1.90
Group1 C 1 2008-01-04 2.10
Group1 C 1 2008-01-07 2.90
Group1 C 1 2008-01-08 0.45
fig, ax = plt.subplots()
df.groupby('Valuegroup').plot(x='Date', y='Value', ax=ax, legend=False, kind='line')
plt.grid(True)
plt.show()
results in:
Besides your format string was incorrect anyway, it should be:
csv_loader['Date'] = pd.to_datetime(csv_loader['Date'], format="%Y%m%d")
however, this won't work as that column will have been loaded as int dtype so you would've needed to convert to string first:
csv_loader['Date'] = pd.to_datetime(csv_loader['Date'].astype(str), format="%Y%m%d")
To format the dates on the x-axis you can use DateFormatter from matplotlib see related: Editing the date formatting of x-axis tick labels in matplotlib
from matplotlib.dates import DateFormatter
fig, ax = plt.subplots()
df.groupby('Valuegroup').plot(x='Date', y='Value', ax=ax, legend=False, kind='line')
plt.grid(True)
myFmt = DateFormatter("%d-%m-%Y")
ax.xaxis.set_minor_formatter(myFmt)
plt.show()
now gives plot:

You're parsing your dates wrong; "%Y-%m-%d" would work for dates like 2017-12-11 (which is Dec 12, 2017). Your dates are of the form "%Y%m%d", without the hyphen.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Highlight specific sample in stripplot from pandas dataframe - python

Related

how to set x_axis label(not xtick label) for all subplots in relplot?

How to merge two plots in Pandas?

how plot multiples dataframe csv in same plot

Plotting Date fails on x-axis and extension to subplots

Wrong Dates in Dataframe and Subplots

Categories

Resources