How to draw a shaded approximation curve - python

In the data in test.csv, I'd like to do the following with the values in the Time column as the x-axis and the values in the A_x column (x=1,2,3) as the data on the y-axis.
・Draw an approximate curve from each of the three types of data.
・Draw the value of column A_xsd (x=1,2,3) as standard deviation, not as an error bar but as a shadow.
However, due to my lack of knowledge, I'm only halfway through. I'd be grateful if someone could give me some correct answer. Thank you very much.
(This is a simplified version of the original data more than 1000 lines)
#!/usr/bin/python
# -*- coding: utf-8 -*-
import pandas as pd
from pandas import DataFrame
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('test.csv', header = None)
x1 = df['Time']
y1 = df['A_1']
x2 = df['Time']
y2 = df['A_2']
x3 = df['Time']
y3 = df['A_3']
sd1 = df['A_1sd']
sd2 = df['A_1sd']
sd3 = df['A_1sd']
fig = plt.figure()
ax.set_xlim(0, 5)
ax.set_ylim(0, 150)
ax.set_xlabel("Time", fontsize=10)
ax.set_ylabel("OD600", fontsize=10)
ax.grid()
ax.tick_params(labelsize=10)
test.csv
Time,A_1,A_1sd,A_2,A_2sd,A_3,A_3sd
1,6,76,23159,125,23239,40
2,20,85,22709,99,22809,50
3,46,20,22629,89,22749,62
4,12,81,22729,85,22859,86
5,1,75,23029,90,23219,112

One of the three data columns has a range very different from the other two, so we should use subplots. You could use a loop to create the same kind of subplot for each of the three data columns.
The shading can be done with matplotlib's fill_between() method:
x = df.Time
fig, ax = plt.subplots(3, 1, sharex=True)
for n in (1, 2, 3):
y = eval('df.A_' + str(n))
sd = eval('df.A_' + str(n) + 'sd')
ax[n-1].plot(x, y)
ax[n-1].fill_between(x, y - sd, y + sd, alpha=0.3)

Related

How to plot on exactly rows of a dataframe

That's not easy to describe with words, so I will reveal a picture for you in order to understand:
As the image shows, I want to plot a line on each row separately based on their values on a data frame. Is it possible with Python libraries?
Here's an example to get you started: it uses table to plot the dataframe and overplots the stacked lines. The line for each row is shifted by ymax, the maximum value in the dataframe, to prevent overlapping.
import matplotlib as mpl
import numpy as np
import pandas as pd
# make sample data
np.random.seed(0)
df = pd.DataFrame(np.random.rand(41,5))
df.index = [f'Row {i}' for i in df.index]
fig, ax = plt.subplots(figsize=(4,10))
ax.set_axis_off()
# plot data as table
plt.matplotlib.table.table(ax, df.applymap('{:.1f}'.format).values.tolist(), rowLabels=df.index, bbox=[0,0,1,1])
# plot curve over table
ymax = df.max().max()
ax.set_ylim(0, ymax * len(df))
ax.plot((df.to_numpy() + ((len(df) - 1 - df.reset_index(drop=True).index.to_numpy()) * ymax)[:, None]).T, color='C0')
To use alternating colors, you can set the color cycler:
from cycler import cycler
# ...
ax.set_prop_cycle(cycler(color='rg'))
ax.plot((df.to_numpy() + ((len(df) - 1 - df.reset_index(drop=True).index.to_numpy()) * ymax)[:, None]).T)

Plot CDF of columns from a CSV file using pandas

I want to plot CDF value of columns from a CSV file using pandas as follows:
I have tried some codes, but they are not reporting the correct plot. Can you help with an easy way?
df = pd.read_csv('pathfile.csv')
def compute_distrib(df, col):
stats_df = df.groupby(col)[col].agg('count')\
.pipe(pd.DataFrame).rename(columns={col: 'frequency'})
# PDF
stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency'])
# CDF
stats_df['CDF'] = stats_df['pdf'].cumsum()
# modifications
stats_df = stats_df.reset_index()\
.rename(columns={col:"X"})
stats_df[" "] = col
return stats_df
cdf = []
for col in ['1','2','3','4']:
cdf.append(compute_distrib(df, col))
cdf = pd.concat(cdf, ignore_index=True)
import seaborn as sns
sns.lineplot(x=cdf["X"],
y=cdf["CDF"],
hue=cdf[" "]);
Due to the lack of runnable code on your post, I created my own code for plotting the CDF of the columns of a dataframe df:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from itertools import accumulate
# GENERATE EXAMPLE DATA
df = pd.DataFrame()
df['x1'] = np.random.uniform(-1,1, size=1000)
df['x2'] = df['x1'] + np.random.uniform(-1,1, size=1000)
df['x3'] = df['x2'] + np.random.uniform(-1,1, size=1000)
df['x4'] = df['x3'] + np.random.uniform(-1, 1, size=1000)
# START A PLOT
fig,ax = plt.subplots()
for col in df.columns:
# SKIP IF IT HAS ANY INFINITE VALUES
if not all(np.isfinite(df[col].values)):
continue
# USE numpy's HISTOGRAM FUNCTION TO COMPUTE BINS
xh, xb = np.histogram(df[col], bins=60, normed=True)
# COMPUTE THE CUMULATIVE SUM WITH accumulate
xh = list(accumulate(xh))
# NORMALIZE THE RESULT
xh = np.array(xh) / max(xh)
# PLOT WITH LABEL
ax.plot(xb[1:], xh, label=f"$CDF$({col})")
ax.legend()
plt.title("CDFs of Columns")
plt.show()
The resulting plot from this code is below:
To put in your own data, just replace the # GENERATE EXAMPLE DATA section with df = pd.read_csv('path/to/sheet.csv')
Let me know if anything in the example is unclear to you or if it needs more explanation.

how to rotate a seaborn lineplot

How can I rotate a seaborn.lineplot so that the result will be as a function of y and not a function of x.
For example, this code:
import pandas as pd
import seaborn as sns
df = pd.DataFrame([[0,1],[0,2],[0,1.5],[1,1],[1,5]], columns=['group','val'])
sns.lineplot(x='group',y='val',data=df)
Create this figure:
But is there a way to rotate the figure in 90° ? so that in the X we will have "val" and in Y we will have "group" and the std will go from left to right and not from bottom to up.
Thanks
EDIT: I've opened a ticket in seaborn to ask for this feature: https://github.com/mwaskom/seaborn/issues/1661
Per the seaborn docs on lineplot, the dataframe passed to data must be
Tidy (“long-form”) dataframe where each column is a variable and each row is an observation.
Which seems to imply there is no way to force the axes to switch, even by manipulating the data. If there is a way to do that I haven't found it - I'm sure there is a more elegant way to do this, but one way you could go about it is to do it by hand so to speak. Something like this would do the trick
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
df = pd.DataFrame([[0,1],[0,2],[0,1.5],[1,1],[1,5]], columns=['group','val'])
group = df['group'].tolist()
val = df['val'].tolist()
yl = list()
yu = list()
avg = list()
ii = 0
while ii < len(group): #Loop through all the groups
g = group[ii]
y0 = val[ii]
y1 = val[ii]
s = 0
jj = ii
while (jj < len(group) and group[jj] == g):
s += val[jj]
#This takes the min and max, but could easily take the standard deviation
if val[jj] > y1:
y1 = val[jj]
if val[jj] < y0:
y0 = val[jj]
jj += 1
avg.append(s/(jj - ii))
ii = jj
yl.append(y0)
yu.append(y1)
x = np.linspace(min(group), max(group), len(yl))
plt.ylabel(df.columns[0])
plt.xlabel(df.columns[1])
plt.plot(avg, x, color="#5a9edd", linestyle="-", linewidth=1.5)
plt.fill_betweenx(x, yl, yu, alpha=0.3)
This will give you the following plot:
For brevity this uses the minimum and maximum from each group to give the error band, but that can be easily changed to standard error or standard deviation as needed.
Consider what you'd do if not using seaborn. You would calculate the mean and standard deviation and plot those as a function of the group. Now it is quite straight forward to exchange x and y for a plot(x,y): plot(y,x). For the filled region, you can use fill_betweenx instead of fill_between.
Below the two cases for comparisson.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame([[0,1],[0,2],[0,1.5],[1,1],[1,5]], columns=['group','val'])
mean = df.groupby("group").mean()
std = df.groupby("group").std()
fig, (ax, ax2) = plt.subplots(ncols=2)
ax.plot(mean.index, mean["val"].values)
ax.fill_between(mean.index, (mean-std)["val"].values, (mean+std)["val"].values, alpha=.5)
ax.set(xlabel="group", ylabel="val")
ax2.plot(mean["val"].values, mean.index)
ax2.fill_betweenx(mean.index, (mean-std)["val"].values, (mean+std)["val"].values, alpha=.5)
ax2.set(ylabel="group", xlabel="val")
fig.tight_layout()
plt.show()

How to add axis offset in matplotlib plot?

I'm drawing several point plots in seaborn on the same graph. The x-axis is ordinal, not numerical; the ordinal values are the same for each point plot. I would like to shift each plot a bit to the side, the way pointplot(dodge=...) parameter does within multiple lines within a single plot, but in this case for multiple different plots drawn on top of each other. How can I do that?
Ideally, I'd like a technique that works for any matplotlib plot, not just seaborn specifically. Adding an offset to the data won't work easily, since the data is not numerical.
Example that shows the plots overlapping and making them hard to read (dodge within each plot works okay)
import pandas as pd
import seaborn as sns
df1 = pd.DataFrame({'x':list('ffffssss'), 'y':[1,2,3,4,5,6,7,8], 'h':list('abababab')})
df2 = df1.copy()
df2['y'] = df2['y']+0.5
sns.pointplot(data=df1, x='x', y='y', hue='h', ci='sd', errwidth=2, capsize=0.05, dodge=0.1, markers='<')
sns.pointplot(data=df2, x='x', y='y', hue='h', ci='sd', errwidth=2, capsize=0.05, dodge=0.1, markers='>')
I could use something other than seaborn, but the automatic confidence / error bars are very convenient so I'd prefer to stick with seaborn here.
Answering this for the most general case first.
A dodge can be implemented by shifting the artists in the figure by some amount. It might be useful to use points as units of that shift. E.g. you may want to shift your markers on the plot by 5 points.
This shift can be accomplished by adding a translation to the data transform of the artist. Here I propose a ScaledTranslation.
Now to keep this most general, one may write a function which takes the plotting method, the axes and the data as input, and in addition some dodge to apply, e.g.
draw_dodge(ax.errorbar, X, y, yerr =y/4., ax=ax, dodge=d, marker="d" )
The full functional code:
import matplotlib.pyplot as plt
from matplotlib import transforms
import numpy as np
import pandas as pd
def draw_dodge(*args, **kwargs):
func = args[0]
dodge = kwargs.pop("dodge", 0)
ax = kwargs.pop("ax", plt.gca())
trans = ax.transData + transforms.ScaledTranslation(dodge/72., 0,
ax.figure.dpi_scale_trans)
artist = func(*args[1:], **kwargs)
def iterate(artist):
if hasattr(artist, '__iter__'):
for obj in artist:
iterate(obj)
else:
artist.set_transform(trans)
iterate(artist)
return artist
X = ["a", "b"]
Y = np.array([[1,2],[2,2],[3,2],[1,4]])
Dodge = np.arange(len(Y),dtype=float)*10
Dodge -= Dodge.mean()
fig, ax = plt.subplots()
for y,d in zip(Y,Dodge):
draw_dodge(ax.errorbar, X, y, yerr =y/4., ax=ax, dodge=d, marker="d" )
ax.margins(x=0.4)
plt.show()
You may use this with ax.plot, ax.scatter etc. However not with any of the seaborn functions, because they don't return any useful artist to work with.
Now for the case in question, the remaining problem is to get the data in a useful format. One option would be the following.
df1 = pd.DataFrame({'x':list('ffffssss'),
'y':[1,2,3,4,5,6,7,8],
'h':list('abababab')})
df2 = df1.copy()
df2['y'] = df2['y']+0.5
N = len(np.unique(df1["x"].values))*len([df1,df2])
Dodge = np.linspace(-N,N,N)/N*10
fig, ax = plt.subplots()
k = 0
for df in [df1,df2]:
for (n, grp) in df.groupby("h"):
x = grp.groupby("x").mean()
std = grp.groupby("x").std()
draw_dodge(ax.errorbar, x.index, x.values,
yerr =std.values.flatten(), ax=ax,
dodge=Dodge[k], marker="o", label=n)
k+=1
ax.legend()
ax.margins(x=0.4)
plt.show()
You can use linspace to easily shift your graphs to where you want them to start and end. The function also makes it very easy to scale the graph so they would be visually the same width
import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.pyplot as plt
start_offset = 3
end_offset = start_offset
y1 = np.random.randint(0, 10, 20) ##y1 has 20 random ints from 0 to 10
y2 = np.random.randint(0, 10, 10) ##y2 has 10 random ints from 0 to 10
x1 = np.linspace(0, 20, y1.size) ##create a number of steps from 0 to 20 equal to y1 array size-1
x2 = np.linspace(0, 20, y2.size)
plt.plot(x1, y1)
plt.plot(x2, y2)
plt.show()

Normalizing pandas DataFrame rows by their sums

What is the most idiomatic way to normalize each row of a pandas DataFrame? Normalizing the columns is easy, so one (very ugly!) option is:
(df.T / df.T.sum()).T
Pandas broadcasting rules prevent df / df.sum(axis=1) from doing this
To overcome the broadcasting issue, you can use the div method:
df.div(df.sum(axis=1), axis=0)
See pandas User Guide: Matching / broadcasting behavior
I would suggest to use Scikit preprocessing libraries and transpose your dataframe as required:
'''
Created on 05/11/2015
#author: rafaelcastillo
'''
import matplotlib.pyplot as plt
import pandas
import random
import numpy as np
from sklearn import preprocessing
def create_cos(number_graphs,length,amp):
# This function is used to generate cos-kind graphs for testing
# number_graphs: to plot
# length: number of points included in the x axis
# amp: Y domain modifications to draw different shapes
x = np.arange(length)
amp = np.pi*amp
xx = np.linspace(np.pi*0.3*amp, -np.pi*0.3*amp, length)
for i in range(number_graphs):
iterable = (2*np.cos(x) + random.random()*0.1 for x in xx)
y = np.fromiter(iterable, np.float)
if i == 0:
yfinal = y
continue
yfinal = np.vstack((yfinal,y))
return x,yfinal
x,y = create_cos(70,24,3)
data = pandas.DataFrame(y)
x_values = data.columns.values
num_rows = data.shape[0]
fig, ax = plt.subplots()
for i in range(num_rows):
ax.plot(x_values, data.iloc[i])
ax.set_title('Raw data')
plt.show()
std_scale = preprocessing.MinMaxScaler().fit(data.transpose())
df_std = std_scale.transform(data.transpose())
data = pandas.DataFrame(np.transpose(df_std))
fig, ax = plt.subplots()
for i in range(num_rows):
ax.plot(x_values, data.iloc[i])
ax.set_title('Data Normalized')
plt.show()

Categories