Plotting boxplots for a groupby object - python

I would like to plot boxplots for several datasets based on a criterion.
Imagine a dataframe similar to the example below:
df = pd.DataFrame({'Group':[1,1,1,2,3,2,2,3,1,3],'M':np.random.rand(10),'F':np.random.rand(10)})
df = df[['Group','M','F']]
Group M F
0 1 0.465636 0.537723
1 1 0.560537 0.727238
2 1 0.268154 0.648927
3 2 0.722644 0.115550
4 3 0.586346 0.042896
5 2 0.562881 0.369686
6 2 0.395236 0.672477
7 3 0.577949 0.358801
8 1 0.764069 0.642724
9 3 0.731076 0.302369
In this case, I have three groups, so I would like to make a boxplot for each group and for M and F separately having the groups on Y axis and the columns of M and F colour-coded.
This answer is very close to what I want to achieve, but I would prefer something more robust, applicable for larger dataframes with greater number of groups. I feel that groupby is the way to go, but I am not familiar with groupby objects and I am failing to even slice them.
. The desirable output would look something like this:
Looks like years ago, someone had the same problem, but got no answers :( Having a boxplot as a graphical representation of the describe function of groupby
My questions are:
How to implement groupby to feed the desired data into the boxplot
What is the correct syntax for the box plot if I want to control what is displayed and not just use default settings (which I don't even know what they are, I am finding the documentation rather vague. To be specific,can I have the box covering the mean +/- standard deviation, and keep the vertical line at median value?)

I think you should use Seaborn library that offers to create these type of customize plots.In your case i had first melted your dataframe to convert it into proper format and then created the boxplot of your choice.
import pandas as pd
import matplotlib.pyplot as plt
Import seaborn as sns
dd=pd.melt(df,id_vars=['Group'],value_vars=['M','F'],var_name='sex')
sns.boxplot(y='Group',x='value',data=dd,orient="h",hue='sex')
The plot looks similar to your required plot.

Finally, I found a solution by slightly modifying this answer. It does not use groupby object, so it is more tedious to prepare the data, but so far it looks like the best solution to me. Here it is:
# here I prepare the data (group them manually and then store in lists)
Groups=[1,2,3]
Columns=df.columns.tolist()[1:]
print Columns
Mgroups=[]
Fgroups=[]
for g in Groups:
dfgc = df[df['Group']==g]
m=dfgc['M'].dropna()
f=dfgc['F'].dropna()
Mgroups.append(m.tolist())
Fgroups.append(f.tolist())
fig=plt.figure()
ax = plt.axes()
def setBoxColors(bp,cl):
plt.setp(bp['boxes'], color=cl, linewidth=2.)
plt.setp(bp['whiskers'], color=cl, linewidth=2.5)
plt.setp(bp['caps'], color=cl,linewidth=2)
plt.setp(bp['medians'], color=cl, linewidth=3.5)
bpl = plt.boxplot(Mgroups, positions=np.array(xrange(len(Mgroups)))*3.0-0.4,vert=False,whis='range', sym='', widths=0.6)
bpr = plt.boxplot(Fgroups, positions=np.array(xrange(len(Fgroups)))*3.0+0.4,vert=False,whis='range', sym='', widths=0.6)
setBoxColors(bpr, '#D7191C') # colors are from http://colorbrewer2.org/
setBoxColors(bpl, '#2C7BB6')
# draw temporary red and blue lines and use them to create a legend
plt.plot([], c='#D7191C', label='F')
plt.plot([], c='#2C7BB6', label='M')
plt.legend()
plt.yticks(xrange(0, len(Groups) * 3, 3), Groups)
plt.ylim(-3, len(Groups)*3)
#plt.xlim(0, 8)
plt.show()
The result looks mostly like what I wanted (as far as I have been able to find, the box always ranges from first to third quartile, so it is not possible to set it to +/- standard deviation). So I am a bit disappointed there is no one-line solution, but I am glad it is possible. However, for hundreds of groups this would not be good enough...

Related

Matplotlib time-based heatmap [duplicate]

This question already has answers here:
Normalize columns of a dataframe
(23 answers)
Closed 8 months ago.
Background: I picked up Python about a month ago, so my experience level is pretty slim. I'm pretty comfortable with VBA though years of data analysis through excel and PI Processbook.
I have 27 thermocouples that I pull data for in 1s intervals. I would like to heatmap them from hottest to coldest at a given instance in time. I've leveraged seaborn heatmaps, but the problem with those is that they compare temperatures across time as well and the aggregate of these thermocouples changes dramatically over time. See chart below:
Notice how in the attached, the pink one is colder than the rest when all of them are cold, but when they all heat up, the cold spot transfers to the orange and green ones (and even the blue one for a little bit at the peak).
In excel, I would write a do loop to apply conditional formatting to each individual timestamp (row), however in Python I can't figure it out for the life of me. The following is the code that I used to develop the above chart, so I'm hoping I can modify this to make it work.
tsStartTime = pd.Timestamp(strStart_Time)
tsEndTime = pd.Timestamp(strEnd_Time)
t = np.linspace(tsStartTime.value,tsEndTime.value, 150301)
TimeAxis = pd.to_datetime(t)
fig,ax = plt.subplots(figsize=(25,5))
plt.subplots_adjust(bottom = 0.25)
x = TimeAxis
i = 1
while i < 28:
globals()['y' + str(i)] = forceconvert_v(globals()['arTTXD' + str(i)])
ax.plot(x,globals()['y' + str(i)])
i += 1
I've tried to use seaborn heatmaps, but when i slice it by timestamps, the output array is size (27,) instead of (27,1), so it gets rejected.
Ultimately, I'm looking for an output that looks like this:
Notice how the values of 15 in the middle are blue despite being higher than the red 5s in the beginning. I didnt fill out every cell, but hopefully you get the jist of what I'm trying to accomplish.
This data is being pulled from OSISoft PI via the PIConnect library. PI leverages their own classes, but they are essentially either series or dataframes, but I can manipulate them into whatever they need to be if someone has any awesome ideas to handle this.
Here's the link to the data: https://file.io/JS0RoQvDL6AB
Thanks!
You are going the wrong way with globals. In this case, I suggest to use pandas.DataFrame.
What you are looking for can be produced like this:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Settings
col_number = 5
start = '1/1/2022 10:00:00'
end = '1/1/2022 10:10:00'
# prepare a data frame
index = pd.date_range(start=start, end=end, freq="S")
columns = [f'y{i}' for i in range(col_number)]
df = pd.DataFrame(index=index, columns=columns)
# fill in the data
for n, col in enumerate(df.columns):
df[col] = np.array([n + np.sin(2*np.pi*i/len(df)) for i in range(len(df))])
# drawing a heatmap
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 5))
ax1.plot(df)
ax1.legend(df.columns)
ax2.imshow(df.T, aspect='auto', cmap='plasma')
ax2.set_yticks(range(len(df.columns)))
ax2.set_yticklabels(df.columns)
plt.show()
Here:
As far as you didn't supply data to reproduce your case I use sin as illustrative values.
Transposing df.T is needed to put records horizontally. Of course, we can initially write data horizontally, it's up to you.
set_yticks is here to avoid troubles when changing the y-labels on the second figure.
seaborn.heatmap(...) can be used as well:
import seaborn as sns
data = df.T
data.columns = df.index.strftime('%H:%M:%S')
plt.figure(figsize=(15,3))
sns.heatmap(data, cmap='plasma', xticklabels=60)
Update
To compare values at each point in time:
data = (data - data.min())/(data.max() - data.min())
sns.heatmap(data, cmap='plasma', xticklabels=60)

Plotting different lines for different states on the same chart

I am trying to create a distribution for the number of ___ across a few states.
I want to get all of the states on the same graph, represented by different lines.
Here is an example what my data looks like: you have the state ('which I want to filter lines by), the number of reviews (x axis), and the frequency of restaurants that have that many reviews (y axis)
State | num_of_reviews | Count_id
alaska 1 400
alaska 2 388
alaska 3 344
...
Wyoming 57 13
Whenever I try doing a simple line plot in seaborn or matplotlib, it just returns a messy graph.
Does anyone know a string of code where I easily can filter df['State']?
Assuming that you have 50+ states, I wouldn't plot the distribution for each on the same plot as it would get really messy and hard to read. Instead, I would suggest to use a FacetGrid (read more about it here).
Something like this should do.
import seaborn as sns
import matplotlib.pyplot as plt
g = sns.FacetGrid(df, col="State", col_wrap=5, height=1.5)
g = g.map(plt.hist, "num_of_reviews")
You can find other possible solutions and ideas on how to visualize your data here.
If none of these work for you then it might be helpful if you explain a bit better your problem and provide a desired output and a minimal, complete, and verifiable example.

pandas plot line segments for each row

I have dataframes with columns containing x,y coordinates for multiple points. One row can consist of several points.
I'm trying to find out an easy way to be able to plot lines between each point generating a curve for each row of data.
Here is a simplified example where two lines are represented by two points each.
line1 = {'p1_x':1, 'p1_y':10, 'p2_x':2, 'p2_y':11 }
line2 = {'p1_x':2, 'p1_y':9, 'p2_x':3, 'p2_y':12 }
df = pd.DataFrame([line1,line2])
df.plot(y=['p1_y','p2_y'], x=['p1_x','p2_x'])
when trying to plot them I expect line 1 to start where x=1 and line 2 to start where x=2.
Instead, the x axis contains two value-pairs (1,2) and (2,3) and both lines have the same start and end-point in x-axis.
How do I get around this problem?
Edit:
If using matplotlib, the following hardcoded values generates the plot i'm interested in
plt.plot([[1,2],[2,3]],[[10,9],[11,12]])
While I'm sure that there should be a more succinct way using pure pandas, here's a simple approach using matplotlib and some derivatives from the original df.(I hope I understood the question correctly)
Assumption: In df, you place x values in even columns and y values in odd columns
Obtain x values
x = df.loc[:, df.columns[::2]]
x
p1_x p2_x
0 1 2
1 2 3
Obtain y values
y = df.loc[:, df.columns[1::2]]
y
p1_y p2_y
0 10 11
1 9 12
Then plot using a for loop
for i in range(len(df)):
plt.plot(x.iloc[i,:], y.iloc[i,:])
One does not need to create additional data frames. One can loop through the rows to plot these lines:
line1 = {'p1_x':1, 'p1_y':10, 'p2_x':2, 'p2_y':11 }
line2 = {'p1_x':2, 'p1_y':9, 'p2_x':3, 'p2_y':12 }
df = pd.DataFrame([line1,line2])
for i in range(len(df)): # for each row:
# plt.plot([list of Xs], [list of Ys])
plt.plot([df.iloc[i,0],df.iloc[i,2]],[df.iloc[i,1],df.iloc[i,3]])
plt.show()
The lines will be drawn in different colors. To get lines of same color, one can add option c='k' or whatever color one wants.
plt.plot([df.iloc[i,0],df.iloc[i,2]],[df.iloc[i,1],df.iloc[i,3]], c='k')
I generaly don't use the pandas plotting because I think it is rather limited, if using matplotlib is not an issue, the following code works:
from matplotlib import pyplot as plt
plt.plot(df.p1_x,df.p1_y)
plt.plot(df.p2_x,df.p2_y)
plt.plot()
if you got lots of lines to plot, you can use a for loop.

Heatmap with specific axis labels coloured

I am trying to plot a heatmap with 2 columns of data from a pandas dataframe. However, I would like to use a 3rd column to label the x axis, ideally by colour though another method such as an additional axis would be equally suitable. My dataframe is:
MUT SAMPLE VAR GROUP
True s1 1_1334442_T CC002
True s2 1_1334442_T CC006
True s1 1_1480354_GAC CC002
True s2 1_1480355_C CC006
True s2 1_1653038_C CC006
True s3 1_1730932_G CC002
...
Just to give a better idea of the data; there are 9 different types of 'GROUP', ~60,000 types of 'VAR' and 540 'SAMPLE's. I am not sure if this is the best way to build a heatmap in python but here is what I figured out so far:
pivot = pd.crosstab(df_all['VAR'],df_all['SAMPLE'])
sns.set(font_scale=0.4)
g = sns.clustermap(pivot, row_cluster=False, yticklabels=False, linewidths=0.1, cmap="YlGnBu", cbar=False)
plt.show()
I am not sure how to get 'GROUP' to display along the x-axis, either as an additional axis or just colouring the axis labels? Any help would be much appreciated.
I'm not sure if the 'MUT' column being a boolean variable is an issue here, df_all is 'TRUE' on every 'VAR' but as pivot is made, any samples which do not have a particular 'VAR' are filled as 0, others are filled with 1. My aim was to try and cluster samples with similar 'VAR' profiles. I hope this helps.
Please let me know if I can clarify anything further? Many thanks
Take look at this example. You can give a list or a dataframe column to the clustermap function. By specifying either the col_colors argument or the row_colors argument you can give colours to either the rows or the columns based on that list.
In the example below I use the iris dataset and make a pandas series object that specifies which colour the specific row should have. That pandas series is given as an argument for row_colors.
iris = sns.load_dataset("iris")
species = iris.pop("species")
lut = dict(zip(species.unique(), "rbg"))
row_colors = species.map(lut)
g = sns.clustermap(iris, row_colors=row_colors,row_cluster=False)
This code results in the following image.
You may need to tweak a bit further to also include a legend for the colouring for groups.

Producing boxplot figures dynamically depending on number of columns in data

I'm writing a python function to produce boxplots of data using python's matplotlib. I require this function to dynamically determine the number of figures and subplots based on the number of columns in the data. To make the figures readable I want the maximum number of subplots to be 4 by 4. If there are more than 16 I want to function to fill as many 4 by 4 figures as necessary then partially fill a final figure with the remainder.
For example, the data I currently have has 43 columns. I therefore want this function to produce 2 full figure containing 16 subplots each and one partially filled figure containing the remaining subplots. My problem is that I am stuck with the logic of writing such a function.
My code:
import matplotlib.pyplot as plt
def boxplot_data(self,parameters_file):
data = read_csv(parameters_file)
header = data.keys()
number_of_full_subplots = len(header)/16
remainder = len(header)-(16*number_of_full_subplots)
for i in range(0,number_of_full_subplots):
plt.figure(i)
for j in range(0,16):
plt.subplot(4,4,j)
boxplot(data[header[0:16]])
My plan was to iterate over the 'full subplots' first then iterate over the remainder however this method produces two identical figures.
Does anybody have any suggestions?
Thanks
the figures are identical because this line
boxplot(data[header[0:16]])
does not change when i and j are incremented.
Replace it with
boxplot(data[header[16*i+j]])
import matplotlib.pyplot as plt
def boxplot_data(self, parameters_file):
data = read_csv(parameters_file)
header = data.keys()
number_of_full_subplots = len(header)/16
for i in range(number_of_full_subplots):
plt.figure(i)
for j in range(16):
plt.subplot(4, 4, j)
boxplot(data[header[16*i+j]])

Categories