Do not sort variable in lineplot - python

I'm trying to make a plot of mean slope x elevation for a given area, but I'm a bit lost with the sorting of data in plotnine. The dataframe has 3 cols: Elevation (already ordered from low to high), Slope (unordered, must remain as is), DEM (used for grouping). When ploting in seaborn, I can set the sort option and it works fine:
sns.lineplot(data=pd_areas, x="Slope", y="Elevation", hue="DEM", sort=False)
seaborn plot
but with plotnine, the values are sorted and the result is wrong:
(p9.ggplot(pd_areas)
+p9.geom_line(mapping=p9.aes(x='Slope', y='Elevation', color='DEM', group='DEM'))
)
plotnine plot
Sorry for not providing an MVE, but I can't send the DEMs at this moment.
thanks

Use geom_path instead of geom_line.

Related

Using Seaborn Catplot scatterplot creates a numerically unordered y-axis

Using this dataset, I tried to make a categorical scatterplot with two mutation types (WES + RNA-Seq & WES) being shown on the x-axis and a small set of numerically ordered numbers spaced apart by a scale on the y-axis. Although I was able to get the x-axis the way I intended it to be, the y-axis instead used every single value in the Mutation Count column as a number on the y-axis. In addition to that, the axis is ordered from the descending order on the dataset, meaning it isn't numerically ordered either. How can I go about fixing these aspects of the graph?
The code and its output are shown below:
import seaborn as sns
g = sns.catplot(x="Sequencing Type", y="Mutation Count", hue="Sequencing Type", data=tets, height=16, aspect=0.8)

How to change scale of a plot with Maplotlib and add increments?

I am making a simple plot in Python with Matplotlib that shows populations of different regions over time. I have a CSV file that has columns of each region's population over the years, so the years is on the x-axis and population is on the y-axis. The plot looks okay except the y-axis. As you can see in the image, every single population value is included on the y-axis, which is too many values and is unnecessary. I would like to y-axis to have some increments (such as 100 million). Is there a simple way to do that or would I have to manually add my own increments?
And I tried to scale it linearly and logarithmic but I would still prefer to have increments on the y-axis.
This is what the plot looks like right now.
(I took out unnecessary code such as legend and formatting):
data2 = pd.read_csv('data02_world.csv')
for region in data2:
if region != 'Year':
plt.plot(data2.Year, data2[region], marker='.', label=region)
plt.xlabel('Year')
plt.ylabel('Population')
plt.show()
I think you can simply do with pandas:
data2 = pd.read_csv('data02_world.csv')
data2.set_index('Year', inplace=True)
data2.plot()
if you would like to adopt matplotlib plt.yticks is what you need

Center nested boxplots in Python/Seaborn with unequal classes

I have grouped data from which I want to generated boxplots using seaborn. However, not every group has all classes. As a result, the boxplots are not centered if classes are missing within one group:
Figure
The graph is generated using the following code:
sns.boxplot(x="label2", y="value", hue="variable",palette="Blues")
Is there any way to force seaborn to center theses boxes? I didn't find any approbiate way.
Thank you in advance.
Yes there is but you are not going to like it.
Centering these will mean that you will have the same y value for median values, so normalize your data so that the median is 0.5 for each y value for each value of x. That will give you the plot you want, but you should note that somewhere in the plot so people will not be confused.

Discrepancy between Seaborn plotted mean and calculated mean. (Python/Pandas)

Hello there wonderful people of StackOverflow!
I have been getting to grips with Python and was starting to feel pretty confident that I knew what I was doing until this doozy came up:
I am plotting and comparing two subselections of a dataframe where "Type" = "area" and "". Seaborn plots a boxplot of these and marks the mean, but when I calculate the mean using .mean() it gives a different answer. Here's the code:
plotdata = df[df['Type'].isin(['A','B'])]
g = sns.violinplot(x="Type", y="value", data=plotdata, inner="quartile")
plt.ylim(ymin=-4, ymax=4) # This is to zoom on the plot to make the 0 line clearer
This is the resulting plot, note how the means are ~-0.1 and ~1.5
But when I calculate them with:
print(df_long[df_long['charttype'].isin(['area'])]['error'].mean())
print(df_long[df_long['charttype'].isin(['angle'])]['error'].mean())
It returns:
0.014542483333332705
-2.024809368191722
So my question is, why don't these numbers match?
Total misunderstanding of basic statistics was the problem!
Box plots (which are inside the seaborn violin plots) plot the interquartile range and the MEDIAN, whereas I was later calculating the MEAN.
Just needed to sleep on it and hey presto all becomes clear.

How to connect boxplot median values

It seems like plotting a line connecting the mean values of box plots would be a simple thing to do, but I couldn't figure out how to do this plot in pandas.
I'm using this syntax to do the boxplot so that it automatically generate the box plot for Y vs. X device without having to do external manipulation of the data frame:
df.boxplot(column='Y_Data', by="Category", showfliers=True, showmeans=True)
One way I thought of doing is to just do a line plot by getting the mean values from the boxplot, but I'm not sure how to extract that information from the plot.
You can save the axis object that gets returned from df.boxplot(), and plot the means as a line plot using that same axis. I'd suggest using Seaborn's pointplot for the lines, as it handles a categorical x-axis nicely.
First let's generate some sample data:
import pandas as pd
import numpy as np
import seaborn as sns
N = 150
values = np.random.random(size=N)
groups = np.random.choice(['A','B','C'], size=N)
df = pd.DataFrame({'value':values, 'group':groups})
print(df.head())
group value
0 A 0.816847
1 A 0.468465
2 C 0.871975
3 B 0.933708
4 A 0.480170
...
Next, make the boxplot and save the axis object:
ax = df.boxplot(column='value', by='group', showfliers=True,
positions=range(df.group.unique().shape[0]))
Note: There's a curious positions argument in Pyplot/Pandas boxplot(), which can cause off-by-one errors. See more in this discussion, including the workaround I've employed here.
Finally, use groupby to get category means, and then connect mean values with a line plot overlaid on top of the boxplot:
sns.pointplot(x='group', y='value', data=df.groupby('group', as_index=False).mean(), ax=ax)
Your title mentions "median" but you talk about category means in your post. I used means here; change the groupby aggregation to median() if you want to plot medians instead.
You can get the value of the medians by using the .get_data() property of the matplotlib.lines.Line2D objects that draw them, without having to use seaborn.
Let bp be your boxplot created as bp=plt.boxplot(data). Then, bp is a dict containing the medians key, among others. That key contains a list of matplotlib.lines.Line2D, from which you can extract the (x,y) position as follows:
bp=plt.boxplot(data)
X=[]
Y=[]
for m in bp['medians']:
[[x0, x1],[y0,y1]] = m.get_data()
X.append(np.mean((x0,x1)))
Y.append(np.mean((y0,y1)))
plt.plot(X,Y,c='C1')
For an arbitrary dataset (data), this script generates this figure. Hope it helps!

Categories