Discrepancy between Seaborn plotted mean and calculated mean. (Python/Pandas) - python

Hello there wonderful people of StackOverflow!
I have been getting to grips with Python and was starting to feel pretty confident that I knew what I was doing until this doozy came up:
I am plotting and comparing two subselections of a dataframe where "Type" = "area" and "". Seaborn plots a boxplot of these and marks the mean, but when I calculate the mean using .mean() it gives a different answer. Here's the code:
plotdata = df[df['Type'].isin(['A','B'])]
g = sns.violinplot(x="Type", y="value", data=plotdata, inner="quartile")
plt.ylim(ymin=-4, ymax=4) # This is to zoom on the plot to make the 0 line clearer
This is the resulting plot, note how the means are ~-0.1 and ~1.5
But when I calculate them with:
print(df_long[df_long['charttype'].isin(['area'])]['error'].mean())
print(df_long[df_long['charttype'].isin(['angle'])]['error'].mean())
It returns:
0.014542483333332705
-2.024809368191722
So my question is, why don't these numbers match?

Total misunderstanding of basic statistics was the problem!
Box plots (which are inside the seaborn violin plots) plot the interquartile range and the MEDIAN, whereas I was later calculating the MEAN.
Just needed to sleep on it and hey presto all becomes clear.

Related

Cannot see the points in my scatter plot despite having R value

I am facing this exact same issue. I am using a csv file with the missing rows dropped to make my scatter plot. I have also use matplotlib yet I am getting not output despite having the R value.
mc_corr=cars2_1["City_Mileage_km_litre"].corr(cars2_1['Fuel_Tank_Capacity_litre'])
plt.scatter(cars2_1["City_Mileage_km_litre"],cars2_1['Fuel_Tank_Capacity_litre'],color='orange')
plt.title('Mileage vs Fuel Tank Capacity')
plt.xlim(5,35)
plt.ylim(3.5,10.0)
plt.xlabel("R = "+str(mc_corr))
plt.show()
The data set:
Your ylim is too short for values to fall into, change:
plt.ylim(3.5,10.0)
to
plt.ylim(24,88)

Python: Histogram return wrong values for counts (EDIT: more general with example)

EDIT: Ive found a general example where it doesnt work either!
I am trying to extract the data for a histogram, but different counts seem wrong. As an example code:
import matplotlib.pyplot as plt
import numpy as np
data = np.random.rand(1000000)
bins = np.arange(0,1,0.0001)
a,b,c = plt.hist(data,bins)
This gives me this rather messy histogram, and i've saved the counts as a and the interval as b. Now, plotting a and b, I should expect the same histogram, right? But that's not what I get:
plt.scatter(b[0:len(b)-1],a,s=2)
which gives me this, which doesnt match at all! Furthurmore, when I try and find the maximum value of a, it gives me 144, which fits fine with the scatterplot, but not with the histogram function.
If I count the numbers myself with the following code:
len(np.intersect1d(np.where(data>=b[np.argmax(a)]),np.where(data<b[np.argmax(a)+1])))
then it also gives me 144, in accordance with the values. So is the displayed histogram just wrong for some reason, and I should ignore it and just take the extracted data?
Old, unedited post:
For a physics course I am trying to bin my results in the following way:
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as ss
from scipy.optimize import curve_fit
plt.rc("font", family=["Helvetica", "Arial"])
plt.rc("axes", labelsize=18)
plt.rc("xtick", labelsize=16, top=True, direction="in")
plt.rc("ytick", labelsize=16, right=True, direction="in")
plt.rc("axes", titlesize=22)
plt.rc("legend", fontsize=16)
data_Ra = np.loadtxt('Ra226_cal2_ch001.txt',skiprows=5)
t_Ra = data_Ra[:,0]*10**-8 # time in seconds
channels_Ra = data_Ra[:,1]
channels_Ra = channels_Ra[np.where(channels_Ra>0)] # removing all the measurements at channel = 0
intervalspace = 2 #The intervals in which we count
bins=np.arange(0,4000,intervalspace)
counts, intervals , stuff = plt.hist(channels_Ra,bins)
plt.xlabel('Channels')
plt.ylabel('Counts')
plt.show()
Here, the histogram plot looks totally fine, with a max near 13000 counts. But when I then use np.max(counts), I am given about 24000, and when I try and just plot the values it gives me with:
plt.scatter(intervals[0:len(intervals)-1]+intervalspace/2,counts,s=1)
plt.xlabel('Channels')
plt.ylabel('Counts')
plt.title('Ra225')
plt.show()
it looks like this, which is totally different, and I can't figure out why. I am expecting the scatterplot to resemble the histogram, and while the peaks are located at the same x-vales, the height do not match.
This problem is in other large datasets as well.
I dont think i'm allowed to drop the txt-file here? So im not sure how much more I can show, but any help will be appreciated!
I don't know why you interpret the results in that way.
If you look at the histogram plot, you will be able to see the maximum value of the y-axis is 25,000. That means that there are some values close to 25,000. This fact can be verified in the scatter plot.
Your scatter plot shows actual values. It would be clearer if you describe how your expected plot looks like.
If you want discard some outlier points, you should apply some filtering before plotting the data.

Using Seaborn Catplot scatterplot creates a numerically unordered y-axis

Using this dataset, I tried to make a categorical scatterplot with two mutation types (WES + RNA-Seq & WES) being shown on the x-axis and a small set of numerically ordered numbers spaced apart by a scale on the y-axis. Although I was able to get the x-axis the way I intended it to be, the y-axis instead used every single value in the Mutation Count column as a number on the y-axis. In addition to that, the axis is ordered from the descending order on the dataset, meaning it isn't numerically ordered either. How can I go about fixing these aspects of the graph?
The code and its output are shown below:
import seaborn as sns
g = sns.catplot(x="Sequencing Type", y="Mutation Count", hue="Sequencing Type", data=tets, height=16, aspect=0.8)

Do not sort variable in lineplot

I'm trying to make a plot of mean slope x elevation for a given area, but I'm a bit lost with the sorting of data in plotnine. The dataframe has 3 cols: Elevation (already ordered from low to high), Slope (unordered, must remain as is), DEM (used for grouping). When ploting in seaborn, I can set the sort option and it works fine:
sns.lineplot(data=pd_areas, x="Slope", y="Elevation", hue="DEM", sort=False)
seaborn plot
but with plotnine, the values are sorted and the result is wrong:
(p9.ggplot(pd_areas)
+p9.geom_line(mapping=p9.aes(x='Slope', y='Elevation', color='DEM', group='DEM'))
)
plotnine plot
Sorry for not providing an MVE, but I can't send the DEMs at this moment.
thanks
Use geom_path instead of geom_line.

Center nested boxplots in Python/Seaborn with unequal classes

I have grouped data from which I want to generated boxplots using seaborn. However, not every group has all classes. As a result, the boxplots are not centered if classes are missing within one group:
Figure
The graph is generated using the following code:
sns.boxplot(x="label2", y="value", hue="variable",palette="Blues")
Is there any way to force seaborn to center theses boxes? I didn't find any approbiate way.
Thank you in advance.
Yes there is but you are not going to like it.
Centering these will mean that you will have the same y value for median values, so normalize your data so that the median is 0.5 for each y value for each value of x. That will give you the plot you want, but you should note that somewhere in the plot so people will not be confused.

Categories