I would like to plot boxplots of dataframes (see sample code below). What I'm wondering is: How can I disable the detection of outlier? I don't want to remove them, I just want a plot which visualizes the data by marking 0%, 25%, 50% and 75% of the datapoints without considering any criteria for outliers etc.
How do I have to modify my code to achieve this? Can I change the outlier detection criteria in a way that it behaves like disabled?
I would be very grateful for any help and if there is already another threat about this (which I didn't find), I would be happy to get a link to it.
Many thanks!
Jordin
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(1234)
df = pd.DataFrame(np.random.randn(10, 4),
columns=['Col1', 'Col2', 'Col3', 'Col4'])
plt.figure()
plt.boxplot(df.values)
plt.show()
EDIT:
I would like to include this outlier when drawing the whiskers and not just not show it.
You're looking for the whis parameter.
For the documentation:
whis : float, sequence, or string (default = 1.5)
As a float, determines the reach of the
whiskers to the beyond the first and third quartiles. In other words,
where IQR is the interquartile range (Q3-Q1), the upper whisker will
extend to last datum less than Q3 + whisIQR). Similarly, the lower
whisker will extend to the first datum greater than Q1 - whisIQR.
Beyond the whiskers, data are considered outliers and are plotted as
individual points. Set this to an unreasonably high value to force the
whiskers to show the min and max values. Alternatively, set this to an
ascending sequence of percentile (e.g., [5, 95]) to set the whiskers
at specific percentiles of the data. Finally, whis can be the string
'range' to force the whiskers to the min and max of the data.
Add it like so:
df.boxplot(whis=99)
If you add sym='' inside your plot function I think you will get what you ask for:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(1234)
df = pd.DataFrame(np.random.randn(10, 4),
columns=['Col1', 'Col2', 'Col3', 'Col4'])
df.boxplot(sym='')
Related
Here is my problem
This is a sample of my two DataFrames (I have 30 columns in reality)
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.DataFrame({"Marc":[6,0,8,-30,-15,0,-3],
"Elisa":[0,1,0,-1,0,-2,-4],
"John":[10,12,24,-20,7,-10,-30]})
df1 = pd.DataFrame({"Marc":[8,2,15,-12,-8,0,-35],
"Elisa":[4,5,7,0,0,1,-2],
"John":[20,32,44,-30,15,-10,-50]})
I would like to create a scatter plot with two different colors :
1 color if the scores of df1 are negative and one if they are positive, but I don't really know how to do it.
I already did that by using matplotlib
plt.scatter(df,df1);
And I also checked this link Link but the problem is that I have two Pandas Dataframe
and not numpy array as on this link. Hence the I can't use the c= np.sign(df.y) method.
I would like to keep Pandas DataFrame as I have many columns but I really stuck on that.
If anyone has a solution, you are welcome!
You can pass the color array in, but it seems to work with 1D array only:
# colors as stated
colors = np.where(df1<0, 'C0', 'C1')
# stack and ravel to turn into 1D
plt.scatter(df.stack(),df1.stack(), c=colors.ravel())
Output:
I am creating a barplot by using groupby showing the success rate of an individual for Calendar year 2012. This works well. X axis= S_L's and Y axis is the success rate%. I have a column in my dataset for the success (1 or 0).
ax=df[df['CY']==2012].groupby('S_L').success.mean().sort_values(ascending=False).plot(kind='bar',stacked=False)
Instead of showing the values for each of the barplots, I want to show the calculations behind the mean, i.e the total for each group and the count where success (which is a flag) =1 i.e. the numerator. For example: If the bar shows 90%, which is calculated by 9 (numerator) being successful/ 10 (overall count for the given S_L group), I want to show n=9 and n=10 for that bar.
I looked at these posts Add labels to barplots , and it works when I display the values for the bars.
However, I don't know how to add the values for the calculation. As I am also sorting the values in descending order, I don't know how to do this. Please help.
My code:
import pandas as pd
from os import path
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
fname=path.expanduser(r'Test file.xlsx')
df=pd.read_excel(io=fname,sheet_name='Sheet1')
ax=df.groupby('S_L').success.mean().sort_values(ascending=False).plot(kind='bar',stacked=False)
vals = ax.get_yticks()
ax.set_ylabel('Success Rate')
ax.set_yticklabels(['{:,.2%}'.format(x) for x in vals])
Below is the dataset image
I'm looking to make a stacked area plot over time, based on summary data created by groupby and sum.
The groupby and sum part correctly groups and sums the data I want, but it seems the resultant format is nonsense in terms of plotting it.
I'm not sure where to go from here:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df=pd.DataFrame({'invoice':[1,2,3,4,5,6],'year':[2016,2016,2017,2017,2017,2017],'part':['widget','wonka','widget','wonka','wonka','wonka'],'dollars':[10,20,30,10,10,10]})
#drop the invoice number from the data since we don't need it
df=df[['dollars','part','year']]
#group by year and part, and add them up
df=df.groupby(['year','part']).sum()
#plotting this is nonsense:
df.plot.area()
plt.show()
to chart multiple series, its easiest to have each series organized as a separate column, i.e. replace
df=df.groupby(['year','part']).sum()
with
df=df.groupby(['year', 'part']).sum().unstack(-1)
Then the rest of the code should work. But, I'm not sure if this is what you need because the desired output is not shown.
df.plot.area() then produces the chart like
In a classifieds website I maintain, I'm comparing classifieds that receive greater-than-median views vs classifieds that are below median in this criterion. I call the former "high performance" classifieds. Here's a simple countplot showing this:
The hue is simply the number of photos the classified had.
My question is - is there a plot type in seaborn or matplotlib which shows proportions instead of absolute counts?
I essentially want the same countplot, but with each bar as a % of the total items in that particular category. For example, notice that in the countplot, classifieds with 3 photos make up a much larger proportion of the high perf category. It takes a while to glean that information. If each bar's height was instead represented by its % contribution to its category, it'd be a much easier comparison. That's why I'm looking for what I'm looking for.
An illustrative example would be great.
Instead of trying to find a special case plotting function that would do exactly what you want, I would suggest to consider keeping data generation and visualization separate. At the end what you want is to plot a bar graph of some values, so the idea would be to generate the data in such a way that they can easily be plotted.
To this end, you may crosstab the two columns in question and divide each row (or column) in the resulting table by its sum. This table can then easily be plotted using the pandas plotting wrapper.
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(42)
import pandas as pd
plt.rcParams["figure.figsize"] = 5.6, 7.0
n = 100
df = pd.DataFrame({"performance": np.random.choice([0,1], size=n, p=[0.7,0.3]),
"photo" : np.random.choice(range(4), size=n, p=[0.6,0.1,0.2,0.1]),
"someothervalue" : np.random.randn(n) })
fig, (ax,ax2, ax3) = plt.subplots(nrows=3)
freq = pd.crosstab(df["performance"],df["photo"])
freq.plot(kind="bar", ax=ax)
relative = freq.div(freq.sum(axis=1), axis=0)
relative.plot(kind="bar", ax=ax2)
relative = freq.div(freq.sum(axis=0), axis=1)
relative.plot(kind="bar", ax=ax3)
ax.set_title("countplot of absolute frequency")
ax2.set_title("barplot of relative frequency by performance")
ax3.set_title("barplot of relative frequency by photo")
for a in [ax, ax2, ax3]: a.legend(title="Photo", loc=6, bbox_to_anchor=(1.02,0.5))
plt.subplots_adjust(right=0.8,hspace=0.6)
plt.show()
Let's say I have a DataFrame that looks (simplified) like this
>>> df
freq
2 2
3 16
1 25
where the index column represents a value, and the freq column represents the frequency of occurance of that value, as in a frequency table.
I'd like to plot a density plot for this table like one obtained from plot kind kde. However, this kind is apparently only meant for pd.Series. My df is too large to flatten out to a 1D Series, i.e. df = [2, 2, 3, 3, 3, ..,, 1, 1].
How can I plot such a density plot under these circumstances?
I know you have asked for the case where df is too large to flatten out, but the following answer works where this isn't the case:
pd.Series(df.index.repeat(df.freq)).plot.kde()
Or more generally, when the values are in a column called val and not the index:
df.val.repeat(df.freq).plot.kde()
You can plot a density distribution using a bar plot if you normalize the y values by the product of the size of the population. This will make the area covered by the bars equal to 1.
plt.bar(
df.index,
df.freq / df.freq.sum(),
width=-1,
align='edge'
)
The width and align parameters are to make sure each bar covers the interval (k-1, k].
Somebody with better knowledge of statistics should answer whether kernel density estimation actually makes sense for discrete distributions.
Maybe this will work:
import matplotlib.pyplot as plt
plt.plot(df.index, df['freq'])
plt.show()
Seaborn was built to do this on top of Matplotlib and automatically calculates kernel density estimates if you want.
import seaborn as sns
x = pd.Series(np.random.randint(0, 20, size = 10000), name = 'freq')
sns.distplot(x, kde = True)