Boxplot and data outliers

Boxplot and data outliers - python

I have data in a dictionary form that I convert to pandas that I am attempting to box plot data that is outside the range of 68 and 72. Ultimately I am trying to rotate the title of the box blot 90 degrees and also exclude outlier data if possible. In this snip below of my real world scenario its impossible to read to column header and its also not necessary to show the box plot if only a few outliers are outside the range 68 & 72. Any tips are greatly appreciated...
Ill make up some code that mimics my real world application.
df = pd.DataFrame(dict(a=[71.5,72.8,79.3],b=[70.2,73.3,74.9],c=[63.1,64.9,65.9],d=[70.1,70.9,70.9]))
Flag too hot:
TooHot = df.apply(lambda x: not (x > 72).any())
print('These zones are too warm')
df[TooHot[~TooHot].index].boxplot()
plt.show()
Flag too cool:
TooCool = df.apply(lambda x: not (x < 68).any())
print('These zones are too cool')
df[TooCool[~TooCool].index].boxplot()
plt.show()

The keyword arguments showfliers=False in .boxplot() will remove the outliers from displaying on the plot.
Using vert=False will make the boxplots horizontal (which I think is what you are asking?
The documentation on matplotlib boxplots is a good place to start: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.boxplot.html

Related

Adjust y axis when using parallel_coordinates

I'm working on plotting some data with pandas in the form of parallel coordinates, and I'm not too sure how to go about setting the y-axis scaling.
here is my code:
def show_means(df: pd.DataFrame):
plt.figure('9D-parallel_coordinates')
plt.title('continents & features')
parallel_coordinates(df,'continent', color=['blue', 'green', 'red', 'yellow','orange','black'])
plt.show()
and I got this:
enter image description here
as shown in the graph, the value of "tempo" is way more than others. I want to scale all features values between 0 and 1 and get a line chart. How could I do that? Also, I want to change exegesis to vertical that readers can understand it easier.
this is my data frame:
enter image description here
Thanks

To normalize your values between 0 and 1, you have multiple choices. One of them could be (MinMaxScaler): the lowest value of each column is 0 and the highest value is 1:
df = (df - df.min()) / (df.max() - df.min())
To have vertically labels, use df.plot(rot=90)

Matplotlib -- UserWarning: Attempting to set identical left == right == 737342.0 results in singular transformations;

By Using Matplotlib i am trying to create a Line chart but i am facing below issue. Below is the code. Can someone help me with any suggestion
Head = ['Date','Count1','Count2','Count3']
df9 = pd.DataFrame(Data, columns=Head)
df9.set_index('Date',inplace=True)
fig,ax = plt.subplots(figsize=(15,10))
df9.plot(ax=ax)
ax.xaxis.set_major_locator(mdates.WeekdayLocator(SATURDAY))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))
plt.legend()
plt.xticks(fontsize= 15)
plt.yticks(fontsize= 15)
plt.savefig(Line.png)
i am getting below error
Error: Matplotlib UserWarning: Attempting to set identical left == right == 737342.0 results in singular transformations; automatically expanding (ax.set_xlim(left, right))
Sample Data:
01-10-2010, 100, 0 , 100
X Axis: I am trying to display date on base of date every saturdays
Y Axis: all other 3 counts
Can some one please help me whats this issue and how can i fix this...

The issue is caused by the fact that somehow, pandas.DataFrame.plot explicitly sets the x- and y- limits of your plot to the limits of your data. This is normally fine, and no one notices. In fact, I had a lot of trouble finding references to your warning anywhere at all, much less the Pandas bug list.
The workaround is to set your own limits manually in your call to DataFrame.plot:
if len(df9) == 1:
delta = pd.Timedelta(days=1)
lims = [df9.index[0] - delta, df9.index[0] + delta]
else:
lims = [None, None]
df9.plot(ax=ax, xlim=lims)

This issue can also arise in a more tricky situation, when you do NOT only have one point, but only one cat get on your plot : Typically, when only one point is >0 and your plot yscale is logarithmic.
One should always set limits on a log scale when there 0 values. Because, there is no way the program can decide on a good scale lower limit.

How to make the confidence interval (error bands) show on seaborn lineplot

I'm trying to create a plot of classification accuracy for three ML models, depending on the number of features used from the data (the number of features used is from 1 to 75, ranked according to a feature selection method). I did 100 iterations of calculating the accuracy output for each model and for each "# of features used". Below is what my data looks like (clsf from 0 to 2, timepoint from 1 to 75):
data
I am then calling the seaborn function as shown in documentation files.
sns.lineplot(x= "timepoint", y="acc", hue="clsf", data=ttest_df, ci= "sd", err_style = "band")
The plot comes out like this:
plot
I wanted there to be confidence intervals for each point on the x-axis, and don't know why it is not working. I have 100 y values for each x value, so I don't see why it cannot calculate/show it.

You could try your data set using Seaborn's pointplot function instead. It's specifically for showing an indication of uncertainty around a scatter plot of points. By default pointplot will connect values by a line. This is fine if the categorical variable is ordinal in nature, but it can be a good idea to remove the line via linestyles = "" for nominal data. (I used join = False in my example)
I tried to recreate your notebook to give a visual, but wasn't able to get the confidence interval in my plot exactly as you describe. I hope this is helpful for you.
sb.set(style="darkgrid")
sb.pointplot(x = 'timepoint', y = 'acc', hue = 'clsf',
data = ttest_df, ci = 'sd', palette = 'magma',
join = False);

How to find the correct condition for my matplotlib scatterplot?

I'm trying to correlate two measures(DD & DRE) from a data set which contains many more columns. I created a data frame and called it as 'Data'.
Within this Data, I want to create a scatterplot between DD(X axis) & DRE(y Axis), I want to include DD values between 0 and 100.
Please help me with the first line of my code to get the condition of DD between 0 and 100
Also when I plot the scatterplot, I get dots beyond 100% ( Y axis is DRE in %) though I dont have any value >100%.
Data1= Data[ Data['DD']<100]
plt.scatter(Data1.DD,Data1.DRE)
tick_val = [0,10,20,30,40,50,60,70,80,90,100]
tick_lab = ['0%','10%','20%','30%','40%','50%','60%','70%','80%','90%','100']
plt.yticks(tick_val,tick_lab)
plt.show()

matplotlib y-adjustment of texts using adjustText

That is a plot i generated using pyplot and (attempted to) adjust the text using the adjustText library which i also found here.
as you can see, it gets pretty crowded in the parts where 0 < x < 0.1. i was thinking that there's still ample space in 0.8 < y < 1.0 such that they could all fit and label the points pretty well.
my attempt was:
plt.plot(df.fpr,df.tpr,marker='.',ls='-')
texts = [plt.text(df.fpr[i],df.tpr[i], str(df.thr1[i])) for i in df.index]
adjust_text(texts,
expand_text=(2,2),
expand_points=(2,2),
expand_objects=(2,2),
force_objects = (2,20),
force_points = (0.1,0.25),
lim=150000,
arrowprops=dict(arrowstyle='-',color='red'),
autoalign='y',
only_move={'points':'y','text':'y'}
)
where my df is a pandas dataframe which can be found here
from what i understood in the docs, i tried varying the bounding boxes and the y-force by making them larger, thinking that it would push the labels further up, but it does not seem to be the case.

I'm the author of adjustText, sorry I just noticed this question. you are having this problem because you have a lot of overlapping texts with exactly the same y-coordinate. It's easy to solve by adding a tiny random shift along the y to the labels (and you do need to increase the force for texts, otherwise along one dimension it works very slowly), like so:
np.random.seed(0)
f, ax = plt.subplots(figsize=(12, 6))
plt.plot(df.fpr,df.tpr,marker='.',ls='-')
texts = [plt.text(df.fpr[i], df.tpr[i]+np.random.random()/100, str(df.thr1[i])) for i in df.index]
plt.margins(y=0.125)
adjust_text(texts,
force_text=(2, 2),
arrowprops=dict(arrowstyle='-',color='red'),
autoalign='y',
only_move={'points':'y','text':'y'},
)
Also notice that I increased the margins along the y axis, it helps a lot with the corners. The result is not quite perfect, limiting the algorithm to just one axis make life more difficult... But it's OK-ish already.
Have to mention, size of the figure is very important, I don't know what yours was.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Boxplot and data outliers - python

Related

Adjust y axis when using parallel_coordinates

Matplotlib -- UserWarning: Attempting to set identical left == right == 737342.0 results in singular transformations;

How to make the confidence interval (error bands) show on seaborn lineplot

How to find the correct condition for my matplotlib scatterplot?

matplotlib y-adjustment of texts using adjustText

Categories

Resources