boxplot at a certain x, y location in matplotlib, seaborn

boxplot at a certain x, y location in matplotlib, seaborn - python

I have some data which is organized as follows:
x y score
1 1 0.951
1 1 0.956
1 1 0.976
1 1 0.875
1.5 1.5 0.94
1.5 1.5 0.76
1.5 1.5 0.88
...
10 10 0.51
10 10 0.66
So what I want to do is aggregate the data by (x, y) values and then show a box plot for the score at those (x, y) values.
I realize that I will need 2 y-axes and I think matplotlib allows one to do that. I see an example here: https://matplotlib.org/gallery/api/two_scales.html
However, I am not sure if it is even possible to arrange this so that the scales will correspond to the means of these datasets.
So, I am guessing my question is whether this can be done or if there is a recommendation on how to visualise this sort of data?

Related

Python PuLP optimization problem for average values with increase and decrease constraints

I am trying to use PuLP to solve the following problem.
I want it to change the values in the value column by either whatever it wants or if the maximum decrease or increase fields are populated by that amount to achieve a target of 0.27.
band
term
value
max_decrease
max_increase
target
A
1
0.259
-0.01
0
0.27
A
2
0.239
0
0.01
0.27
B
1
0.17
0
0.01
0.27
C
1
0.245
-0.01
0
0.27
I have tried setting this problem up but not making any progress. I am currently trying to set it up by minimizing the following equation:
Minimize(target - avg(value)) to 0. So by changing the values it needs to achieve an average value of 0.27.

Percent of total clusters per cluster per season using pandas

I have a pandas DataFrame that looks like this with 12 clusters in total. Certain clusters don't appear in a certain season.
I want to create a multi-line graph over the seasons of the percent of a specific cluster over each season. So if there are 30 teams in the 97-98 season and there are 10 teams in Cluster 1, then that value would be .33 since cluster 1 has one third of the total possible spots.
It'll look like this
And I want the dateset to look like this, where each cluster has its own percentage of the whole number of clusters in that season by percentage. I've tried using pandas groupby method to get a bunch of lists and then use value_counts() on that but that doesn't work since looping through df.groupby(['SEASON']) returns tuples, not a Series..
Thanks so much

Use .groupby combined with .value_counts and .unstack:
temp_df = df.groupby(['SEASON'])['Cluster'].value_counts(normalize=True).unstack().fillna(0.0)
temp_df.plot()
print(temp_df.round(2))
Cluster 0 1 2 4 5 6 7 10 11
SEASON
1996-97 0.1 0.21 0.17 0.21 0.07 0.1 0.03 0.07 0.03
1997-98 0.2 0.00 0.20 0.20 0.00 0.0 0.20 0.20 0.00

Multiclass classification to balance in python (over sampling)

I have the following problem, there is a classification problem. On the track 50,000 lines, on Y 60 labels. But the data is unbalanced (in one class, 35000 values, in the other 59 classes 15000 values, of which in some 30 values). If for example, that is, X (column_1, column_2, column_3) and Y:
colum_1 colum_2 colum_3 Y
0.5 1 2 1
0.5 1.1 2 1
0.55 0.95 3 1
0.1 1 2 2
2 0.9 3 3
And need to add "noisy" data, so that there is no imbalance, conditionally, that all values become the same:
colum_1 colum_2 colum_3 Y
0.5 1 2 1
0.5 1.1 2 1
0.55 0.95 3 1
0.1 1 2 2
0.15 0.99 2 2
0.05 1.01 2 2
2 0.9 3 3
1.95 0.95 3 3
2.05 0.85 3 3
Only this is a toy example, but I have many meanings.

Although the question is not exactly clear, I think you're looking for help with oversampling the minority classes. A common approach would be the SMOTE algorithm, which you can find in the imblearn package.
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42, ratio = 1.0)
X_res, Y_res = sm.fit_sample(X_train, Y_train)
Just make sure you divide your data up into train and test groups first, and then over-sample each group separately so you don't end of with the same data in both. A fuller description here.

Modelling and Plotting Fat Tails - Python

I am working with Stock Indices. I have a Numpy Array which contains the daily returns data for the Index for last 25 yrs or so. I have Plotted the Empirical PDF and also the Corresponding Normal PDF to show how deviant the actual data is from a Normal Distribution.
My questions are:-
Is there a Pythonic way to test if my Left Tail is actually a Fat Tail or not?
And in the above graph how do I mark a point/ threshold beyond which I can say the Tail is Fat?

Consider scipy.stats.kurtosistest and scipy.stats.skewtest.
To your second question, use .axvline to mark your line there. Depending on how granular the bins are, try finding the first point left of zero that meets the following condition:
df
Out[20]:
Normal Empirical
Bin
-1.0 0 2.0
-0.9 1 2.5
-0.8 2 3.0
-0.7 3 3.5
-0.6 4 4.0
-0.5 5 4.5
-0.4 6 5.0
-0.3 7 6.0
-0.2 8 8.0
-0.1 9 10.0
0.0 10 12.0
df.index[(df.Normal.shift() < df.Empirical.shift())
& (df.Normal == df.Empirical)].values
Out[38]: array([-0.6])
And lastly, you could consider plotting the actual histogram in addition to fitted distribution, and using an inset, as is done here.

Plotting a table of log-log values

I have a table of values which aren't logs but to find their relation I think I need to create a log-log plot. The values I have are:
R C
----------
0.2 103
2 13.9
20 2.72
200 0.800
2000 0.401
20000 0.433
How do I plot the logs of these values ?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

boxplot at a certain x, y location in matplotlib, seaborn - python

Related

Python PuLP optimization problem for average values with increase and decrease constraints

Percent of total clusters per cluster per season using pandas

Multiclass classification to balance in python (over sampling)

Modelling and Plotting Fat Tails - Python

Plotting a table of log-log values

Categories

Resources