How to do draw a seaborn or a matlab plot - python

I have a dataframe which looks like below
A B C
Above 90% 30 50 60
80% - 90% 39 72 79
65% - 80% 80 150 130
Below 65% 80 70 90
And I would like to get a bar chart like below
And I have another dataset which has one more sub category for A, B and C
Company Category Above 90% 80% - 90% 65% - 80% Below 65%
A Fast Movers 1 11 21 9
A Med Movers 2 13 42 40
A Slow Movers 34 26 44 50
B Fast Movers 1 11 21 9
B Med Movers 16 48 92 45
B Slow Movers 33 13 37 45
C Fast Movers 0 0 2 30
C Medi Movers 11 37 74 50
C Slow Movers 41 42 65 30
For the above table I want a stacked bar chart with colors only for categories and I want Company and accuracy to be labels not colors.
I was not even able to achieve it using other softwares.
I request you all to kindly help me to achieve it, I will appreciate your time and help.
Thank you in advance

For the first question, you can use this:
import matplotlib.pyplot as plt
ax = df.plot(kind='bar', title ="Accuracy distribution", figsize=(15, 10), legend=True, fontsize=12)
for p in ax.patches:
ax.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() * 1.005))
plt.show()

You can use:
df.set_index(['Company','Category']).stack().unstack(1).plot.bar(stacked=True)
Or for each company separate graph:
for i, x in df.set_index(['Company','Category']).stack().unstack(1).reset_index(level=0).groupby('Company'):
x.plot.bar(stacked=True)

Related

Is there a way to use a for-loop to quickly create sublots in matplotlib and pandas?

I'm saving the daily stock price for several stocks in a Pandas Dataframe. I'm using python and Jupyter notebook.
Once saved, I'm using matplotlib to graph the prices to check the data.
The idea is to graph 9 stocks at at time in a 3 x 3 subplot.
When I want to check other stock tickers I have to mannualy change each ticker in each subplot, which takes a long time and seems inefficient.
¿Is there a way to do this with some sort of list and for loop?
Here is my current code. It works but it seems to long and hard to update. (Stock tickers are only examples from a vanguard model portfolio).
x = price_df.index
a = price_df["P_VOO"]
b = price_df["P_VGK"]
c = price_df["P_VPL"]
d = price_df["P_IEMG"]
e = price_df["P_MCHI"]
f = price_df["P_VNQ"]
g = price_df["P_GDX"]
h = price_df["P_BND"]
i = price_df["P_BNDX"]
# Plot a figure with various axes scales
fig = plt.figure(figsize=(15,10))
# Subplot 1
plt.subplot(331)
plt.plot(x, a)
plt.title("VOO")
plt.ylim([0,550])
plt.grid(True)
plt.subplot(332)
plt.plot(x, b)
plt.title("VGK")
plt.ylim([0,400])
plt.grid(True)
plt.subplot(333)
plt.plot(x, c)
plt.title('VPL')
plt.ylim([0,110])
plt.grid(True)
plt.subplot(334)
plt.plot(x, d)
plt.title('IEMG')
plt.ylim([0,250])
plt.grid(True)
plt.subplot(335)
plt.plot(x, e)
plt.title('MCHI')
plt.ylim([0,75])
plt.grid(True)
plt.subplot(336)
plt.plot(x, f)
plt.title('P_VNQ')
plt.ylim([0,55])
plt.grid(True)
plt.subplot(337)
plt.plot(x, g)
plt.title('P_GDX')
plt.ylim([0,8])
plt.grid(True)
plt.subplot(338)
plt.plot(x, h)
plt.title('P_BND')
plt.ylim([0,200])
plt.grid(True)
plt.subplot(339)
plt.plot(x, i)
plt.title('P_BNDX')
plt.ylim([0,350])
plt.grid(True)
plt.tight_layout()
Try with DataFrame.plot and enable subplots, set the layout and figsize:
axes = df.plot(subplots=True, title=df.columns.tolist(),
grid=True, layout=(3, 3), figsize=(15, 10))
plt.tight_layout()
plt.show()
Or use plt.subplots to set the layout then plot on those axes with DataFrame.plot:
# setup subplots
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(15, 10))
# Plot DataFrame on axes
df.plot(subplots=True, ax=axes, title=df.columns.tolist(), grid=True)
plt.tight_layout()
plt.show()
Sample Data and imports:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
np.random.seed(5)
df = pd.DataFrame(np.random.randint(10, 100, (10, 9)),
columns=list("ABCDEFGHI"))
df:
A B C D E F G H I
0 88 71 26 83 18 72 37 40 90
1 17 86 25 63 90 37 54 87 85
2 75 57 40 94 96 28 19 51 72
3 11 92 26 88 15 68 10 90 14
4 46 61 37 41 12 78 48 93 29
5 28 17 40 72 21 77 75 65 13
6 88 37 39 43 99 95 17 26 24
7 41 19 48 57 26 15 44 55 69
8 34 23 41 42 86 54 15 24 57
9 92 10 17 96 26 74 18 54 47
Does this implementation not work out in your case?
x = price_df.index
cols = ["P_VOO","P_VGK",...] #Populate before running
ylims = [[0,550],...] #Populate before running
# Plot a figure with various axes scales
fig = plt.figure(figsize=(15,10))
# Subplot 1
for i, (col, ylim) in enumerate(zip(cols, ylims)):
plt.subplot(331+i)
plt.plot(x, price_df[col])
plt.title(col.split('_')[1])
plt.ylim(ylim)
plt.grid(True)
Haven't run the code in my local, could have some minor bugs. But you get the general idea, right?

Plotting of dot points based on np.where condition

I have a lot of data points (in .CSV form) that I am trying to visualize, what I would like to do is to read the csv and read the "result" column, if the value in the corresponding column is positive(I was trying to use np.where condition), I would like to plot the A B C D E F G parameters corresponding to it in such a way that the y-axis is the value of the parameters and x-axis is the name of the parameter.(Something like a dot/scatter plot) I would like to plot all the values in the same graph, Furthermore, if the number of points are more than 20 I would like to use the first 20 points for the plotting.
An example of the type of dataset is below. (Mine contains around 12000 rows)
A B C D E F G result
23 -54 36 27 98 39 80 -0.86
14 44 -16 47 28 29 26 1.65
67 84 26 67 -88 29 10 0.5
-45 14 76 37 68 59 90 0
24 34 56 27 38 79 48 -1.65
Any help in guiding for this would be appreciated !
From your question I assume that your data is a pandas dataframe. In this case you can do the selection with pandas and use its built-in plotting function:
df.loc[df.result>0, df.columns[:-1]].T.plot(ls='', marker='o')
If you want to plot the first 20 rows only, just add [:20] (or better .iloc[:20]) to df.loc.

Wrong scipy fit even with good initial guess

The model to fit is the equation
def func(x, b):
return b*np.exp(-b*x)*(1.0 + b*x)/4.0
I know that b=0.1 is a good guess to my data
0 0.1932332495855138
1 0.0283534527253836
2 0.0188036856033853
3 0.0567007258167565
4 0.0704161703188139
5 0.0276463443409273
6 0.0144509808494943
7 0.0188027609145469
8 0.0049573500626925
9 0.0064589075683206
10 0.0118522499082115
11 0.0087201376939245
12 0.0055855004231049
13 0.0110355379801288
14 0.0024829496736532
15 0.0050982312687186
16 0.0041032075307342
17 0.0063991465281368
18 0.0047195530453669
19 0.0028479431829209
20 0.0177577032522473
21 0.0082863863356967
22 0.0057720347102372
23 0.0053694769677398
24 0.017408417311084
25 0.0023307847797263
26 0.0014090741613788
27 0.0019007144648791
28 0.0043599058193019
29 0.004435997067249
30 0.0015569027316533
31 0.0016127575928092
32 0.00120222948697
33 0.0006851723909766
34 0.0014497504163
35 0.0014245210449107
36 0.0011375555693977
37 0.0007939973846594
38 0.0005707034948325
39 0.0007890519641431
40 0.0006274139241806
41 0.0005899624312505
42 0.0003989619799181
43 0.0002212632688891
44 0.0001465605806698
45 0.000188075040325
46 0.0002779076010181
47 0.0002941294723591
48 0.0001690581072228
49 0.0001448055157076
50 0.0002734759385405
51 0.0003228484365634
52 0.0002120441778252
53 0.0002383276583408
54 0.0002156310534404
55 0.0004499244488764
56 0.0001408465706883
57 0.000135998586104
58 0.00028706917157
59 0.0001788548683777
But it doesn't matter if I set p0=0.1, or p0=1.0, the fitting parameter in both cases python says to be popt= [0.42992594] and popt=[0.42994105], which is almost the same value. Why the curve_fit function doesn't work in this case?
popt, pcov = curve_fit(func, xdata, ydata, p0=[0.1])
There's nothing too mysterious going on here. 0.4299... is just a better fit to the data, in the least-squares sense.
With b = 0.1, the first few points are not well fit at all. Least-squares heavily weights outliers, so the optimizer tries very hard to fit those better, even if it means doing slightly worse at other points. In other words, "most" points are fit "pretty well", and there is a very high penalty for fitting any point very badly (that's the "square" in least-squares).
Below is a plot of the data (blue) and your model function with b = 0.1 and b = 0.4299 in orange and green respectively. The value returned by curve_fit is better subjectively and objectively. Computing the MSE to the data in both cases gives about 0.18 using b = 0.1, and 0.13 using b = 0.4299.

get rid of tiny fraction in bar plot scale (Pandas/Python)

When I'm trying to plot a bar plot (of histograms), using pd.cut, I get a funny (and very annoying!) 0.001 added to the axis (from the left), making it starting from -1.001 instead of -1. The question is how to get rid of this? (please see the figure).
My code is:
out_i = pd.cut(df, bins=np.arange(-1,1.2,0.2), include_lowest=True)
out_i.value_counts(sort=False).plot.bar(rot=45, figsize=(6,6))
plt.tight_layout()
with df:
a
0 -0.402203
1 -0.019031
2 -0.979292
3 -0.701221
4 -0.267261
5 -0.563602
7 -0.454961
8 0.632456
9 -0.843081
10 -0.629253
11 -0.946188
12 -0.628178
13 -0.776933
14 -0.717091
15 -0.392144
16 -0.799408
17 -0.897951
18 0.255321
19 -0.641854
20 -0.356393
21 -0.507321
22 -0.698238
23 -0.985097
25 -0.661444
26 -0.751593
27 -0.437505
28 -0.413451
29 -0.798745
30 -0.736440
31 -0.672727
32 -0.807688
33 -0.087085
34 -0.393203
35 -0.979730
36 -0.902951
37 -0.454231
38 -0.561951
39 -0.388580
40 -0.706501
41 -0.408248
42 -0.377235
43 -0.283110
44 -0.517428
45 -0.949603
46 -0.268667
47 -0.376199
48 -0.472293
49 -0.211781
50 -0.921520
51 -0.345870
53 -0.542487
55 -0.597996
In case it is acceptable to chop off the decimal points of the intervals, generate a custom list of interval labels and set this as the xticklabels of the plot:
out_i = pd.cut(df['a'], bins=np.arange(-1,1.2,0.2), include_lowest=True)
intervals = out_i.cat.categories
labels = ['(%.1f, %.1f]' % (int(interval.left*100)/100, interval.right) for interval in intervals]
ax = out_i.value_counts(sort=False).plot.bar(rot=45, figsize=(6,6))
ax.set_xticklabels(labels)
plt.tight_layout()
Which results in the following plot:
Note: this will always output a half-closed interval (a,b]. It can be improved by making the brackets dynamic as per the parameters of pd.cut.

Python/Pandas Select Columns based on Best Value Distribution

I have a dataframe (df) in pandas/python with ['Product','OrderDate','Sales'].
I noticed that some rows, values have better Distribution (like in a Histogram) than others. By "Best" meaning, the shape is more spread, or the spread of values make the shape looks wider than for other rows.
If I want to pick from say +700 Product's, those with more spread values, is there a way to do that easily in pandas/python?
txs in advance.
Caveat here is that I'm not a stats expert but basically scipy has a number of tests you can conduct on your data to test whether it could be considered to be a normalised Gaussian distribution.
Here I create 2 series one is simple a linear range and the other is a random normalised sampling with mean set to 50 and variance set to 25.
In [48]:
import pandas as pd
import scipy.stats as stats
df = pd.DataFrame({'linear':arange(100), 'normal':np.random.normal(50, 25, 100)})
df
Out[48]:
linear normal
0 0 66.565374
1 1 63.453899
2 2 65.736406
3 3 65.848908
4 4 56.916032
5 5 93.870682
6 6 89.513998
7 7 9.949555
8 8 9.727099
9 9 47.072785
10 10 62.849321
11 11 33.263309
12 12 42.168484
13 13 38.488933
14 14 51.833459
15 15 54.911915
16 16 62.372709
17 17 96.928452
18 18 65.333546
19 19 26.341462
20 20 41.692790
21 21 22.852561
22 22 15.799415
23 23 50.600141
24 24 14.234088
25 25 72.428607
26 26 45.872601
27 27 80.783253
28 28 29.561586
29 29 51.261099
.. ... ...
70 70 32.826052
71 71 35.413106
72 72 49.415386
73 73 28.998378
74 74 32.237667
75 75 86.622402
76 76 105.098296
77 77 53.176413
78 78 -7.954881
79 79 60.313761
80 80 42.739641
81 81 56.667834
82 82 68.046688
83 83 72.189683
84 84 67.125708
85 85 24.798553
86 86 58.845761
87 87 54.559792
88 88 93.116777
89 89 30.209895
90 90 80.952444
91 91 57.895433
92 92 47.392336
93 93 13.136111
94 94 26.624532
95 95 53.461421
96 96 28.782809
97 97 16.342756
98 98 64.768579
99 99 68.410021
[100 rows x 2 columns]
From this page there are a number of tests we can use which are combined to for the normaltest, namely the skewtest and kurtosistest, I cannot explain these but you can see that the p-value is poor for the linear series and is relatively closer to 1 for the normalised data:
In [49]:
print('linear skewtest teststat = %6.3f pvalue = %6.4f' % sc.stats.skewtest(df['linear']))
print('normal skewtest teststat = %6.3f pvalue = %6.4f' % sc.stats.skewtest(df['normal']))
print('linear kurtoisis teststat = %6.3f pvalue = %6.4f' % sc.stats.kurtosistest(df['linear']))
print('normal kurtoisis teststat = %6.3f pvalue = %6.4f' % sc.stats.kurtosistest(df['normal']))
print('linear normaltest teststat = %6.3f pvalue = %6.4f' % sc.stats.normaltest(df['linear']))
print('normal normaltest teststat = %6.3f pvalue = %6.4f' % sc.stats.normaltest(df['normal']))
linear skewtest teststat = 1.022 pvalue = 0.3070
normal skewtest teststat = -0.170 pvalue = 0.8652
linear kurtoisis teststat = -5.799 pvalue = 0.0000
normal kurtoisis teststat = -1.113 pvalue = 0.2656
linear normaltest teststat = 34.674 pvalue = 0.0000
normal normaltest teststat = 1.268 pvalue = 0.5304
From the scipy site:
When testing for normality of a small sample of t-distributed
observations and a large sample of normal distributed observation,
then in neither case can we reject the null hypothesis that the sample
comes from a normal distribution. In the first case this is because
the test is not powerful enough to distinguish a t and a normally
distributed random variable in a small sample.
So you'll have to try the above and see if it fits with what you want, hope this helps.
Sure. What you'd like to do here is find the 700 entries with the largest standard deviation.
pandas.DataFrame.std() will return the standard deviation for an axis, and then you just need to keep track of the entries with the highest corresponding values.
Large Standard Deviation vs. Small Standard Deviation

Categories