I'm trying to plot a probability distribution using a pandas.Series and I'm struggling to set different yerr for each bar. In summary, I'm plotting the following distribution:
It comes from a Series and it is working fine, except for the yerr. It cannot overpass 1 or 0. So, I'd like to set different errors for each bar. Therefore, I went to the documentation, which is available here and here.
According to them, I have 3 options to use either the yerr aor xerr:
scalar: Symmetric +/- values for all data points.
scalar: Symmetric +/- values for all data points.
shape(2,N): Separate - and + values for each bar. The first row contains the lower errors, the second row contains the upper errors.
The case I need is the last one. In this case, I can use a DataFrame, Series, array-like, dict and str. Thus, I set the arrays for each yerr bar, however it's not working as expected. Just to replicate what's happening, I prepared the following examples:
First I set a pandas.Series:
import pandas as pd
se = pd.Series(data=[0.1,0.2,0.3,0.4,0.4,0.5,0.2,0.1,0.1],
index=list('abcdefghi'))
Then, I'm replicating each case:
This works as expected:
err1 = [0.2]*9
se.plot(kind="bar", width=1.0, yerr=err1)
This works as expected:
err2 = err1
err2[3] = 0.5
se.plot(kind="bar", width=1.0, yerr=err1)
Now the problem: This doesn't works as expected!
err_up = [0.3]*9
err_low = [0.1]*9
err3 = [err_low, err_up]
se.plot(kind="bar", width=1.0, yerr=err3)
It's not setting different errors for low and up. I found an example here and a similar SO question here, although they are using matplotlib instead of pandas, it should work here.
I'm glad if you have any solution about that.
Thank you.
Strangely, plt.bar works as expected:
err_up = [0.3]*9
err_low = [0.1]*9
err3 = [err_low, err_up]
fig, ax = plt.subplots()
ax.bar(se.index, se, width=1.0, yerr=err3)
plt.show()
Output:
A bug/feature/design-decision of pandas maybe?
Based on #Quanghoang comment, I started to think it was a a bug. So, I tried to change the yerr shape, and surprisely, the following code worked:
err_up = [0.3]*9
err_low = [0.1]*9
err3 = [[err_low, err_up]]
print (err3)
se.plot(kind="bar", width=1.0, yerr=err3)
Observe I included a new axis in err3. Now it's a (1,2,N) array. However, the documentation says it should be (2,N).
In addition, a possible work around that I found was set the ax.ylim(0,1). It doesn't solve the problem, but plots the graph correctly.
Related
By Using Matplotlib i am trying to create a Line chart but i am facing below issue. Below is the code. Can someone help me with any suggestion
Head = ['Date','Count1','Count2','Count3']
df9 = pd.DataFrame(Data, columns=Head)
df9.set_index('Date',inplace=True)
fig,ax = plt.subplots(figsize=(15,10))
df9.plot(ax=ax)
ax.xaxis.set_major_locator(mdates.WeekdayLocator(SATURDAY))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))
plt.legend()
plt.xticks(fontsize= 15)
plt.yticks(fontsize= 15)
plt.savefig(Line.png)
i am getting below error
Error: Matplotlib UserWarning: Attempting to set identical left == right == 737342.0 results in singular transformations; automatically expanding (ax.set_xlim(left, right))
Sample Data:
01-10-2010, 100, 0 , 100
X Axis: I am trying to display date on base of date every saturdays
Y Axis: all other 3 counts
Can some one please help me whats this issue and how can i fix this...
The issue is caused by the fact that somehow, pandas.DataFrame.plot explicitly sets the x- and y- limits of your plot to the limits of your data. This is normally fine, and no one notices. In fact, I had a lot of trouble finding references to your warning anywhere at all, much less the Pandas bug list.
The workaround is to set your own limits manually in your call to DataFrame.plot:
if len(df9) == 1:
delta = pd.Timedelta(days=1)
lims = [df9.index[0] - delta, df9.index[0] + delta]
else:
lims = [None, None]
df9.plot(ax=ax, xlim=lims)
This issue can also arise in a more tricky situation, when you do NOT only have one point, but only one cat get on your plot : Typically, when only one point is >0 and your plot yscale is logarithmic.
One should always set limits on a log scale when there 0 values. Because, there is no way the program can decide on a good scale lower limit.
I am doing a multiple regression in Stan.
I want a trace plot of the beta vector parameter for the regressors/design matrix.
When I do the following:
fit = model.sampling(data=data, iter=2000, chains=4)
fig = fit.plot('beta')
I get a pretty horrid image:
I was after something a little more user friendly. I have managed to hack the following which is closer to what I am after.
My hack plugs into the back of pystan as follows.
r = fit.extract() # r for results
from pystan.external.pymc import plots
param = 'beta'
beta = r[param]
name = df.columns.values.tolist()
(rows, cols) = beta.shape
assert(len(df.columns) == cols)
values = {param+'['+str(k+1)+'] '+name[k]:
beta[:,k] for k in range(cols)}
fig = plots.traceplot(values, values.keys())
for a in fig.axes:
# shorten the y-labels
l = a.get_ylabel()
if l == 'frequency':
a.set_ylabel('freq')
if l=='sample value':
a.set_ylabel('val')
fig.set_size_inches(8, 12)
fig.tight_layout(pad=1)
fig.savefig(g_dir+param+'-trace.png', dpi=125)
plt.close()
My question - surely I have missed something - but is there an easier way to get the kind of output I am after from pystan for a vector parameter?
Discovered that the ArviZ module does this pretty well.
ArviZ can be found here: https://arviz-devs.github.io/arviz/
I also struggled with this and just found a way to extract the parameters for the traceplot (the betas, I already knew).
When you do your fit, you can save it to a dataframe:
fit_df = fit.to_dataframe()
Now you have a new variable, your dataframe. Yes, it took me a while to find that pystan had a straightforward way to save the fit to a dataframe.
With that at hand you can check your dataframe. You can see it's header by printing the keys:
fit_df.keys()
the output is something like this:
Index([u'chain', u'chain_idx', u'warmup', u'accept_stat__', u'energy__',
u'n_leapfrog__', u'stepsize__', u'treedepth__', u'divergent__',
u'beta[1,1]', ...
u'eta05[892]', u'eta05[893]', u'eta05[894]', u'eta05[895]',
u'eta05[896]', u'eta05[897]', u'eta05[898]', u'eta05[899]',
u'eta05[900]', u'lp__'],
dtype='object', length=9037)
Now, you have everything you need! The betas are in columns as well as the chain ids. That's all you need to plot the betas and traceplot. Therefore, you can manipulate it in anyway you want and customize your figures as you wish. I'll show you an example of how I did it:
chain_idx = fit_df['chain_idx']
beta11 = fit_df['beta[1,1]']
beta12 = fit_df['beta[1,2]']
plt.subplots(figsize=(15,3))
plt.subplot(1,4,1)
sns.kdeplot(beta11)
plt.subplot(1,4,2)
plt.plot(chain_idx, beta11)
plt.subplot(1,4,3)
sns.kdeplot(beta12)
plt.subplot(1,4,4)
plt.plot(chain_idx, beta12)
plt.tight_layout()
plt.show()
The image from the above plot!
I hope it helps (if you still need it) ;)
I receive this error in scipy interp1d function. Normally, this error would be generated if the x was not monotonically increasing.
import scipy.interpolate as spi
def refine(coarsex,coarsey,step):
finex = np.arange(min(coarsex),max(coarsex)+step,step)
intfunc = spi.interp1d(coarsex, coarsey,axis=0)
finey = intfunc(finex)
return finex, finey
for num, tfile in enumerate(files):
tfile = tfile.dropna(how='any')
x = np.array(tfile['col1'])
y = np.array(tfile['col2'])
finex, finey = refine(x,y,0.01)
The code is correct, because it successfully worked on 6 data files and threw the error for the 7th. So there must be something wrong with the data. But as far as I can tell, the data increase all the way down.
I am sorry for not providing an example, because I am not able to reproduce the error on an example.
There are two things that could help me:
Some brainstorming - if the data are indeed monotonically
increasing, what else could produce this error? Another hint,
regarding the decimals, could be in this question, but I think
my solution (the min and max of x) is robust enough to avoid it. Or
isn't it?
Is it possible (how?) to return the value of x_new and
it's index when throwing the ValueError: A value in x_new is above the interpolation range. so that I could actually see where in the
file is the problem?
UPDATE
So the problem is that, for some reason, max(finex) is larger than max(coarsex) (one is .x39 and the other is .x4). I hoped rounding the original values to 2 significant digits would solve the problem, but it didn't, it displays fewer digits but still counts with the undisplayed. What can I do about it?
If you are running Scipy v. 0.17.0 or newer, then you can pass fill_value='extrapolate' to spi.interp1d, and it will extrapolate to accomadate these values of your's that lie outside the interpolation range. So define your interpolation function like so:
intfunc = spi.interp1d(coarsex, coarsey,axis=0, fill_value="extrapolate")
Be forewarned, however!
Depending on what your data looks like and the type on interpolation you are performing, the extrapolated values can be erroneous. This is especially true if you have noisy or non-monotonic data. In your case you might be ok because your x_new value is only slighly beyond your interpolation range.
Here's simple demonstration of how this feature can work nicely but also give erroneous results.
import scipy.interpolate as spi
import numpy as np
x = np.linspace(0,1,100)
y = x + np.random.randint(-1,1,100)/100
x_new = np.linspace(0,1.1,100)
intfunc = spi.interp1d(x,y,fill_value="extrapolate")
y_interp = intfunc(x_new)
import matplotlib.pyplot as plt
plt.plot(x_new,y_interp,'r', label='interp/extrap')
plt.plot(x,y, 'b--', label='data')
plt.legend()
plt.show()
So the interpolated portion (in red) worked well, but the extrapolated portion clearly fails to follow the otherwise linear trend in this data because of the noise. So have some understanding of your data and proceed with caution.
A quick test of your finex calc shows that it can (always?) gets into the extrapolation region.
In [124]: coarsex=np.random.rand(100)
In [125]: max(coarsex)
Out[125]: 0.97393109991816473
In [126]: step=.01;finex=np.arange(min(coarsex), max(coarsex)+step, step);(max(
...: finex),max(coarsex))
Out[126]: (0.98273730602114795, 0.97393109991816473)
In [127]: step=.001;finex=np.arange(min(coarsex), max(coarsex)+step, step);(max
...: (finex),max(coarsex))
Out[127]: (0.97473730602114794, 0.97393109991816473)
Again it is a quick test, and may be missing some critical step or value.
I am trying to fill between two lines using Bokeh. My two datasets contain sections of NaNs. The patch renders correctly for the last section of data, but fails for sections before any NaN blocks. The following example illustrates the problem:
from bokeh.plotting import figure, output_file, show
import numpy as np
p = figure(plot_width=400, plot_height=300)
mx = np.array(np.random.randint(20, 25, 30), dtype=float)
mx[7:11] = np.nan
mx[19:23] = np.nan
mn = mx-10
x = np.arange(0, len(mn))
wX = np.append(x, x[::-1])
wY = np.append(mx, mn[::-1])
p.patch(wX, wY)
show(p)
This produces the following figure:
I would like the first two parallel line sections to plot with a fill-between as the final section is plotting. Instead, these sections seem to be applying the patch just to the line segments themselves. I have a solution that creates individual patches by looping over each contiguous section of data, but it is too slow over many 100s of patches.
For as far as i can tell, Bokeh renders the patches you provide correctly. Keep in mind that passing NaN's seperates individual patches. That makes it a bit strange that you pass multiple consecutive NaN's, which doesn't add anything. It's also a bit confusing to specify a valid X-coordinates together with a NaN Y-coordinate.
Just as with line() and multi_line(), NaN values can be passed to patch() and patches() glyphs. In this case, you end up with single logical patch objects, that have multiple disjoint components when rendered
http://docs.bokeh.org/en/latest/docs/user_guide/plotting.html
I have added the x,y coordinates to the first patch i get when running you code. Perhaps you intend something different, but Bokeh is rendering, what you specify, correctly.
This question seems to have been asked a few times already, but for some reason it doesn't work for me.
I am making a plt.errorbar plot from the arrays of points results['logF'], results['R'] which are in a pandas DataFrame. I want to scale the colour of the points with a third variable results['M']. I've tried various things but I always get some kind of error, I'm clearly doing something wrong but I can't find any place that explains exactly what is required.
So firstly, results['M'] are a bunch of floats in the range 0 - 13. So as I understand it, I need to normalise them, which I did with matplotlib.colors.Normalise(vmin=0.0, vmax=13.0).
When I try plotting with the following code:
results = get_param_results(totP)
colormap = mlb.colors.Normalize(vmin=0.0, vmax=13.0)
mass_color = np.array(colormap(results['M']))
#import pdb; pdb.set_trace()
plt.errorbar(results['logF'], results['R'], marker='x',
mew=1.2, ms=3.5, capsize=2, c=mass_color,
yerr=[results['logF_l'], results['logF_u']],
xerr=[results['R_l'], results['R_u']],
elinewidth=1.2)
I get an error ValueError: Color array must be two-dimensional. Not sure why it should be two dimensional. In other stackoverflow threads, they pass one dimensional arrays and it's fine.
Using a different form (basically just copying the style from another stackoverflow thread), I write:
results = get_param_results(totP)
colormap = mlb.colors.Normalize(vmin=0.0, vmax=13.0)
#import pdb; pdb.set_trace()
plt.errorbar(results['logF'], results['R'], marker='x',
mew=1.2, ms=3.5, capsize=2, c=results['M'], cmap=mlb.cm.jet, norm=colormap
yerr=[results['logF_l'], results['logF_u']],
xerr=[results['R_l'], results['R_u']],
elinewidth=1.2)
I get a different error, TypeError: There is no Line2D property "cmap"
I don't understand this either (it also doesn't recognise norm), scatter should definitely have the norm and cmap arguments.
Basically, I can't find any great explanations or tutorials on how to get the color scale with an errorbar plot. Can someone help?
Thanks.
EDIT:
Been asked to post the data I'm using. This is the .head() table of the results DataFrame (the full one has 257 rows).
R R_l R_u F F_l F_u \
0 1.486045 0.068775 0.068508 2.999561e+06 488301.994185 496244.025108
1 0.992957 0.062303 0.062664 4.583829e+04 6652.971755 6636.980813
2 1.422328 0.029163 0.029323 2.068257e+06 186692.732530 187685.738474
3 1.326820 0.094840 0.093995 1.049490e+06 185012.117516 184290.913875
4 0.887831 0.013825 0.013939 5.883107e+05 52537.237452 52492.326206
M M_l M_u logF logF_l logF_u
0 1.030471 0.122698 0.123368 6.471041 0.071150 0.072506
1 2.753916 0.157837 0.160584 4.656550 0.063427 0.063404
2 2.344767 0.340987 0.345171 6.313780 0.039261 0.039548
3 0.918979 0.069931 0.069984 6.014049 0.077296 0.077189
4 1.310289 0.076565 0.076805 5.767830 0.038848 0.038895
So basically:
results['M'] = array([ 1.03047146, 2.75391626, 2.34476658, 0.91897949, 1.31028926])
results['logF'] = array([ 6.47104102, 4.65655021, 6.31377955, 6.01404944, 5.76782953])
results['R'] = array([ 1.48604489, 0.99295713, 1.42232837, 1.3268205 , 0.88783067])
and etc... (for the error bars, just use an array([1,1,1,1,1]) to save time or something).
I reran the code, by replacing results with the above, and it still gives me ValueError: Color array must be two-dimensional
I'm not sure what the second dimension should be. Is there something obvious that I'm doing wrong when I'm calling the errorbar plot function?