Related
The code below takes a dataframe filters by a string in a column and then plot the values of another column
I plot the values of the using histogram and than worked fine until I added Mean, Median and standard deviation but now I am just getting an empty graph where instead the all of the variables mentioned below should be plotted in one graph together with their labels
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt
from matplotlib import pyplot as plt
import numpy as np
df = pd.read_csv(r'C:/Users/output.csv', delimiter=";", encoding='unicode_escape')
df['Plot_column'] = df['Plot_column'].str.split(',').str[0]
df['Plot_column'] = df['Plot_column'].astype('int64', copy=False)
X=df[df['goal_colum']=='start running']['Plot_column'].values
dev_x= X
mean_=np.mean(dev_x)
median_=np.median(dev_x)
standard_=np.std(dev_x)
plt.hist(dev_x, bins=5)
plt.plot(mean_, label='Mean')
plt.plot(median_, label='Median')
plt.plot(standard_, label='Std Deviation')
plt.title('Data')
https://matplotlib.org/3.1.1/gallery/statistics/histogram_features.html
There are two major ways to plot in matplotlib, pyplot (the easy way) and ax (the hard way). Ax lets you customize your plot more and you should work to move towards that. Try something like the following
num_bins = 50
fig, ax = plt.subplots()
# the histogram of the data
n, bins, patches = ax.hist(dev_x, num_bins, density=1)
ax.plot(np.mean(dev_x))
ax.plot(np.median(dev_x))
ax.plot(np.std(dev_x))
# Tweak spacing to prevent clipping of ylabel
fig.tight_layout()
plt.show()
I frequently find myself working in log units for my plots, for example taking np.log10(x) of data before binning it or creating contour plots. The problem is, when I then want to make the plots presentable, the axes are in ugly log units, and the tick marks are evenly spaced.
If I let matplotlib do all the conversions, i.e. by setting ax.set_xaxis('log') then I get very nice looking axes, however I can't do that to my data since it is e.g. already binned in log units. I could manually change the tick labels, but that wouldn't make the tick spacing logarithmic. I suppose I could also go and manually specify the position of every minor tick such it had log spacing, but is that the only way to achieve this? That is a bit tedious so it would be nice if there is a better way.
For concreteness, here is a plot:
I want to have the tick labels as 10^x and 10^y (so '1' is '10', 2 is '100' etc.), and I want the minor ticks to be drawn as ax.set_xaxis('log') would draw them.
Edit: For further concreteness, suppose the plot is generated from an image, like this:
import matplotlib.pyplot as plt
import scipy.misc
img = scipy.misc.face()
x_range = [-5,3] # log10 units
y_range = [-55, -45] # log10 units
p = plt.imshow(img,extent=x_range+y_range)
plt.show()
and all we want to do is change the axes appearance as I have described.
Edit 2: Ok, ImportanceOfBeingErnest's answer is very clever but it is a bit more specific to images than I wanted. I have another example, of binned data this time. Perhaps their technique still works on this, though it is not clear to me if that is the case.
import numpy as np
import pandas as pd
import datashader as ds
from matplotlib import pyplot as plt
import scipy.stats as sps
v1 = sps.lognorm(loc=0, scale=3, s=0.8)
v2 = sps.lognorm(loc=0, scale=1, s=0.8)
x = np.log10(v1.rvs(100000))
y = np.log10(v2.rvs(100000))
x_range=[np.min(x),np.max(x)]
y_range=[np.min(y),np.max(y)]
df = pd.DataFrame.from_dict({"x": x, "y": y})
#------ Aggregate the data ------
cvs = ds.Canvas(plot_width=30, plot_height=30, x_range=x_range, y_range=y_range)
agg = cvs.points(df, 'x', 'y')
# Create contour plot
fig = plt.figure()
ax = fig.add_subplot(111)
ax.contourf(agg, extent=x_range+y_range)
ax.set_xlabel("x")
ax.set_ylabel("y")
plt.show()
The general answer to this question is probably given in this post:
Can I mimic a log scale of an axis in matplotlib without transforming the associated data?
However here an easy option might be to scale the content of the axes and then set the axes to a log scale.
A. image
You may plot your image on a logarithmic scale but make all pixels the same size in log units. Unfortunately imshow does not allow for such kind of image (any more), but one may use pcolormesh for that purpose.
import numpy as np
import matplotlib.pyplot as plt
import scipy.misc
img = scipy.misc.face()
extx = [-5,3] # log10 units
exty = [-45, -55] # log10 units
x = np.logspace(extx[0],extx[-1],img.shape[1]+1)
y = np.logspace(exty[0],exty[-1],img.shape[0]+1)
X,Y = np.meshgrid(x,y)
c = img.reshape((img.shape[0]*img.shape[1],img.shape[2]))/255.0
m = plt.pcolormesh(X,Y,X[:-1,:-1], color=c, linewidth=0)
m.set_array(None)
plt.gca().set_xscale("log")
plt.gca().set_yscale("log")
plt.show()
B. contour
The same concept can be used for a contour plot.
import numpy as np
from matplotlib import pyplot as plt
x = np.linspace(-1.1,1.9)
y = np.linspace(-1.4,1.55)
X,Y = np.meshgrid(x,y)
agg = np.exp(-(X**2+Y**2)*2)
fig, ax = plt.subplots()
plt.gca().set_xscale("log")
plt.gca().set_yscale("log")
exp = lambda x: 10.**(np.array(x))
cf = ax.contourf(exp(X), exp(Y),agg, extent=exp([x.min(),x.max(),y.min(),y.max()]))
ax.set_xlabel("x")
ax.set_ylabel("y")
plt.show()
I'm trying to create a CDF but at the end of the graph, there is a vertical line, shown below:
I've read that his is because matplotlib uses the end of the bins to draw the vertical lines, which makes sense, so I added into my code as:
bins = sorted(X) + [np.inf]
where X is the data set I'm using and set the bin size to this when plotting:
plt.hist(X, bins = bins, cumulative = True, histtype = 'step', color = 'b')
This does remove the line at the end and produce the desired effect, however when I normalise this graph now it produces an error:
ymin = max(ymin*0.9, minimum) if not input_empty else minimum
UnboundLocalError: local variable 'ymin' referenced before assignment
Is there anyway to either normalise the data with
bins = sorted(X) + [np.inf]
in my code or is there another way to remove the line on the graph?
An alternative way to plot a CDF would be as follows (in my example, X is a bunch of samples drawn from the unit normal):
import numpy as np
import matplotlib.pyplot as plt
X = np.random.randn(10000)
n = np.arange(1,len(X)+1) / np.float(len(X))
Xs = np.sort(X)
fig, ax = plt.subplots()
ax.step(Xs,n)
I needed a solution where I would not need to alter the rest of my code (using plt.hist(...) or, with pandas, dataframe.plot.hist(...)) and that I could reuse easily many times in the same jupyter notebook.
I now use this little helper function to do so:
def fix_hist_step_vertical_line_at_end(ax):
axpolygons = [poly for poly in ax.get_children() if isinstance(poly, mpl.patches.Polygon)]
for poly in axpolygons:
poly.set_xy(poly.get_xy()[:-1])
Which can be used like this (without pandas):
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
X = np.sort(np.random.randn(1000))
fig, ax = plt.subplots()
plt.hist(X, bins=100, cumulative=True, density=True, histtype='step')
fix_hist_step_vertical_line_at_end(ax)
Or like this (with pandas):
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(1000))
fig, ax = plt.subplots()
ax = df.plot.hist(ax=ax, bins=100, cumulative=True, density=True, histtype='step', legend=False)
fix_hist_step_vertical_line_at_end(ax)
This works well even if you have multiple cumulative density histograms on the same axes.
Warning: this may not lead to the wanted results if your axes contain other patches falling under the mpl.patches.Polygon category. That was not my case so I prefer using this little helper function in my plots.
Assuming that your intentions are pure aesthetic, add a vertical line, of the same color as your plot background:
ax.axvline(x = value, color = 'white', linewidth = 2)
Where "value" stands for the right extreme of the rightmost bin.
I am trying to plot a bar plot where each bin has a difference length and as a result I end up with a very ugly result.c:) What I would like to do is still be able to define a bin of deference lengths but all the bars be plotted the same fixed width. How can I do that? Here is what I have done so far:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette("deep", desat=.6)
sns.set_context(rc={"figure.figsize": (8, 4)})
np.random.seed(9221999)
data = [0,2,30,40,50,10,50,40,150,70,150,10,3,70,70,90,10,2]
bins = [0,1,2,3,4,5,6,7,8,9,10,20,30,40,50,60,70,80,90,100,200]
plt.hist(data, bins=bins);
EDIT
This question has been marked as duplicate but in fact non of the proposed links solved my problem; the 1st is a very crappy workaround and the 2nd doesn't solve the problem at all as it sets all bars' width to a certain number.
Here you go, with seaborn, as you please. But you have to understand that seaborn itself uses matplotlib to create plots.
AND: Please delete your other question, now it really is a duplicate.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette("deep", desat=.6)
sns.set_context(rc={"figure.figsize": (8, 4)})
data = [0,2,30,40,50,10,50,40,150,70,150,10,3,70,70,90,10,2]
bins = [0,1,2,3,4,5,6,7,8,9,10,20,30,40,50,60,70,80,90,100,200]
bin_middles = bins[:-1] + np.diff(bins)/2.
bar_width = 1.
m, bins = np.histogram(data, bins)
plt.bar(np.arange(len(m)) + (1-bar_width)/2., m, width=bar_width)
ax = plt.gca()
ax.set_xticks(np.arange(len(bins)))
ax.set_xticklabels(['{:.0f}'.format(i) for i in bins])
plt.show()
Personally I think, that plotting your data like this is confusing. Having non-linear (or non-log) axis scaling is usually not a good idea.
Are you wanting to place a bar with a fixed width at the center of each bin?
If so, try something something similar to this:
import numpy as np
import matplotlib.pyplot as plt
data = [0,2,30,40,50,10,50,40,150,70,150,10,3,70,70,90,10,2]
bins = [0,1,2,3,4,5,6,7,8,9,10,20,30,40,50,60,70,80,90,100,200]
counts, _ = np.histogram(data, bins)
centers = np.mean([bins[:-1], bins[1:]], axis=0)
plt.bar(centers, counts, width=5, align='center')
plt.show()
As far as I know the option Log=True in the histogram function only refers to the y-axis.
P.hist(d,bins=50,log=True,alpha=0.5,color='b',histtype='step')
I need the bins to be equally spaced in log10. Is there something that can do this?
use logspace() to create a geometric sequence, and pass it to bins parameter. And set the scale of xaxis to log scale.
import pylab as pl
import numpy as np
data = np.random.normal(size=10000)
pl.hist(data, bins=np.logspace(np.log10(0.1),np.log10(1.0), 50))
pl.gca().set_xscale("log")
pl.show()
The most direct way is to just compute the log10 of the limits, compute linearly spaced bins, and then convert back by raising to the power of 10, as below:
import pylab as pl
import numpy as np
data = np.random.normal(size=10000)
MIN, MAX = .01, 10.0
pl.figure()
pl.hist(data, bins = 10 ** np.linspace(np.log10(MIN), np.log10(MAX), 50))
pl.gca().set_xscale("log")
pl.show()
The following code indicates how you can use bins='auto' with the log scale.
import numpy as np
import matplotlib.pyplot as plt
data = 10**np.random.normal(size=500)
_, bins = np.histogram(np.log10(data + 1), bins='auto')
plt.hist(data, bins=10**bins);
plt.gca().set_xscale("log")
In addition to what was stated, performing this on pandas dataframes works as well:
some_column_hist = dataframe['some_column'].plot(bins=np.logspace(-2, np.log10(max_value), 100), kind='hist', loglog=True, xlim=(0,max_value))
I would caution, that there may be an issue with normalizing the bins. Each bin is larger than the previous one, and therefore must be divided by it's size to normalize the frequencies before plotting, and it seems that neither my solution, nor HYRY's solution accounts for this.
Source: https://arxiv.org/pdf/cond-mat/0412004.pdf