Setting specific display range for scatter plot axes in Matplotlib - python

I am trying to make a scatter plot showing the housing prices in Manhattan using the longitude and latitude from a data set. When creating the scatter plot. The output only shows the extreme values of the longitude although all the other values are grouped in the range -74 to -72 (longitude). I don't know how to set the specific range in the x axis so the longitudes represented show the relevant data from the data set.
x = dataset_noise['Lon']
y = dataset_noise['Lat']
no_of_values = len(dataset_noise['Lon'])
index = np.arange(no_of_values)
plt.figure(figsize=(6,6))
plt.scatter(x, y, cmap=plt.get_cmap("jet"),linewidths=0.5,marker='.',alpha=0.2,label='Prices')
plt.title('House prices in Manhattan')
plt.show()
This is what I coded and the output

You can use the following two functions provided by the mpl :
1.
set_xlim
2.
set_ylim
these functions accept an interval (a list/tuple argument with lower (L) and upper (U) limits) [L, U] or (L, U)

Related

Creating a pseudo color plot with a linear and nonlinear axis and computing values based on the center of grid values

I have the equation: z(x,y)=1+x^(2/3)y^(-3/4)
I would like to calculate values of z for x=[0,100] and y=[10^1,10^4]. I will do this for 100 points in each axis direction. My grid, then, will be 100x100 points. In the x-direction I want the points spaced linearly. In the y-direction I want the points space logarithmically.
Were I to need these values I could easily go through the following:
x=np.linspace(0,100,100)
y=np.logspace(1,4,100)
z=np.zeros( (len(x), len(y)) )
for i in range(len(x)):
for j in range(len(y)):
z[i,j]=1+x[i]**(2/3)*y[j]**(-3/4)
The problem for me comes with visualizing these results. I know that I would need to create a grid of points. I feel my options are to create a meshgrid with the values and then use pcolor.
My issue here is that the values at the center of the block do not coincide with the calculated values. In the x-direction I could fix this by shifting the x-vector by half of dx (the step between successive values). I'm not so sure how I would do this for the y-axis. Furthermore, If I wanted to compute values for each of the y-direction values, including the end points, they would not all show up.
In the final visualization I would like to have the y-axis as a log scale and the x axis as a linear scale. I would also like the tick marks to fall in the center of the cells, correlating with the correct value. Can someone point me to the correct plotting functions for this. I have to resolve the issue using pcolor or pcolormesh.
Should you require more details, please let me know.
In current matplotlib, you can use pcolormesh with shading='nearest', and it will center the blocks with the values:
import matplotlib.pyplot as plt
y_plot = np.log10(y)
z[5, 5] = 0 # to make it more evident
plt.pcolormesh(x, y_plot, z, shading="nearest")
plt.colorbar()
ax = plt.gca()
ax.set_xticks(x)
ax.set_yticks(y_plot)
plt.axvline(x[5])
plt.axhline(y_plot[5])
Output:

Gridding irregular scatter data on basemap, trying different resolutions

I am trying to regrid/interpolate within a grid of a certain size, my dataset of irregularly scattered location (lat lon) tied variable values. My data is available as a dataframe with columns marking the value of variable, latitude and longitude, separately.
I have to first grid this data, by optimizing grid size, and then find the best method to take average of different number of points lying within the grid box.
I have tried a code by following an online example. I use histogram2d function to grid the latitudes and longitudes. I fill the grid boxes having scatter points, with density count (equal to average of all points lying within the grid). (I will then have to use this newly gridded data, generated out of scatter points, to compare with another dataset that has a different grid resolution).
It should ideally work fine but grid boxes without scatter points are getting filled while those with the points are being left out. The mismatch is greater in finer resolution or smaller bin sizes.
I have looked up these examples - example 1, example 2.
Here is a part of my code:
df #Dataframe as a csv file opened in pandas
y = df['lon']
x = df['lat']
z = df['var']
# Bin the data onto a 10x10 grid or into any other size
# Have to reverse x & y due to row-first indexing
zi, yi, xi = np.histogram2d(y, x, bins=(5,5), weights=z, normed=False)
counts, _, _ = np.histogram2d(y, x, bins=(5,5))
zi = zi / counts
zi = np.ma.masked_invalid(zi)
m = Basemap(llcrnrlat=45,urcrnrlat=55,llcrnrlon=25,urcrnrlon=30)
m.drawcoastlines(linewidth =0.75, color ="black")
m.drawcountries(linewidth =0.75, color ="black")
m.drawmapboundary()
p,q = m(yi,xi)
#cs=m.pcolormesh(xi, yi, zi, edgecolors='black',cmap = 'jet')
cs=m.pcolormesh(p, q, zi, edgecolors='black',cmap = 'jet')
m.colorbar(cs)
#scat = m.scatter(x,y, c=z, s=200,edgecolors='red')
scat=m.scatter(y,x, latlon=True,c=z, s =80)
The following is the image getting generated.
Any help will be much appreciated.
A friend helped me figure this out.
Had to Transpose the array matrix generated from histogram while doing a pcolormesh plot:
cs=m.pcolormesh(p, q, zi.T, edgecolors='black',cmap = 'jet')

Scaling the y axis in matplotlib

I want to draw multiple plots in the same plot so I took 2d list in which for one parameter it's storing the values in the string format and I am using the for loop for the same but when I am plotting the larger values on y axis are coming below the smaller values
Here is the code snippet that might help to understand
m=['H','.','<','^','*','+','x','#']
cnt=0
for i in all:# here all has the row wise data to plot
matplotlib.pyplot.plot(l3,i,m[cnt])# l3 contains the values about the x axis
cnt=cnt+1
plt.xlabel("x")
plt.ylabel("y")
plt.legend(para,loc='best')# para contains the info about the y parameters
plt.show()
The graph is coming like this how to get 12000 above 0 in the graph
This is the plot I got how to rescale it so that all values comes in acsending order on y axis
You have to convert strings to floats before plotting
for i in all:# here all has the row wise data to plot
y = [float(ii) for ii in i]
matplotlib.pyplot.plot(l3, y, m[cnt])# l3 contains the values about the x axis
cnt=cnt+1

Python 2d Ratio Plot with weighted mean trendline

Hello and thanks in advance. I am starting with a pandas dataframe and I would like like make a 2d plot with a trendline showing the weighteed mean y value with error bars for the uncertainty on the mean. The mean should be weighted by the total number of events in each bin. I start by grouping the df into a "photon" group and a "total" group where "photon" is a subset of the total. In each bin, I am plotting the ratio of photon events to total. On the x axis and y axis I have two unrelated variables "cluster energy" and "perimeter energy".
My attempt:
#make the 2d binning and total hist
energybins=[11,12,13,14,15,16,17,18,19,20,21,22]
ybins = [0,.125,.25,.5,.625,.75,1.,1.5,2.5]
total_hist,x,y,i = plt.hist2d(train['total_energy'].values,train['max_perimeter'].values,[energybins,ybins])
total_hist = np.array(total_hist)
#make the photon 2d hist with same bins
groups = train.groupby(['isPhoton'])
prompt_hist,x,y,i = plt.hist2d(groups.get_group(1)['total_energy'].values,groups.get_group(1)['max_perimeter'].values,bins=[energybins,ybins])
prompt_hist = np.array(prompt_hist)
ratio = np.divide(prompt_hist,total_hist,out=np.zeros_like(prompt_hist),where = total_hist!=0)
#plot the ratio
fig, ax = plt.subplots()
ratio=np.transpose(ratio)
p = ax.pcolormesh(ratio,)
for i in range(len(ratio)):
for j in range(len(ratio[i])):
text = ax.text(j+1, i+1, round(ratio[i, j], 2),ha="right", va="top", color="w")
ax.set_xticklabels(energybins)
ax.set_yticklabels(ybins)
plt.xlabel("Cluster Energy")
plt.ylabel("5x5 Perimeter Energy")
plt.title("Prompt Photon Fraction")
def myBinnedStat(x,v,bins):
means,_,_ = stats.binned_statistic(x,v,'mean',bins)
std,_ ,_= stats.binned_statistic(x,v,'std',bins)
count,_,_ = stats.binned_statistic(x,v,'count',bins)
return [ufloat(m,s/(c**(1./2))) for m,s,c in zip(means,std,count)]
I can then plot an errorbar plot, but I have not been able to plot the errorbar on the same axis as the pcolormesh. I was able to do this with hist2d. I am not sure why that is. I feel like there is a cleaner way to do the whole thing.
This yields a plot
pcolormesh plots each element as a unit on the x axis. That is, if you plot 8 columns, this data will span 0-8 on the x axis. However, you also redefined the x axis ticklabel so that 0-10 is labeled as 11-21.
For your errorbars, you specified x values at 11-21, or so it looks, which is where the data is plotted. But is not labeled since you changed the ticklabels to correspond to pcolormesh.
This discrepancy is why your two plots do not align. Instead, you could use "default" x values for errorbar or define x values for pcolormesh. For example, use:
ax.errorbar(range(11), means[0:11], yerr=uncertainties[0:11])

Plotting data points on where they fall in a distribution

Lets say I have a large data set to where I can manipulate it all in some sort analysis. Which can be looking at values in a probability distribution.
Now that I have this large data set, I then want to compare known, actual data to it. Primarily, how many of the values in my data set have the same value or property with the known data. For example:
This is a cumulative distribution. The continuous lines are from generated data from simulations and the decreasing intensities are just predicted percentages. The stars are then observational (known) data, plotted against generated data.
Another example I have made is how visually the points could possibly be projected on a histogram:
I'm having difficulty marking where the known data points fall in the generated data set and plot it cumulatively along side the distribution of the generated data.
If I were to try and retrieve the number of points that fall in the vicinity of the generated data, I would start out like this (its not right):
def SameValue(SimData, DefData, uncert):
numb = [(DefData-uncert) < i < (DefData+uncert) for i in SimData]
return sum(numb)
But I am having trouble accounting for the points falling in the value ranges and then having it all set up to where I can plot it. Any idea on how to gather this data and project this onto a cumulative distribution?
The question is pretty chaotic with lots of irrelevant information but staying vague at the essetial points. I will try interprete it the best I can.
I think what you are after is the following: Given a finite sample from an unknown distribution, what is the probability to obtain a new sample at a fixed value?
I'm not sure if there is a general answer to it, but in any case that would be a question to be asked to statistics or mathematics people. My guess is that you would need to make some assumptions about the distribution itself.
For the practical case however, it might be sufficient to find out in which bin of the sampled distribution the new value would lie.
So assuming we have a distribution x, which we divide into bins. We can compute the histogram h, using numpy.histogram. The probability to find a value in each bin is then given by h/h.sum().
Having a value v=0.77, of which we want to know the probability according to the distribution, we can find out the bin in which it would belong by looking for the index ind in the bin array where this value would need to be inserted for the array to stay sorted. This can be done using numpy.searchsorted.
import numpy as np; np.random.seed(0)
x = np.random.rayleigh(size=1000)
bins = np.linspace(0,4,41)
h, bins_ = np.histogram(x, bins=bins)
prob = h/float(h.sum())
ind = np.searchsorted(bins, 0.77, side="right")
print prob[ind] # which prints 0.058
So the probability is 5.8% to sample a value in the bin around 0.77.
A different option would be to interpolate the histogram between the bin centers, as to find the the probability.
In the code below we plot a distribution similar to the one from the picture in the question and use both methods, the first for the frequency histogram, the second for the cumulative distribution.
import numpy as np; np.random.seed(0)
import matplotlib.pyplot as plt
x = np.random.rayleigh(size=1000)
y = np.random.normal(size=1000)
bins = np.linspace(0,4,41)
h, bins_ = np.histogram(x, bins=bins)
hcum = np.cumsum(h)/float(np.cumsum(h).max())
points = [[.77,-.55],[1.13,1.08],[2.15,-.3]]
markers = [ur'$\u2660$',ur'$\u2665$',ur'$\u263B$']
colors = ["k", "crimson" , "gold"]
labels = list("ABC")
kws = dict(height_ratios=[1,1,2], hspace=0.0)
fig, (axh, axc, ax) = plt.subplots(nrows=3, figsize=(6,6), gridspec_kw=kws, sharex=True)
cbins = np.zeros(len(bins)+1)
cbins[1:-1] = bins[1:]-np.diff(bins[:2])[0]/2.
cbins[-1] = bins[-1]
hcumc = np.linspace(0,1, len(cbins))
hcumc[1:-1] = hcum
axc.plot(cbins, hcumc, marker=".", markersize="2", mfc="k", mec="k" )
axh.bar(bins[:-1], h, width=np.diff(bins[:2])[0], alpha=0.7, ec="C0", align="edge")
ax.scatter(x,y, s=10, alpha=0.7)
for p, m, l, c in zip(points, markers, labels, colors):
kw = dict(ls="", marker=m, color=c, label=l, markeredgewidth=0, ms=10)
# plot points in scatter distribution
ax.plot(p[0],p[1], **kw)
#plot points in bar histogram, find bin in which to plot point
# shift by half the bin width to plot it in the middle of bar
pix = np.searchsorted(bins, p[0], side="right")
axh.plot(bins[pix-1]+np.diff(bins[:2])[0]/2., h[pix-1]/2., **kw)
# plot in cumulative histogram, interpolate, such that point is on curve.
yi = np.interp(p[0], cbins, hcumc)
axc.plot(p[0],yi, **kw)
ax.legend()
plt.tight_layout()
plt.show()

Categories