Python curve for a lot points - python

I'm using Python and matplotlib.
I have a lot of Points, generated with arrays.
fig, ax = plt.subplots(ncols=1, nrows=1, figsize=Groesse_cm/2.54)
ax.set_title(title)
ax.set_xlabel(xlabel) # Beschriftung X-Achse
ax.set_ylabel(ylabel) # Beschriftung Y-Achse
ax.plot(xWerte, yWerte, 'ro', label=kurveName)
ax.plot(xWerte, y2Werte, 'bo', label=kurveName2)
plt.show()
So I have the arrayX for x Values and the arrayYmax for Y Values (red) and arrayYmin for Y Values (blue). I can't give you my arrays, couse that is much too complicated.
My question is:
How can I get a spline/fit like in the upper picture? I do not know the function of my fited points, so I have just Points with [x / y] Values. So i don't wann connect the points i wanna have a fit. So yeah I say fit to this :D
Here is an example i don't wanna have:
The code for this is:
fig, ax = plt.subplots(ncols=1, nrows=1, figsize=Groesse_cm/2.54)
degree = 7
np.poly1d(np.polyfit(arrayX,arrayYmax,degree))
ax.plot(arrayX, arrayYmax, 'r')
np.poly1d(np.polyfit(arrayX,arrayYmin,degree))
ax.plot(arrayX, arrayYmin, 'b')
#Punkte
ax.plot(arrayX, arrayYmin, 'bo')
ax.plot(arrayX, arrayYmax, 'ro')
plt.show()

you're pretty close, you just need to use the polynomial model you're estimating/fitting.
start with pulling in packages and defining your data:
import numpy as np
import matplotlib.pyplot as plt
arr_x = [-0.8, 2.2, 5.2, 8.2, 11.2, 14.2, 17.2]
arr_y_min = [65, 165, 198, 183, 202, 175, 97]
arr_y_max = [618, 620, 545, 626, 557, 626, 555]
then we estimate the polynomial fit, as you were doing, but saving the result into a variable that we can use later:
poly_min = np.poly1d(np.polyfit(arr_x, arr_y_min, 2))
poly_max = np.poly1d(np.polyfit(arr_x, arr_y_max, 1))
next we plot the data:
plt.plot(arr_x, arr_y_min, 'bo:')
plt.plot(arr_x, arr_y_max, 'ro:')
next we use the polynomial fit from above to plot estimated value at a set of sampled points:
poly_x = np.linspace(-1, 18, 101)
plt.plot(poly_x, poly_min(poly_x), 'b')
plt.plot(poly_x, poly_max(poly_x), 'r')
giving us:
note that I'm using much lower degree polynomials (1 and 2) than you (7). a seven degree polynomial is certainly overfitting this small amount of data, and these look like a reasonable fits

Related

python violin plot regular axis

I want to to a violin plot of binned data but at the same time be able to plot a model prediction and visualize how well the model describes the main part of the individual data distributions. My problem here is, I guess, that the x-axis after the violin plot does not behave like a regular axis with numbers, but more like string-values that just accidentally happen to be numbers. Maybe not a good description, but in the example I would like to have a "normal" plot a function, e.g. f(x) = 2*x**2, and at x=1, x=5.2, x=18.3 and x=27 I would like to have the violin in the background.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
np.random.seed(10)
collectn_1 = np.random.normal(1, 2, 200)
collectn_2 = np.random.normal(802, 30, 200)
collectn_3 = np.random.normal(90, 20, 200)
collectn_4 = np.random.normal(70, 25, 200)
ys = [collectn_1, collectn_2, collectn_3, collectn_4]
xs = [1, 5.2, 18.3, 27]
sns.violinplot(x=xs, y=ys)
xx = np.arange(0, 30, 10)
plt.plot(xx, 2*xx**2)
plt.show()
Somehow this code actually does not plot violins but only bars, this is only a problem in this example and not in the original code though. In my real code I want to have different "half-violins" on both sides, therefore I use sns.violinplot(x="..", y="..", hue="..", data=.., split=True).
I think that would be hard to do with seaborn because it does not provide an easy way to manipulate the artists that it creates, particularly if there are other things plotted on the same Axes. Matplotlib's violinplot allows setting the position of the violins, but does not provide an option for plotting only half violins. Therefore, I would suggest using statsmodels.graphics.boxplots.violinplot, which does both.
from statsmodels.graphics.boxplots import violinplot
df = sns.load_dataset('tips')
x_col = 'day'
y_col = 'total_bill'
hue_col = 'smoker'
xs = [1, 5.2, 18.3, 27]
xx = np.arange(0, 30, 1)
yy = 0.1*xx**2
cs = ['C0','C1']
fig, ax = plt.subplots()
ax.plot(xx,yy)
for (_,gr0),side,c in zip(df.groupby(hue_col),['left','right'],cs):
print(side)
data = [gr1 for (_,gr1) in gr0.groupby(x_col)[y_col]]
violinplot(ax=ax, data=data, positions=xs, side=side, show_boxplot=False, plot_opts=dict(violin_fc=c))
# violinplot above messes up which ticks are shown, the line below restores a sensible tick locator
ax.xaxis.set_major_locator(matplotlib.ticker.MaxNLocator())

How do I add a trend line to this data frame (Python)

I was wondering how I could add a trend line (or line of best fit) to my bar graph. I have tried searching but have only found myself confused. Here is the code that I use to create the bar graph:
dfY = pd.DataFrame(data={"Year":dataYVote['Year'], "Vote": dataYVote['Vote']})
year_list = []
avg_votes = []
for year in np.unique(dfY["Year"]):
year_list.append(year)
avg_votes.append(dfY.loc[dfY["Year"]==year, "Vote"].mean())
plt.bar(year_list, avg_votes, width = 0.5)
plt.xlabel('Year')
plt.ylabel('Amount of Votes')
plt.title('Average Amount of Votes for Each Year')
plt.show()
Any help is much appreciated :)
You could use a numpy.polyfit which minimizes the squared error and returns the gradient and intercept.
import numpy as np
slope, intercept = np.polyfit(x_data, y_data, 1) #deg of 1 for straight line
plt.plot(x, slope*x + intercept)

Plotting negative values using matplotlib scatter

I want to plot scatter points corresponding to 6 different datasets over global maps of the Earth. The problem is that some of these quantities have negative values and they don't appear in the maps. I have tried to overcome this problem by taking absolute values of the data and multiplying (or taking the power of) them by some factors, but nothing seems to work the way I want. The problem is that the datasets have very different ranges. Ideally, I want them all to have the same scale so everything will be more organized, but I don't know how to do this.
I created some synthetic data to illustrate this issue
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from mpl_toolkits.basemap import Basemap, addcyclic, shiftgrid
from matplotlib.pyplot import cm
np.random.seed(100)
VarReTx = np.random.uniform(low=-0.087, high=0.0798, size=(52,))
VarReTy = np.random.uniform(low=-0.076, high=0.1919, size=(52,))
VarImTx = np.random.uniform(low=-0.0331, high=0.0527, size=(52,))
VarImTy = np.random.uniform(low=-0.0311, high=0.2007, size=(52,))
eTx = np.random.uniform(low=0.0019, high=0.0612, size=(52,))
eTx = np.random.uniform(low=0.0031, high=0.0258, size=(52,))
obslat = np.array([18.62, -65.25, -13.8, -7.95, -23.77, 51.84, 40.14, 58.07,
-12.1875, -35.32, 36.37, -46.43, 40.957, -43.474, 38.2 , 37.09,
48.17, 0.6946, 13.59, 28.32, 51., -25.88, -34.43, 21.32,
-12.05, 52.27, 36.23, -12.69, 31.42, 5.21, -22.22, 36.1,
14.38, -54.5, 43.91, 61.16, 48.27, 52.07, 54.85, 45.403,
52.971, -17.57, -51.7, 18.11, 39.55, 47.595, 22.79, -37.067,
-1.2, 32.18, 51.933, 48.52])
obslong = np.array([-287.13, -64.25, -171.78, -14.38, -226.12, -339.21, -105.24,
-321.77, -263.1664, -210.64, -233.146, -308.13, -359.667, -187.607,
-77.37, -119.72, -348.72, -287.8463, -215.13, -16.43, -4.48,
-332.29, -340.77, -158., -75.33, -255.55, -219.82, -227.53,
-229.12, -52.73, -245.9, -256.16, -16.97, -201.05, -215.81,
-45.442, -117.12, -347.32, -276.77, -75.552, -201.752, -149.58,
-57.89, -66.15, -4.35, -52.677, -354.47, -12.315, -48.5,
-110.73, -10.25, -123.42, ])
fig, ([ax1, ax2], [ax3, ax4], [eax1, eax2]) = plt.subplots(3,2, figsize=(24,23))
matplotlib.rc('xtick', labelsize=12)
matplotlib.rc('ytick', labelsize=12)
plots = [ax1, ax2, ax3, ax4, eax1, eax2]
Vars = [VarReTx, VarReTy, VarImTx, VarImTy, eTx, eTy]
titles = [r'$\Delta$ ReTx', r'$\Delta$ ReTy', r'$\Delta$ ImTx', r'$\Delta$ ImTy', 'Error (X)', 'Error (Y)']
colors = iter(cm.jet(np.reshape(np.linspace(0.0, 1.0, len(plots)), ((len(plots), 1)))))
for j in range(len(plots)):
c3 = next(colors)
lat = np.arange(-91, 91, 0.5)
long = np.arange(-0.1, 360.1, 0.5)
longrid, latgrid = np.meshgrid(long, lat)
plots[j].set_title(titles[j], fontsize=48, y=1.05)
condmap = Basemap(projection='robin', llcrnrlat=-90, urcrnrlat=90,\
llcrnrlon=-180, urcrnrlon=180, resolution='c', lon_0=0, ax=plots[j])
maplong, maplat = condmap(longrid, latgrid)
condmap.drawcoastlines()
condmap.drawmapboundary(fill_color='white')
parallels = np.arange(-90, 90, 15)
condmap.drawparallels(parallels,labels=[False,True,True,False], fontsize=15)
x,y = condmap(obslong, obslat)
w = []
for m in range(obslong.size):
w.append(Vars[j][m])
w = np.array(w)
condmap.scatter(x, y, s = w*1e+4, c=c3)
r = np.linspace(np.min(Vars[j]), np.max(Vars[j]), 4)
for n in r:
condmap.scatter([], [], c=c3, s=n*1e+4, label=str(np.round(n, 4)))
plots[j].legend(bbox_to_anchor=(0., -0.2, 1., .102), loc='lower left',
ncol=4, mode="expand", borderaxespad=0., fontsize=16, frameon = False)
plt.show()
plt.close('all')
As you can see in the map, negative data does not are not being exhibited. I want they all to appear in the maps and that all the scatter plots have the same scale in their respective ranges. Thanks!
It looks like you are trying to map your dataset to dot size. Obviously you cannot have negative size dots, so that won't work.
Instead, you need to normalize your dataset to a strictly positive range and use those normalized values for the size parameter. A simple way to do this would be to use matplotlib.colors.Normalize(vmin, vmax), which allows you to map any values in the interval [vmin, vmax] to the interval [0,1].
If you want to have a shared scale for all your datasets, first find the global min and max, and use that to instantiate your normalization, then normalize each dataset when plotting:
datasets = [VarReTx,VarReTy,VarImTx,VarImTy,eTx,eTx]
min_val = min([d.min() for d in datasets])
max_val = max([d.max() for d in datasets])
norm = matplotlib.colors.Normalize(vmin=min_val, vmax=max_val)
plt.scatter(x,y,s=norm(VarReTx)*100) # choose appropiate scaling factor instead of 100 to get nicely sized dots

plotting/marking seleted points from a 1D array

this seems a simple question but I have tried it for a really long time.
I got a 1d array data(named 'hightemp_unlocked', after I found the peaks(an array of location where the peaks are located) of it, I wanted to mark the peaks on the plot.
import matplotlib
from matplotlib import pyplot as plt
.......
plt.plot([x for x in range(len(hightemp_unlocked))],hightemp_unlocked,label='200 mk db ramp')
plt.scatter(peaks, hightemp_unlocked[x in peaks], marker='x', color='y', s=40)
for some reason, it keeps telling me that x, y must be the same size
it shows:
File "period.py", line 86, in <module>
plt.scatter(peaks, hightemp_unlocked[x in peaks], marker='x', color='y', s=40)
File "/usr/local/lib/python2.6/dist-packages/matplotlib/pyplot.py", line 2548, in scatter
ret = ax.scatter(x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, faceted, verts, **kwargs)
File "/usr/local/lib/python2.6/dist-packages/matplotlib/axes.py", line 5738, in scatter
raise ValueError("x and y must be the same size")
I don't think hightemp_unlocked[x in peaks] is what you want. Here x in peaks reads as the conditional statement "is x in peaks?" and will return True or False depending on what was last stored in x. When parsing hightemp_unlocked[x in peaks], True or False is interpreted as 0 or 1, which returns only the first or second element of hightemp_unlocked. This explains the array size error.
If peaks is an array of indexes, then simply hightemp_unlocked[peaks] will return the corresponding values.
You are almost on the right track, but hightemp_unlocked[x in peaks] is not what you are looking for. How about something like:
from matplotlib import pyplot as plt
# dummy temperatures
temps = [10, 11, 14, 12, 10, 8, 5, 7, 10, 12, 15, 13, 12, 11, 10]
# list of x-values for plotting
xvals = list(range(len(temps)))
# say our peaks are at indices 2 and 10 (temps of 14 and 15)
peak_idx = [2, 10]
# make a new list of just the peak temp values
peak_temps = [temps[i] for i in peak_idx]
# repeat for x-values
peak_xvals = [xvals[i] for i in peak_idx]
# now we can plot the temps
plt.plot(xvals, temps)
# and add the scatter points for the peak values
plt.scatter(peak_xvals, peak_temps)

How to plot cdf in matplotlib in Python?

I have a disordered list named d that looks like:
[0.0000, 123.9877,0.0000,9870.9876, ...]
I just simply want to plot a cdf graph based on this list by using Matplotlib in Python. But don't know if there's any function I can use
d = []
d_sorted = []
for line in fd.readlines():
(addr, videoid, userag, usertp, timeinterval) = line.split()
d.append(float(timeinterval))
d_sorted = sorted(d)
class discrete_cdf:
def __init__(data):
self._data = data # must be sorted
self._data_len = float(len(data))
def __call__(point):
return (len(self._data[:bisect_left(self._data, point)]) /
self._data_len)
cdf = discrete_cdf(d_sorted)
xvalues = range(0, max(d_sorted))
yvalues = [cdf(point) for point in xvalues]
plt.plot(xvalues, yvalues)
Now I am using this code, but the error message is :
Traceback (most recent call last):
File "hitratioparea_0117.py", line 43, in <module>
cdf = discrete_cdf(d_sorted)
TypeError: __init__() takes exactly 1 argument (2 given)
I know I'm late to the party. But, there is a simpler way if you just want the cdf for your plot and not for future calculations:
plt.hist(put_data_here, normed=True, cumulative=True, label='CDF',
histtype='step', alpha=0.8, color='k')
As an example,
plt.hist(dataset, bins=bins, normed=True, cumulative=True, label='CDF DATA',
histtype='step', alpha=0.55, color='purple')
# bins and (lognormal / normal) datasets are pre-defined
EDIT: This example from the matplotlib docs may be more helpful.
As mentioned, cumsum from numpy works well. Make sure that your data is a proper PDF (ie. sums to one), otherwise the CDF won't end at unity as it should. Here is a minimal working example:
import numpy as np
from pylab import *
# Create some test data
dx = 0.01
X = np.arange(-2, 2, dx)
Y = np.exp(-X ** 2)
# Normalize the data to a proper PDF
Y /= (dx * Y).sum()
# Compute the CDF
CY = np.cumsum(Y * dx)
# Plot both
plot(X, Y)
plot(X, CY, 'r--')
show()
The numpy function to compute cumulative sums cumsum can be useful here
In [1]: from numpy import cumsum
In [2]: cumsum([.2, .2, .2, .2, .2])
Out[2]: array([ 0.2, 0.4, 0.6, 0.8, 1. ])
Nowadays, you can just use seaborn's kdeplot function with cumulative as True to generate a CDF.
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
X1 = np.arange(100)
X2 = (X1 ** 2) / 100
sns.kdeplot(data = X1, cumulative = True, label = "X1")
sns.kdeplot(data = X2, cumulative = True, label = "X2")
plt.legend()
plt.show()
For an arbitrary collection of values, x:
def cdf(x, plot=True, *args, **kwargs):
x, y = sorted(x), np.arange(len(x)) / len(x)
return plt.plot(x, y, *args, **kwargs) if plot else (x, y)
((If you're new to python, the *args, and **kwargs allow you to pass arguments and named arguments without declaring and managing them explicitly))
What works best for me is quantile function of pandas.
Say I have 71 participants. Each participant have a certain number of interruptions. I want to compute the CDF plot of #interruptions for participants. Goal is to be able to tell how many percent of participants have at least 30 interventions.
step=0.05
indices = np.arange(0,1+step,step)
num_interruptions_per_participant = [32,70,52,52,39,20,37,31,60,57,31,71,24,23,38,4,77,37,79,43,63,43,75,13
,45,31,57,28,61,29,30,52,65,11,76,37,65,28,33,73,65,43,50,33,45,40,50,44
,33,49,24,69,55,47,22,45,54,11,30,13,32,52,31,50,10,46,10,25,47,51,83]
CDF = pd.DataFrame({'dummy':num_interruptions_per_participant})['dummy'].quantile(indices)
plt.plot(CDF,indices,linewidth=9, label='#interventions', color='blue')
According to Graph Almost 25% of the participants have less than 30 interventions.
You can use this statistic for your further analysis. For instance, In my case I need at least 30 intervention for each participant in order to meet minimum sample requirement needed for leave-one-subject out evaluation. CDF tells me that I have problem with 25% of the participants.
import matplotlib.pyplot as plt
X=sorted(data)
Y=[]
l=len(X)
Y.append(float(1)/l)
for i in range(2,l+1):
Y.append(float(1)/l+Y[i-2])
plt.plot(X,Y,color=c,marker='o',label='xyz')
I guess this would do,for the procedure refer http://www.youtube.com/watch?v=vcoCVVs0fRI

Categories