How do you smoothen out values in an array (without polynomial equations)?

How do you smoothen out values in an array (without polynomial equations)? - python

So basically I have some data and I need to find a way to smoothen it out (so that the line produced from it is smooth and not jittery). When plotted out the data right now looks like this:
and what I want it to look is like this:
I tried using this numpy method to get the equation of the line, but it did not work for me as the graph repeats (there are multiple readings so the graph rises, saturates, then falls then repeats that multiple times) so there isn't really an equation that can represent that.
I also tried this but it did not work for the same reason as above.
The graph is defined as such:
gx = [] #x is already taken so gx -> graphx
gy = [] #same as above
#Put in data
#Get nice data #[this is what I need help with]
#Plot nice data and original data
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()
The method I think would be most applicable to my solution is getting the average of every 2 points and setting that to the value of both points, but this idea doesn't sit right with me - potential values may be lost.

You could use a infinite horizon filter
import numpy as np
import matplotlib.pyplot as plt
x = 0.85 # adjust x to use more or less of the previous value
k = np.sin(np.linspace(0.5,1.5,100))+np.random.normal(0,0.05,100)
filtered = np.zeros_like(k)
#filtered = newvalue*x+oldvalue*(1-x)
filtered[0]=k[0]
for i in range(1,len(k)):
# uses x% of the previous filtered value and 1-x % of the new value
filtered[i] = filtered[i-1]*x+k[i]*(1-x)
plt.plot(k)
plt.plot(filtered)
plt.show()

I figured it out, by averaging 4 results I was able to significantly smooth out the graph. Here is a demonstration:
Hope this helps whoever needs it

Related

How can I produce multiple plots on one graph where each plot has a different color? Can I set a colormap to an array of scalar variables?

I have a series of simple mass-radius relationships (so a 2d plot) that I'd like to include in one plot according to how well of a fit it is to my data. I have the radii (x), masses (y), and a separate 1d array that quantifies how well the M-R relationship fits to my data. This 1d array can be likened to error, but it isn't calculated using a standard Python function (I calculate it myself).
Ideally, my end result is a series of ~2000 mass-radius relationships on one plot, where each mass-radius relationship is color coded according to its agreement with my data. So something like this, but instead of two colors, it's on a grayscale:
Here's a snippet of what I'm trying to do but obviously isn't working, as I didn't even define a colormap:
for i in range(10):
plt.plot(x,y,c=error[i])
plt.colorbar()
plt.show()
And again, I'd like to have each element in error correspond to a color in greyscale.
I know this is simple so I'm definitely outing myself as an amateur here, but I really appreciate any help!
EDIT: Here is the code snippet where I made the plot:
for i in range(2396):
if eps[i]==0.:
plt.plot(f[i,:,1],f[i,:,0],c='g',linewidth=0.1)
else:
plt.plot(f[i,:,1],f[i,:,0],c='r',linewidth=0.1)
plt.xlabel('Radius')
plt.ylabel('Mass')
plt.title('Neutron Star Mass-Radius Relationships')

You have one fit value for each series of points:
Here is a script to plot multiple series on a single plot, where each series (i.e. each line) is colored based on a third fit variable:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
fit = np.random.rand(25)
cmap = mpl.cm.get_cmap('binary')
color_gradients = cmap(fit) # this line changed! it was incorrect before
fig, (ax1,ax2) = plt.subplots(1,2, gridspec_kw={'width_ratios': [30, 1]})
for i,_ in enumerate(fit):
x = sorted(np.random.randint(100, size=25))
y = sorted(np.random.randint(100, size=25))
ax1.plot(x, y, c=color_gradients[i])
cb = mpl.colorbar.ColorbarBase(ax2, cmap=cmap,
orientation='vertical',
ticks=[0,1])
Now responding to your questions from the comments:
How does fit play into the rest of the plot?
fit is an array of random decimals between 0 and 1, corresponding to the "error" values for each series:
>>>fit
array([0.76458568, 0.15017328, 0.70686393, 0.98885091, 0.18449953,
0.62506401, 0.49513702, 0.69138913, 0.96844495, 0.48937011,
0.09878352, 0.68965829, 0.13524182, 0.95419698, 0.39844843,
0.63095159, 0.95933663, 0.00693236, 0.98212815, 0.16262205,
0.26274884, 0.56880703, 0.68233984, 0.18304883, 0.66759496])
fit is used to generate the divisions of the color gradient in these lines:
cmap = mpl.cm.get_cmap('binary')
color_gradients = cmap(fit)
I'm not sure where the specific documentation for this is, but basically, passing an array of numbers to the cmap will return an array of RGBA color values spaced accordingly to the array passed:
>>>color_gradients
array([[0.23529412, 0.23529412, 0.23529412, 1. ],
[0.85098039, 0.85098039, 0.85098039, 1. ],
[0.29411765, 0.29411765, 0.29411765, 1. ],
[0.00784314, 0.00784314, 0.00784314, 1. ],
.
.
.
So this array can be used to assign specific colors to each line, based on their fit. And it assumes the higher numbers are better fits, and that you want better fits to be colored darker.
Note that before I had color_gradient_divisions = [(1/len(fit))*i for i in range(len(fit))], which was incorrect as it evenly divides the color map into 25 pieces, not actually returning values corresponding to the fit.
The cmap is also passed to the colorbar when constructing it. Often you can just call plt.colorbar to simply create one, but here matplotlib doesn't automatically know what to create a color bar for as the lines are separate and manually colored. So instead, we create 2 axes, one for the plot and one for the colorbar (spacing them accordingly with the gridspec_kw argument), and then using mpl.colorbar.ColorbarBase to make the colorbar (I also removed a norm argument b/c I don't think it is needed).
why have you used an underscore in the for loop?
This is a pattern in Python, typically meaning "I'm not using this thing". enumerate returns an iterator of tuples with the structure (value index, value). So enumerate(fit) returns (0, 0.76458568), (1, 0.15017328), etc (based on the data shown above). I am only using the index (i) to get the corresponding position (and color) in color_gradients (ax1.plot(x, y, c=color_gradients[i])). Even though the values from fit are being returned by enumerate, I am not using them, so I instead point them to _. If I was using them within the loop, I would use a typical variable name instead.
enumerate is the encouraged way to loop over an iterable if you need to access both the count of the values and the values themselves. People tend to use for i in range(len(fit)) also to do this (which works fine!) but the further I've gone with Python the more I've seen people avoiding that.
This was a little bit of a confusing example; I set my loop to iterate over fit b/c I was conceptualizing "creating one graph for each value in fit". But I could have just looped over color_gradients (for c in color_gradients) which might be more clear.
But in your real data, something like enumerate may be helpful if you are looping over multiple aligned arrays. In my example, I just create new random data within each loop. But you will likely want to have an array of fit values, an array of color values, an array (of series) of radii, and an array (of series) of masses, such that the ith element of each array corresponds to the same star. You may be iterating over one array and want to access the same position in another (zip is used for this also).
I'll leave this second answer here, even though it wasn't what OP was getting at:
You have one fit value for each point:
Here, each pair of x,y coordinates has its own fit value:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.randint(100, size=25)
y = np.random.randint(100, size=25)
fit = np.random.rand(25)
plt.scatter(x, y, c=fit, cmap='binary')
plt.colorbar()
Note that with either approach, poorly fitting points or lines may be invisible

Linregress output seems incorrect

I plotted a scatter plot on my dataframe which looks like this:
with code
from scipy import stats
import pandas as pd
import seaborn as sns
df = pd.read_csv('/content/drive/My Drive/df.csv', sep=',')
subset = df[:,1:10080]
df['mean'] = subset.mean(axis=1)
df.plot(x='mean', y='Result', kind = 'scatter')
sns.lmplot('mean', 'Result', df, order=1)
I wanted to find the slope of the regression in the graph using code
scipy.stats.mstats.linregress(Result,average)
but from the output it seems like the slope magnitude is too small:
LinregressResult(slope=-0.0001320534706614152, intercept=27.887336813241845, rvalue=-0.16776138446214162, pvalue=3.0450456899520655e-07, stderr=2.55977061451773e-05)
if I switched the Resultand average positions,
scipy.stats.mstats.linregress(average,Result)
it still doesn't look right as the intercept is too large
LinregressResult(slope=-213.12489536011773, intercept=7138.48783135982, rvalue=-0.16776138446214162, pvalue=3.0450456899520655e-07, stderr=41.31287437069993)
Why is this happening? Do these output values need to be rescaled?

The signature for scipy.stats.mstats.linregress is linregress(x,y) so your second ordering, linregress(average, Result) is the one that is consistent with the way your graph is drawn. And on that graph, an intercept of 7138 doesn't seem unreasonable—are you getting confused by the fact that the x-axis limits you're showing don't go down to 0, where the intercept would actually happen?
In any case, your data really don't look like they follow a linear law, so the slope (or any parameter from a completely-misspecified model) will not actually tell you much. Are the x and y values all strictly positive? And is there a particular reason why x can never logically go below 25? The data-points certainly seem to be piling up against that vertical asymptote. If so, I would probably subtract 25 from x, then fit a linear model to logged data. In other words, do your plot and your linregress with x=numpy.log(average-25) and y=numpy.log(Result). EDIT: since you say x is temperature there’s no logical reason why x can’t go below 25 (it is meaningful to want to extrapolate below 25, for example—and even below 0). Therefore don’t subtract 25, and don’t log x. Just log y.
In your comments you talk about rescaling the slope, and eventually the suspicion emerges that you think this will give you a correlation coefficient. These are different things. The correlation coefficient is about the spread of the points around the line as well as slope. If what you want is correlation, look up the relevant tools using that keyword.

Comparing two arrays which have very dispersed values

I have a very sparse array that looks like:
Array A: min = -68093253945.0 max=8.54631971208e+13
Array B: min=-1e+15 max = 1.87343e+14
And also each array will have concentration at certain levels e.g. near 2000, near 1m, near 0.05 and so on.
I am trying to compare these two arrays in terms of concentration, and want to do so in a way that is invariant to the number of entries in each. I also want to account for huge outliers if possible and maybe compress the bins to be between 0 and 1 or something of this sort.
The aim is to make a histogram via:
plt.hist(A,alpha=0.5,label='A') # plt.hist passes it's arguments to np.histogram
ion()
plt.hist(B,alpha=0.5,label='B')
plt.title("Histogram of Values")
plt.legend(loc='upper right')
plt.savefig('valuecomp.png')
How do I do this? I have experimented with:
A = stats.zscore(A)
B = stats.zscore(B)
A = preprocessing.scale(A)
B = preprocessing.scale(B)
A = preprocessing.scale(A, axis=0, with_mean=True, with_std=True, copy=True)
B = preprocessing.scale(B, axis=0, with_mean=True, with_std=True, copy=True)
And then for my histograms, adding normed=True, range(0,100). All the methods give me a histogram with a massive vertical chunk near to 0.0 instead of distributing the values smoothly. range(0,100) looks good but it ignores any values like 1m outside of 100.
Perhaps I need to remove outliers from my data first and then do a histogram?

#sascha's suggestion of using AstroML was a good one, but the knuth and freedman versions seem to take astronomically long (excuse the pun), and the blocks version simply thinned the blocks.
I took the sigmoid of each value via from scipy.special import expit and then plotted the histogram that way. Only way I could get this to work.

Plotting a curve from numpy array with large values

I am trying to plot a curve from molecular dynamics potential energies data stored in numpy array. As you can see from my figure attached, on the top left of the figure, a large number appears which is related to the label on y-axis. Look at it.
Even if I rescale the data, still a number appears there. I do not want it. Please can you suggest me howto sort out this issue? Thank you very much..

This is likely happening because your data is a small value offset by a large one. That's what the - sign means at the front of the number, "take the plotted y-values and subtract this number to get the actual values". You can remove it by plotting with the mean subtracted. Here's an example:
import numpy as np
import matplotlib.pyplot as plt
y = -1.5*1e7 + np.random.random(100)
plt.plot(y)
plt.ylabel("units")
gives the form you don't like:
but subtracting the mean (or some other number close to that, like min or max, etc) will remove the large offset:
plt.figure()
plt.plot(y - np.mean(y))
plt.ylabel("offset units")
plt.show()

You can remove the offset by using:
plt.ticklabel_format(useOffset=False)

It seems your data is displayed in exponential form like: 1e+10, 2e+10, etc.
This question here might help:
How to prevent numbers being changed to exponential form in Python matplotlib figure

Solving for zeroes in interpolated data in numpy/matplotlib

I have some data over a 2D range that I am interested in analyzing. These data were originally in lists x,y, and z where z[i] was the value for the point located at (x[i],y[i]). I then interpolated this data onto a regular grid using
x=np.array(x)
y=np.array(y)
z=np.array(z)
xi=np.linspace(minx,maxx,100)
yi=np.linspace(miny,maxy,100)
zi=griddata(x,y,z,xi,yi)
I then plotted the xi,yi,zi data using
plt.contour(xi,yi,zi)
plt.pcolormesh(xi,yi,zi,cmap=plt.get_cmap('PRGn'),norm=plt.Normalize(-10,10),vmin=-10,vmax=10)
This produced this plot:
In this plot you can see the S-like curve where the values are equal to zero (aside: the data doesn't vary as rapidly as shown in the colorbar -- that's simply a result of me normalizing the data to -10-10 when it actually extends far beyond that range; I did this to make the zero-valued region show up better -- maybe there's a better way of doing this too...).
The scattered dots are simply the points at which I have original data (yes, in this case my data was already on a regular grid). What I'm curious about is whether there is a good way for me to extract the values for which the curve is zero and obtain x,y pairs that, if plotted as a line, would trace that zero-region in the colormesh. I could interpolate to a really fine grid and then just brute force search for the values which are closest to zero. But is there a more automatic way of doing this, or a more automatic way of plotting this "zero-line"?
And a secondary question: I am using griddata correctly, right? I have these simple 1D arrays although elsewhere people use various meshgrids, loading texts, etc., before calling griddata.

Here is a full example:
import numpy as np
import matplotlib.pyplot as plt
y, x = np.ogrid[-1.5:1.5:200j, -1.5:1.5:200j]
f = (x**2 + y**2)**4 - (x**2 - y**2)**2
plt.figure(figsize=(9,4))
plt.subplot(121)
extent = [np.min(x), np.max(x), np.min(y), np.max(y)]
cs = plt.contour(f, extent=extent, levels=[0.1],
colors=["b", "r"], linestyles=["solid", "dashed"], linewidths=[2, 2])
plt.subplot(122)
# get the points on the lines
for c in cs.collections:
data = c.get_paths()[0].vertices
plt.plot(data[:,0], data[:,1],
color=c.get_color()[0], linewidth=c.get_linewidth()[0])
plt.show()
here is the output:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.