Understand log scale and actually taking np.log of a data - python

I am currently working up some experimental data and am having a hard time understanding whether I should be doing a log scale or actually applying np.log onto the data.
Here is the plot I have made.
Blue represents using plt.yscale('log'), whereas the orange is creating a new column and applying np.log onto the data.
My question
Why are their magnitudes so different? Which is correct? and if using plt.yscale('log') is the optimal way to do it, is there a way I can get those values as I need to do a curve fit after?
Thanks in advance for anyone that can provide some answers!
edit(1)
I understand that plt.yscale('log') is in base 10 and np.log refers to the natural log. I have tried using np.log10 on the data instead and it gives a smaller value that does not correspond to using a log scale.

Your data is getting log-ified but "pointing"? in the wrong direction.
Consider this toy data
x = np.linspace(0, 1, 100)[:-1]
y = np.log(1-x) + 5
Then we plot
plt.plot(x, y)
If I log scale it:
It's just more exaggerated
plt.plot(x, y)
plt.xscale('log')
You need to point your data the other direction like normal log data
plt.plot(-x, y)
But you also have to make sure the data is positive or ... you know ... logs and stuff ¯\_(ツ)_/¯
plt.plot(-x + 1, y)
plt.xscale('log')

Related

lmfit fits 'wrong' peak in data with multiple peaks

With this code (before the snippet I just read in the data, which works fine, and after the snippet I just do labels etc.)
plt.errorbar(xdata, ydata, yerr, fmt='.', label='Data')
model = models.GaussianModel()
params = model.make_params()
params['center'].set(6.5)
#params['center'].vary = False
fit = model.fit(ydata, params=params, x=xdata, weights=1/yerr)
print(fit.fit_report())
plt.plot(xdata, fit.best_fit, label='Fit')
I try to fit the last peak (approximately at x=6.5). But as you can see in the picture, the code does not do that. Can anyone tell me why that is?
Edit: If I run the line params['center'].vary = False the "fit" just becomes zero everywhere.
I have never used lmfit, but the problem is most likely, that you try to fit the whole data region. Considering the entire region of data you handed to the .fit call, the resulting fit is probably the best and correct.
You should try to pass only relevant data to the fit. In your case xdata should only be the set of data points from around 5.5 up to 7.5 (or somewhere around these numbers). ydata has to be adapted to these values as well, of course. Then the fit should work nicely.

How do you smoothen out values in an array (without polynomial equations)?

So basically I have some data and I need to find a way to smoothen it out (so that the line produced from it is smooth and not jittery). When plotted out the data right now looks like this:
and what I want it to look is like this:
I tried using this numpy method to get the equation of the line, but it did not work for me as the graph repeats (there are multiple readings so the graph rises, saturates, then falls then repeats that multiple times) so there isn't really an equation that can represent that.
I also tried this but it did not work for the same reason as above.
The graph is defined as such:
gx = [] #x is already taken so gx -> graphx
gy = [] #same as above
#Put in data
#Get nice data #[this is what I need help with]
#Plot nice data and original data
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()
The method I think would be most applicable to my solution is getting the average of every 2 points and setting that to the value of both points, but this idea doesn't sit right with me - potential values may be lost.
You could use a infinite horizon filter
import numpy as np
import matplotlib.pyplot as plt
x = 0.85 # adjust x to use more or less of the previous value
k = np.sin(np.linspace(0.5,1.5,100))+np.random.normal(0,0.05,100)
filtered = np.zeros_like(k)
#filtered = newvalue*x+oldvalue*(1-x)
filtered[0]=k[0]
for i in range(1,len(k)):
# uses x% of the previous filtered value and 1-x % of the new value
filtered[i] = filtered[i-1]*x+k[i]*(1-x)
plt.plot(k)
plt.plot(filtered)
plt.show()
I figured it out, by averaging 4 results I was able to significantly smooth out the graph. Here is a demonstration:
Hope this helps whoever needs it

Linregress output seems incorrect

I plotted a scatter plot on my dataframe which looks like this:
with code
from scipy import stats
import pandas as pd
import seaborn as sns
df = pd.read_csv('/content/drive/My Drive/df.csv', sep=',')
subset = df[:,1:10080]
df['mean'] = subset.mean(axis=1)
df.plot(x='mean', y='Result', kind = 'scatter')
sns.lmplot('mean', 'Result', df, order=1)
I wanted to find the slope of the regression in the graph using code
scipy.stats.mstats.linregress(Result,average)
but from the output it seems like the slope magnitude is too small:
LinregressResult(slope=-0.0001320534706614152, intercept=27.887336813241845, rvalue=-0.16776138446214162, pvalue=3.0450456899520655e-07, stderr=2.55977061451773e-05)
if I switched the Resultand average positions,
scipy.stats.mstats.linregress(average,Result)
it still doesn't look right as the intercept is too large
LinregressResult(slope=-213.12489536011773, intercept=7138.48783135982, rvalue=-0.16776138446214162, pvalue=3.0450456899520655e-07, stderr=41.31287437069993)
Why is this happening? Do these output values need to be rescaled?
The signature for scipy.stats.mstats.linregress is linregress(x,y) so your second ordering, linregress(average, Result) is the one that is consistent with the way your graph is drawn. And on that graph, an intercept of 7138 doesn't seem unreasonable—are you getting confused by the fact that the x-axis limits you're showing don't go down to 0, where the intercept would actually happen?
In any case, your data really don't look like they follow a linear law, so the slope (or any parameter from a completely-misspecified model) will not actually tell you much. Are the x and y values all strictly positive? And is there a particular reason why x can never logically go below 25? The data-points certainly seem to be piling up against that vertical asymptote. If so, I would probably subtract 25 from x, then fit a linear model to logged data. In other words, do your plot and your linregress with x=numpy.log(average-25) and y=numpy.log(Result). EDIT: since you say x is temperature there’s no logical reason why x can’t go below 25 (it is meaningful to want to extrapolate below 25, for example—and even below 0). Therefore don’t subtract 25, and don’t log x. Just log y.
In your comments you talk about rescaling the slope, and eventually the suspicion emerges that you think this will give you a correlation coefficient. These are different things. The correlation coefficient is about the spread of the points around the line as well as slope. If what you want is correlation, look up the relevant tools using that keyword.

Plotting high precision data as axis ticks or transforming it to different values?

I am trying to plot some data as a histogram in matplotlib using a high precision values as x-axis ticks. The data is between 0 and 0.4, but most values are really close like:
0.05678, 0.05879, 0.125678, 0.129067
I used np.around() in order to make the values (and it made them as it should from 0 to 0.4) less but it didn't work quite right for all of the data.
Here is an example of the one that worked somewhat right ->
and one that didn't ->
you can see there are points after 0.4 which is just not right.
Here is the code I used in Jupyter Notebook:
plt.hist(x=[advb_ratios,adj_ratios,verb_ratios],color = ['r','y','b'], bins =10, label = ['adverbs','adjectives', 'verbs'])
plt.xticks(np.around(ranks,1))
plt.xlabel('Argument Rank')
plt.ylabel('Frequency')
plt.legend()
plt.show()
It is the same for both histograms only different x that I am plotting, all x values that are used are between 0 and 1.
So my questions are:
Is there a way to fix that and reflect my data as it is?
Is it better to give my rank values different labels that will separate them more from one another for example - 1,2,3,4 or I will lose the precision of my data and some useful info?
What is the general approach in such situations? Would a different graphic help? What?
I don't understand your problem, the fact that you data is between 0 and 0.4 should not influence the way it is displayed. I don't see why you need to do anything else but call plt.hist().
In addition, you can pass an array to the bins argument to indicate which bins you want, so you could do something like that to force the size of your bins to be always the same
# Fake data
x1 = np.random.normal(loc=0, scale=0.1, size=(1000,))
x2 = np.random.normal(loc=0.2, scale=0.1, size=(1000,))
x3 = np.random.normal(loc=0.4, scale=0.1, size=(1000,))
plt.hist([x1,x2,x3], bins=np.linspace(0,0.4,10))

Fourier smoothing of data set

I am following this link to do a smoothing of my data set.
The technique is based on the principle of removing the higher order terms of the Fourier Transform of the signal, and so obtaining a smoothed function.
This is part of my code:
N = len(y)
y = y.astype(float) # fix issue, see below
yfft = fft(y, N)
yfft[31:] = 0.0 # set higher harmonics to zero
y_smooth = fft(yfft, N)
ax.errorbar(phase, y, yerr = err, fmt='b.', capsize=0, elinewidth=1.0)
ax.plot(phase, y_smooth/30, color='black') #arbitrary normalization, see below
However some things do not work properly.
Indeed, you can check the resulting plot :
The blue points are my data, while the black line should be the smoothed curve.
First of all I had to convert my array of data y by following this discussion.
Second, I just normalized arbitrarily to compare the curve with data, since I don't know why the original curve had values much higher than the data points.
Most importantly, the curve is like "specular" to the data point, and I don't know why this happens.
It would be great to have some advices especially to the third point, and more generally how to optimize the smoothing with this technique for my particular data set shape.
Your problem is probably due to the shifting that the standard FFT does. You can read about it here.
Your data is real, so you can take advantage of symmetries in the FT and use the special function np.fft.rfft
import numpy as np
x = np.arange(40)
y = np.log(x + 1) * np.exp(-x/8.) * x**2 + np.random.random(40) * 15
rft = np.fft.rfft(y)
rft[5:] = 0 # Note, rft.shape = 21
y_smooth = np.fft.irfft(rft)
plt.plot(x, y, label='Original')
plt.plot(x, y_smooth, label='Smoothed')
plt.legend(loc=0)
plt.show()
If you plot the absolute value of rft, you will see that there is almost no information in frequencies beyond 5, so that is why I choose that threshold (and a bit of playing around, too).
Here the results:
From what I can gather you want to build a low pass filter by doing the following:
Move to the frequency domain. (Fourier transform)
Remove undesired frequencies.
Move back to the time domain. (Inverse fourier transform)
Looking at your code, instead of doing 3) you're just doing another fourier transform. Instead, try doing an actual inverse fourier transform to move back to the time domain:
y_smooth = ifft(yfft, N)
Have a look at scipy signal to see a bunch of already available filters.
(Edit: I'd be curious to see the results, do share!)
I would be very cautious in using this technique. By zeroing out frequency components of the FFT you are effectively constructing a brick wall filter in the frequency domain. This will result in convolution with a sinc in the time domain and likely distort the information you want to process. Look up "Gibbs phenomenon" for more information.
You're probably better off designing a low pass filter or using a simple N-point moving average (which is itself a LPF) to accomplish the smoothing.

Categories