I'm very new to python and programming in general (This is my first programming language, I started about a month ago).
I have a CSV file with data ordered like this (CSV file data at the bottom). There are 31 columns of data. The first column (wavelength) must be read in as the independent variable (x) and for the first iteration, it must read in the second column (i.e. the first column labelled as "observation") as the dependent variable (y). I am then trying to fit a Gaussian+line model to the data and extracting the value of the mean of the Gaussian (mu) from the data which should be stored in an array for further analysis. This process should be repeated for each set of observations, whilst the x values read in must stay the same (i.e. from the Wavelength column)
Here is the code for how I am currently reading in the data:
import numpy as np #importing necessary packages
import matplotlib.pyplot as plt
import pandas as pd
import scipy as sp
from scipy.optimize import curve_fit
e=np.exp
spectral_data=np.loadtxt(r'C:/Users/Sidharth/Documents/Computing Labs/Project 1/Halpha_spectral_data.csv', delimiter=',', skiprows=2) #importing data file
print(spectral_data)
x=spectral_data[:,0] #selecting column 0 to be x-axis data
y=spectral_data[:,1] #selecting column 1 to be y-axis data
So I need to automate that process so that instead of having to change y=spectral_data[:,1] to y=spectral_data[:,2] manually all the way up to y=spectral_data[:,30] for each iteration, it can simply be automated.
My code for producing the Gaussian fit is as follows:
plt.scatter(x,y) #produce scatter plot
plt.title('Observation 1')
plt.ylabel('Intensity (arbitrary units)')
plt.xlabel('Wavelength (m)')
plt.plot(x,y,'*')
plt.plot(x,c+m*x,'-') #plots the fit
print('The slope and intercept of the regression is,', m,c)
m_best=m
c_best=c
def fit_gauss(x,a,mu,sig,m,c):
gaus = a*sp.exp(-(x-mu)**2/(2*sig**2))
line = m*x+c
return gaus + line
initial_guess=[160,7.1*10**-7,0.2*10**-7,m_best,c_best]
po,po_cov=sp.optimize.curve_fit(fit_gauss,x,y,initial_guess)
The Gaussian seems to fit fine (as shown in the image of the plot) and so the mean value of this gaussian (i.e. the x-coordinate of its peak) is the value I must extract from it. The value of the mean is given in the console (denoted by mu):
The slope and intercept of the regression is, -731442221.6844947 616.0099144830941
The signal parameters are
Gaussian amplitude = 19.7 +/- 0.8
mu = 7.1e-07 +/- 2.1e-10
Gaussian width (sigma) = -0.0 +/- 0.0
and the background estimate is
m = 132654859.04 +/- 6439349.49
c = 40 +/- 5
So my questions are, how can I iterate the process of reading in data from the csv so that I don't have to manually change the column y takes data from, and then how do I store the value of mu from each iteration of the read-in so that I can do further analysis/calculations with that mean later?
My thoughts are I should use a for-loop but I'm not sure how to do it.
The orange line shown in the plot is a result of some code I tried earlier. I think its irrelevant which is why it isn't in the main part of the question, but if necessary, this is all it is.
x=spectral_data[:,0] #selecting column 0 to be x-axis data
y=spectral_data[:,1] #selecting column 1 to be y-axis data
plt.scatter(x,y) #produce scatter plot
plt.title('Observation 1')
plt.ylabel('Intensity (arbitrary units)')
plt.xlabel('Wavelength (m)')
plt.plot(x,y,'*')
plt.plot(x,c+m*x,'-') #plots the fit
Usually when you encounter a problem like that, try to break it into what has to be kept unchanged (in your example, the x data and the analysis code), and what does have to be changed (the y data, or more specific the index which tells the rest of the code what is the right column for the y data), and how to keep the values you wish to store further down the road.
Once you figure this out, we need to formalize the right loop and how to store the values we wish to. To do the latter, an easy way is to store them in a list, so we'll initiate an empty list and at the end of each loop iteration we'll append the value to that list.
mu_list = [] # will store our mu's in this list
for i in range(1, 31): # each iteration i gets a different value, starting with 1 and ends with 30 (and not 31)
x = spectral_data[:, 0]
y = spectral_data[:, i]
# Your analysis and plot code here #
mu = po[1] # Not sure po[1] is the right place where your mu is, please change it appropriately...
mu_list.append(mu) # store mu at the end of our growing mu_list
And you will have a list of 30 mu's under mu_list.
Now, notice we don't have to do everything inside the loop, for example x is the same regardless to i (loading x only once - improves performance) and the analysis code is basically the same, except for a different input (y data), so we can define a function for it (a good practice to make bigger code much more readable), so most likely we can take them out from the loop. We can write x = spectral_data[:, 0] before the loop, and define a function which analyizes the data and returns mu:
def analyze(x, y):
# Your analysis and plot code here #
mu = po[1]
return mu
x = spectral_data[:, 0]
mu_list = [] # will store our mu's in this list
for i in range(1, 31):
y = spectral_data[:, i]
mu_list.append(analyze(x,y)) # Will calculate mu using our function, and store it at the end of our growing mu_list
Related
I have a small set of data. I used python3 to read it and created a scatter plot. My next task is to set the slope a to 10 and the intercept b to 0 and calculate y for every value of x. The task says I should not use any existing linear regression functions. I have been stuck for some time on this. How can I do that?
If your slope is already set to 10, I don't see why you need to use Linear Regression. I hope I'm not missing anything from your task.
However, keeping that aside if you need to get a list in python with all elements multiplied by your slope a then you can use a list comprehension to find this new list in the following way:
y_computed = [item*a for item in x]
You can literally just draw a line with a constant slope (10) on the same plot, then calculate the the predicted y-value based on that line "estimate" (you can also find the error of the estimate if you want). That be done using the following:
import numpy as np
from matplotlib import pyplot as plt
def const_line(x):
y = 10 * x + 0 # Just to illustrate that the intercept is zero
return y
x = np.linspace(0, 1)
y = const_line(x)
plt.plot(x, y, c='m')
plt.show()
# Find the y-values for each sample point in your data:
for x in data:
const_line(x)
I have data from distinct curves, and want to fit each of them individually. However, the data is mixed into a single array, so first I believe I need a way to separate the data.
I know that each of the individual curves is under the family A/x+B. As of now I cut out each of the curves by hand and curve fit, but would like to automate this process, have the computer separate these curves a fit them. I attempted to use machine learning, but didn't know where to start, what packages to use. I am using python, but can also use C++, in fact I hope to transfer it to C++ by the end. Where do you think I should start, is it worth it to use unsupervised machine learning, or is there a better way to separate the data?
The expected curves:
An example of the data
Well, you sure do have an interesting problem.
I see that there are curves with Y-axis values that are considerably larger than the rest of them. I would simply take the first N-values with the largest Y-axis values and then fit them to an exponential decay curve (or that other curve you mention). You can then simply take the points that most fit that curve and then leave the other points alone.
Except...
This is a terrible way to extrapolate data. Doing this, you are cherry-picking the data you want. This is falsifying information and is very bad.
Your best bet is to create a single curve that all points fit too if you cannot isolate all of those points into separate curves with external information.
But...
We do know some information: a valid function must have only 1 output given a single input.
If the X-Axis is discreet, this means you can create a lookup table of Outputs given the input. This allows you to count how many curves there are associated with the specific X-value (which could be a time unit). In other words, you have to have external information to separate points locally. You can then reorder the points in increasing Y-value, and now you have your separate curves defined in discrete points.
Basically, this is an unsolvable problem in the general sense, but in your specific application, there might be extra rules that further define the domain and range such that you can do data filtering.
One more thing...
I am making these statements with the assumption that the (X,Y) values are floats that cannot maintain accuracy after some mathematical operations.
If you are using things like unum numbers, you might be able to keep enough information in the decimal such that your fitting functions can differentiate between points without extra filtering.
This case is more of a hope than anything, as adopting a new number representation to get more accuracy to isolate sampled points is a stretch at best.
Just for completeness, there are some mathematical libraries that might help you.
Boost.uBLAS
Eigen
LAPACK++
Hopefully, I have given you enough information to allow you to solve your problem.
I extracted data from the plot for analysis. Here is example code that loads, separates, fits and plots the three data sets. It works when the separate data files are appended into a single text file.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
##########################################################
# data load and separation section
datafilename = 'temp.dat'
textdata = open(datafilename, 'rt').read()
xLists = [[], [], []]
yLists = [[], [], []]
previousY = 0.0 # initialize
whichList = -1 # initialize
datalines = textdata.split('\n')
for line in datalines:
if not line: # allow for blank lines in data file
continue
spl = line.split()
x = float(spl[0])
y = float(spl[1])
if y > previousY + 50.0: # this separator must be greater than max noise
whichList += 1
previousY = y
xLists[whichList].append(x)
yLists[whichList].append(y)
##########################################################
# curve fitting section
def func(x, a, b):
return a / x + b
parameterLists = []
for curveIndex in range(len(xLists)):
# these are the same as the scipy defaults
initialParameters = numpy.array([1.0, 1.0])
xData = numpy.array(xLists[curveIndex], dtype=float)
yData = numpy.array(yLists[curveIndex], dtype=float)
# curve fit the test data
fittedParameters, pcov = curve_fit(func, xData, yData, initialParameters)
parameterLists.append(fittedParameters)
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
for curveIndex in range(len(xLists)):
# first the raw data as a scatter plot
axes.plot(xLists[curveIndex], yLists[curveIndex], 'D')
# create data for each fitted equation plot
xModel = numpy.linspace(min(xLists[curveIndex]), max(xLists[curveIndex]))
yModel = func(xModel, *parameterLists[curveIndex])
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)
The idea:
create N naive, easy to calculate, sufficiently precise(for clustering), approximations. Then "classify" each data-point to the closest such approximation.
This is done like this:
The approximations are analytical approximations using these two equations I derived:
where (x1,y1) and (x2,y2) are coordinates of two points on the curve.
To get these two points I assumed that (1) the first points(according to the x-axis) are distributed equally between the different real curves. And (2) the 2 first points of each real curve, are smaller or bigger than the 2 first points of each other real curve. Thus sorting them and dividing into N groups will successfully cluster the first *2*N* points. If these assumptions are false you can still manually classify the 2 first points of each real curve and the rest will be classified automatically (this is actually the first approach I implemented).
Then cluster rest of the points to each point's closest approximation. Closest meaning with the smallest error.
Edit: A stronger approach for the initial approximation could be by calculating A and B for a couple of pairs of points and using their mean A and B as the approximation. And maybe even possibly doing K-means on these points/approximations.
The Code:
import numpy as np
import matplotlib.pyplot as plt
# You should probably edit this variable
NUM_OF_CURVES = 4
# <data> should be a 1-D array containing the Y values of the series
# <x_of_data> should be a 1-D array containing the corresponding X values of the series
data, x_of_data = np.loadtxt('...')
# clustering of first 2*num_of_curves points
# I started at NUM_OF_CURVES instead of 0 because my xs started at 0.
# The range (0:NUM_OF_CURVES*2) will probably be better for you.
raw_data = data[NUM_OF_CURVES:NUM_OF_CURVES*3]
raw_xs = x_of_data[NUM_OF_CURVES:NUM_OF_CURVES*3]
sort_ind = np.argsort(raw_data)
Y = raw_data[sort_ind].reshape(NUM_OF_CURVES,-1).T
X = raw_xs[sort_ind].reshape(NUM_OF_CURVES,-1).T
# approximation of A and B for each curve
A = ((Y[0]*Y[1])*(X[0]-X[1]))/(Y[1]-Y[0])
B = (A / Y[0]) - X[0]
# creating approximating curves
f = []
for i in range(NUM_OF_CURVES):
f.append(A[i]/(x_of_data+B[i]))
curves = np.vstack(f)
# clustering the points to the approximating curves
raw_clusters = [[] for _ in range(NUM_OF_CURVES)]
for i in range(len(data)):
raw_clusters[np.abs(curves[:,i]-data[i]).argmin()].append((x_of_data[i],data[i]))
# changing the clusters to np.arrays of the shape (2,-1)
# where row 0 contains the X coordinates and row 1 the Y coordinates
clusters = []
for i in range(len(raw_clusters)):
clusters.append(np.array(list(zip(*raw_clusters[i]))))
Example:
raw series:
separated series:
So basically I have some data and I need to find a way to smoothen it out (so that the line produced from it is smooth and not jittery). When plotted out the data right now looks like this:
and what I want it to look is like this:
I tried using this numpy method to get the equation of the line, but it did not work for me as the graph repeats (there are multiple readings so the graph rises, saturates, then falls then repeats that multiple times) so there isn't really an equation that can represent that.
I also tried this but it did not work for the same reason as above.
The graph is defined as such:
gx = [] #x is already taken so gx -> graphx
gy = [] #same as above
#Put in data
#Get nice data #[this is what I need help with]
#Plot nice data and original data
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()
The method I think would be most applicable to my solution is getting the average of every 2 points and setting that to the value of both points, but this idea doesn't sit right with me - potential values may be lost.
You could use a infinite horizon filter
import numpy as np
import matplotlib.pyplot as plt
x = 0.85 # adjust x to use more or less of the previous value
k = np.sin(np.linspace(0.5,1.5,100))+np.random.normal(0,0.05,100)
filtered = np.zeros_like(k)
#filtered = newvalue*x+oldvalue*(1-x)
filtered[0]=k[0]
for i in range(1,len(k)):
# uses x% of the previous filtered value and 1-x % of the new value
filtered[i] = filtered[i-1]*x+k[i]*(1-x)
plt.plot(k)
plt.plot(filtered)
plt.show()
I figured it out, by averaging 4 results I was able to significantly smooth out the graph. Here is a demonstration:
Hope this helps whoever needs it
I have a large simulated data set in which I have passed through values and what not for an analysis. My main objective is to take actual, real record values and compare it the simulated data via cumulative distribution.
I start out by defining the method of going through each bin of the data set by taking values that have a certain value x and match it to the "real" data analyzed with the same value x
bins = np.linspace(SimData.min(),SimData.max(), 24)
def CumuProb(SimData, bins, x, realValue):
h, bins_ = np.histogram(be, bins=bins)
hcum = np.cumsum(h)/float(np.cumsum(h).max())
cbins = np.zeros(len(bins)+1)
cbins[1:-1] = bins[1:]-np.diff(bins[:2])[0]/2.
cbins[-1] = bins[-1]
hcumc = np.linspace(0,1, len(cbins))
hcumc[1:-1] = hcum
p = [x, realValue]
yi = np.interp(p[1],cbins, hcumc)
return [p[1],yi]
This method works for large values fine. But, if I were to pass this through values <<1 but >0, this miserably fails.
For example, performing, on my project using this method gives:
Where you can see at the very bottom, their is 2 points, when their should be about 10 points all on the blue line (the actual data).
The main culprit is found from this traceback:
RuntimeWarning: invalid value encountered in divide hcum = np.cumsum(h)/float(np.cumsum(h).max())
So this has to do how I am most likely defining my bin size, which is defined at bin=np.linspace(np.log(binding).min(),np.log(binding).max(),24), which is going through the logarithmic x-axis values in the plot above for binning.
How do I fix this?
I can't be 100% sure, since the question lacks a lot of relevant information needed, but judging from how I intended to use this function, it seems odd to put realValue into the interpolation. If, what the name suggests, x is the x axis value of the data point to be investigated, the interpolation should take x in:
yi = np.interp(x,cbins, hcumc)
return [x,yi]
I have two datasets of a specific region: The first is the rainfall and the second a vegetation measure (npp) of that region. So, the first two dimensions (x,y) represent the geographical location. The third dimension is the time (8 time steps). What I want to do is to perform a linear regression for each location of the 8 values rainfall versus the 8 values of the vegetation. The result should be either several two dimensional arrays in which for each location the p-value, the r², the slope and ideally the residuals are calculated or all values togeher in a 3D array.
nppList = glob.glob(nppPath+"*.img")
rainList = glob.glob(rainPath+"*.img")
nppImg = [gdal.Open(i) for i in nppList]
rainImg = [gdal.Open(i) for i in rainList]
nppFiles = [i.ReadAsArray() for i in nppImg]
rainFiles = [i.ReadAsArray() for i in rainImg]
# get nodata
nppNodata = nppImg[1].GetRasterBand(1).GetNoDataValue()
rainNodata = rainImg[1].GetRasterBand(1).GetNoDataValue()
# convert to float and set no data
nppStack = nppStack.astype(float)
nppStack[nppStack == nppNodata] = np.nan
rainStack = rainStack.astype(float)
rainStack[rainStack == rainNodata] = np.nan
# instead of range(0,8) there should be the rainfall variable, but on a pixel base
def linReg(a):
return stats.linregress(a, range(0, 8))
lm = np.apply_along_axis(linReg, axis=2, arr=nppStack)
I know the function numpy.apply_along_axis() but here a function can be applied to only one array. I am searching for a possibility to apply a function on two arrays along an axis preferably wihtout looping through the arrays.
The source for scipy.stats.linregress indicates that only arrays with dimension greater than 2 are not supported (and only then for the case that your x and y data happen to be in the same data structure).
Honestly, in your case I would use a Python loop -- it is unlikely that the slowest part of the code is looping over the data points; rather, the regression itself will be determining the speed.
In that case, you could flatten your positional axes, use a single loop, and then reshape the regression results back to 3D. Something like:
n = nx * ny
frain = rainStack.reshape((n, 8))
fnpp = nppStack.reshape((n, 8))
reg_results = np.empty((n,5))
for i in range(n):
reg_results[i] = stats.linregress(frain[i], fnpp[i])
reg_results[i].reshape((nx,ny,8)) # back to 3D