I'm working with some data trying to create a 2D polynomial fit just like IRAF's surfit(see here). I have 16 data points distributed in a grid pattern (i.e. pixel values at 16 different x- and y-coordinates) that need to be fitted to produce a 1024x1024 array. I've tried a bunch of different methods, starting with things like astropy.modeling or scipy.interpolate, but nothing gives quite the right result compared to IRAF's surfit. I imagine it's because I'm only using 16 data points, but that's all I have! The result should look something like this:
But what I'm getting looks more like this:
or this:
If you have any suggestions for how best to accomplish this task, I would very much appreciate your input! Thank you.
Related
I have a dataset that you see below. The data is pretty noisy, but there is a clear linear trend that goes up and to the right. I'd like to transform the data with y = m * x to make the lines horizontal. Essentially, I'd like to do a regression on the orange lines to pull out the slope, but I don't know how to extract the different linear clusters. Is there a good method for transforming data like this? I'm using python/pandas/numpy.
It looks like you'll want to try clustering the orange points. Some clustering methods will cope with the parallel clusters. I would probably start with DBSCAN.
For more on clustering, check out the tutorial on this scikit-learn page. Your situation is a bit like the 4th row here:
If you provide your data, I expect several people will take a look at it.
Let's say price of houses(target variable) can be easily plotted against area of houses(predictor variables) and we can see the data plotted and draw a best fit line through the data.
However, consider if we have predictor variables as ( size, no.of bedrooms,locality,no.of floors ) etc. How am I gonna plot all these against the
target variable and visualize them on a 2-D figure?
The computation shouldn't be an issue (the math works regardless of dimensionality), but the plotting definitely gets tricky. PCA can be hard to interpret and forcing orthogonality might not be appropriate here. I'd check out some of the advice provided here: https://stats.stackexchange.com/questions/73320/how-to-visualize-a-fitted-multiple-regression-model
Fundamentally, it depends on what you are trying to communicate. Goodness of fit? Maybe throw together multiple plots of residuals.
If you truly want a 2D figure, that's certainly not easy. One possible approach would be to reduce the dimensionality of your data to 2 using something like Principal Component Analysis. Then you can plot it in two dimensions again. Reducing to 3 dimensions instead of 2 might also still work, humans can understand 3D plots drawn on a 2D screen fairly well.
You don't normally need to do linear regression by hand though, so you don't need a 2D drawing of your data either. You can just let your computer compute the linear regression, and that works perfectly fine with way more than 2 or 3 dimensions.
I have a set of data that I would like to get an interpolating function for. MATLAB's interpolating functions seem to only return values at a finer set of discrete points. However, for my purposes, I need to be able to look up the function value for any input. What I'm looking for is something like SciPy's "interp1d."
That appears to be what ppval is for. It looks like many of the 1D interpolation functions have a pp variant that plugs into this.
Disclaimer: I haven't actually tried this.
I want to compare two distributions using 2 sample K-S test.
I'm using python's (2.7) ks_2samp but I'm having some troubles.
First of all I don't understand if I have to put as parameters just the arrays with my data or to build a cumulative distribution of them. I guess the first one...
Secondly, when I use ks_2samp on my data, I obtain as return p-values that don't look realistic...
For example, for a couple of distribution that looks like this:
CDF of 2 datasets
ks_2samp returns:
D-value = 0.038629201101928384
P-value = 0.0
That means that the distributions don't descends from a same one (roughly speaking). I think it is very strange for these data... It looks strange also a result like "0.0", because usually the task gives results with many decimals...
Using similar data in input I get for example p-value = 6.65e-136, that actually it's very strange.
What could be the problem? Or is it all right?
In my array there are many "nans", but I also run ks_2samp on data where I masked the nans, getting the same result. So I don't think it cares...
Thank you very much in advance!
I'm supposed to be doing a kmeans clustering implementation with some data. The example I looked at from http://glowingpython.blogspot.com/2012/04/k-means-clustering-with-scipy.html shows their test data in 2 columns... however, the data I'm given is 68 subjects with 78 features (so 68x78 matrix). How am I supposed to create an appropriate input for this?
I've basically just tried inputting the matrix anyway, but it doesn't seem to do what I want... and I don't know why it would. I'm pretty confused as to what to do.
data = np.rot90(data)
centroids,_ = kmeans(data,2)
# assign each sample to a cluster
idx,_ = vq(data,centroids)
# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
data[idx==1,0],data[idx==1,1],'or')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()
I honestly don't know what kind of code to show you.. the data format I told you was already described. Otherwise, it's the same as the tutorial I linked.
Your visualization only uses the first two dimensions.
That is why these points appear to be "incorrect" - they are closer in a different dimension.
Have a look at the next two dimensions:
plot(data[idx==0,2],data[idx==0,3],'ob',
data[idx==1,2],data[idx==1,3],'or')
plot(centroids[:,2],centroids[:,3],'sg',markersize=8)
show()
... repeat for all remaining of oyur 78 dimensions...
At this many features, (squared) Euclidean distance gets meaningless, and k-means results tend to become as good as random convex partitions.
To get a more representative view, consider using MDS to project the data into 2d for visualization. It should work reasonably fast with just 68 subjects.
Please include visualizations in your questions. We don't have your data.