How to divide two data sets (spectra) with different sizes?

How to divide two data sets (spectra) with different sizes? - python

Spectrum_3 = Spectrum_1/Spectrum_2, but they have different sizes. How could I proceed? Since I am dealing with spectra, my approach is to decrease the resolution of Spectrum_1 so that the data size matches (if you come from Astrophysics is this a correct approach?). Anyhow, I (think I) need to bin the data from Spectrum_1 in such a way that the size of it matches the size of Spectrum_2.
arr1.size is 313136
synth_spec2.size is 102888
arr1_new = arr1.reshape(-1,2).mean(axis=1) # should be the answer but
# I don`t fully understand it.
I need
len(arr1_new) == len(synth_spec2) #True

Generally you need to interpolate the two spectra onto a common wavelength grid, paying careful attention to the ends of the spectra if they don't overlap fully. I would suggest having looking at the synphot package and in particular the SourceSpectrum classes. Despite the name, it has support for a variety of spectra as synthetic photometry is normally done by assembling a suitable source spectrum, applying reddening/extinction etc to it and then multiplying by a filter bandpasses (which is also spectrum-like, being transmission against wavelength) and integrating to derive a flux.

Related

Clustering on large, mixed type data

I'm dealing with a dataframe of dimension 4 million x 70. Most columns are numeric, and some are categorical, in addition to the occasional missing values. It is essential that the clustering is ran on all data points, and we look to produce around 400,000 clusters (so subsampling the dataset is not an option).
I have looked at using Gower's distance metric for mixed type data, but this produces a dissimilarity matrix of dimension 4 million x 4 million, which is just not feasible to work with since it has 10^13 elements. So, the method needs to avoid dissimilarity matrices entirely.
Ideally, we would use an agglomerative clustering method, since we want a large amount of clusters.
What would be a suitable method for this problem? I am struggling to find a method which meets all of these requirements, and I realise it's a big ask.
Plan B is to use a simple rules-based grouping method based on categorical variables alone, handpicking only a few variables to cluster on since we will suffer from the curse of dimensionality otherwise.

The first step is going to be turning those categorical values into numbers somehow, and the second step is going to be putting the now all numeric attributes into the same scale.
Clustering is computationally expensive, so you might try a third step of representing this data by the top 10 components of a PCA (or however many components have an eigenvalue > 1) to reduce the columns.
For the clustering step, you'll have your choice of algorithms. I would think something hierarchical would be helpful for you, since even though you expect a high number of clusters, it makes intuitive sense that those clusters would fall under larger clusters that continue to make sense all the way down to a small number of "parent" clusters. A popular choice might be HDBSCAN, but I tend to prefer trying OPTICS. The implementation in free ELKI seems to be the fastest (it takes some messing around with to figure it out) because it runs in java. The output of ELKI is a little strange, it outputs a file for every cluster so you have to then use python to loop through the files and create your final mapping, unfortunately. But it's all doable (including executing the ELKI command) from python if you're building an automated pipeline.

Comparing feature extractors (or comparing aligned images)

I'd like to compare ORB, SIFT, BRISK, AKAZE, etc. to find which works best for my specific image set. I'm interested in the final alignment of images.
Is there a standard way to do it?
I'm considering this solution: take each algorithm, extract the features, compute the homography and transform the image.
Now I need to check which transformed image is closer to the target template.
Maybe I can repeat the process with the target template and the transformed image and look for the homography matrix closest to the identity but I'm not sure how to compute this closeness exactly. And I'm not sure which algorithm should I use for this check, I suppose a fixed one.
Or I could do some pixel level comparison between the images using a perceptual difference hash (dHash). But I suspect the the following hamming distance may not be very good for images that will be nearly identical.
I could blur them and do a simple subtraction but sounds quite weak.
Thanks for any suggestions.
EDIT: I have thousands of images to test. These are real world pictures. Images are of documents of different kinds, some with a lot of graphics, others mostly geometrical. I have about 30 different templates. I suspect different templates works best with different algorithms (I know in advance the template so I could pick the best one).
Right now I use cv2.matchTemplate to find some reference patches in the transformed images and I compare their locations to the reference ones. It works but I'd like to improve over this.

From your question, it seems like the task is not to compare the feature extractors themselves, but rather to find which type of feature extractor leads to the best alignment.
For this, you need two things:
a way to perform the alignment using the features from different extractors
a way to check the accuracy of the alignment
The algorithm you suggested is a good approach for doing the alignment. To check if accuracy, you need to know what is a good alignment.
You may start with an alignment you already know. And the easiest way to know the alignment between two images is if you made the inverse operation yourself. For example, starting with one image, you rotate it some amount, you translate/crop/scale or combine all this operations. Knowing how you obtained the image, you can obtain your ideal alignment (the one that undoes your operations).
Then, having the ideal alignment and the alignment generated by your algorithm, you can use one metric to evaluate its accuracy, depending on your definition of "good alignment".

Why should I discard half of what a FFT returns?

Looking at this answer:
Python Scipy FFT wav files
The technical part is obvious and working, but I have two theoretical questions (the code mentioned is below):
1) Why do I have to normalized (b=...) the frames? What would happen if I used the raw data?
2) Why should I only use half of the FFT result (d=...)?
3) Why should I abs(c) the FFT result?
Perhaps I'm missing something due to inadequate understanding of WAV format or FFT, but while this code works just fine, I'd be glad to understand why it works and how to make the best use of it.
Edit: in response to the comment by #Trilarion :
I'm trying to write a simple, not 100% accurate but more like a proof-of-concept Speaker Diarisation in Python. That means taking a wav file (right now I am using this one for my tests) and in each second (or any other resolution) say if the speaker is person #1 or person #2. I know in advance that these are 2 persons and I am not trying to link them to any known voice signatures, just to separate. Right now take each second, FFT it (and thus get a list of frequencies), and cluster them using KMeans with the number of clusters between 2 and 4 (A, B [,Silence [,A+B]]).
I'm still new to analyzing wav files and audio in general.
import matplotlib.pyplot as plt
from scipy.io import wavfile # get the api
fs, data = wavfile.read('test.wav') # load the data
a = data.T[0] # this is a two channel soundtrack, I get the first track
b=[(ele/2**8.)*2-1 for ele in a] # this is 8-bit track, b is now normalized on [-1,1)
c = sfft.fft(b) # create a list of complex number
d = len(c)/2 # you only need half of the fft list
plt.plot(abs(c[:(d-1)]),'r')
plt.show()

To address these in order:
1) You don't need to normalize, but the input normalization is close to the raw structure of the digitized waveform so the numbers are unintuitive. For example, how loud is a value of 67? It's easier to normalize it to be in the range -1 to 1 to interpret the values. (But if you wanted to implement a filter, for example, where you did an FFT, modified the FFT values, followed by an IFFT, normalizing would be an unnecessary hassle.)
2) and 3) are similar in that they both have to do with the math living primarily in the complex numbers space. That is, FFTs take a waveform of complex numbers (eg, [.5+.1j, .4+.7j, .4+.6j, ...]) to another sequence of complex numbers.
So in detail:
2) It turns out that if the input waveform is real instead of complex, then the FFT has a symmetry about 0, so only the values that have a frequency >=0 are uniquely interesting.
3) The values output by the FFT are complex, so they have a Re and Im part, but this can also be expressed as a magnitude and phase. For audio signals, it's usually the magnitude that's the most interesting, because this is primarily what we hear. Therefore people often use abs (which is the magnitude), but the phase can be important for different problems as well.

That depends on what you're trying to do. It looks like you're only looking to plot the spectral density and then it's OK to do so.
In general the coefficient in the DFT is depending on the phase for each frequency so if you want to keep phase information you have to keep the argument of the complex numbers.
The symmetry you see is only guaranteed if the input is real numbered sequence (IIRC). It's related to the mirroring distortion you'll get if you have frequencies above the Nyquist frequency (half the sampling frequency), the original frequency shows up in the DFT, but also the mirrored frequency.
If you're going to inverse DFT you should keep the full data and also keep the arguments of the DFT-coefficients.

Classify into one of two sets (without learning)

I am dealing with a problem where I would like to automatically divide a set into two subsets, knowing that ALMOST ALL of the objects in the set A will have greater values in all of the dimensions than objects in the set B.
I know I could use machine learning but I need it to be fully automated, as in various instances of a problem objects of set A and set B will have different values (so values in set B of the problem instance 2 might be greater than values in set A of the problem instance 1!).
I imagine the solution could be something like finding objects which are the best representatives of those two sets (the density of the objects around them is the highest).
Finding N best representatives of both sets would be sufficient for me.
Does anyone know the name of the problem and/or could propose the implementation for that? (Python is preferable).
Cheers!

You could try some of the clustering methods, which belong to unsupervised machine learning. The result depends on your data and how distributed they are. According to your picture I think K-means algorithm could work. There is a python library for machine learning scikit-learn, which already contains k-means implementation: http://scikit-learn.org/stable/modules/clustering.html#k-means

If your data is as easy as you explained, then there are some rather obvious approaches.
Center and count:
Center your data set, and count for each object how many values are positive. If more values are positive than negative, it will likely be in the red class.
Length histogram:
Compute the sum of each vector. Make a histogram of values. Split at the largest gap, vectors longer than the threshold are in one group, the others in the lower group.
I have made an ipython notebook to demonstrate this approach available.

Classifying a Distribution of Points for Object Identification

I have some points that I need to classify. Given the collection of these points, I need to say which other (known) distribution they match best. For example, given the points in the top left distribution, my algorithm would have to say whether they are a better match to the 2nd, 3rd, or 4th distribution. (Here the bottom-left would be correct due to the similar orientations)
I have some background in Machine Learning, but I am no expert. I was thinking of using Gaussian Mixture Models, or perhaps Hidden Markov Models (as I have previously classified signatures with these- similar problem).
I would appreciate any help as to which approach to use for this problem. As background information, I am working with OpenCV and Python, so I would most likely not have to implement the chosen algorithm from scratch, I just want a pointer to know which algorithms would be applicable to this problem.
Disclaimer: I originally wanted to post this on the Mathematics section of StackExchange, but I lacked the necessary reputation to post images. I felt that my point could not be made clear without showing some images, so I posted it here instead. I believe that it is still relevant to Computer Vision and Machine Learning, as it will eventually be used for object identification.
EDIT:
I read and considered some of the answers given below, and would now like to add some new information. My main reason for not wanting to model these distributions as a single Gaussian is that eventually I will also have to be able to discriminate between distributions. That is, there might be two different and separate distributions representing two different objects, and then my algorithm should be aware that only one of the two distributions represents the object that we are interested in.

I think this depends on where exactly the data comes from and what sort of assumptions you would like to make as to its distribution. The points above can easily be drawn even from a single Gaussian distribution, in which case the estimation of parameters for each one and then the selection of the closest match are pretty simple.
Alternatively you could go for the discriminative option, i.e. calculate whatever statistics you think may be helpful in determining the class a set of points belongs to and perform classification using SVM or something similar. This can be viewed as embedding these samples (sets of 2d points) in a higher-dimensional space to get a single vector.
Also, if the data is actually as simple as in this example, you could just do the principle component analysis and match by the first eigenvector.

You should just fit the distributions to the data, determine the chi^2 deviation for each one, look at F-Test. See for instance these notes on model fitting etc

You might want to consider also non-parametric techniques (e.g. multivariate kernel density estimation on each of your new data set) in order to compare the statistics or distances of the estimated distributions. In Python stats.kde is an implementation in SciPy.Stats.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.