Pytorch/numpy/python efficient Interpolation of corrupted data - python

I have temporal discrete information which may have missing values.
I do have a mask indicating where are those values too.
How can I perform an efficient interpolation filling those values?
In practice I have a TxCxJ tensor (Q). Some elements are let’s say corrupted. I would like, given a corrupted element Q[t,c,j], to fill that value with an interpolation between Q[t-1,c,j] and Q[t+1,c,j]
Also, in the worst case I may find several consecutive corrupted elements:
Q[t_0:t_1,c,j]
to be filled with the interpolation between
Q[t_0-1,c,j] and Q[t_1+1,c,j]
It is ok to use linear interpolation using numpy or pytorch (or any other suitable libray for which I don't need to study 5 months :/). I can code it using for loops but i was looking for any efficient library which allows to pass a mask or something with some sort of cool indexing/masking not having to run the algorithm for known points.
Thaaanks

Related

How do I apply a mean filter on a data set in Python?

I have a data set of 15,497 sets of values. The graph shows the raw data angle of pendulum vs. sample number which, obviously, looks awful. It should look like the second picture filtered data. A part of the assignment is introducing a mean filter to "smoothen" the data, making it look like the data on the 2nd graph. The data is put into np.array's in Python. But, using np.array's, I can't seem to figure out how to introduce a mean filter.
I'm interested in applying a mean filter on theta in the code screenshot of Python code, as theta are the values on the y axis on the plots. The code is added for you to easily see how the data file is introduced in the code.
There is a whole world of filtering techniques. There is a not a single unique 'mean filter'. Moreover, there are causal and non-causal filters (i.e. the difference between not using future values in the filter vs. using the future values in the filter.) I'm going to assume you are desiring a mean filter of size N, as that is pretty standard. Then, to apply this filter, convolve your 'theta' vector with a mean kernel.
I suggest printing the mean kernel and studying how it looks with different N. Then you may understand how it is averaging the values in the signal. I also urge you to think about why convolution is applying this filter to theta. I'll help you by telling you to think about the equivalent multiplication in the frequency domain. Also, investigate the different modes in the convolution function, as this may be more tailored for the specific solution you desire.
N=2
mean_kernel = np.ones(N)/N
filtered_sig = np.convolve(sig, mean_kernel, mode='same')

How to perform clustering on a dataset containing TRUE/FALSE values in Python?

My dataset contains columns describing abilities of certain characters, filled with True/False values. There are no empty values. My ultimate goal is to make groups of characters with similar abilities. And here's the question:
Should i change True/False values to 1 and 0? Or there's no need for that?
What clustering model should i use? Is KMeans okay for that?
How do i interpret the results (output)? Can i visualize it?
The thing is i always see people perform clustering on numeric datasets that you can visualize and it looks much easier to do. With True/False i just don't even know how to approach it.
Thanks.
In general there is no need to change True/False to 0/1. This is only necessary if you want to apply a specific algorithm for clustering that cannot deal with boolean inputs, like K-means.
K-means is not a preferred option. K-means requires continuous features as input, as it is based on computing distances, like many clustering algorithms. So no boolean inputs. And although binary input (0-1) works, it does not compute distances in a very meaningful way (many points will have the same distance to each other). In case of 0-1 data only, I would not use clustering, but would recommend tabulating the data and see what cells occur frequently. If you have a large data set you might use the Apriori algorithm to find cells that occur frequently.
In general, a clustering algorithm typically returns a cluster number for each observation. In low-dimensions, this number is frequently used to give a color to an observation in a scatter plot. However, in your case of boolean values, I would just list the most frequently occurring cells.

Clustering on large, mixed type data

I'm dealing with a dataframe of dimension 4 million x 70. Most columns are numeric, and some are categorical, in addition to the occasional missing values. It is essential that the clustering is ran on all data points, and we look to produce around 400,000 clusters (so subsampling the dataset is not an option).
I have looked at using Gower's distance metric for mixed type data, but this produces a dissimilarity matrix of dimension 4 million x 4 million, which is just not feasible to work with since it has 10^13 elements. So, the method needs to avoid dissimilarity matrices entirely.
Ideally, we would use an agglomerative clustering method, since we want a large amount of clusters.
What would be a suitable method for this problem? I am struggling to find a method which meets all of these requirements, and I realise it's a big ask.
Plan B is to use a simple rules-based grouping method based on categorical variables alone, handpicking only a few variables to cluster on since we will suffer from the curse of dimensionality otherwise.
The first step is going to be turning those categorical values into numbers somehow, and the second step is going to be putting the now all numeric attributes into the same scale.
Clustering is computationally expensive, so you might try a third step of representing this data by the top 10 components of a PCA (or however many components have an eigenvalue > 1) to reduce the columns.
For the clustering step, you'll have your choice of algorithms. I would think something hierarchical would be helpful for you, since even though you expect a high number of clusters, it makes intuitive sense that those clusters would fall under larger clusters that continue to make sense all the way down to a small number of "parent" clusters. A popular choice might be HDBSCAN, but I tend to prefer trying OPTICS. The implementation in free ELKI seems to be the fastest (it takes some messing around with to figure it out) because it runs in java. The output of ELKI is a little strange, it outputs a file for every cluster so you have to then use python to loop through the files and create your final mapping, unfortunately. But it's all doable (including executing the ELKI command) from python if you're building an automated pipeline.

Python: How to interpolate errors using scipy interpolate.interp1d

I have a number of data sets, each containing x, y, and y_error values, and I'm simply trying to calculate the average value of y at each x across these data sets. However the data sets are not quite the same length. I thought the best way to get them to an equal length would be to use scipy's interoplate.interp1d for each data set. However, I still need to be able to calculate the error on each of these averaged values, and I'm quite lost on how to accomplish that after doing an interpolation.
I'm pretty new to Python and coding in general, so I appreciate your help!
As long as you can assume that your errors represent one-sigma intervals of normal distributions, you can always generate synthetic datasets, resample and interpolate those, and compute the 1-sigma errors of the results.
Or just interpolate values+err and values-err, if all you need is a quick and dirty rough estimate.

Fitting a bimodal distribution to a set of values

Given a 1D array of values, what is the simplest way to figure out what the best fit bimodal distribution to it is, where each 'mode' is a normal distribution? Or in other words, how can you find the combination of two normal distributions that bests reproduces the 1D array of values?
Specifically, I'm interested in implementing this in python, but answers don't have to be language specific.
Thanks!
What you are trying to do is called a Gaussian Mixture model. The standard approach to solving this is using Expectation Maximization, scipy svn includes a section on machine learning and em called scikits. I use it a a fair bit.
I suggest using the awesome scipy package.
It provides a few methods for optimisation.
There's a big fat caveat with simply applying a pre-defined least square fit or something along those lines.
Here are a few problems you will run into:
Noise larger than second/both peaks.
Partial peak - your data is cut of at one of the borders.
Sampling - width of peaks are smaller than your sampled data.
It isn't normal - you'll get some result ...
Overlap - If peaks overlap you'll find that often one peak is fitted correctly but the second will apporach zero...

Categories