I am trying to find a procedure to fit data monotonically in Python.
The data won’t be necessarily monotonic but the fit must be because of theoretical assumptions: so the signal must be monotonic but the measurements are taken with noise.
I imagine that a way of doing that would be to run an isotonic regression and then interpolate using a cubic spline. Are there easier alternatives?
In R, for example; I would use the cobs package for constrained splines. Does anything similar exists in Python?
Other ways of achieving the same result would also be fine if effective (e.g. fitting curves on monotonic transformations of the data that would maintain the overall shape of the relationship). I already know there are ways of achieving a similar result with GBM but I am looking for an alternative.
Thank you
I am trying to cluster a data set containing mixed data(nominal and ordinal) using k_prototype clustering based on Huang, Z.: Clustering large data sets with mixed numeric and categorical values.
my question is how to find the optimal number of clusters?
There is not one optimal number of clusters. But dozens. Every heuristic will suggest a different "optimal" number for another poorly defined notion of what is "optimal" that likely has no relevancy for the problem that you are trying to solve in the first place.
Rather than being overly concerned with "optimality", rather explore and experiment more. Study what you are actually trying to achieve, and how to get this into mathematical form to be able to compute what is solving your problem, and what is solving someone else's...
I would like to know if in Python, and more precisely, in lmfit library, there is an option for fitting data by parts ? I would like to fit data defined in different ranges and then obtain a unique fit.
Thank you
Without a more concrete example, it is hard to give a concrete answer. But, if I understand your question correctly, you are looking to do a fit to one specific region of your data, then a fit (probably with a different functional form) to another region of your data, and then perhaps combine the multiple regions to get a final fit.
If that is correct, then yes, this can be done with lmfit (and probably with other libraries as well). Let's say you want to fit data that is sort of peak like with an exponential decaying background. First, isolate a region around that peak (it doesn't have to be perfect) and fit a peak (say, Gaussian to that). Then fit an exponential decay to all the data except the peak area. (Aside: numpy.where can be very useful in identifying the regions). Finally, combine the two and fit the whole curve to peak + background.
If that is too vague and doesn't point you in the right direction, please make the question more specific.
I have some 2D data:
The data is labeled and shown in different colors. Definitely a non supervised process will not yield any correct prediction because the data is pretty mixed (although the colors seem to have regions of preference). I want to see if it is possible to measure how mixed are points from different sets.
For this I need to define a measurement of how mixed they are (I think that this should exist). Also it would be nice to have these algorithms implemented. I am also looking for a simple predictive model that can be trained used the data shown. Thanks for your help. If possible I'm looking for these implementations in python.
Can someone please tell me if there is a good (easy) way to visualize high dimensional data? My data is currently 21 dimensions but I would like to see how whether it is dense or sparse. Are there techniques to achieve this?
Parallel coordinates are a popular method for visualizing high-dimensional data.
What kind of visualization is best for your data in particular will depend on its characteristics-- how correlated are the different dimensions?
Principal component analysis could be helpful if the dimensions are correlated.
The buzzword I would search for is multidimensional scaling. It is a technique to develop a projection from the high dimensional space to a lower space (2 or 3 dimensional) in such a way that points which are close in the full space will be close in the projection.
It is often used for visualising the output of clustering algorithms (i.e. if your clusters are compact in the MDS projection there is a good chance they are also in the full space).
Edit: This wouldn't necessarily help with determining if the data is dense or sparse, because you lose the scale in the projection, but it would show whether it is uniform or clumpy (perhaps thats what you mean).
Not sure what kind of patterns you would like to see from the data. t-SNE and its faster variant Barnes-Hut-SNE do a very good job in visualizing groups of related concepts for high-dimensional data. It is available through R.
There is a short tutorial on using it against high-dimensional data with about 300 dimensions.
http://www.codeproject.com/Tips/788739/Visualizing-High-Dimensional-Vector-using-T-SNE-wi
I was looking for ways to visualize high dimensional data and found this t-SNE technique that has been used effectively. Might help others as well.
Take a look at http://www.ggobi.org (tours, parallel coordinates, scatterplot matrices) can be used for real-valued variables. Also http://cranvas.org for more recent. The tourr package in R.
Try using http://hypertools.readthedocs.io/en/latest/.
HyperTools is a library for visualizing and manipulating high-dimensional data in Python.
Star Schema.
http://en.wikipedia.org/wiki/Star_schema
Works well for high-dimensional data.
If the cardinality of your fact table is close to the product of your dimension sizes, you have dense data.
If the cardinality of your fact table is smaller than the product of your dimension sizes, you have sparse data.
In the middle you have a judgement call.
The curios.IT data exploration software is designed for the visualization of high dimensional data: data is shown as a collection of 3D objects (one for each data group) which can show up to 13 variables at the same time. The relationships between data variables and visual features are much easier to remember than with other techniques (like parallel coordinates).