I have a dataset with numerical values. Those values are used alongside constants to calculate different factors. Based on these factors decisions are made which ultimately lead to a single numerical value.
My goal is to find the maximum (or minimum when multiplied by -1) of this numerical value by changing those constants.
I have been looking at the library SciPy Hyperparamater optimization (machine learning) and minimize(). But after reading through several tutorials I am a little bit lost on which one to use and how to implement it.
Is there someone who would be so kind and guide me onto the right track?
Related
I would like to calculate the total variation distance(TVD) between two continuous probability distributions. I would like to point out that while there are two relevant questions(see here and here), they are both working with discrete distributions.
For those not familiar with TVD,
Informally, this is the largest possible difference between the
probabilities that the two probability distributions can assign to the
same event.
as it is described in the respective Wikipedia page. In the case of continuous distributions, TVD is equal with half the integral of the absolute difference between the two (since I cannot add math notation see this for a proof and for the notation).
So far, I wasn't able to find a tool for my job in Python. I would be interested in one if exists. Also, while I have no experience in R, I understand that is commonly used for such tasks so I would be interested in one as well (TVD calculation is the final step of my algorithm so I guess it won't be hard to read some data from a file, do the calculation and print a number even if I am completely new to R).
I would like to add that I am mainly interesting in normal distributions so a tool strictly for those is more than welcomed.
If no such tools exist, then any help adapting answers from this question to use the builtin probability functions will be of great help as well.
Thank you in advance.
I am trying to cluster a data set containing mixed data(nominal and ordinal) using k_prototype clustering based on Huang, Z.: Clustering large data sets with mixed numeric and categorical values.
my question is how to find the optimal number of clusters?
There is not one optimal number of clusters. But dozens. Every heuristic will suggest a different "optimal" number for another poorly defined notion of what is "optimal" that likely has no relevancy for the problem that you are trying to solve in the first place.
Rather than being overly concerned with "optimality", rather explore and experiment more. Study what you are actually trying to achieve, and how to get this into mathematical form to be able to compute what is solving your problem, and what is solving someone else's...
I would like to create a function that, given a list of integers as input, returns a boolean based on that number. I would like it to use an algorithm to find the optimum cut-off value that optimizes the number of correct returns.
Is there some tool built-in with Python for this? Otherwise, how would I approach such a problem using Python? Preferably, I would want to learn how to do both.
This appears to be something that a linear machine learning algorithm could solve. In fact, the Ordinary Least Squares linear classification model seems to follow the exact outline you provide: it uses an algorithm to attempt to match it's output with your examples based on the numerical input, with the heuristic it attempts to minimize being a number of answers it gets wrong. If this is indeed the case, I believe scikit-learn will be the library you want. As to learning how this is done, the document linked above will at least get you started.
I'm trying to call upon the famous multilateration algorithm in order to pinpoint a radiation emission source given a set of arrival times for various detectors. I have the necessary data, but I'm still having trouble implementing this calculation; I am relatively new with Python.
I know that, if I were to do this by hand, I would use matrices and carry out elementary row operations in order to find my 3 unknowns (x,y,z), but I'm not sure how to code this. Is there a way to have Python implement ERO, or is there a better way to carry out my computation?
Depending on your needs, you could try:
NumPy if your interested in numerical solutions. As far as I remember, it could solve linear equations. Don't know how it deals with non-linear resolution.
SymPy for symbolic math. It solves symbolically linear equations ... according to their main page.
The two above are "generic" math packages. I doubt you will find (easily) any dedicated (and maintained) library for your specific need. Their was already a question on that topic here: Multilateration of GPS Coordinates
I have some points that I need to classify. Given the collection of these points, I need to say which other (known) distribution they match best. For example, given the points in the top left distribution, my algorithm would have to say whether they are a better match to the 2nd, 3rd, or 4th distribution. (Here the bottom-left would be correct due to the similar orientations)
I have some background in Machine Learning, but I am no expert. I was thinking of using Gaussian Mixture Models, or perhaps Hidden Markov Models (as I have previously classified signatures with these- similar problem).
I would appreciate any help as to which approach to use for this problem. As background information, I am working with OpenCV and Python, so I would most likely not have to implement the chosen algorithm from scratch, I just want a pointer to know which algorithms would be applicable to this problem.
Disclaimer: I originally wanted to post this on the Mathematics section of StackExchange, but I lacked the necessary reputation to post images. I felt that my point could not be made clear without showing some images, so I posted it here instead. I believe that it is still relevant to Computer Vision and Machine Learning, as it will eventually be used for object identification.
EDIT:
I read and considered some of the answers given below, and would now like to add some new information. My main reason for not wanting to model these distributions as a single Gaussian is that eventually I will also have to be able to discriminate between distributions. That is, there might be two different and separate distributions representing two different objects, and then my algorithm should be aware that only one of the two distributions represents the object that we are interested in.
I think this depends on where exactly the data comes from and what sort of assumptions you would like to make as to its distribution. The points above can easily be drawn even from a single Gaussian distribution, in which case the estimation of parameters for each one and then the selection of the closest match are pretty simple.
Alternatively you could go for the discriminative option, i.e. calculate whatever statistics you think may be helpful in determining the class a set of points belongs to and perform classification using SVM or something similar. This can be viewed as embedding these samples (sets of 2d points) in a higher-dimensional space to get a single vector.
Also, if the data is actually as simple as in this example, you could just do the principle component analysis and match by the first eigenvector.
You should just fit the distributions to the data, determine the chi^2 deviation for each one, look at F-Test. See for instance these notes on model fitting etc
You might want to consider also non-parametric techniques (e.g. multivariate kernel density estimation on each of your new data set) in order to compare the statistics or distances of the estimated distributions. In Python stats.kde is an implementation in SciPy.Stats.