I am working on Python 2.7. I want to create nomograms based on the data of various variables in order to predict one variable. I am looking into and have installed PyNomo package.
However, the from the documentation here and here and the examples, it seems that nomograms can only be made when you have equation(s) relating these variables, and not from the data. For example, examples here show how to use equations to create nomograms. What I want, is to create a nomogram from the data and use that to predict things. How do I do that? In other words, how do I make the nomograph take data as input and not the function as input? Is it even possible?
Any input would be helpful. If PyNomo cannot do it, please suggest some other package (in any language). For example, I am trying function nomogram from package rms in R, but not having luck with figuring out how to properly use it. I have asked a separate question for that here.
The term "nomogram" has become somewhat confused of late as it now refers to two entirely different things.
A classic nomogram performs a full calculation - you mark two scales, draw a straight line across the marks and read your answer from a third scale. This is the type of nomogram that pynomo produces, and as you correctly say, you need a formula. As mentioned above, producing nomograms like this is definitely a two-step process.
The other use of the term (very popular, recently) is to refer to regression nomograms. These are graphical depictions of regression models (usually logistic regression models). For these, a group of parallel predictor variables are depicted with a common scale on the bottom; for each predictor you read the 'score' from the scale and add these up. These types of nomograms have become very popular in the last few years, and thats what the RMS package will draft. I haven't used this but my understanding is that it works directly from the data.
Hope this is of some use! :-)
Related
I want to build a model that describes a curve that fits the data shown in the scatterplot. I thought it would be straight forward using sklearn. But the choice and application of the different methods gets rather confusing.
Which algorithms would you use to tackle this problem?
This is really a question for CrossValidated rather than a Python question.
Your data seems to strongly indicate a simple underlying model which is linear until the very end, when it perhaps becomes polynomial.
As a first step, if possible, I would investigate this phenomenon. It's unusual. Perhaps there's something wrong with the data source. But maybe not. For example, a physical phenomenon with two distinct phases might produce data like these.
As to models, I would suggest natural cubic splines for this data. They are simple and involve cutting the data up into windows which you fit with cubic polynomials (a special case of which is a line).
You might also consider smoothing splines, and local regression.
For information on these, see the free online textbook, An Introduction to Statistical Learning.
I would like to understand more the machine learning technics, I have read and watch a bunch of things on Python, sklearn and supervised feed forward net but I am still struggling to see how I can apply all this to my project and where to start with. Maybe it is a little bit too ambitious yet.
I have the following algorithm which generates nice patterns as binary format inputs on csv file. The outputs and the goal is to predict the next row.
The simplify logic of this algorithm is the prediction of the next line (top line being the most recent one) would be 0,0,1,1,1,0 and then the next after that would become either 0,0,0,1,1,0 or come back to its previous step 0,1,1,1,0. However you can see the model is slightly more complex and noisy this is why I would like to introduce some machine learnings here. I am aware to have a reliable prediction I will need to introduce other relevant inputs afterwards.
Would someone please help me to get started and stand on my feet here?
I don't like throwing this here and not being able to provide a single piece of code but I am slightly confused to where to start.
Should I pass as input each (line-1) as vectors and then the associated output would be the top line? Should I build the array manually with all my dataset?
I guess I have to use the sigmoid function and python seems the most common way to answer this but for the synapses (or weights), I understand I need also to provide a constant, should this be 1?
Finally assuming you want this to run continuously what would be required?
Please would you share with me readings or simplification tasks that could help me to increase my knowledge with all this.
Many thanks.
I have a bunch of contact data listing what members were contacted by what offer, which summarizes something like this:
To make sense of it (and to make it more scalable) I was considering creating dummy variables for each offer and then using a logistic model to see how different offers impact performance:
Before I embark too far on this journey I wanted to get some input if this is a sensible way to approach this (I have started playing around but and got a model output, but haven't dug into it yet). Someone suggested I use linear regression instead, but I'm not really sure about the approach for that in this case.
What I'm hoping to get are coefficients that are interpretable - so I can see that Mailing the 50% off offer in the 3d mailing is not as impactful as the $25 giftcard etc, and then do this at scale (lots of mailings with lots of different offers) to draw some conclusions about the impact of timing of different offers.
My concern is that I will end up with a fairly sparse matrix where only some combinations of the many possible are respresented, and what problems may arise from this. I've taken some online courses in ML but am new to it, and this is one of my first chances to work directly with it so I'm hoping I could create something useful out of this. I have access to lots and lots of data, it's just a matter of getting something basic out that can show some value. Maybe there's already some work on this or even some kind of library I can use?
Thanks for any help!
If your target variable is binary (1 or 0) as in the second chart, then a classification model is appropriate. Logistic Regression is a good first option, you could also a tree-based model like a decision tree classifier or a random forest.
Creating dummy variables is a good move; you could also convert the discounts to numerical values if you want to keep them in a single column, however this may not work so well for a linear model like logistic regression as the correlation will probably not be linear.
If you wanted to model the first chart directly you could use a linear regressions for predicting the conversion rate, I'm not sure about the difference is in doing this, it's actually something I've been wondering about for a while, you've motivated me to post a question on stats.stackexchange.com
My Question is as follows:
I know a little bit about ML in Python (using NLTK), and it works ok so far. I can get predictions given certain features. But I want to know, is there a way, to display the best features to achieve a label? I mean the direct opposite of what I've been doing so far (put in all circumstances, and get a label for that)
I try to make my question clear via an example:
Let's say I have a database with Soccer games.
The Labels are e.g. 'Win', 'Loss', 'Draw'.
The Features are e.g. 'Windspeed', 'Rain or not', 'Daytime', 'Fouls committed' etc.
Now I want to know: Under which circumstances will a Team achieve a Win, Loss or Draw? Basically I want to get back something like this:
Best conditions for Win: Windspeed=0, No Rain, Afternoon, Fouls=0 etc
Best conditions for Loss: ...
Is there a way to achieve this?
My paint skills aren't the best!
All I know is theory, so well you'll have to look for the code..
If you have only 1 case(The best for "x" situations) the diagram becomes something like (It won't be 2-D, but something like this):
Green (Win), Orange(Draw), Red(Lose)
Now if you want to predict whether the team wins, loses or draws, you have (at least) 2 models to classify:
Linear Regression, the separator is the Perpendicular bisector of the line joining the 2 points:
K-nearest-neighbours: it is done just by calculating the distance from all the points, and classifying the point as the same as the closest..
So, for example, if you have a new data, and have to classify it, here's how:
We have a new point, with certain attributes..
We classify it by seeing/calculating which side of the line the point comes in (or seeing how far it is from our benchmark situations...
Note: You will have to give some weightage to each factor, for more accuracy..
You could compute the representativeness of each feature to separate the classes via feature weighting. The most common method for feature selection (and therefore feature weighting) in Text Classification is chi^2. This measure will tell you which features are better. Based on this information you can analyse the specific values that are best for every case. I hope this helps.
Regards,
Not sure if you have to do this in python, but if not, I would suggest Weka. If you're unfamiliar with it, here's a link to a set of tutorials: https://www.youtube.com/watch?v=gd5HwYYOz2U
Basically, you'd just need to write a program to extract your features and labels and then output a .arff file. Once you've generated a .arff file, you can feed this to Weka and run myriad different classifiers on it to figure out what model best fits your data. If necessary, you can then program this model to operate on your data. Weka has plenty of ways to analyze your results and to graphically display said results. It's truly amazing.
I have some points that I need to classify. Given the collection of these points, I need to say which other (known) distribution they match best. For example, given the points in the top left distribution, my algorithm would have to say whether they are a better match to the 2nd, 3rd, or 4th distribution. (Here the bottom-left would be correct due to the similar orientations)
I have some background in Machine Learning, but I am no expert. I was thinking of using Gaussian Mixture Models, or perhaps Hidden Markov Models (as I have previously classified signatures with these- similar problem).
I would appreciate any help as to which approach to use for this problem. As background information, I am working with OpenCV and Python, so I would most likely not have to implement the chosen algorithm from scratch, I just want a pointer to know which algorithms would be applicable to this problem.
Disclaimer: I originally wanted to post this on the Mathematics section of StackExchange, but I lacked the necessary reputation to post images. I felt that my point could not be made clear without showing some images, so I posted it here instead. I believe that it is still relevant to Computer Vision and Machine Learning, as it will eventually be used for object identification.
EDIT:
I read and considered some of the answers given below, and would now like to add some new information. My main reason for not wanting to model these distributions as a single Gaussian is that eventually I will also have to be able to discriminate between distributions. That is, there might be two different and separate distributions representing two different objects, and then my algorithm should be aware that only one of the two distributions represents the object that we are interested in.
I think this depends on where exactly the data comes from and what sort of assumptions you would like to make as to its distribution. The points above can easily be drawn even from a single Gaussian distribution, in which case the estimation of parameters for each one and then the selection of the closest match are pretty simple.
Alternatively you could go for the discriminative option, i.e. calculate whatever statistics you think may be helpful in determining the class a set of points belongs to and perform classification using SVM or something similar. This can be viewed as embedding these samples (sets of 2d points) in a higher-dimensional space to get a single vector.
Also, if the data is actually as simple as in this example, you could just do the principle component analysis and match by the first eigenvector.
You should just fit the distributions to the data, determine the chi^2 deviation for each one, look at F-Test. See for instance these notes on model fitting etc
You might want to consider also non-parametric techniques (e.g. multivariate kernel density estimation on each of your new data set) in order to compare the statistics or distances of the estimated distributions. In Python stats.kde is an implementation in SciPy.Stats.