I get a new GBDT algorithm named Ngboost invented by stanfordmlgroup. I want to use it and call encode
pip install ngboost==0.2.0
to install it.
and then I train a dataset that donot impute or delete missing value.
however I get a error:
Input contains NaN, infinity or a value too large for dtype('float32').
is this mean Ngboost cannot processing missing value automatic like xgboost?
You have two possibilities with this error.
1- You have some really large value. Check the max of your columns.
2- The algorithm don't support NAN and inf type so you have to handle them like in some other regression models.
Here's a response from one of the ngboost creators about that
Hey #omsuchak, thanks for the suggestion. There is no one "natural" or good way to generically handle missing data. If ngboost were to do this for you, we would be making a number of choices behind the scenes that would be obscured from the user.
If we limited ourselves to use cases where the base learner is a regression tree (like we do with the feature importances) there are some reasonable default choices for what to do with missing data. Implementing those strategies here is probably not crazy hard to do but it's also not a trivial task. Either way, I'd want the user to have a transparent choice about what is going on. I'd be open to review pull requests on that front as they satisfy that requirement, but it's not something I plan on working on myself in the foreseeable future. I'll close for now but if anyone wants to try to add this please feel free to comment.
And then you can see other answer about how to solve that, for example with sklearn.impute.MissingIndicator module (to indicate to the model the presence of missings) or some Imputer module.
If you need a practical example you can try with the survival example (located in the repo!).
Related
I am using "implicit" package (https://github.com/benfred/implicit) to create a recommender system in python. More preciseling, I am using the implicit least square algorithm.
The library is pretty easy to use, I was able to make predictions for already existing users, or to find similar items, no prob. But how can I make predictions for a new user which was not in input data? My goal is to get prediction from a new vector of items (~a new user). All items exist in input data.
This library and other equivalent ones usually provide a predict method for user already existing in dataset.
My first attempt was to get a prediction vector for each item and sum them all. But it does not feel right, does it?
This seems like a common usage, so I think I am missing something. What would be the method to use? Thank you for your help.
depends on what you're recommending but for example if it is something like movies then to a new user we would just generally recommend the most popular movies. Then as we get to know more about the user we can use the usual matrix factorization.
My company's hosting a friendly data science competition based off of the PhysioNet challenge here
But with a much simpler scoring system and unfortunately no prize XD Some quick background, I'm not a data science/machine learning expert by any means. I've done some simple projects messing around with standard library models like kNN, regression, etc. And I've worked with data with missing values, but in those cases it wasn't 95+% and you could safely impute using mean or median.
A quick overview of the problem, so much of the data is missing because the values they measure are test results, and due to costs tests are taken infrequently and only when ordered for specific reasons (most likely because a physician suspects something's wrong with the patient). I've read a bunch of the papers from the physionet challenge submissions and have come up with a few points. I've chosen features based on those used to identify sepsis and then use PCA to see if any of them are highly correlated to other sparse features that I could drop. Missing values are imputed with foward-fill if possible, but if there's no previous value they remain NaN. For features that are decently populated, I can fill missing values with the mean or median depending on the shape of the data.
Also I'm working in python, if that makes a difference for modules and such.
But that's where I hit a bit of a problem. There's still a lot of null values left and imputing doesn't make sense (if you have 98% null, imputing with the mean would introduce tremendous bias) but discarding them also seems bad. For example, a patient who stayed for a very short time and left without getting any tests taken because they had a quick recovery would be missing all their test values. So, ironically, the lack of data actually tells you something in some cases.
I was trying to do some research on what models can handle null values and so far the internet has given me Regression Trees and Gradient Boosting models, both of which I've never used so I don't know how well they work with missing values. Also, I've read some conflicting information that some of them actually just use mean imputation under the hood to fill in the values? (though I might be wrong since, again, I don't have any first hand experience with these models. But for example in this post).
So tl;dr, if your dataset has null values you don't want to throw out, what are some models that would handle that well? Or for regression trees and gradient boosting models, how do they handle null values? It seems like they replace them in some cases, but there's a lot of different sources conflicting on how. The most popular model I seem to be running into is XGBoost which also does well with structured data (which this is)
EDIT: Some relevant analysis - the data is highly skewed. There's like 400+K entries and 90% of them are non-sepsis. Also the fields with high sparsity are like 90-99% null, which is why imputing with the mean or median doesn't make sense to me. Forward-filling lowers that number by quite a bit but there's still a significant amount. There's also cases where a patient will have 100% null values for a field because they never had a test requested (so there's no way to impute even if you wanted to)
I'm working on a model that would predict an exam schedule for a given course and term. My input would be the term and the course name, and the output would be the date. I'm currently done with the data cleaning and preprocessing step, however, I can't wrap my head around a way to make a model whose input is two strings and the output is two numbers (the day and month of exam). One approach that I thought of would be encoding my course names, and writing the term as a binary list. I.E input: encoded(course), [0,0,1] output: day, month. and then feeding to a regression model.
I hope someone who's more experienced could tell me a better approach.
Before I start answering your question:
/rant
I know this sounds dumb and doesn't really help your question, but why are you using Neural Networks for this?!
To me, this seems like the classical case of "everybody uses ML/AI in their area, so now I have to, too!" (which is completely not true) /rant over
For string-like inputs, there exist several methods to encode these; choosing the right one might depend on your specific task. As you have a very "simple" (and predictable) input - i.e., you know in advance that there might not be any new/unseen course titles during testing/inference, or you do not need contextual/semantic information, you can resort to something like scikit-learn's LabelEncoder, which will turn it into different classes.
Alternatively, you could also throw a more heavy-weight encoding structure at the problem, that embeds the values in a matrix. Most DL frameworks offer some form of internal function for this, which basically requires you to pass an unique index for your input data, and actively learns some k-dimensional embedding vector for this. Intuitively, these embeddings correspond to a semantic or topical direction. If you have for example 3-dimensional embeddings, the first one could represent "social sciences course", the other one "technical courses", and the third for "seminar".
Of course, this is just a simplification of it, but helps imagining how it works.
For the output, predicting a specific date is actually a really good question. As I have personally never predicted dates myself, I can only recommend tips by other users. A nice answer on dates (as input) is given here.
If you can sacrifice a little bit of accuracy in the result, predicting the calendar week in which the exam is happening might be a good idea. Otherwise, you could simply treat it as two regressed values, but you might end up with invalid combinations (i.e. "negative days/months", or something like "31st February".
Depending on how much training data of high quality you have, results might vary quite heavily. Lastly, I would again recommend you to overthink whether you actually need a neural network for this task, or whether there are simpler metrics to do this.
Create dummy variables or use RandomForest. They accept text input and numerical output.
I am working on Python 2.7. I want to create nomograms based on the data of various variables in order to predict one variable. I am looking into and have installed PyNomo package.
However, the from the documentation here and here and the examples, it seems that nomograms can only be made when you have equation(s) relating these variables, and not from the data. For example, examples here show how to use equations to create nomograms. What I want, is to create a nomogram from the data and use that to predict things. How do I do that? In other words, how do I make the nomograph take data as input and not the function as input? Is it even possible?
Any input would be helpful. If PyNomo cannot do it, please suggest some other package (in any language). For example, I am trying function nomogram from package rms in R, but not having luck with figuring out how to properly use it. I have asked a separate question for that here.
The term "nomogram" has become somewhat confused of late as it now refers to two entirely different things.
A classic nomogram performs a full calculation - you mark two scales, draw a straight line across the marks and read your answer from a third scale. This is the type of nomogram that pynomo produces, and as you correctly say, you need a formula. As mentioned above, producing nomograms like this is definitely a two-step process.
The other use of the term (very popular, recently) is to refer to regression nomograms. These are graphical depictions of regression models (usually logistic regression models). For these, a group of parallel predictor variables are depicted with a common scale on the bottom; for each predictor you read the 'score' from the scale and add these up. These types of nomograms have become very popular in the last few years, and thats what the RMS package will draft. I haven't used this but my understanding is that it works directly from the data.
Hope this is of some use! :-)
I'm trying to do a PCA analysis on a masked array. From what I can tell, matplotlib.mlab.PCA doesn't work if the original 2D matrix has missing values. Does anyone have recommendations for doing a PCA with missing values in Python?
Thanks.
Imputing data will skew the result in ways that might bias the PCA estimates. A better approach is to use a PPCA algorithm, which gives the same result as PCA, but in some implementations can deal with missing data more robustly.
I have found two libraries. You have
Package PPCA on PyPI, which is called PCA-magic on github
Package PyPPCA, having the same name on PyPI and github
Since the packages are in low maintenance, you might want to implement it yourself instead. The code above build on theory presented in the well quoted (and well written!) paper by Tipping and Bishop 1999. It is available on Tippings home page if you want guidance on how to implement PPCA properly.
As an aside, the sklearn implementation of PCA is actually a PPCA implementation based on TippingBishop1999, but they have not chosen to implement it in such a way that it handles missing values.
EDIT: both the libraries above had issues so I could not use them directly myself. I forked PyPPCA and bug fixed it. Available on github.
I think you will probably need to do some preprocessing of the data before doing PCA.
You can use:
sklearn.impute.SimpleImputer
https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer
With this function you can automatically replace the missing values for the mean, median or most frequent value. Which of this options is the best is hard to tell, it depends on many factors such as how the data looks like.
By the way, you can also use PCA using the same library with:
sklearn.decomposition.PCA
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
And many others statistical functions and machine learning tecniques.