I am in 10th grade and I am looking to use a machine learning model on patient data to find a correlation between the time of week and patient adherence. I have separated the week into 21 time slots, three for each time of day (1 is Monday morning, 2 is monday afternoon, etc.). Adherence values will be binary (0 means they did not take the medicine, 1 means they did). I will simulate training, validation and test data for my model. From my understanding, I can use a logistic regression model to output the probability of the patient missing their medication on a certain time slot given past data for that time slot. This is because logistic regression outputs binary values when given a threshold and is good for problems dealing with probability and binary classes, which is my scenario. In my case, the two classes I am dealing with are yes they will take their medicine, and no they will not. But the major problem with this is that this data will be non-linear, at least to my understanding. To make this more clear, let me give a real life example. If a patient has yoga class on Sunday mornings, (time slot 19) and tends to forget to take their medication at this time, then most of the numbers under time slot 19 would be 0s, while all the other time slots would have many more 1s. The goal is to create a machine learning model which can realize given past data that the patient is very likely going to miss their medication on the next time slot 19. I believe that logistic regression must be used on data that still has an inherently linear data distribution, however I am not sure. I also understand that neural networks are ideal for non-linear distributions, but neural networks require a lot of data to function properly, and ideally the goal of my model is to be able to function decently with simply a few weeks of data. Of course any model becomes more accurate with more data, but it seems to me that generally neural networks need thousands of data sets to truly become decently accurate. Again, I could very well be wrong.
My question is really what model type would work here. I do know that I will need some form of supervised classification. But can I use logistic regression to make predictions when given time of week about adherence?
Really any general feedback on my project is greatly appreciated! Please keep in mind I am only 15, and so certain statements I made were possibly wrong and I will not be able to fully understand very complex replies.
I also have to complete this within the next two weeks, so please do not hesitate to respond as soon as you can! Thank you so much!
In my opinion a logistic regression won't be enough for this as u are going to use a single parameter as input. When I imagine a decision line for this problem, I don't think it can be achieved by a single neuron(a logistic regression). It may need few more neurons or even few layers of them to do so. And u may need a lot of data set for this purpose.
It's true that you need a lot of data for applying neural networks.
It would have been helpful if you could be more precise about your dataset and the features. You can also try implementing K-Means-Clustering for your project. If your aim is to find out that did the patient took medicine or not then it can be done using logistic regression.
Related
If you want to check an anomaly in stock data many studies use a linear regression. Let's say you want to check if there is a Monday effect, meaning that monday is significantly worse than other days.
I understood that we can use a regression like: return = a + b DummyMon + e
a is the constant, b the regression coefficient, we have the Dummy for Monday and the error term e.
That's what I used in python:
First you add a constant to the anomaly:
anomaly = sm.add_constant(anomaly)
Then you build the model:
model = sm.OLS(return, anomaly)
The you fit the model:
results = model.fit()
I wonder if this is the correct model setup.
In this case a plot of the linear regression would just show two vertical areas above 0 (for no Monday) and 1 (for Monday) with all the returns. It looks pretty strange. Is this correct?
Should I somehow try to use the time (t) in the regression? If so, how can I do it with python? I thought about giving each date an increasing number, but then I wondered how to treat weekends.
I would assume that with many data points both approaches are similar, if the time series is stationary, right? In the end I do a cross section anaylsis and don't care about the aspect of the time series in this case, correct? ( I heard about GARCH models etc, where this is a different)
Well, I am just learning and hope someone could give me some ideas about the topic.
Thank you very much in advance.
For time series analysis tasks (such as forecasting or anomaly detection), you may need a more advanced model, such as Recurrent Neural Networks (RNN) in deep learning. You can assign any time step to an RNN Cell, in your case, every RNN Cell can represent a day or maybe an hour or half a day etc.
The main purpose of the RNNs is to make the model understand the time dependencies in the data. For example, if monday has a bad affect, then corresponding RNN Cells will have reasonable parameters. I would recommend you to do some further research about it. Here there are some documentations that may help:
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
(Also includes different types of RNN)
https://towardsdatascience.com/understanding-rnn-and-lstm-f7cdf6dfc14e
And you can use tensorflow, keras or PyTorch libraries.
This is a question about a general approach rather than a specific coding problem. I'm trying to do time series forecasting with Tensorflow where features of the label timestep are known to the model. E.g. a human trying to predict a variable a week from now would know things that are going to happen in the next week that will affect that variable. So a window of 20 timesteps where the label is the 20th timestep would look something like this:
Timesteps 1-19 would each have a set of features plus the timeseries data
Timestep 20 would have a set of features which are known, plus the timeseries label which is unknown
Is there a model that could handle this sort of data? I've gone through the Tensorflow time series forecasting tutorial, done a Coursera course on Tensorflow time series forecasting and searched elsewhere but I can't find anything. I'm fairly new to this so apologies for any imprecise language.
I once tried to do this kind of TS problem by stacking a multivariate model and another machine learning model. My idea was that I use the normal TS model's output, add it as another feature in the other model that only takes the last time step's info as input. But it is complicated and might overfit a lot even if I carefully regularized the second model. The idea is that I use step 1 to window_size - 1 info to predict a rough output at step window_size, then use the info at step window_size to reduce the residual between my TS model output and the actual label; But I don't think this approach is theoretically correct and the result might be worse than using a TS model without feeding the target step's info.
I don't think tensorflow have any API for your problem because this type of problem is not a normal TS problem. Usually people would just treat this kind of problem as a regression or classification problem.
I am not an expert on this problem as well, but I just happened to attempt to solve the exact problem so this is just my personal experience...
I need to train a model with scikit-learn to predict possible time for less people in a room.
Here is how my dataset looks like:
Time PeopleCount
---------------------------------------------
2019-12-29 12:40:10 50
2019-12-29 12:42:10 30
2019-12-29 12:44:10 10
2019-12-29 12:46:10 10
2019-12-29 12:48:10 80
and so on...
This data will be available for 30 days.
Once the model is trained, I will query the model to get the possible time when there will be fewer people in the room between 10.AM and 8.PM. I expect the machine learning model to respond back with the 30-minute accuracy, ie. "3.00 PM to 3.30PM"
What algorithm can I use for this problem and how can I achieve the goal? Or are there any other Python libraries than SciKit-Learn which can be used for this purpose?
I am new to machine learning, sorry for a naive question.
First of all, time-series prediction is on the base of theory that current value more or less depend on the past ones. For instance, 80 of people count as of 2019-12-29 12:48:10 has to be strongly influenced on the people count at the time of 12:46:10, 12:44:20 or previous ones, correlating with past values. If not, you would be better off using the other algorithm for prediction.
While the scikit package contains a various modules as the machine learning algorithm, most of them specialize in the classification algorithm. I think the classification algorithm certainly satisfy your demand if your date is not identified as the type of time series. Actually, scikit also has some regression module, even though I think that seem not to be well suitable for prediction of time series data.
In the case of prediction of time series data, RNN or LSTM algorithm (Deep Learning) has been widely utilized, but scikit does not provide the build-in algorithm of it. So, you might be better off studying Tensorflow or Pytorch framework which are common tools to be enable you to build the RNN or LSTM model.
SciKitLearn models do not recognize timestamps, so you will have to break down your timestamp column into a number of features, ie. day of week, hour, etc. If you need 30-minute accuracy then you will have to aggregate your data from the PeopleCount column somehow, ie. record average, minimum or maximum number of people within each 30-minute time interval. It may be a good idea to also create lagged features, ie. what was the people count in a previous time slot or even 2, 3 or n time slots ago.
Once you have you have your time features and labels (corresponding people counts) ready you can start training your models in standard way:
split your data into training and validation sets,
train each model that you want to try and compare the results.
Any regressor should be suitable for this task, ie. Ridge, Lasso, DecisionTreeRegressor, SVR etc. Note however that if you need to get the best time slot from the given range you will need to make predictions for every slot from the range and pick the one which fits the criteria, although there may be cases where the smallest predicted value is not smaller then value you compare it with.
If you do not get satisfying results with regressors, ie. every time the mean or median squared errors are too high, you could come up with a classification case, ie. instead of training a regressor to predict the number of people you can train a classifier to predict whether the count is greater than 50 or not.
There are many ways to approach this problem. Once try different models and examine the results you will come up with ways to optimize the parameters, engineer features, pre-process the input etc.
I’m relatively new to the topic of machine learning, so naturally I have a couple of issues that I hope you can help me with or lead me in the right direction. I had a project before, during which we collected data of people walking normally and also with a stone in their shoe. We measured Acceleration and also with a gyroscope sensor. Based on this data I build a neural network that can classify the signals into normal or impaired walking. So two possible outputs.
Now my idea is this: I want to, using the same data, build a network that can predict the weights of the participants (it was also recorded).
Based on this my three questions:
- What kind of network structure is most suitable for such a task? (Dense, CNN, LSTM,…)
- Before the network basically had two options to answer from (normal or impaired walking) but now I have a continuous range of answers… How can this be approached?
- How can I make sure the network initializes with a sensible prediction?
I hope all the questions make sense. Any help will be much appreciated!
You can use the NNa architecture you prefer:
If you work with sequences use 1d convolutionals or RNNs.
As you are dealing with a regression problem you have to have a single neuron as output without activation function.
Take a.look here to learn to solve a regression problem with RNNs
Sorry for the weird title, I don't know how to better express my problem. I'm working with an insurance dataset to predict future claim costs for a given policy.
For anyone who has worked with insurance claim data, you know that the claims are heavily 0-weighted. I've run into the issue before where regression on the entire dataset does not perform well, due to the skew of the data, and the continuous-discrete distribution mix.
I've tried some Tweedie distributions in R to help with this disconnect, but I ended up going a different route.
I first decided to classify the data into "Claim Amount = 0" and "Claim amount != 0", by using a support vector classifier sklearn.svm.svc(with 98% training and 95% test accuracy), where if a claim amount is predicted to be != 0, it will be fed into a regression model to predict the incurred claim amount. I decided to go with ridge regression sklearn.linear_model.Ridge for this part, and achieved a relatively good $R^2$ of 0.67 for the test set (real world data, so I'm not expecting anything extraordinary).
So my question is, what is the best way to evaluate this composite model, specifically in python? Do you think the MSE would be a good metric? The only other model I can compare it to is a basic linear regression (on the entire dataset, without the pre-classification).
Of course, feel free to suggest alternatives to this two-part classification-regression model.
EDIT: To clarify, I chose these specific models (over neural networks, for example) because of their ability to be translated into simple math for different applications.