I have access to a dataframe of 100 persons and how they performed on a certain motion test. This frame contains about 25,000 rows per person since the performance of this person is kept track of (approximately) each centisecond (10^-2). We want to use this data to predict a binary y-label, that is to say, if someone has a motor problem or not.
Trained neural networks on mean's and variances of certain columns per person classified +-72% of the data correctly.
Naive bayes classifier on mean's and variances of certain columns per person classified +-80% correctly.
Now since this is time based data, 'performance on this test through time', we were suggested to use Recurrent Neural Networks. I've looked into this and I find that this is mostly used to predict future events, i.e. the events happening in the next centiseconds.
Question is, is it in general feasible to use RNN's on (in a way time-based) data like this to predict a binary label? If not, what is?
Yes it definitely is feasible and also very common. Search for any document classification tasks (e.g. sentiment) for examples of this kind of tasks.
Related
I've been given a youtube trending dataset with the assignment to make a predictive model which outputs the probability of a video getting into trending with at least 60% accuracy.
I have the title, channel, thumbnail_link, views, likes, dislikes, comments, date, ...
I've done some analyses and go figure the important columns are
category, tags(a "|" separated list)
The problem is that it's assumed all videos have trended so I can't use a classifier and fit it with training data to predict a trending yes/no column or use a regression algorithm without changing the goal to "predict how liked will it be" or something.
So it sounds like what I'm looking for is a clustering alg, I've looked into KMeans but as far as I can tell it won't do the trick
I'm thinking that I could compare video by video which categories and tags it contains and score it by the popularity of them or make a distance calculating similarity function but the implication is that I should use scikit
This sounds like a one-class classification problem. Some options are:
fit a representative distribution of the data, then for a new observation (video) check how likely it is to have come from that distribution
fit a classifier that will essentially find the boundaries of the data, then for a new observation tell you how far in/out-side of the boundary it is, for example scikit-learn.svm.OneClassSVM
fit cluster centers, or find archetypal examples, and then for a new observation tell how far it is from the cluster center compared to an average observation in the training data
Just some ideas, there are certainly other approaches. :)
I need to train a model with scikit-learn to predict possible time for less people in a room.
Here is how my dataset looks like:
Time PeopleCount
---------------------------------------------
2019-12-29 12:40:10 50
2019-12-29 12:42:10 30
2019-12-29 12:44:10 10
2019-12-29 12:46:10 10
2019-12-29 12:48:10 80
and so on...
This data will be available for 30 days.
Once the model is trained, I will query the model to get the possible time when there will be fewer people in the room between 10.AM and 8.PM. I expect the machine learning model to respond back with the 30-minute accuracy, ie. "3.00 PM to 3.30PM"
What algorithm can I use for this problem and how can I achieve the goal? Or are there any other Python libraries than SciKit-Learn which can be used for this purpose?
I am new to machine learning, sorry for a naive question.
First of all, time-series prediction is on the base of theory that current value more or less depend on the past ones. For instance, 80 of people count as of 2019-12-29 12:48:10 has to be strongly influenced on the people count at the time of 12:46:10, 12:44:20 or previous ones, correlating with past values. If not, you would be better off using the other algorithm for prediction.
While the scikit package contains a various modules as the machine learning algorithm, most of them specialize in the classification algorithm. I think the classification algorithm certainly satisfy your demand if your date is not identified as the type of time series. Actually, scikit also has some regression module, even though I think that seem not to be well suitable for prediction of time series data.
In the case of prediction of time series data, RNN or LSTM algorithm (Deep Learning) has been widely utilized, but scikit does not provide the build-in algorithm of it. So, you might be better off studying Tensorflow or Pytorch framework which are common tools to be enable you to build the RNN or LSTM model.
SciKitLearn models do not recognize timestamps, so you will have to break down your timestamp column into a number of features, ie. day of week, hour, etc. If you need 30-minute accuracy then you will have to aggregate your data from the PeopleCount column somehow, ie. record average, minimum or maximum number of people within each 30-minute time interval. It may be a good idea to also create lagged features, ie. what was the people count in a previous time slot or even 2, 3 or n time slots ago.
Once you have you have your time features and labels (corresponding people counts) ready you can start training your models in standard way:
split your data into training and validation sets,
train each model that you want to try and compare the results.
Any regressor should be suitable for this task, ie. Ridge, Lasso, DecisionTreeRegressor, SVR etc. Note however that if you need to get the best time slot from the given range you will need to make predictions for every slot from the range and pick the one which fits the criteria, although there may be cases where the smallest predicted value is not smaller then value you compare it with.
If you do not get satisfying results with regressors, ie. every time the mean or median squared errors are too high, you could come up with a classification case, ie. instead of training a regressor to predict the number of people you can train a classifier to predict whether the count is greater than 50 or not.
There are many ways to approach this problem. Once try different models and examine the results you will come up with ways to optimize the parameters, engineer features, pre-process the input etc.
I read a lot of tutorials on the web and topics on stackoverflow but one question is still foggy for me. If consider just the stage of collecting data for multi-label training, what way (see below) are better and whether are both of them acceptable and effective?
Try to find 'pure' one-labeled examples at any cost.
Every example can be multi labeled.
For instance, I have articles about war, politics, economics, culture. Usually, politics tied to economics, war connected to politics, economics issues may appear in culture articles etc. I can assign strictly one main theme for each example and drop uncertain works or assign 2, 3 topics.
I'm going to train data using Spacy, volume of data will be about 5-10 thousand examples per topic.
I'd be grateful for any explanation and/or a link to some relevant discussion.
You can try OneVsAll / OneVsRest strategy. This will allow you to do both: predict exact one category without the need to strictly assign one label.
Also known as one-vs-all, this strategy consists in fitting one
classifier per class. For each classifier, the class is fitted against
all the other classes. In addition to its computational efficiency
(only n_classes classifiers are needed), one advantage of this
approach is its interpretability. Since each class is represented by
one and one classifier only, it is possible to gain knowledge about
the class by inspecting its corresponding classifier. This is the most
commonly used strategy for multiclass classification and is a fair
default choice.
This strategy can also be used for multilabel learning, where a
classifier is used to predict multiple labels for instance, by fitting
on a 2-d matrix in which cell [i, j] is 1 if sample i has label j and
0 otherwise.
Link to docs:
https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html
I am in 10th grade and I am looking to use a machine learning model on patient data to find a correlation between the time of week and patient adherence. I have separated the week into 21 time slots, three for each time of day (1 is Monday morning, 2 is monday afternoon, etc.). Adherence values will be binary (0 means they did not take the medicine, 1 means they did). I will simulate training, validation and test data for my model. From my understanding, I can use a logistic regression model to output the probability of the patient missing their medication on a certain time slot given past data for that time slot. This is because logistic regression outputs binary values when given a threshold and is good for problems dealing with probability and binary classes, which is my scenario. In my case, the two classes I am dealing with are yes they will take their medicine, and no they will not. But the major problem with this is that this data will be non-linear, at least to my understanding. To make this more clear, let me give a real life example. If a patient has yoga class on Sunday mornings, (time slot 19) and tends to forget to take their medication at this time, then most of the numbers under time slot 19 would be 0s, while all the other time slots would have many more 1s. The goal is to create a machine learning model which can realize given past data that the patient is very likely going to miss their medication on the next time slot 19. I believe that logistic regression must be used on data that still has an inherently linear data distribution, however I am not sure. I also understand that neural networks are ideal for non-linear distributions, but neural networks require a lot of data to function properly, and ideally the goal of my model is to be able to function decently with simply a few weeks of data. Of course any model becomes more accurate with more data, but it seems to me that generally neural networks need thousands of data sets to truly become decently accurate. Again, I could very well be wrong.
My question is really what model type would work here. I do know that I will need some form of supervised classification. But can I use logistic regression to make predictions when given time of week about adherence?
Really any general feedback on my project is greatly appreciated! Please keep in mind I am only 15, and so certain statements I made were possibly wrong and I will not be able to fully understand very complex replies.
I also have to complete this within the next two weeks, so please do not hesitate to respond as soon as you can! Thank you so much!
In my opinion a logistic regression won't be enough for this as u are going to use a single parameter as input. When I imagine a decision line for this problem, I don't think it can be achieved by a single neuron(a logistic regression). It may need few more neurons or even few layers of them to do so. And u may need a lot of data set for this purpose.
It's true that you need a lot of data for applying neural networks.
It would have been helpful if you could be more precise about your dataset and the features. You can also try implementing K-Means-Clustering for your project. If your aim is to find out that did the patient took medicine or not then it can be done using logistic regression.
I have IMU data (accelerometer, magnetometer, and gyroscope) during a variety of exercises (squats, push-ups, sit-ups, burpees). These exercises are completed in a single 1D time series signal and I would like to use a machine learning classification method to identify the different exercises within the signal. I do not want to condense the signal into 0D peaks and build my features that way but rather keep the time domain intact. Below is a figure showing example data from the accelerometer that contains the four exercises. My question therefore is which method would be most effective at doing so? K-means clustering would be perfect in the 0D sense so is there a 1D equivalent? Any resources to python (sklearn) would be greatly appreciated!
Thanks in advance!
I think that rather than classification, you want to do clustering. Classification is putting data into predefined categories (usually based on some training data), whereas clustering is used to group parts of data into previously unknown classes. Here is a short table showing the difference between classification and clustering.
One thing you can do is chop the time series up into overlapping samples (perhaps 1000 timesteps each) and calculate some statistics for those (mean, variance, etc.). Then perform K-Means Clustering on the statistics you calculated.
After performing clustering, you could use the classes identified during clustering to create training data for a classifier.
For time-series data the standard method is bag-of-frames, chopping it up into small chunks called frames. The frames can be overlapped and windowed or disjoint. Frame size is an important hyper-parameter and depends on task. Features like min,max,median,variance,RMS are calculated on each frame. To use variation over time in the classifier one uses lagged or delta features. Lagged features are values from the frames before. Delta features are computed as the difference of current frame to the previous one.
For classification you will need to label the segments of different activities. Note that for human-activity detection on accelerator data there are also tons of public datasets available, like UCI: Human Activity Recognition Using Smartphones