I try to predict hits in a sequence of numbers, but it is important to me not to predict hits that already exist. For example, I have a sequence of numbers from 1 to 100000. And for some of these numbers I have hit, now I would like to predict even more hits. Which method is suitable for this?
Your problem statement is very high level. What I understand that Neural Networks are doing what you need. Neural Nets take input data that is already classified as hit or not hit. Then they can learn what is a hit. Then you can feed as much data as you want and the learned rules will be applied on the new data. If you want to learn more just go here, or update your question.
Related
I am enquiring about assistance regarding regression testing. I have a continuous time series that fluctuates between positive and negative integers. I also have events occurring throughout this time series at seemingly random time points. Essentially, when an event occurs I grab the respective integer. I then want to test whether this integer influences the event at all. As in, are there more positive/negative integers.
I originally thought logistic regression with the positive/negative number but that would require at least two distinct groups. Whereas, I only have info on events that have occurred. I can't really include that amount of events that don't occur as it's somewhat continuous and random. The amount of times an event doesn't occur is impossible to measure.
So my distinct group is all true in a sense as I don't have any results from something that didn't occur. What I am trying to classify is:
When an outcome occurs, does the positive or negative integer influence this outcome.
It sounds like you are interested in determining the underlying forces that are producing a given stream of data. Such mathematical models are called Markov Models. A classic example is the study of text.
For example, if I run a Hidden Markov Model algorithm on a paragraph of English text, then I will find that there are two driving categories that are determining the probabilities of what letters show up in the paragraph. Those categories can be roughly broken into two groups, "aeiouy " and "bcdfghjklmnpqrstvwxz". Neither the mathematics nor the HMM "knew" what to call those categories, but they are what is statistically converged to upon analysis of a paragraph of text. We might call those categories "vowels" and "consonants". So, yes, vowels and consonants are not just 1st grade categories to learn, they follow from how text is written statistically. Interestingly, a "space" behaves more like a vowel than a consonant. I didn't give the probabilities for the example above, but it is interesting to note that "y" ends up with a probability of roughly 0.6 vowel and 0.4 consonant; meaning that "y" is the most consonant behaving vowel statistically.
A great paper is https://www.cs.sjsu.edu/~stamp/RUA/HMM.pdf which goes over the basic ideas of this kind of time-series analysis and even provides some sudo-code for reference.
I don't know much about the data that you are dealing with and I don't know if the concepts of "positive" and "negative" are playing a determining factor in the data that you see, but if you ran an HMM on your data and found the two groups to be the collection of positive numbers and collection of negative numbers, then your answer would be confirmed, yes, the most influential two-categories that are driving your data are the concepts of positive and negative. If they don't split evenly, then your answer is that those concepts are not an influential factor in driving the data. Regardless, in the end, the algorithm would end with several probability matricies that would show you how much each integer in your data is being influenced by each category, hence you would have much greater insight in the behavior of your time-series data.
Although, the question is quite difficult to understand after first paragraph. Let me help from what I could understand from this question.
Assuming you want to understand if there is relationship between the events happening and the integers in the data.
1st approach: Plot the data on a 2d scale and check visually if there is a relationship between data.
2nd approach: make the data from the events continuous and remove the events from other data and using rolling window smooth out the data and then compare both trends.
Above given approach only works well if I am understanding your problem correctly.
There is also one more thing known as Survivorship bias. You might be missing data, please also check that part also.
Maybe I am misunderstanding your problem but I don't believe that you can preform any kind of meaningful regression without more information.
Regression is usually used to find a relationship between two or more variables, however It appears that you only have one variable (if they are positive or negative) and one constant (outcome is always true in data). Maybe you could do some statistics on the distribution of the numbers (mean, median, standard deviation) but I am unsure how you might do regression.
https://en.wikipedia.org/wiki/Regression_analysis
You might want to consider that there might be some strong survivorship bias if you are missing a large chunk of your data.
https://en.wikipedia.org/wiki/Survivorship_bias
Hope this is at least a bit helpful to get you steered in the right direction
I have a neural network program that is designed to take in input variables and output variables, and use forecasted data to predict what the output variables should be based on the forecasted data. After running this program, I will have an output of an output vector. Lets say for example, my input matrix is 100 rows and 10 columns and my output matrix is a vector with 100 values. How do I determine which of my 10 variables (columns) had the most impact on my output?
I've done a correlation analysis between each of my variables (columns) and my output and created a list of the highest correlation between each variable and output, but I'm wondering if there is a better way to go about this.
If what you want to know is model selection, and it's not as simple as studiying the correlation of your features to your target. For an in-depth, well explained look at model selection, I'd recommend you read chapter 7 of The Elements Statistical Learning. If what you're looking for is how to explain your network, then you're in for a treat as well and I'd recommend reading this article for starters, though I won't go into the matter myself.
Naive approaches to model selection:
There a number of ways to do this.
The naïve way is to estimate all possible models, so every combination of features. Since you have 10 features, it's computationally unfeasible.
Another way is to take a variable you think is a good predictor and train to model only on that variable. Compute the error on the training data. Take another variable at random, retrain the model and recompute the error on the training data. If it drops the error, keep the variable. Otherwise discard it. Keep going for all features.
A third approach is the opposite. Start with training the model on all features and sequentially drop variables (a less naïve approach would be to drop variables you intuitively think have little explanatory power), compute the error on training data and compare to know if you keep the feature or not.
There are million ways of going about this. I've exposed three of the simplest, but again, you can go really deeply into this subject and find all kinds of different information (which is why I highly recommend you read that chapter :) ).
ipdb> np.count_nonzero(test==0) / len(ytrue) * 100
76.44815766923736
I have a datafile counting 24000 prices where I use them for a time series forecasting problem. Instead of trying predicting the price, I tried to predict log-return, i.e. log(P_t/P_P{t-1}). I have applied the log-return over the prices as well as all the features. The prediction are not bad, but the trend tend to predict zero. As you can see above, ~76% of the data are zeros.
Now the idea is probably to "look up for a zero-inflated estimator : first predict whether it's gonna be a zero; if not, predict the value".
In details, what is the perfect way to deal with excessive number of zeros? How zero-inflated estimator can help me with that? Be aware originally I am not probabilist.
P.S. I am working trying to predict the log-return where the units are "seconds" for High-Frequency Trading study. Be aware that it is a regression problem (not a classification problem).
Update
That picture is probably the best prediction I have on the log-return, i.e log(P_t/P_{t-1}). Although it is not bad, the remaining predictions tend to predict zero. As you can see in the above question, there is too many zeros. I have probably the same problem inside the features as I take the log-return on the features as well, i.e. if F is a particular feature, then I apply log(F_t/F_{t-1}).
Here is a one day data, log_return_with_features.pkl, with shape (23369, 30, 161). Sorry, but I cannot tell what are the features. As I apply log(F_t/F_{t-1}) on all the features and on the target (i.e. the price), then be aware I added 1e-8 to all the features before applying the log-return operation to avoid division by 0.
Ok, so judging from your plot: it's the nature of the data, the price doesn't really change that often.
Try subsampling your original data a bit (perhaps by a factor of 5, just look at the data), so that you generally see a price movement with every time-tick. This should make any modeling much MUCH easier.
For the subsampling: I suggest you do simple regular downsampling in time domain. So if you have price data with a second resolution (i.e. one price tag every second), then simply take every fifth datapoint. Then proceed as you usually do, specifically, compute the log-increase in the price from this subsampled data. Remember that whatever you do, it must be reproducible during the test time.
If that is not an option for you for whatever reasons, have a look at something that can handle multiple time scales, e.g. WaveNet or Clockwork RNN.
I'm working on a model that would predict an exam schedule for a given course and term. My input would be the term and the course name, and the output would be the date. I'm currently done with the data cleaning and preprocessing step, however, I can't wrap my head around a way to make a model whose input is two strings and the output is two numbers (the day and month of exam). One approach that I thought of would be encoding my course names, and writing the term as a binary list. I.E input: encoded(course), [0,0,1] output: day, month. and then feeding to a regression model.
I hope someone who's more experienced could tell me a better approach.
Before I start answering your question:
/rant
I know this sounds dumb and doesn't really help your question, but why are you using Neural Networks for this?!
To me, this seems like the classical case of "everybody uses ML/AI in their area, so now I have to, too!" (which is completely not true) /rant over
For string-like inputs, there exist several methods to encode these; choosing the right one might depend on your specific task. As you have a very "simple" (and predictable) input - i.e., you know in advance that there might not be any new/unseen course titles during testing/inference, or you do not need contextual/semantic information, you can resort to something like scikit-learn's LabelEncoder, which will turn it into different classes.
Alternatively, you could also throw a more heavy-weight encoding structure at the problem, that embeds the values in a matrix. Most DL frameworks offer some form of internal function for this, which basically requires you to pass an unique index for your input data, and actively learns some k-dimensional embedding vector for this. Intuitively, these embeddings correspond to a semantic or topical direction. If you have for example 3-dimensional embeddings, the first one could represent "social sciences course", the other one "technical courses", and the third for "seminar".
Of course, this is just a simplification of it, but helps imagining how it works.
For the output, predicting a specific date is actually a really good question. As I have personally never predicted dates myself, I can only recommend tips by other users. A nice answer on dates (as input) is given here.
If you can sacrifice a little bit of accuracy in the result, predicting the calendar week in which the exam is happening might be a good idea. Otherwise, you could simply treat it as two regressed values, but you might end up with invalid combinations (i.e. "negative days/months", or something like "31st February".
Depending on how much training data of high quality you have, results might vary quite heavily. Lastly, I would again recommend you to overthink whether you actually need a neural network for this task, or whether there are simpler metrics to do this.
Create dummy variables or use RandomForest. They accept text input and numerical output.
Is it possible to pass a vector to a trained neural network so it only chooses from a subset of the classes it was trained to recognize. For example, I have a network trained to recognize numbers and letters, but I know that the images I'm running it on next will not contain lowercase letters (Such as images of serial numbers). Then I pass it a vector telling it not to guess any lowercase letters. Since the classes are exclusive the network ends in a softmax function. Following are just examples of what I'd thought of trying but none really work.
import numpy as np
def softmax(arr):
return np.exp(arr)/np.exp(arr).sum()
#Stand ins for previous layer/NN output and vector of allowed answers.
output = np.array([ 0.15885351,0.94527385,0.33977026,-0.27237907,0.32012873,
0.44839673,-0.52375875,-0.99423903,-0.06391236,0.82529586])
restrictions = np.array([1,1,0,0,1,1,1,0,1,1])
#Ideas -----
'''First: Multilpy restricted before sending it through softmax.
I stupidly tried this one.'''
results = softmax(output*restrictions)
'''Second: Multiply the results of the softmax by the restrictions.'''
results = softmax(output)
results = results*restrictions
'''Third: Remove invalid entries before calculating the softmax.'''
result = output*restrictions
result[result != 0] = softmax(result[result != 0])
All of these have issues. The first one causes invalid choices to default to:
1/np.exp(arr).sum()
since inputs to softmax can be negative this can raise the probability given to an invalid choice and make the answer worse. (Should've looked into it before I tried it.)
The second and third both have similar issues in that they wait until right before an answer is given to apply the restriction. For example, if the network is looking at the letter l, but it starts to determine that it's the number 1, this won't be corrected until the very end with these methods. So if it was on it's way to giving the output of 1 with .80 probability but then this option removed it seems the remaining options will redistribute and the highest valid answer won't be as confident as 80%. The remaining options end up a lot more homogeneous.
An example of what I'm trying to say:
output
Out[75]: array([ 5.39413513, 3.81445419, 3.75369546, 1.02716988, 0.39189373])
softmax(output)
Out[76]: array([ 0.70454877, 0.14516581, 0.13660832, 0.00894051, 0.00473658])
softmax(output[1:])
Out[77]: array([ 0.49133596, 0.46237183, 0.03026052, 0.01603169])
(Arrays were ordered to make it easier.)
In the original output the softmax gives .70 that the answer is [1,0,0,0,0] but if that's an invalid answer and thus removed the redistribution how assigns the 4 remaining options with under 50% probability which could easily be ignored as too low to use.
I've considered passing a vector into the network earlier as another input but I'm not sure how to do this without requiring it to learn what the vector is telling it to do, which I think would increase time required to train.
EDIT: I was writing way too much in the comments so I'll just post updates here. I did eventually try giving the restrictions as an input to the network. I took the one hot-encoded answer and randomly added extra enabled classes to simulate an answer key and ensure the correct answer was always in the key. When the key had very few enabled categories the network relied heavily on it and it interfered with learning features from the image. When the key had a lot of enabled categories it seemingly ignored the key completely. This could have been a problem that needed optimized, issues with my network architecture, or just needed a tweak to training but I never got around the the solution.
I did find out that removing answers and zeroing were almost the same when I eventually subtracted np.inf instead of multiplying by 0. I was aware of ensembles but as mentioned in a comment to the first response my network was dealing with CJK characters (alphabet was just to make example easier) and had 3000+ classes. The network was already overly bulky which is why I wanted to look into this method. Using binary networks for each individual category was something I hadn't thought of but 3000+ networks seems problematic too (if I understood what you were saying correctly) though I may look into it later.
First of all, I will loosely go through available options you have listed and add some viable alternatives with the pros and cons. It's kinda hard to structure this answer but I hope you'll get what I'm trying to put out:
1. Multiply restricted before sending it through softmax.
Obviously may give higher chance to the zeroed-out entries as you have written, at seems like a false approach at the beginning.
Alternative: replace impossible values with smallest logit value. This one is similar to softmax(output[1:]), though the network will be even more uncertain about the results. Example pytorch implementation:
import torch
logits = torch.Tensor([5.39413513, 3.81445419, 3.75369546, 1.02716988, 0.39189373])
minimum, _ = torch.min(logits, dim=0)
logits[0] = minimum
print(torch.nn.functional.softmax(logits))
which yields:
tensor([0.0158, 0.4836, 0.4551, 0.0298, 0.0158])
Discussion
Citing you: "In the original output the softmax gives .70 that the answer is [1,0,0,0,0] but if that's an invalid answer and thus removed the redistribution how assigns the 4 remaining options with under 50% probability which could easily be ignored as too low to use."
Yes, and you would be in the right when doing that. Even more so, the actual probabilities for this class are actually far lower, around 14% (tensor([0.7045, 0.1452, 0.1366, 0.0089, 0.0047])). By manually changing the output you are essentially destroying the properties this NN has learned (and it's output distribution) rendering some part of your computations pointless. This points to another problem stated in the bounty this time:
2. NN are known to be overconfident for classification problems
I can imagine this being solved in multiple ways:
2.1 Ensemble
Create multiple neural networks and ensemble them by summing logits taking argmax at the end (or softmax and then `argmax). Hypothetical situation with 3 different models with different predictions:
import torch
predicted_logits_1 = torch.Tensor([5.39413513, 3.81419, 3.7546, 1.02716988, 0.39189373])
predicted_logits_2 = torch.Tensor([3.357895, 4.0165, 4.569546, 0.02716988, -0.189373])
predicted_logits_3 = torch.Tensor([2.989513, 5.814459, 3.55369546, 3.06988, -5.89473])
combined_logits = predicted_logits_1 + predicted_logits_2 + predicted_logits_3
print(combined_logits)
print(torch.nn.functional.softmax(combined_logits))
This would gives us the following probabilities after softmax:
[0.11291057 0.7576356 0.1293983 0.00005554 0.]
(notice the first class is now the most probable)
You can use bootstrap aggregating and other ensembling techniques to improve predictions. This approach makes the classifying decision surface smoother and fixes mutual errors between classifiers (given their predictions vary quite a lot). It would take many posts to describe in any greater detail (or separate question with specific problem would be needed), here or here are some which might get you started.
Still I would not mix this approach with manual selection of outputs.
2.2 Transform the problem into binary
This approach might yield better inference time and maybe even better training time if you can distribute it over multiple GPUs.
Basically, each class of yours can either be present (1) or absent (0). In principle you could train N neural networks for N classes, each outputting a single unbounded number (logit). This single number tells whether the network thinks this example should be classified as it's class or not.
If you are sure certain class won't be the outcome for sure you do not run network responsible for this class detection.
After obtaining predictions from all the networks (or subset of networks), you choose the highest value (or highest probability if you use sigmoid activation, though it would be computationally wasteful).
Additional benefit would be simplicity of said networks (easier training and fine-tuning) and easy switch-like behavior if needed.
Conclusions
If I were you I would go with the approach outlined in 2.2 as you could save yourself some inference time easily and would allow you to "choose outputs" in a sensible manner.
If this approach is not enough, you may consider N ensembles of networks, so a mix of 2.2 and 2.1, some bootstrap or other ensembling techniques. This should improve your accuracy as well.
First ask yourself: what is the benefit of excluding certain outputs based on external data. In your post, I don't see why exactly you want to exclude them.
Saving them won't save computation as óne connection (or óne neuron) has effect on multiple outputs: you can't disable connections/neurons.
Is it really necessary to exclude certain classes? If your network is trained well enough, it will know if it's a capital or not.
So my answer: I don't think you should fiddle with any operation before the softmax. This will give you false conclusions. So you have the following options:
Multiply the results of the softmax by the restrictions.
Don't multiply, if the highest class is 'a', convert it to 'A' as output (convert output to lowercase)
Train a network that sees no difference between capital and non-capital letters