I am trying to work on an AI that composes possible winning bets. However I don't know how I should approach this way of AI. I've made simple AI's that can detect the differenace between humans and animals for example, but this one is a lot more complex.
Which AI model should I use? I don't think linear regression and K-Nearest Neighbours will work in this situation. I'm trying to experiment with neural networks but I don't have any experience with them.
To make things a little bit more clear.
I have a MongoDB with fixtures, leagues, countries and predictions
A fixture contains two teams, a league id, and some more values
A league containes a country and some more values
A country is just an ID with for example a flag in SVG format
A prediction is a collection of different markets* with their probability
market = a way to place a bet for example: home wins, away wins, both teams score
I also have a collection that contains information about which league's predictions are most accurate.
How would I go about creating this AI. All data is in a very decent form I just don't know how to begin. For example what AI model to use, and what inputs to use. Also, how would I go about saving the AI model and training it with new data that enters the MongoDB? (I have multiple cron jobs inserting data in the MongoDB)
Note:
The AI should compose bets containing X number of fixtures
Because there is no "Right" way to do this, I'll tell you the most generic way.
The first thing you want to figure out is the target of the model:
The label/target that you want to classify is the market.I can suggest for the simplicity you can use -1 for home , 0 for tie and 1 for away.
data cleaning : remove outliers , complete/interpolate missing values etc.
feature extraction:
convert categorical values using one-hot encoding.
standardise between 0~1 the values of numeric features.
Remove all of the non relevant values: that has very low
entropy over the whole dataset or very high entropy within each
of the labels.
try to extract logical features from the raw data that might help the classifier distinguish between the classes.
select features using (for example) mutual information gain.
try using simple model as Naive Base, if you have more time you can use SVM model.and remember - no free lunch theory and also less is more -always prefer simple features and models.
Related
I have a database full of items and I've been tasked with classifying them (they could be books, stationary etc). The options are to either go through 100k records manually and figure out what they are or to automate the task.
The codes for each type of item follow some kind of pattern, and so I'm hoping to use machine learning to solve this (I do not want to use regular expressions). Though I'm quite good at python, my ml knowledge only goes as far as random forests and logistic regression.
Is this at all possible? The data looks like this:
Item code type
1 4S2BDANC5L3247151 book
2 1N4AL3AP1JC236284 book
3 3R4BTTGC3L3237430 book
4 KNMAT2MT1KP546287 book
5 97806773062273208 pen
6 07356196706378892 Pen
7 97807345361169253 pen
8 01008130715194136 chair
9 01076305063010CCE44 chair
etc
I'm happy to look up and learn whatever is necessary, I just don't know where to start
Thanks!
I understood that you have 100k example. You can use RNN, LSTM or Attention based deep learning methods because the pattern of codes can be tracked by that models. Also machine learning models can solve that problem. In the end, your problem includes specific type of patterns for different classes. Therefore, you can separate that classes.
1) You need to start with finding embedding to represent your codes. I guess you can use ascii codes of numbers and letters. Also to make all vectors be the same length, use padding. Then you can normalize them to put in between 0-1.
2) Then my advise is to start with SVM with one-vs-all strategy to make multi-class classification. After that you can try XGBoost, which is a powerful ml model. Or you can start with more basic ml models. Idea behind that, start from basic and goes to complicated models.
3) If ml models are not enough for that task, start with basic RNN models.
I don't know about your data distribution among classes and also number of classes. If it is balanced and each class have enough data I guess you can easily automate that task.
I have a file called train.dat which has three fields - userID, movieID and rating.
I need to predict the rating in the test.dat file based on this.
I want to know how I can use scikit-learn's KMeans to group similar users given that I have only feature - rating.
Does this even make sense to do? After the clustering step, I could do a regression step to get the ratings for each user-movie pair in test.dat
Edit: I have some extra files which contain the actors in each movie, the directors and also the genres that the movie falls into. I'm unsure how to use these to start with and I'm asking this question because I was wondering whether it's possible to get a simple model working with just rating and then enhance it with the other data. I read that this is called content based recommendation. I'm sorry, I should've written about the other data files as well.
scikit-learn is not a library for recommender systems, neither is kmeans typical tool for clustering such data. Things that you are trying to do deal with graphs, and usually are either analyzed on graph level, or using various matrix factorization techniques.
In particular kmeans only works in euclidean spaces, and you do not have such thing here. What you can do is to use DBScan (or any other clustering technique accepting arbitrary simialrity, but this one is actually in scikit-learn) and define similarity between two users by some kind of their agreement in terms of their taste, for example:
sim(user1, user2) = # movies both users like / # movies at least one of them likes
which is known as Jaccard coefficient for similarity between binary vectors. You have rating, not just "liking" but I am giving here a simplest possible example, while you can come up with dozens other things to try out. The point is - for the simplest approach all you have to do is define a notion of per-user similarity and apply clustering that accepts such a setting (like mentioned DBScan).
Clustering users makes sense. But if your only feature is the rating, I don't think it could produce a useful model for prediction. Below are my assumptions to make this justification:
The quality of movie should be distributed with a gaussion distribution.
If we look at the rating distribution of a common user, it should be something like gaussian.
I don't exclude the possibility that a few users only give ratings when they see a bad movie (thus all low ratings); and vice versa. But on a large scale of users, this should be unusual behavior.
Thus I can imagine that after clustering, you get small groups of users in the two extreme cases; and most users are in the middle (because they share the gaussian-like rating behavior). Using this model, you probably get good results for users in the two small (extreme) groups; however for the majority of users, you cannot expect good predictions.
I want to classify the tweets into predefined categories (like: sports, health, and 10 more). If I had labeled data, I would be able to do the classification by training Naive Bayes or SVM. As described in http://cucis.ece.northwestern.edu/publications/pdf/LeePal11.pdf
But I cannot figure out a way with unlabeled data. One possibility could be using Expectation-Maximization and generating clusters and label those clusters. But as said earlier I have predefined set of classes, so clustering won't be as good.
Can anyone guide me on what techniques I should follow. Appreciate any help.
Alright by what i can understand i think there are multiple ways to attend to this case.
there will be trade offs and the accuracy rate may vary. because of the well know fact and observation
Each single tweet is distinct!
(unless you are extracting data from twitter stream api based on tags and other keywords). Please define the source of data and how are you extracting it. i am assuming you're just getting general tweets which can be about anything
The thing you can do is to generate a set of dictionary for each class you have
(i.e Music => pop , jazz , rap , instruments ...)
which will contain relevant words to that class. You can use NLTK for python or Stanford NLP for other languages.
You can start with extracting
Synonyms
Hyponyms
Hypernyms
Meronyms
Holonyms
Go see these NLP Lexical semantics slides. it will surely clear some of the concepts.
Once you have dictionaries for each classes. cross compare them with the tweets you have got. the tweet which has the most similarity (you can rank them according to the occurrences of words from the these dictionaries) you can label it to that class. This will make your tweets labeled like others.
Now the question is the accuracy! But it depends on the data and versatility of your classes. This may be an "Over kill" But it may come close to what you want.
Furthermore you can label some set of tweets this way and use Cosine Similarity to cross identify other tweets. This will help with the optimization part. But then again its up-to you. As you know what Trade offs you can bear
The real struggle will be the machine learning part and how you manage that.
Actually this seems as a typical use case of semi-supervised learning. There are plenty methods of use here, including clustering with constraints (where you force model to cluster samples from the same class together), transductive learning (where you try to extrapolate model from labeled samples onto distribution of unlabeled ones).
You could also simply cluster data as #Shoaib suggested, but then you will have to come up the the heuristic approach how to deal with clusters with mixed labeling. Futhermore - obviously solving optimziation problem not related to the task (labeling) will not be as good as actually using this knowledge.
You can use clustering for that task. For that you have to label some examples for each class first. Then using these labeled examples, you can identify the class of each cluster easily.
I am collecting a lot of really interesting data points as users come to my Python web service. For example, I have their current city, state, country, user-agent, etc. What I'd like to be able to do is run these through some type of machine learning system / algorithm (maybe a Bayesian classifier?), with the eventual goal of getting e-mail notifications when something out-of-the-ordinary occurs (anomaly detection). For example, Jane Doe has only ever logged in from USA on Chrome. So if she suddenly logs into my web service from the Ukraine on Firefox, I want to see that as a highly 'unusual' event and fire off a notification.
I am using CouchDB (specifically with Cloudant) already, and I see people often saying here and there online that Cloudant / CouchDB is perfect for this sort of thing (big data analysis). However I am at a complete loss for where to start. I have not found much in terms of documentation regarding relatively simple tracking of outlying events for a web service, let alone storing previously 'learned' data using CouchDB. I see several dedicated systems for doing this type of data crunching (PredictionIO comes to mind), but I can't help but feel that they are overkill given the nature of CouchDB in the first place.
Any insight would be much appreciated. Thanks!
You're correct in assuming that this is a problem ideally suited to Machine Learning, and scikit-learn.org is my preferred library for these types of problems. Don't worry about specifics - (couchdb cloudant) for now, lets get your problem into a state where it can be solved.
If we can assume that variations in log-in details (time, location, user-agent etc.) for a given user are low, then any large variation from this would trigger your alert. This is where the 'outlier' detection that #Robert McGibbon suggested comes into play.
For example, squeeze each log-in detail into one dimension, and the create a log-in detail vector for each user (there is significant room for improving this digest of log-in information);
log-in time (modulo 24 hrs)
location (maybe an array of integer locations, each integer representing a different country)
user-agent (a similar array of integer user-agents)
and so on. Every time a user logs in, create this detail array and store it. Once you have accumulated a large set of test data you can try running some ML routines.
So, we have a user and a set of log-in data corresponding to successful log-ins (a training set). We can now train a Support Vector Machine to recognise this users log-in pattern:
from sklearn import svm
# training data [[11.0, 2, 2], [11.3, 2, 2] ... etc]
train_data = my_training_data()
# create and fit the model
clf = svm.OneClassSVM()
clf.fit(train_data)
and then, every time a new log-in even occurs, create a single log-in detail array and pass that past the SVM
if clf.predict(log_in_data) < 0:
fire_alert_event()
else:
# log-in is not dissimilar to previous attempts
print('log in ok')
if the SVM finds the new data point to be significantly different from it's training set then it will fire the alarm.
My Two Pence. Once you've got hold of a good training set, there are many more ML techniques that may be better suited to your task (they may be faster, more accurate etc) but creating your training sets and then training the routines would be the most significant challenge.
There are many exciting things to try! If you know you have bad log-in attempts, you can add these to the training sets by using a more complex SVM which you train with good and bad log-ins. Instead of using an array of disparate 'location' values, you could find the Euclidean different log-ins and use that! This sounds like great fun, good luck!
I also thought the approach using svm.OneClassSVM from sklearn was going to produce a good outlier detector. However, I put together some representative data based upon the example in the question and it simply could not detect an outlier. I swept the nu and gamma parameters from .01 to .99 and found no satisfactory SVM predictor.
My theory is that because the samples have categorical data (cities, states, countries, web browsers) the SVM algorithm is not the right approach. (I did, BTW, first convert the data into binary feature vectors with the DictVectorizer.fit_transform method).
I believe #sullivanmatt is on the right track when he suggests using a Bayesian classifier. Bayesian classifiers are used for supervised learning but, at least on the surface, this problem was cast as an unsupervised learning problem, ie we don't know a priori which observations are normal and which are outliers.
Because the outliers you want to detect are very rare in the stream of web site visits, I believe you could train the Bayesian classifier by labeling every observation in your training set as a positive/normal observation. The classifier should predict that true normal observations have higher probability simply because the majority of the observations really are normal. A true outlier should stand out as receiving a low predicted probability.
If you're trying to investigate on anomalies of user behaviours during the time, I'd recommend you to look at time-series anomaly detectors. With this approach you'll be able to statistically/automatically figure out new, potentially suspicious, emerging patters and abnormal events.
http://www.autonlab.org/tutorials/biosurv.html and http://web.engr.oregonstate.edu/~wong/workshops/icml2006/slides/agarwal.ppt
explain some techniques based on machine learning. In this case you can use scikit-learn.org, a very powerful Python library that contains tons of ML algos.
My Question is as follows:
I know a little bit about ML in Python (using NLTK), and it works ok so far. I can get predictions given certain features. But I want to know, is there a way, to display the best features to achieve a label? I mean the direct opposite of what I've been doing so far (put in all circumstances, and get a label for that)
I try to make my question clear via an example:
Let's say I have a database with Soccer games.
The Labels are e.g. 'Win', 'Loss', 'Draw'.
The Features are e.g. 'Windspeed', 'Rain or not', 'Daytime', 'Fouls committed' etc.
Now I want to know: Under which circumstances will a Team achieve a Win, Loss or Draw? Basically I want to get back something like this:
Best conditions for Win: Windspeed=0, No Rain, Afternoon, Fouls=0 etc
Best conditions for Loss: ...
Is there a way to achieve this?
My paint skills aren't the best!
All I know is theory, so well you'll have to look for the code..
If you have only 1 case(The best for "x" situations) the diagram becomes something like (It won't be 2-D, but something like this):
Green (Win), Orange(Draw), Red(Lose)
Now if you want to predict whether the team wins, loses or draws, you have (at least) 2 models to classify:
Linear Regression, the separator is the Perpendicular bisector of the line joining the 2 points:
K-nearest-neighbours: it is done just by calculating the distance from all the points, and classifying the point as the same as the closest..
So, for example, if you have a new data, and have to classify it, here's how:
We have a new point, with certain attributes..
We classify it by seeing/calculating which side of the line the point comes in (or seeing how far it is from our benchmark situations...
Note: You will have to give some weightage to each factor, for more accuracy..
You could compute the representativeness of each feature to separate the classes via feature weighting. The most common method for feature selection (and therefore feature weighting) in Text Classification is chi^2. This measure will tell you which features are better. Based on this information you can analyse the specific values that are best for every case. I hope this helps.
Regards,
Not sure if you have to do this in python, but if not, I would suggest Weka. If you're unfamiliar with it, here's a link to a set of tutorials: https://www.youtube.com/watch?v=gd5HwYYOz2U
Basically, you'd just need to write a program to extract your features and labels and then output a .arff file. Once you've generated a .arff file, you can feed this to Weka and run myriad different classifiers on it to figure out what model best fits your data. If necessary, you can then program this model to operate on your data. Weka has plenty of ways to analyze your results and to graphically display said results. It's truly amazing.