Detecting 'unusual behavior' using machine learning with CouchDB and Python?

Detecting 'unusual behavior' using machine learning with CouchDB and Python? - python

I am collecting a lot of really interesting data points as users come to my Python web service. For example, I have their current city, state, country, user-agent, etc. What I'd like to be able to do is run these through some type of machine learning system / algorithm (maybe a Bayesian classifier?), with the eventual goal of getting e-mail notifications when something out-of-the-ordinary occurs (anomaly detection). For example, Jane Doe has only ever logged in from USA on Chrome. So if she suddenly logs into my web service from the Ukraine on Firefox, I want to see that as a highly 'unusual' event and fire off a notification.
I am using CouchDB (specifically with Cloudant) already, and I see people often saying here and there online that Cloudant / CouchDB is perfect for this sort of thing (big data analysis). However I am at a complete loss for where to start. I have not found much in terms of documentation regarding relatively simple tracking of outlying events for a web service, let alone storing previously 'learned' data using CouchDB. I see several dedicated systems for doing this type of data crunching (PredictionIO comes to mind), but I can't help but feel that they are overkill given the nature of CouchDB in the first place.
Any insight would be much appreciated. Thanks!

You're correct in assuming that this is a problem ideally suited to Machine Learning, and scikit-learn.org is my preferred library for these types of problems. Don't worry about specifics - (couchdb cloudant) for now, lets get your problem into a state where it can be solved.
If we can assume that variations in log-in details (time, location, user-agent etc.) for a given user are low, then any large variation from this would trigger your alert. This is where the 'outlier' detection that #Robert McGibbon suggested comes into play.
For example, squeeze each log-in detail into one dimension, and the create a log-in detail vector for each user (there is significant room for improving this digest of log-in information);
log-in time (modulo 24 hrs)
location (maybe an array of integer locations, each integer representing a different country)
user-agent (a similar array of integer user-agents)
and so on. Every time a user logs in, create this detail array and store it. Once you have accumulated a large set of test data you can try running some ML routines.
So, we have a user and a set of log-in data corresponding to successful log-ins (a training set). We can now train a Support Vector Machine to recognise this users log-in pattern:
from sklearn import svm
# training data [[11.0, 2, 2], [11.3, 2, 2] ... etc]
train_data = my_training_data()
# create and fit the model
clf = svm.OneClassSVM()
clf.fit(train_data)
and then, every time a new log-in even occurs, create a single log-in detail array and pass that past the SVM
if clf.predict(log_in_data) < 0:
fire_alert_event()
else:
# log-in is not dissimilar to previous attempts
print('log in ok')
if the SVM finds the new data point to be significantly different from it's training set then it will fire the alarm.
My Two Pence. Once you've got hold of a good training set, there are many more ML techniques that may be better suited to your task (they may be faster, more accurate etc) but creating your training sets and then training the routines would be the most significant challenge.
There are many exciting things to try! If you know you have bad log-in attempts, you can add these to the training sets by using a more complex SVM which you train with good and bad log-ins. Instead of using an array of disparate 'location' values, you could find the Euclidean different log-ins and use that! This sounds like great fun, good luck!

I also thought the approach using svm.OneClassSVM from sklearn was going to produce a good outlier detector. However, I put together some representative data based upon the example in the question and it simply could not detect an outlier. I swept the nu and gamma parameters from .01 to .99 and found no satisfactory SVM predictor.
My theory is that because the samples have categorical data (cities, states, countries, web browsers) the SVM algorithm is not the right approach. (I did, BTW, first convert the data into binary feature vectors with the DictVectorizer.fit_transform method).
I believe #sullivanmatt is on the right track when he suggests using a Bayesian classifier. Bayesian classifiers are used for supervised learning but, at least on the surface, this problem was cast as an unsupervised learning problem, ie we don't know a priori which observations are normal and which are outliers.
Because the outliers you want to detect are very rare in the stream of web site visits, I believe you could train the Bayesian classifier by labeling every observation in your training set as a positive/normal observation. The classifier should predict that true normal observations have higher probability simply because the majority of the observations really are normal. A true outlier should stand out as receiving a low predicted probability.

If you're trying to investigate on anomalies of user behaviours during the time, I'd recommend you to look at time-series anomaly detectors. With this approach you'll be able to statistically/automatically figure out new, potentially suspicious, emerging patters and abnormal events.
http://www.autonlab.org/tutorials/biosurv.html and http://web.engr.oregonstate.edu/~wong/workshops/icml2006/slides/agarwal.ppt
explain some techniques based on machine learning. In this case you can use scikit-learn.org, a very powerful Python library that contains tons of ML algos.

Related

Recommendation for ML algorithm to differentiate between ban or allowance

I am new to Machine Learning and wanted to see if any of you could recommend an algorithm I could apply onto a project I'm doing. Basically I want to scrape popular housing websites and look at their descriptions to see if they allow/disallow something, for example pets. The problem is a simple search for pets leads contradictory results: 'pets allowed' and 'no additional cost for pets' or 'no pets' or 'I don't accept at this time'. As seen from these examples, often negative keywords 'no' are used to indicate pets are allowed, whereas positive keywords like 'accept' are used to indicate a ban. As such, I was wondering if there was any algorithm I could use (preferably in python) differentiate between the two. (Note: I can't run training data to generate an algorithm myself as the thing I am actually looking for is quite niche).
Thank you very much for your help!!

The keyword you're looking for is "document classification". This is a document classification problem. You start with documents (i.e. webpages) and you want to classify them as "allows pets" or "doesn't allow pets" (or whatever). There are a lot of good tutorials out there for performing document classification but a full explanation is beyond the scope of a StackOverflow answer.
You won't be able to do this for your particular niche case without providing at least some training data but you could gather, say 30 example websites, extract their text, manually add labels ("does fit my niche" vs "doesn't fit my niche"), and then run through a standard document classification system and see if that gets you the accuracy you want. Also in order for this to work with a small amount of training data (like your 30 documents), you'll need to start from a pretrained model.
Good luck!

Linear Regression Model that improves as the user selects and trains data

I'm developing a script that detects peaks on a signal data from a biological source. I want to create a semi-automated model that helps predict which peaks are the correct ones. This script improves as the user manually selects a few of these peaks to help teach the model which ones are correct.
The workflow I'm trying to attain is this:
1. User manually selects data
2. Script obtains the correct data and fits it into the model
3. Use the model to predict the likelihood of a given peak to be correct.
4. Hopefully with enough data and training, it could be automated to run through the rest.
I also don't know the name of the general topic and I'm struggling to find what to google.
I've tried to fit it on linear regression model in scikit learn but I don't have enough datasets (as it learns from the user's first intervention). Is what I'm doing possible?

Sorry for the general-ness of this answer but the OP asked for general topics.
It sounds like semi-supervised learning and here for scikit-learn and here for more details may work.
There is no labeled data to start. A manual process is started to gain some labeled data. Soon, semi-supervised can kick in and take over - with a process measuring its accuracy. A match to your situation and a good place to start.
Eventually you may have "enough" correctly labeled data that you can investigate fitting a classic algorithm to predict the remainder. "Enough" being relative to how hard the problem is. Could be tens, hundreds, thousands, ...
Depending on other details of your situation, Reinforcement learning may work. As you described the situation, this may not work but there may be other details in your environment to leverage this family.
Word of warning - machine learning and semi-supervised in particular may not always work great to every problem. Measure, measure, measure.

Thank you everyone for all your help. I was talking to a colleague and he referred me to Online Machine Learning. I think this was the one I was looking for. Although I would not be handling time-series data nor streaming data from online, the method i think is sufficient for my needs. This method allows that data is trained one by one and not as a batch. I think SciKit Learn currently does not have the ability of out-of-the-box online machine learning.
This i think gives a great rundown on the strengths of online machine learning (also showcasing of the creme python library).
Thanks again!

How does real world machine learning production systems run?

Dear Machine Learning/AI Community,
I am just a budding and aspiring Machine Learner who has worked on open online data sets and some POC's built locally for my project. I have built some models and converted into pickle objects in order to avoid re-training.
And this question always puzzles me. How does a real production system work for ML algorithms?
Say, I have trained my ML algorithm with some millions of data and I want to move it to production system or host it on a server. In real world, do they convert into pickle objects? If so, it would be huge pickled file, isn't. The ones I trained locally and converted for 50000 rows data itself took 300 Mb space on disk for that pickled object. I don't think so this is right approach.
So how does it work in order to avoid my ML algorithm to re-train and start predicting on incoming data? And how do we actually make ML algorithm as a continuous online learner. For example, I built a image classifier, and start predicting the incoming images. But I want to again train algorithm by adding the incoming online images to my previously trained data sets. May be not for every data, but daily once I want to combine all received data for that day and re-train with newly 100 images which my previously trained classifier predicted with actual value. And this approach shouldn't effect my previously trained algorithm to stop predicting incoming data as this re-training may take time based on computational resources and data.
I have Googled and read many articles, but couldn't find or understand to my above question. And this is puzzling me every day. Do manual intervention is needed for production systems as well? or any automated approach is there for it?
Any leads or answers to above questions would be highly helpful and appreciated. Please let me know if my questions doesn't make sense or not understandable.
This is not a project centric I am looking for. Just a generic case of real world production ML systems example.
Thank you in advance!

Note that this is is very broadly formulated, and your question should be put on hold probably, but I try to give a brief summary of what you are trying to ask:
"How does a real production system work?"
Well, it always depends on the scale of your product, and in what way you are using ML/AI in your system. For the most parts, you would deploy a model on your server or app. Note that deployment does NOT lineraly scale with the amount of training data you have. Rather, the size of your network is solely determined by the number of activations in your network. Note that, after training, you might not even need as much storage space, since for example CNNs have a very limited number of connections, while retaining a much larger number during training. I can highly recommend Roger Grosse's slides on the size of a network. This also directly relates to the second point.
"How to avoid re-training?"
From what I am aware of, most systems will not be retrained on a regular basis, at least for the smaller scale. This means that a network will mostly run in inference mode only, which has the aforementioned benefit of what I mentioned about the size of the network (and the time it takes to compute a result). Then again, this also highly depends on the specific task for which you are deploying a ML model. Image classification on "standard categories" have the benefit of already delivering quite substantial models (AlexNet, Inception, ResNet,...), whereas a model for machine translation mostly depends on your specific domain and vocabulary.
"How would I go about re-training?"
This is actually the tricky part, which has a significant field called "bandit learning" behind it. The problem is that most of your incoming "new" data will be unlabeled, i.e. cannot be used for the direct integration into a new training phase. Instead, you rely on feedback from users to give you a sense of what was wrong or right. Then again, not every user has the same ratings for the same machine translation (or same recommendations on Amazon etc.), for example, so judging whether your system is "right" or "wrong" becomes very hard.
There are obviously quite a few methods to automate labeling (i.e. nearest neighbor for images, or other similarity-based searches). Online learning therefore only also works if you have this continuous loop of feedback/retraining.
For larger scale systems, it also becomes important to scale your models, to perform the desired amount of predictions/classifications per second. This is also mentioned in the link to the TensorFlow deployment page I provided, and mainly builds on top of cloud/distributed architectures, such as Hadoop or (more recently) Kubernetes. Then again, for smaller products this is mostly overkill, but serves the purpose of delivering enough resources at any arbitrary scale (and possibly on demand).
As for the integration cycle of machine learning models, there is a nice overview in this article. I want to conclue by stressing that this is a heavily opinionated question, so every answer might be different!

Using logistic regression for a multiple touch response model (python/pandas)?

I have a bunch of contact data listing what members were contacted by what offer, which summarizes something like this:
To make sense of it (and to make it more scalable) I was considering creating dummy variables for each offer and then using a logistic model to see how different offers impact performance:
Before I embark too far on this journey I wanted to get some input if this is a sensible way to approach this (I have started playing around but and got a model output, but haven't dug into it yet). Someone suggested I use linear regression instead, but I'm not really sure about the approach for that in this case.
What I'm hoping to get are coefficients that are interpretable - so I can see that Mailing the 50% off offer in the 3d mailing is not as impactful as the $25 giftcard etc, and then do this at scale (lots of mailings with lots of different offers) to draw some conclusions about the impact of timing of different offers.
My concern is that I will end up with a fairly sparse matrix where only some combinations of the many possible are respresented, and what problems may arise from this. I've taken some online courses in ML but am new to it, and this is one of my first chances to work directly with it so I'm hoping I could create something useful out of this. I have access to lots and lots of data, it's just a matter of getting something basic out that can show some value. Maybe there's already some work on this or even some kind of library I can use?
Thanks for any help!

If your target variable is binary (1 or 0) as in the second chart, then a classification model is appropriate. Logistic Regression is a good first option, you could also a tree-based model like a decision tree classifier or a random forest.
Creating dummy variables is a good move; you could also convert the discounts to numerical values if you want to keep them in a single column, however this may not work so well for a linear model like logistic regression as the correlation will probably not be linear.
If you wanted to model the first chart directly you could use a linear regressions for predicting the conversion rate, I'm not sure about the difference is in doing this, it's actually something I've been wondering about for a while, you've motivated me to post a question on stats.stackexchange.com

How to forecast in python using machine learning , from a given set of geographical data?

I was analyzing some geographical data and attempting to predict/forecast next occurrence of event with respect to time and it geographical position. The data was in following order (with sample data)
Timestamp Latitude Longitude Event
13307266 102.86400972 70.64039541 "Event A"
13311695 102.8082912 70.47394645 "Event A"
13314940 102.82240522 70.6308513 "Event A"
13318949 102.83402128 70.64103035 "Event A"
13334397 102.84726242 70.66790352 "Event A"
First step was classifying it into 100 zones, so that reduces dimensions and complexity.
Timestamp Zone
13307266 47
13311695 65
13314940 51
13318949 46
13334397 26
Next step was to do time series analysis then I got stuck here for 2 months, read around a lot of literature and figured these were my options
* ARIMA (auto-regression method)
* Machine Learning
I wanted to utilize Machine learning to forecast using python but couldn't really figure out how.Specifically are there any python libraries/open-source-code specific for use case, which I can build upon.
EDIT 1:
To clarify, data is loosely dependent on past data but over a period of time is uniformly distributed.
The best way to visualize the data would be, to imagine N number of agents controlled by a algorithm which allots them task of picking resource from grids. Resources are function of socioeconomic structure of society and also strongly dependent on geography. Its in interest of " algorithm " to be able to predict demand zone and time wise.
p.s:
For Auto-regressive models like ARIMA Python already has a library http://pypi.python.org/pypi/statsmodels .

Without example data or existing code I can't offer you anything concrete.
However, often it's helpful to re-phrase your problem in the nomenclature of the field you want to explore. In ML terms:
Your problem's features: How your inputs are specified. Timestamp is continuous, geographic zone is discrete.
Your problems's target label: an event, precisely whether or not a given event has occurred.
Your problem is supervised: target labels for previous data are available. You have previous instances of (timestamp, geographic zone) to event mappings.
The target label is discrete, so this is a classification problem (as opposed to a regression problem, where the output is continuous).
So I'd say you have a supervised classification problem. As an aside you may want to do some sort of time regularisation first; I'm guessing there are going to be patterns of the events depending on what time of the day, day of the month, or month of the year it is, and you may want to represent this as an additional feature.
Taking a look at one of the popular Python ML libraries available, scikit-learn, here:
http://scikit-learn.org/stable/supervised_learning.html
and consulting a recent posting on a cheatsheet for scikit-learn by one of the contributors:
http://peekaboo-vision.blogspot.de/2013/01/machine-learning-cheat-sheet-for-scikit.html
Your first good bet would be to try Support Vector Machines (SVM), and if that fails maybe give k Nearest Neighbours (kNN) a shot as well. Note that using an ensemble classifier is usually superior than using just one instance of a given SVM/kNN.
How, exactly, to apply SVM/kNN with time as a feature may require more research, since AFAIK (and others will probably correct me) SVM/kNN require bounded inputs with a mean of zero (or normalised to have a mean of zero). Just doing some random Googling you may be able to find certain SVM kernels, for example a Fourier kernel, that can transform a time-series feature for you:
SVM Kernels for Time Series Analysis
http://www.stefan-rueping.de/publications/rueping-2001-a.pdf
scikit-learn handily allows you to specify a custom kernel for an SVM. See:
http://scikit-learn.org/stable/auto_examples/svm/plot_custom_kernel.html#example-svm-plot-custom-kernel-py
With your knowledge of ML nomenclature, and example data in hand, you may want to consider posting the question to Cross Validated, the statistics Stack Exchange.
EDIT 1: Thinking about this problem more you need to really understand if your features and corresponding labels are independent and identically distributed (IID) or not. For example what if you were modelling how forest fires spread over time. It's clear that the likelihood of a given zone catches fire is contingent on its neighbours being on fire or not. AFAIK SVM and kNN assume the data is IID. At this point I'm starting to get out of my depth, but I think you should at least give several ML methods a shot and see what happens! Remember to cross-validate! (scikit-learn does this for you).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.