Data testing framework for data streaming (deequ vs Great Expectations)

Data testing framework for data streaming (deequ vs Great Expectations) - python

I want to introduce data quality testing (empty fields/max-min values/regex/etc...) into my pipeline which will essentially consume kafta topics testing the data before it is logged into the DB.
I am having a hard time choosing between the Deequ and Great Expectations frameworks. Deequ lacks clear documentation but has "anomaly detection" which can compare previous scans to current ones. Great expectations has very nice and clear documentation and thus less overhead. I think neither of these frameworks is made for data streaming specifically.
Can anyone offer some advice/other framework suggestions?

As Philipp observed, in most cases batches of some sort are a good way to apply tests to streaming data (even Spark Streaming is effectively using a "mini-batch" system).
That said: if you need to use a streaming algorithm to compute a metric required for your validation (e.g. to maintain running counts over observed data), it is possible to decompose your target metric into a "state" and "update" portion, which can be properties of the "last" and "current" batches (even if those are only one record each). Improved support for that kind of cross-batch metric is actually the area we're most actively working on in Great Expectations now!
In that way, I think of the concept of the Batch as both baked deeply into the core concepts of what gets validated, but also sufficiently flexible to work in a streaming system.
Disclaimer: I am one of the authors of Great Expectations. (Stack Overflow alerts! :))

You can mini-batch your data and apply data quality verification to each of these batches individually. Moreover, deequ allows for stateful computation of data quality metrics where, like James already pointed out, metrics are computed on two partitions of data and are then merged. You can find deequ examples of this here.
Is there a specific example that was not covered in deequ's documentation? You can find a basic example of running deequ against a Spark Dataframe here. Also, there are more examples in the same folder, for example for anomaly detection use-cases.
Disclaimer: I am one of the authors of deequ.

Related

Linear Regression Model that improves as the user selects and trains data

I'm developing a script that detects peaks on a signal data from a biological source. I want to create a semi-automated model that helps predict which peaks are the correct ones. This script improves as the user manually selects a few of these peaks to help teach the model which ones are correct.
The workflow I'm trying to attain is this:
1. User manually selects data
2. Script obtains the correct data and fits it into the model
3. Use the model to predict the likelihood of a given peak to be correct.
4. Hopefully with enough data and training, it could be automated to run through the rest.
I also don't know the name of the general topic and I'm struggling to find what to google.
I've tried to fit it on linear regression model in scikit learn but I don't have enough datasets (as it learns from the user's first intervention). Is what I'm doing possible?

Sorry for the general-ness of this answer but the OP asked for general topics.
It sounds like semi-supervised learning and here for scikit-learn and here for more details may work.
There is no labeled data to start. A manual process is started to gain some labeled data. Soon, semi-supervised can kick in and take over - with a process measuring its accuracy. A match to your situation and a good place to start.
Eventually you may have "enough" correctly labeled data that you can investigate fitting a classic algorithm to predict the remainder. "Enough" being relative to how hard the problem is. Could be tens, hundreds, thousands, ...
Depending on other details of your situation, Reinforcement learning may work. As you described the situation, this may not work but there may be other details in your environment to leverage this family.
Word of warning - machine learning and semi-supervised in particular may not always work great to every problem. Measure, measure, measure.

Thank you everyone for all your help. I was talking to a colleague and he referred me to Online Machine Learning. I think this was the one I was looking for. Although I would not be handling time-series data nor streaming data from online, the method i think is sufficient for my needs. This method allows that data is trained one by one and not as a batch. I think SciKit Learn currently does not have the ability of out-of-the-box online machine learning.
This i think gives a great rundown on the strengths of online machine learning (also showcasing of the creme python library).
Thanks again!

The algorithm behind tsfresh select_features method

I recently started to use tsfresh library to extract features from time-series data.
It's very cool that I can get the bag of features in few lines of code but I have doubt about the logic behind the select_features method. I looked into the official documents and googled it, but I couldn't find which algorithm is used for this. I want to know how it works, so that I can decide what to do on the feature selection phase after data processing in tsfresh.

According to that page in their documentation, what they do is:
they extract a whole set of features
they individually test the different features for significance (in a supervised setting, so the test is something like "is this feature useful to predict that output?") and keep the most significant ones using a procedure called the Benjamini-Yekutieli procedure
The references they provide should be of interest:
[1] Christ, M., Kempa-Liehr, A.W. and Feindt, M. (2016). Distributed and parallel time series feature extraction for industrial big data applications. ArXiv e-prints: 1610.07717 URL: http://adsabs.harvard.edu/abs/2016arXiv161007717C
[2] Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of statistics, 1165–1188
where [1] is the paper describing tsfresh and [2] is the reference for the multiple testing procedure (called Benjamini-Yekutieli procedure above).

How does real world machine learning production systems run?

Dear Machine Learning/AI Community,
I am just a budding and aspiring Machine Learner who has worked on open online data sets and some POC's built locally for my project. I have built some models and converted into pickle objects in order to avoid re-training.
And this question always puzzles me. How does a real production system work for ML algorithms?
Say, I have trained my ML algorithm with some millions of data and I want to move it to production system or host it on a server. In real world, do they convert into pickle objects? If so, it would be huge pickled file, isn't. The ones I trained locally and converted for 50000 rows data itself took 300 Mb space on disk for that pickled object. I don't think so this is right approach.
So how does it work in order to avoid my ML algorithm to re-train and start predicting on incoming data? And how do we actually make ML algorithm as a continuous online learner. For example, I built a image classifier, and start predicting the incoming images. But I want to again train algorithm by adding the incoming online images to my previously trained data sets. May be not for every data, but daily once I want to combine all received data for that day and re-train with newly 100 images which my previously trained classifier predicted with actual value. And this approach shouldn't effect my previously trained algorithm to stop predicting incoming data as this re-training may take time based on computational resources and data.
I have Googled and read many articles, but couldn't find or understand to my above question. And this is puzzling me every day. Do manual intervention is needed for production systems as well? or any automated approach is there for it?
Any leads or answers to above questions would be highly helpful and appreciated. Please let me know if my questions doesn't make sense or not understandable.
This is not a project centric I am looking for. Just a generic case of real world production ML systems example.
Thank you in advance!

Note that this is is very broadly formulated, and your question should be put on hold probably, but I try to give a brief summary of what you are trying to ask:
"How does a real production system work?"
Well, it always depends on the scale of your product, and in what way you are using ML/AI in your system. For the most parts, you would deploy a model on your server or app. Note that deployment does NOT lineraly scale with the amount of training data you have. Rather, the size of your network is solely determined by the number of activations in your network. Note that, after training, you might not even need as much storage space, since for example CNNs have a very limited number of connections, while retaining a much larger number during training. I can highly recommend Roger Grosse's slides on the size of a network. This also directly relates to the second point.
"How to avoid re-training?"
From what I am aware of, most systems will not be retrained on a regular basis, at least for the smaller scale. This means that a network will mostly run in inference mode only, which has the aforementioned benefit of what I mentioned about the size of the network (and the time it takes to compute a result). Then again, this also highly depends on the specific task for which you are deploying a ML model. Image classification on "standard categories" have the benefit of already delivering quite substantial models (AlexNet, Inception, ResNet,...), whereas a model for machine translation mostly depends on your specific domain and vocabulary.
"How would I go about re-training?"
This is actually the tricky part, which has a significant field called "bandit learning" behind it. The problem is that most of your incoming "new" data will be unlabeled, i.e. cannot be used for the direct integration into a new training phase. Instead, you rely on feedback from users to give you a sense of what was wrong or right. Then again, not every user has the same ratings for the same machine translation (or same recommendations on Amazon etc.), for example, so judging whether your system is "right" or "wrong" becomes very hard.
There are obviously quite a few methods to automate labeling (i.e. nearest neighbor for images, or other similarity-based searches). Online learning therefore only also works if you have this continuous loop of feedback/retraining.
For larger scale systems, it also becomes important to scale your models, to perform the desired amount of predictions/classifications per second. This is also mentioned in the link to the TensorFlow deployment page I provided, and mainly builds on top of cloud/distributed architectures, such as Hadoop or (more recently) Kubernetes. Then again, for smaller products this is mostly overkill, but serves the purpose of delivering enough resources at any arbitrary scale (and possibly on demand).
As for the integration cycle of machine learning models, there is a nice overview in this article. I want to conclue by stressing that this is a heavily opinionated question, so every answer might be different!

Detecting 'unusual behavior' using machine learning with CouchDB and Python?

I am collecting a lot of really interesting data points as users come to my Python web service. For example, I have their current city, state, country, user-agent, etc. What I'd like to be able to do is run these through some type of machine learning system / algorithm (maybe a Bayesian classifier?), with the eventual goal of getting e-mail notifications when something out-of-the-ordinary occurs (anomaly detection). For example, Jane Doe has only ever logged in from USA on Chrome. So if she suddenly logs into my web service from the Ukraine on Firefox, I want to see that as a highly 'unusual' event and fire off a notification.
I am using CouchDB (specifically with Cloudant) already, and I see people often saying here and there online that Cloudant / CouchDB is perfect for this sort of thing (big data analysis). However I am at a complete loss for where to start. I have not found much in terms of documentation regarding relatively simple tracking of outlying events for a web service, let alone storing previously 'learned' data using CouchDB. I see several dedicated systems for doing this type of data crunching (PredictionIO comes to mind), but I can't help but feel that they are overkill given the nature of CouchDB in the first place.
Any insight would be much appreciated. Thanks!

You're correct in assuming that this is a problem ideally suited to Machine Learning, and scikit-learn.org is my preferred library for these types of problems. Don't worry about specifics - (couchdb cloudant) for now, lets get your problem into a state where it can be solved.
If we can assume that variations in log-in details (time, location, user-agent etc.) for a given user are low, then any large variation from this would trigger your alert. This is where the 'outlier' detection that #Robert McGibbon suggested comes into play.
For example, squeeze each log-in detail into one dimension, and the create a log-in detail vector for each user (there is significant room for improving this digest of log-in information);
log-in time (modulo 24 hrs)
location (maybe an array of integer locations, each integer representing a different country)
user-agent (a similar array of integer user-agents)
and so on. Every time a user logs in, create this detail array and store it. Once you have accumulated a large set of test data you can try running some ML routines.
So, we have a user and a set of log-in data corresponding to successful log-ins (a training set). We can now train a Support Vector Machine to recognise this users log-in pattern:
from sklearn import svm
# training data [[11.0, 2, 2], [11.3, 2, 2] ... etc]
train_data = my_training_data()
# create and fit the model
clf = svm.OneClassSVM()
clf.fit(train_data)
and then, every time a new log-in even occurs, create a single log-in detail array and pass that past the SVM
if clf.predict(log_in_data) < 0:
fire_alert_event()
else:
# log-in is not dissimilar to previous attempts
print('log in ok')
if the SVM finds the new data point to be significantly different from it's training set then it will fire the alarm.
My Two Pence. Once you've got hold of a good training set, there are many more ML techniques that may be better suited to your task (they may be faster, more accurate etc) but creating your training sets and then training the routines would be the most significant challenge.
There are many exciting things to try! If you know you have bad log-in attempts, you can add these to the training sets by using a more complex SVM which you train with good and bad log-ins. Instead of using an array of disparate 'location' values, you could find the Euclidean different log-ins and use that! This sounds like great fun, good luck!

I also thought the approach using svm.OneClassSVM from sklearn was going to produce a good outlier detector. However, I put together some representative data based upon the example in the question and it simply could not detect an outlier. I swept the nu and gamma parameters from .01 to .99 and found no satisfactory SVM predictor.
My theory is that because the samples have categorical data (cities, states, countries, web browsers) the SVM algorithm is not the right approach. (I did, BTW, first convert the data into binary feature vectors with the DictVectorizer.fit_transform method).
I believe #sullivanmatt is on the right track when he suggests using a Bayesian classifier. Bayesian classifiers are used for supervised learning but, at least on the surface, this problem was cast as an unsupervised learning problem, ie we don't know a priori which observations are normal and which are outliers.
Because the outliers you want to detect are very rare in the stream of web site visits, I believe you could train the Bayesian classifier by labeling every observation in your training set as a positive/normal observation. The classifier should predict that true normal observations have higher probability simply because the majority of the observations really are normal. A true outlier should stand out as receiving a low predicted probability.

If you're trying to investigate on anomalies of user behaviours during the time, I'd recommend you to look at time-series anomaly detectors. With this approach you'll be able to statistically/automatically figure out new, potentially suspicious, emerging patters and abnormal events.
http://www.autonlab.org/tutorials/biosurv.html and http://web.engr.oregonstate.edu/~wong/workshops/icml2006/slides/agarwal.ppt
explain some techniques based on machine learning. In this case you can use scikit-learn.org, a very powerful Python library that contains tons of ML algos.

Neural Network based ranking of documents

I'm planning of implementing a document ranker which uses neural networks. How can one rate a document by taking in to consideration the ratings of similar articles?. Any good python libraries for doing this?. Can anyone recommend a good book for AI, with python code.
EDIT
I'm planning to make a recommendation engine which would make recommendations from similar users as well as using the data clustered using tags. User would be given chance to vote for articles. There will be about hundred thousand articles. Documents would be clustered based on their tags. Given a keyword articles would be fetched based on their tags and passed through a neural network for ranking.

The problem you are trying to solve is called "collaborative filtering".
Neural Networks
One state-of-the-art neural network method is Deep Belief Networks and Restricted Boltzman Machines. For a fast python implementation for a GPU (CUDA) see here. Another option is PyBrain.
Academic papers on your specific problem:
This is probably the state-of-the-art of neural networks and collaborative filtering (of movies):
Salakhutdinov, R., Mnih, A. Hinton, G, Restricted Boltzman
Machines for Collaborative Filtering, To appear in
Proceedings of the 24th International Conference on
Machine Learning 2007.
PDF
A Hopfield network implemented in Python:
Huang, Z. and Chen, H. and Zeng, D. Applying associative retrieval techniques to alleviate the sparsity problem in collaborative filtering.
ACM Transactions on Information Systems (TOIS), 22, 1,116--142, 2004, ACM. PDF
A thesis on collaborative filtering with Restricted Boltzman Machines (they say Python is not practical for the job):
G. Louppe. Collaborative filtering: Scalable
approaches using restricted Boltzmann machines.
Master's thesis, Universite de Liege, 2010.
PDF
Neural networks are not currently the state-of-the-art in collaborative filtering. And they are not the simplest, wide-spread solutions. Regarding your comment about the reason for using NNs being having too little data, neural networks don't have an inherent advantage/disadvantage in that case. Therefore, you might want to consider simpler Machine Learning approaches.
Other Machine Learning Techniques
The best methods today mix k-Nearest Neighbors and Matrix Factorization.
If you are locked on Python, take a look at pysuggest (a Python wrapper for the SUGGEST recommendation engine) and PyRSVD (primarily aimed at applications in collaborative filtering, in particular the Netflix competition).
If you are open to try other open source technologies look at: Open Source collaborative filtering frameworks and http://www.infoanarchy.org/en/Collaborative_Filtering.

Packages
If you're not committed to neural networks, I've had good luck with SVM, and k-means clustering might also be helpful. Both of these are provided by Milk. It also does Stepwise Discriminant Analysis for feature selection, which will definitely be useful to you if you're trying to find similar documents by topic.
God help you if you choose this route, but the ROOT framework has a powerful machine learning package called TMVA that provides a large number of classification methods, including SVM, NN, and Boosted Decision Trees (also possibly a good option). I haven't used it, but pyROOT provides python bindings to ROOT functionality. To be fair, when I first used ROOT I had no C++ knowledge and was in over my head conceptually too, so this might actually be amazing for you. ROOT has a HUGE number of data processing tools.
(NB: I've also written a fairly accurate document language identifier using chi-squared feature selection and cosine matching. Obviously your problem is harder, but consider that you might not need very hefty tools for it.)
Storage vs Processing
You mention in your question that:
...articles would be fetched based on their tags and passed through a neural network for ranking.
Just as another NB, one thing you should know about machine learning is that processes like training and evaluating tend to take a while. You should probably consider ranking all documents for each tag only once (assuming you know all the tags) and storing the results. For machine learning generally, it's much better to use more storage than more processing.
Now to your specific case. You don't say how many tags you have, so let's assume you have 1000, for roundness. If you store the results of your ranking for each doc on each tag, that gives you 100 million floats to store. That's a lot of data, and calculating them all will take a while, but retrieving them is very fast. If instead you recalculate the ranking for each document on demand, you have to do 1000 passes of it, one for each tag. Depending on the kind of operations you're doing and the size of your docs, that could take a few seconds to a few minutes. If the process is simple enough that you can wait for your code to do several of these evaluations on demand without getting bored, then go for it, but you should time this process before making any design decisions / writing code you won't want to use.
Good luck!

If I understand correctly, your task is something related to Collaborative filtering. There are many possible approaches to this problem; I suggest you follow the wikipedia page to have an overview of the main approaches you can choose.
For your project work I can suggest looking at Python based intro to Neural Networks with a simple BackProp NN implementation and a classification example. This is not "the" solution, but perhaps you can build your system out of that example without the need for a bigger framework.

You might want to check out PyBrain.

The FANN library also looks promising.

I am not really sure if a neural networks are the best way to solve this. I think Euclidean Distance Score or Pearson Correlation Score combined with item or user based filtering would be a good start.
An excellent book on the topic is: Programming Collective Intelligence from Toby Segaran

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.