I simply want to receive notifications from dropbox that a change has been made. I am currently following this tutorial:
https://www.dropbox.com/developers/reference/webhooks#tutorial
The GET method is done, verification is good.
However, when trying to mimic their implementation of POST, I am struggling because of a few things:
I have no idea what redis_url means in the def_process function of the tutorial.
I can't actually verify if anything is really being sent from dropbox.
Also any advice on how I can debug? I can't print anything from my program since it has to be ran on a site rather than an IDE.
Redis is a key-value store; it's just a way to cache your data throughout your application.
For example, access token that is received after oauth callback is stored:
redis_client.hset('tokens', uid, access_token)
only to be used later in process_user:
token = redis_client.hget('tokens', uid)
(code from https://github.com/dropbox/mdwebhook/blob/master/app.py as suggested by their documentation: https://www.dropbox.com/developers/reference/webhooks#webhooks)
The same goes for per-user delta cursors that are also stored.
However there are plenty of resources how to install Redis, for example:
https://www.digitalocean.com/community/tutorials/how-to-install-and-use-redis
In this case your redis_url would be something like:
"redis://localhost:6379/"
There are also hosted solutions, e.g. http://redistogo.com/
Possible workaround would be to use database for such purpose.
As for debugging, you could use logging facility for Python, it's thread safe and capable of writing output to file stream, it should provide you with plenty information if properly used.
More info here:
https://docs.python.org/2/howto/logging.html
The final outcome of my work should be a Python function that takes a JSON object as the only input and return another JSON object as output. To keep it more specific, I am a data scientist, and the function that I am speaking about, is derived from data and it delivers predictions (in other words, it is a machine learning model).
So, my question is how to deliver this function to the "tech team" that is going to incorporate it into a web-service.
At the moment I face few problems. First, the tech team does not necessarily work in Python environment. So, they cannot just "copy and paste" my function into their code. Second, I want to make sure that my function runs in the same environment as mine. For example, I can imagine that I use some library that the tech team does not have or they have a version that differ from the version that I use.
ADDED
As a possible solution I consider the following. I start a Python process that listen to a socket, accept incoming strings, transforms them into JSON, gives the JSON to the "published" function and returns the output JSON as a string. Does this solution have disadvantages? In other words, is it a good idea to "publish" a Python function as a background process listening to a socket?
You have the right idea with using a socket but there are tons of frameworks doing exactly what you want. Like hleggs, I suggest you checkout Flask to build a microservice. This will let the other team post JSON objects in an HTTP request to your flask application and receive JSON objects back. No knowledge of the underlying system or additional requirements required!
Here's a template for a flask app that replies and responds with JSON
from flask import Flask, request, jsonify
app = Flask(__name__)
#app.route('/', methods=['POST'])
def index():
json = request.json
return jsonify(your_function(json))
if __name__=='__main__':
app.run(host='0.0.0.0', port=5000)
Edit: embeded my code directly as per Peter Britain's advice
My understanding of your question boils down to:
How can I share a Python library with the rest of my team, that may not be using Python otherwise?
And how can I make sure my code and its dependencies are what the receiving team will run?
And that the receiving team can install things easily mostly anywhere?
This is a simple question with no straightforward answer... as you just mentioned that this may be integrated in some webservice, but you do not know the actual platform for this service.
You also ask:
As a possible solution I consider the following. I start a Python process that listen to a socket, accept incoming strings, transforms them into JSON, gives the JSON to the "published" function and returns the output JSON as a string. Does this solution have disadvantages? In other words, is it a good idea to "publish" a Python function as a background process listening to a socket?
In the most simple case and for starting I would say no in general. Starting network servers such as an HTTP server (which is built-in Python) is super easy. But a service (even if qualified as "micro") means infrastructure, means security, etc.
What if the port you expect is not available on the deployment machine? - What happens when you restart that machine?
How will your server start or restart when there is a failure?
Would you need also to eventually provide an upstart or systemd service (on Linux)?
Will your simple socket or web server support multiple concurrent requests?
is there a security risk to expose a socket?
Etc, etc. When deployed, my experience with "simple" socket servers is that they end up being not so simple after all.
In most cases, it will be simpler to avoid redistributing a socket service at first. And the proposed approach here could be used to package a whole service at a later stage in a simpler way if you want.
What I suggest instead is a simple command line interface nicely packaged for installation.
The minimal set of things to consider would be:
provide a portable mechanism to call your function on many OSes
ensure that you package your function such that it can be installed with all the correct dependencies
make it easy to install and of course provide some doc!
Step 1. The simplest common denominator would be to provide a command line interface that accepts the path to a JSON file and spits JSON on the stdout.
This would run on Linux, Mac and Windows.
The instructions here should work on Linux or Mac and would need a slight adjustment for Windows (only for the configure.sh script further down)
A minimal Python script could be:
#!/usr/bin/env python
"""
Simple wrapper for calling a function accepting JSON and returning JSON.
Save to predictor.py and use this way::
python predictor.py sample.json
[
"a",
"b",
4
]
"""
from __future__ import absolute_import, print_function
import json
import sys
def predict(json_input):
"""
Return predictions as a JSON string based on the provided `json_input` JSON
string data.
"""
# this will error out immediately if the JSON is not valid
validated = json.loads(json_input)
# <....> your code there
with_predictions = validated
# return a pretty-printed JSON string
return json.dumps(with_predictions, indent=2)
def main():
"""
Print the JSON string results of a prediction, loading an input JSON file from a
file path provided as a command line argument.
"""
args = sys.argv[1:]
json_input = args[0]
with open(json_input) as inp:
print(predict(inp.read()))
if __name__ == '__main__':
main()
You can process eventually large inputs by passing the path to a JSON file.
Step 2. Package your function. In Python this is achieved by creating a setup.py script. This takes care of installing any dependent code from Pypi too. This will ensure that the version of libraries you depend on are the ones you expect. Here I added nltk as an example for a dependency. Add yours: this could be scikit-learn, pandas, numpy, etc. This setup.py also creates automatically a bin/predict script which will be your main command line interface:
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
from __future__ import absolute_import, print_function
from setuptools import setup
from setuptools import find_packages
setup(
name='predictor',
version='1.0.0',
license='public domain',
description='Predict your life with JSON.',
packages=find_packages(),
# add all your direct requirements here
install_requires=['nltk >= 3.2, < 4.0'],
# add all your command line entry points here
entry_points={'console_scripts': ['predict = prediction.predictor:main']}
)
In addition as is common for Python and to make the setup code simpler I created a "Python package" directory moving the predictor inside this directory.
Step 3. You now want to package things such that they are easy to install. A simple configure.sh script does the job. It installs virtualenv, pip and setuptools, then creates a virtualenv in the same directory as your project and then installs your prediction tool in there (pip install . is essentially the same as python setup.py install). With this script you ensure that the code that will be run is the code you want to be run with the correct dependencies. Furthermore, you ensure that this is an isolated installation with minimal dependencies and impact on the target system. This is tested with Python 2 but should work quite likely on Python 3 too.
#!/bin/bash
#
# configure and installs predictor
#
ARCHIVE=15.0.3.tar.gz
mkdir -p tmp/
wget -O tmp/venv.tgz https://github.com/pypa/virtualenv/archive/$ARCHIVE
tar --strip-components=1 -xf tmp/venv.tgz -C tmp
/usr/bin/python tmp/virtualenv.py .
. bin/activate
pip install .
echo ""
echo "Predictor is now configured: run it with:"
echo " bin/predict <path to JSON file>"
At the end you have a fully configured, isolated and easy to install piece of code with a simple highly portable command line interface.
You can see it all in this small repo: https://github.com/pombredanne/predictor
You just clone or fetch a zip or tarball of the repo, then go through the README and you are in business.
Note that for a more engaged way for more complex applications including vendoring the dependencies for easy install and not depend on the network you can check this https://github.com/nexB/scancode-toolkit I maintain too.
And if you really want to expose a web service, you could reuse this approach and package that with a simple web server (like the one built-in in the Python standard lib or bottle or flask or gunicorn) and provide configure.sh to install it all and generate the command line to launch it.
Your task is (in generality) about productionizing a machine learning model, where the consumer of the model may not be working in the same environment as the one which was used to develop the model. I've been trying to tackle this problem since past few years. The problem is faced by many companies and it is aggravated due to skill set, objectives as well as environment (languages, run time) mismatch between data scientists and developers. From my experience, following solutions/options are available, each with its unique advantages and downsides.
Option 1 : Build the prediction part of your model as a standalone web service using any lightweight tool in Python (for example, Flask). You should try to decouple the model development/training and prediction part as much as possible. The model that you have developed, must be serialized to some form so that the web server can use it.
How frequently is your machine learning model updated? If it is not done very frequently, the serialized model file (example: Python pickle file) can be saved to a common location accessible to the web server (say s3), loaded in memory. The standalone web server should offer APIs for prediction.
Please note that exposing a single model prediction using Flask would be simple. But scaling this web server if needed, configuring it with right set of libraries, authentication of incoming requests are all non-trivial tasks. You should choose this route only if you have dev teams ready to help with these.
If the model gets updated frequently, versioning your model file would be a good option. So in fact, you can piggyback on top of any version control system by checking in the whole model file if it is not too large. The web server can de-serialize (pickle.load) this file at startup/update and convert to a Python object on which you can call prediction methods.
Option 2 : use predictive modeling markup language. PMML was developed specifically for this purpose: predictive modeling data interchange format independent of environment. So data scientist can develop model, export it to a PMML file. The web server used for prediction can then consume the PMML file for doing predictions. You should definitely check the open scoring project which allows you to expose machine learning models via REST APIs for deploying models and making predictions.
Pros: PMML is standardized format, open scoring is a mature project with good development history.
Cons: PMML may not support all models. Open scoring is primarily useful if your tech team's choice of development platform is JVM. Exporting machine learning models from Python is not straightforward. But R has good support for exporting models as PMML files.
Option 3 : There are some vendors offering dedicated solutions for this problem. You will have to evaluate cost of licensing, cost of hardware as well as stability of the offerings for taking this route.
Whichever option you choose, please consider the long term costs of supporting that option. If your work is in a proof of concept stage, Python flask based web server + pickled model files will be the best route. Hope this answer helps you!
As already suggested in other answers the best option would be creating a simple web service. Besides Flask you may want to try bottle which is very thin one-file web framework. Your service may looks as simple as:
from bottle import route, run, request
#route('/')
def index():
return my_function(request.json)
run(host='0.0.0.0', port=8080)
In order to keep environments the same check virtualenv to make isolated environment for avoiding conflicts with already installed packages and pip to install exact version of packages into virtual environment.
I guess you have 3 possibilities :
convert python function to javascript function:
Assuming the "tech-team" use Javascript for web-service, you may try to convert your python function directly to a Javascript function (which will be really easy to integrate on web page) using empythoned (based on emscripten)
The bad point of this method is that each time you need update/upgrade your python function, you need also to convert to Javascript again, then check & validate that the function continue to work.
simple API server + JQuery
If the conversion method is impossible, I am agree with #justin-bell, you may use FLASK
getting JSON as input > JSON to your function parameter > run python function > convert function result to JSON > serve the JSON result
Assuming you choose the FLASK solution, "tech-team" will only need to send an async. GET/POST request containing all the arguments as JSON obj, when they need to get some result from your python function.
websocket server + socket.io
You can also use take a look on Websocket to dispatch to the webservice (look at flask + websocket for your side & socket.io for webservice side.)
=> websocket is really usefull when you need to push/receive data with low cost and latency to (or from) a lot of users (Not sure that websocket will be the best fit to your need)
Regards
For testing I use pytest so it would be great if you suggest something pytest specific.
I have some code which uses the requests library. What it does is basically simple POST/GET requests for logging in, parsing data, etc.
Surely I want to test that code locally without doing any actual HTTP requests.
A monkeypatch funcarg could be the solution, but I think that mocking request.get(...) calls or directly pythons's urllib isn't good, because, for example, there are functions which do more than one HTTP request inside , so I can't just mock the request.get("anyURL") with a simple lambda *args, **kwaargs: """<html>response</html>""".
There are different URLs which should return different content. Sometimes it should be based on POST/GET data. Also I have no idea how will requests.session behave in case of direct mocking. Besides that how to emulate session termination? How to emulate a connection failure?
So in the end in my opinion it's quite hard to use monkey patching here. At least I am not able to write a good mocking function which will take into account everything. Also if I choose to mock urllib directly and someday requests library starts using something different all my tests will fail.
So the best way I think is to use actual HTTP server which turns on on a test run, and if possible takes into account pytest's scopes, etc (so it's a funcarg). While googling I found only two solutions:
https://pypi.python.org/pypi/pytest-localserver
https://github.com/kevin1024/pytest-httpbin
The first one sets up the HTTP server and serves predefined content over a specific URL. Definitely that does not work for me, because as I mentioned some functions which I intend to test do several requests so all inner HTTP requests.get() will get the same answer. Bad.
The second one as far a I see has the same problem. Or at least do not understand how to use it.
The third option could be writing a small Flask based service, but I guess I'll run into a problem that things I use in tests should be tested as well which is a bad practice.
You can rather unmock get after first call.
class Requester():
def get(*args):
...
def mock_get(requester, response):
orig_get = requester.get
def return_text_and_unmock(*args, **kwargs):
self.get = orig_get
return response
requester.get = return_text_and_unmock.__get__(requester, Requester)
return requester
I believe using a local server for unit testing is not a good idea as this is not really a unit test. I you're using requests one good way of being able to mock the requests is to use the module responses that is developed and maintained by dropbox: response dropbox. With responses you will be able to mock each request you make by specifying that you want a certain content to be return when a request is issued to a given URL. The README gives a quick overview of the module's abilities.
I'm developing an OpenID consumer library [1] that does its job in two separate Web requests:
When user requests authentication the library discovers some information from a URL provided by the user.
This information is used during another Web request when the library actually finishes authentication.
I want an advice on how best to design the library API for persisting the "discovery information" between two requests:
I could ask the caller to provide their own session object for both calls and the library would store and read its own object from it:
result = openid.request(session, url) # stores discovery in `session`
# ...
openid.authenticate(session, auth_data) # reads discovery from `session`
I don't really like it aesthetically as it looks like an Inversion of Control: bring your own buffer, we'll use it for you. Also, a "session" is not a well-defined interface and may expose some limitations. For example, sessions in Django by default only want to store JSON-serializable objects and the object representing the discovery information is of a custom class that json doesn't recognize.
I could return the discovery info to the caller and let them deal with it as they see fit:
result, discovery = openid.request(url)
session['openid.discovery'] = discovery
# ...
openid.authenticate(session['openid.discovery'], auth_data)
This makes the library cleaner (no need to invent session key names or session cleanup policies) but moves this job to the caller. Which is probably okay if the caller is another library code implementing OpenID behaviour for a Web framework, but not okay if the caller is an application developer who shouldn't be bothered with this sort of book-keeping.
Which approach is best (for any subjective definition of "best")? Is there another one?
Second approach is definitely cleaner, because:
normal return values are always better than "out" arguments
it doesn't restrict format of you session object (in fact, there may be no explicit session object at all)
I was writing debugging methods for my CherryPy application. The code in question was (very) basically equivalent to this:
import cherrypy
class Page:
def index(self):
try:
self.body += 'okay'
except AttributeError:
self.body = 'okay'
return self.body
index.exposed = True
cherrypy.quickstart(Page(), config='root.conf')
I was surprised to notice that from request to request, the output of self.body grew. When I visited the page from one client, and then from another concurrently-open client, and then refreshed the browsers for both, the output was an ever-increasing string of "okay"s. In my debugging method, I was also recording user-specific information (i.e. session data) and that, too, showed up in both users' output.
I'm assuming that's because the python module is loaded into working memory instead of being re-run for every request.
My question is this: How does that work? How is it that self.debug is preserved from request to request, but cherrypy.session and cherrypy.response aren't?
And is there any way to set an object attribute that will only be used for the current request? I know I can overwrite self.body per every request, but it seems a little ad-hoc. Is there a standard or built-in way of doing it in CherryPy?
(second question moved to How does CherryPy caching work?)
synthesizerpatel's analysis is correct, but if you really want to store some data per request, then store it as an attribute on cherrypy.request, not in the session. The cherrypy.request and .response objects are new for each request, so there's no fear that any of their attributes will persist across requests. That is the canonical way to do it. Just make sure you're not overwriting any of cherrypy's internal attributes! cherrypy.request.body, for example, is already reserved for handing you, say, a POSTed JSON request body.
For all the details of exactly how the scoping works, the best source is the source code.
You hit the nail on the head with the observation that you're getting the same data from self.body because it's the same in memory of the Python process running CherryPy.
self.debug maintains 'state' for this reason, it's an attribute of the running server.
To set data for the current session, use cherrypy.session['fieldname'] = 'fieldvalue', to get data use cherrypy.session.get('fieldname').
You (the programmer) do not need to know the session ID, cherrypy.session handles that for you -- the session ID is automatically generated on the fly by cherrypy and is persisted by exchanging a cookie between the browser and server on subsequent query/response interactions.
If you don't specify a storage_type for cherrypy.session in your config, it'll be stored in memory (accessible to the server and you), but you can also store the session files on disk if you wish which might be a handy way for you to debug without having to write a bunch of code to dig out session IDs or key/pair values from the running server.
For more info check out http://www.cherrypy.org/wiki/CherryPySessions