How (in what form) to share (deliver) a Python function? - python

The final outcome of my work should be a Python function that takes a JSON object as the only input and return another JSON object as output. To keep it more specific, I am a data scientist, and the function that I am speaking about, is derived from data and it delivers predictions (in other words, it is a machine learning model).
So, my question is how to deliver this function to the "tech team" that is going to incorporate it into a web-service.
At the moment I face few problems. First, the tech team does not necessarily work in Python environment. So, they cannot just "copy and paste" my function into their code. Second, I want to make sure that my function runs in the same environment as mine. For example, I can imagine that I use some library that the tech team does not have or they have a version that differ from the version that I use.
ADDED
As a possible solution I consider the following. I start a Python process that listen to a socket, accept incoming strings, transforms them into JSON, gives the JSON to the "published" function and returns the output JSON as a string. Does this solution have disadvantages? In other words, is it a good idea to "publish" a Python function as a background process listening to a socket?

You have the right idea with using a socket but there are tons of frameworks doing exactly what you want. Like hleggs, I suggest you checkout Flask to build a microservice. This will let the other team post JSON objects in an HTTP request to your flask application and receive JSON objects back. No knowledge of the underlying system or additional requirements required!
Here's a template for a flask app that replies and responds with JSON
from flask import Flask, request, jsonify
app = Flask(__name__)
#app.route('/', methods=['POST'])
def index():
json = request.json
return jsonify(your_function(json))
if __name__=='__main__':
app.run(host='0.0.0.0', port=5000)
Edit: embeded my code directly as per Peter Britain's advice

My understanding of your question boils down to:
How can I share a Python library with the rest of my team, that may not be using Python otherwise?
And how can I make sure my code and its dependencies are what the receiving team will run?
And that the receiving team can install things easily mostly anywhere?
This is a simple question with no straightforward answer... as you just mentioned that this may be integrated in some webservice, but you do not know the actual platform for this service.
You also ask:
As a possible solution I consider the following. I start a Python process that listen to a socket, accept incoming strings, transforms them into JSON, gives the JSON to the "published" function and returns the output JSON as a string. Does this solution have disadvantages? In other words, is it a good idea to "publish" a Python function as a background process listening to a socket?
In the most simple case and for starting I would say no in general. Starting network servers such as an HTTP server (which is built-in Python) is super easy. But a service (even if qualified as "micro") means infrastructure, means security, etc.
What if the port you expect is not available on the deployment machine? - What happens when you restart that machine?
How will your server start or restart when there is a failure?
Would you need also to eventually provide an upstart or systemd service (on Linux)?
Will your simple socket or web server support multiple concurrent requests?
is there a security risk to expose a socket?
Etc, etc. When deployed, my experience with "simple" socket servers is that they end up being not so simple after all.
In most cases, it will be simpler to avoid redistributing a socket service at first. And the proposed approach here could be used to package a whole service at a later stage in a simpler way if you want.
What I suggest instead is a simple command line interface nicely packaged for installation.
The minimal set of things to consider would be:
provide a portable mechanism to call your function on many OSes
ensure that you package your function such that it can be installed with all the correct dependencies
make it easy to install and of course provide some doc!
Step 1. The simplest common denominator would be to provide a command line interface that accepts the path to a JSON file and spits JSON on the stdout.
This would run on Linux, Mac and Windows.
The instructions here should work on Linux or Mac and would need a slight adjustment for Windows (only for the configure.sh script further down)
A minimal Python script could be:
#!/usr/bin/env python
"""
Simple wrapper for calling a function accepting JSON and returning JSON.
Save to predictor.py and use this way::
python predictor.py sample.json
[
"a",
"b",
4
]
"""
from __future__ import absolute_import, print_function
import json
import sys
def predict(json_input):
"""
Return predictions as a JSON string based on the provided `json_input` JSON
string data.
"""
# this will error out immediately if the JSON is not valid
validated = json.loads(json_input)
# <....> your code there
with_predictions = validated
# return a pretty-printed JSON string
return json.dumps(with_predictions, indent=2)
def main():
"""
Print the JSON string results of a prediction, loading an input JSON file from a
file path provided as a command line argument.
"""
args = sys.argv[1:]
json_input = args[0]
with open(json_input) as inp:
print(predict(inp.read()))
if __name__ == '__main__':
main()
You can process eventually large inputs by passing the path to a JSON file.
Step 2. Package your function. In Python this is achieved by creating a setup.py script. This takes care of installing any dependent code from Pypi too. This will ensure that the version of libraries you depend on are the ones you expect. Here I added nltk as an example for a dependency. Add yours: this could be scikit-learn, pandas, numpy, etc. This setup.py also creates automatically a bin/predict script which will be your main command line interface:
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
from __future__ import absolute_import, print_function
from setuptools import setup
from setuptools import find_packages
setup(
name='predictor',
version='1.0.0',
license='public domain',
description='Predict your life with JSON.',
packages=find_packages(),
# add all your direct requirements here
install_requires=['nltk >= 3.2, < 4.0'],
# add all your command line entry points here
entry_points={'console_scripts': ['predict = prediction.predictor:main']}
)
In addition as is common for Python and to make the setup code simpler I created a "Python package" directory moving the predictor inside this directory.
Step 3. You now want to package things such that they are easy to install. A simple configure.sh script does the job. It installs virtualenv, pip and setuptools, then creates a virtualenv in the same directory as your project and then installs your prediction tool in there (pip install . is essentially the same as python setup.py install). With this script you ensure that the code that will be run is the code you want to be run with the correct dependencies. Furthermore, you ensure that this is an isolated installation with minimal dependencies and impact on the target system. This is tested with Python 2 but should work quite likely on Python 3 too.
#!/bin/bash
#
# configure and installs predictor
#
ARCHIVE=15.0.3.tar.gz
mkdir -p tmp/
wget -O tmp/venv.tgz https://github.com/pypa/virtualenv/archive/$ARCHIVE
tar --strip-components=1 -xf tmp/venv.tgz -C tmp
/usr/bin/python tmp/virtualenv.py .
. bin/activate
pip install .
echo ""
echo "Predictor is now configured: run it with:"
echo " bin/predict <path to JSON file>"
At the end you have a fully configured, isolated and easy to install piece of code with a simple highly portable command line interface.
You can see it all in this small repo: https://github.com/pombredanne/predictor
You just clone or fetch a zip or tarball of the repo, then go through the README and you are in business.
Note that for a more engaged way for more complex applications including vendoring the dependencies for easy install and not depend on the network you can check this https://github.com/nexB/scancode-toolkit I maintain too.
And if you really want to expose a web service, you could reuse this approach and package that with a simple web server (like the one built-in in the Python standard lib or bottle or flask or gunicorn) and provide configure.sh to install it all and generate the command line to launch it.

Your task is (in generality) about productionizing a machine learning model, where the consumer of the model may not be working in the same environment as the one which was used to develop the model. I've been trying to tackle this problem since past few years. The problem is faced by many companies and it is aggravated due to skill set, objectives as well as environment (languages, run time) mismatch between data scientists and developers. From my experience, following solutions/options are available, each with its unique advantages and downsides.
Option 1 : Build the prediction part of your model as a standalone web service using any lightweight tool in Python (for example, Flask). You should try to decouple the model development/training and prediction part as much as possible. The model that you have developed, must be serialized to some form so that the web server can use it.
How frequently is your machine learning model updated? If it is not done very frequently, the serialized model file (example: Python pickle file) can be saved to a common location accessible to the web server (say s3), loaded in memory. The standalone web server should offer APIs for prediction.
Please note that exposing a single model prediction using Flask would be simple. But scaling this web server if needed, configuring it with right set of libraries, authentication of incoming requests are all non-trivial tasks. You should choose this route only if you have dev teams ready to help with these.
If the model gets updated frequently, versioning your model file would be a good option. So in fact, you can piggyback on top of any version control system by checking in the whole model file if it is not too large. The web server can de-serialize (pickle.load) this file at startup/update and convert to a Python object on which you can call prediction methods.
Option 2 : use predictive modeling markup language. PMML was developed specifically for this purpose: predictive modeling data interchange format independent of environment. So data scientist can develop model, export it to a PMML file. The web server used for prediction can then consume the PMML file for doing predictions. You should definitely check the open scoring project which allows you to expose machine learning models via REST APIs for deploying models and making predictions.
Pros: PMML is standardized format, open scoring is a mature project with good development history.
Cons: PMML may not support all models. Open scoring is primarily useful if your tech team's choice of development platform is JVM. Exporting machine learning models from Python is not straightforward. But R has good support for exporting models as PMML files.
Option 3 : There are some vendors offering dedicated solutions for this problem. You will have to evaluate cost of licensing, cost of hardware as well as stability of the offerings for taking this route.
Whichever option you choose, please consider the long term costs of supporting that option. If your work is in a proof of concept stage, Python flask based web server + pickled model files will be the best route. Hope this answer helps you!

As already suggested in other answers the best option would be creating a simple web service. Besides Flask you may want to try bottle which is very thin one-file web framework. Your service may looks as simple as:
from bottle import route, run, request
#route('/')
def index():
return my_function(request.json)
run(host='0.0.0.0', port=8080)
In order to keep environments the same check virtualenv to make isolated environment for avoiding conflicts with already installed packages and pip to install exact version of packages into virtual environment.

I guess you have 3 possibilities :
convert python function to javascript function:
Assuming the "tech-team" use Javascript for web-service, you may try to convert your python function directly to a Javascript function (which will be really easy to integrate on web page) using empythoned (based on emscripten)
The bad point of this method is that each time you need update/upgrade your python function, you need also to convert to Javascript again, then check & validate that the function continue to work.
simple API server + JQuery
If the conversion method is impossible, I am agree with #justin-bell, you may use FLASK
getting JSON as input > JSON to your function parameter > run python function > convert function result to JSON > serve the JSON result
Assuming you choose the FLASK solution, "tech-team" will only need to send an async. GET/POST request containing all the arguments as JSON obj, when they need to get some result from your python function.
websocket server + socket.io
You can also use take a look on Websocket to dispatch to the webservice (look at flask + websocket for your side & socket.io for webservice side.)
=> websocket is really usefull when you need to push/receive data with low cost and latency to (or from) a lot of users (Not sure that websocket will be the best fit to your need)
Regards

Related

How to integrate BIRT with Python Django Project by using Py4j

Hi is there anyone who is help me to Integrate BIRT report with Django Projects? or any suggestion for connect third party reporting tools with Django like Crystal or Crystal Clear Report.
Some of the 3rd-party Crystal Reports viewers listed here provide a full command line API, so your python code can preview/export/print reports via subprocess.call()
The resulting process can span anything between an interactive Crystal Report viewer session (user can login, set/change parameters, print, export) and an automated (no user interaction) report printing/exporting.
While this would simplify your code, it would restrict deployment to Windows.
For prototyping, or if you don't mind performance, you can call from BIRT from the command line.
For example, download the POJO runtime and use the script genReport.bat (IIRC) to generate a report to a file (eg. PDF format). You can specify the output options and the report parameters on the command line.
However, the BIRT startup is heavy overhead (several seconds).
For achieving reasonable performance, it is much better to perform this only once.
To achieve this goal, there are at least two possible ways:
You can use the BIRT viewer servlet (which is included as a WAR file with the POJO runtime). So you start the servlet with a web server, then you use HTTP requests to generate reports.
This looks technically old-fashioned (eg. no JSON Requests), but it should work. However, I never used this approach.
The other option is to write your own BIRT server.
In our product, we followed this approach.
You can take the viewer servlet as a template for seeing how this could work.
The basic idea is:
You start one (or possibly more than one) Java process.
The Java process initializes the BIRT runtime (this is what takes some seconds).
After that, the Java process listens for requests somehow (we used a plain socket listener, but of course you could use HTTP or some REST server framework as well).
A request would contain the following information:
which module to run
which output format
report parameters (specific to the module)
possibly other data/metadata, e.g. for authentication
This would create a RunAndRenderTask or separate RunTask and RenderTasks.
Depending on your reports, you might consider returning the resulting output (e.g. PDF) directly as a response, or using an asynchronous approach.
Note that BIRT will happily create several reports at the same time - multi-threading is no problem (except for the initialization), given enough RAM.
Be warned, however, that you will need at least a few days to build a POC for this "create your own server" approach, and probably some weeks for prodction quality.
So if you just want to build something fast to see if the right tool for you, you should start with the command line approach, then the servlet approach and only then, and only if you find that the servlet approach is not quite good enough, you should go the "create your own server" way.
It's a pity that currently there doesn't seem to exist an open-source, production-quality, modern BIRT REST service.
That would make a really good contribution to the BIRT open-source project... (https://github.com/eclipse/birt)

How we can automate python client application which is used an an interface for user to call APIs

We have made an python client which is used as an interface for user. some function is defined in the client which internally calls the APIs and give output to users.
My requirement is to automate the python client - functions and validate the output.
Please suggest tools to use.
There are several ways to do that:
You can write multiple tests for your application as the test cases which are responsible to call your functions and get the result and validate them. It calls the "feature test". To do that, you can use the python "unittest" library and call the tests periodically.
If you have a web application you can use "selenium" to make automatic test flows. (Also you can run it in a docker container virtually)
The other solution is to write another python application to call your functions or send requests everywhere you want to get the specific data and validate them. (It's the same with the two other solutions with a different implementation)
The most straightforward way is using Python for this, the simplest solution would be a library like pytest. More comprehensive option would be something like Robot framework
Given you have jmeter in your tags I assume that at some point you will want to make a performance test, however it might be easier to use Locust for this as it's pure Python load testing framework.
If you still want to use JMeter it's possible to call Python programs using OS Process Sampler

How to run a python script on Azure for CSV file analysis

I have a python script on my local machine that reads a CSV file and outputs some metrics. The end goal is to create a web interface where the user uploads the CSV file and the metrics are displayed, while all being hosted on Azure.
I want to use a VM on Azure to run this python script.
The script takes the CSV file and outputs metrics which are stored in CosmosDB.
A web interface reads from this DB and displays graphs from the data generated by the script.
Can someone elaborate on the steps I need to follow to achieve this? Detailed steps are not essentially required, but a brief overview with links to relevant learning sources would be helpful.
There's an article that lists the primary options for hosting sites in Azure: https://learn.microsoft.com/en-us/azure/developer/python/quickstarts-app-hosting
As Sadiq mentioned, Functions is probably your best choice as it will probably be less expensive, less maintenance, and can handle both the script and the web interface. Here is a python tutorial for that method: https://learn.microsoft.com/en-us/azure/developer/python/tutorial-vs-code-serverless-python-01
Option 2 would be to run a traditional website on an App Service plan, with background tasks handled either by Functions or a Webjob- they both use the webjobs SDK, so the code is very similar: https://learn.microsoft.com/en-us/learn/paths/deploy-a-website-with-azure-app-service/
VMs are an option if either of those two don't work, but it comes with significantly more administration. This learning path has info on how to do this. The website is built on the MEAN stack, but is applicable to Python as well: https://learn.microsoft.com/en-us/learn/paths/deploy-a-website-with-azure-virtual-machines/

How can a CGI server based on CGIHTTPRequestHandler require that a script start its response with headers that include a `content-type`?

Later note: the issues in the original posting below have been largely resolved.
Here's the background: For an introductory comp sci course, students develop html and server-side Python 2.7 scripts using a server provided by the instructors. That server is based on CGIHTTPRequestHandler, like the one at pointlessprogramming. When the students' html and scripts seem correct, they port those files to a remote, slow Apache server. Why support two servers? Well, the initial development using a local server has the benefit of reducing network issues and dependency on the remote, weak machine that is running Apache. Eventually porting to the Apache-running machine has the benefit of publishing their results for others to see.
For the local development to be most useful, the local server should closely resemble the Apache server. Currently there is an important difference: Apache requires that a script start its response with headers that include a content-type; if the script fails to provide such a header, Apache sends the client a 500 error ("Internal Server Error"), which too generic to help the students, who cannot use the server logs. CGIHTTPRequestHandler imposes no similar requirement. So it is common for a student to write header-free scripts that work with the local server, but get the baffling 500 error after copying files to the Apache server. It would be helpful to have a version of the local server that checks for a content-type header and gives a good error if there is none.
I seek advice about creating such a server. I am new to Python and to writing servers. Here are the issues that occur to me, but any helpful advice would be appreciated.
Is a content-type header required by the CGI standard? If so, other people might benefit from an answer to the main question here. Also, if so, I despair of finding a way to disable Apache's requirement. Maybe the relevant part of the CGI RFC is section 6.3.1 (CGI Response, Content-Type): "If an entity body is returned, the script MUST supply a Content-Type field in the response."
To make a local server that checks for the content-type header, perhaps I should sub-class CGIHTTPServer.CGIHTTPRequestHandler, to override run_cgi() with a version that issues an error for a missing header. I am looking at CGIHTTPServer.py __version__ = "0.4", which was installed with Python 2.7.3. But run_cgi() does a lot of processing, so it is a little unappealing to copy all its code, just to add a couple calls to a header-checking routine. Is there a better way?
If the answer to (2) is something like "No, overriding run_cgi() is recommended," I anticipate writing a version that invokes the desired script, then checks the script's output for headers before that output is sent to the client. There are apparently two places in the existing run_cgi() where the script is invoked:
3a. When run_cgi() is executed on a non-Unix system, the script is executed using Python's subprocess module. As a result, the standard output from the script will be available as an in-memory string, which I can presumably check for headers before the call to self.wfile.write. Does this sound right?
3b. But when run_cgi() is executed on a *nix system, the script is executed by a forked process. I think the child's stdout will write directly to self.wfile (I'm a little hazy on this), so I see no opportunity for the code in run_cgi() to check the output. Ugh. Any suggestions?
If analyzing the script's output is recommended, is email.parser the standard way to recognize whether there is a content-type header? Is another standard module recommended instead?
Is there a more appropriate forum for asking the main question ("How can a CGI server based on CGIHTTPRequestHandler require...")? It seems odd to ask if there is a better forum for asking programming questions than Stack Overflow, but I guess anything is possible.
Thanks for any help.

Scriptable HTTP benchmark (preferable in Python)

I'm searching for a good way to stress test a web application. Basically I'm searching für something like ab with a scriptable interface. Ideally I want to define some tasks, that simulate different action on the webapp (register a account, login, search, etc.) and the tool runs a hole bunch of processes that executes these tasks*. As result I would like something like "average request time", "slowest request (per uri)", etc.
*: To be independed from the client bandwith I will run theses test from some EC2 instances so in a perfect world the tool will already support this - otherwise I will script is using boto.
If you're familiar with the python requests package, locust is very easy to write load tests in.
http://locust.io/
I've used it to write all of our perf tests in it.
You can maybe look onto these tools:
palb (Python Apache-Like Benchmark Tool) - HTTP benchmark tool with command line interface resembles ab.
It lacks the advanced features of ab, but it supports multiple URLs (from arguments, files, stdin, and Python code).
Multi-Mechanize - Performance Test Framework in Python
Multi-Mechanize is an open source framework for performance and load testing.
Runs concurrent Python scripts to generate load (synthetic transactions) against a remote site or service.
Can be used to generate workload against any remote API accessible from Python.
Test output reports are saved as HTML or JMeter-compatible XML.
Pylot (Python Load Tester) - Web Performance Tool
Pylot is a free open source tool for testing performance and scalability of web services.
It runs HTTP load tests, which are useful for capacity planning, benchmarking, analysis, and system tuning.
Pylot generates concurrent load (HTTP Requests), verifies server responses, and produces reports with metrics.
Tests suites are executed and monitored from a GUI or shell/console.
( Pylot on GoogleCode )
The Grinder
Default script language is Jython.
Pretty compact how-to guide.
Tsung
Maybe a bit unusual for the first use but really good for stress-testing.
Step-by-step guide.
+1 for locust.io in answer above.
I would recommend JMeter.
See: http://jmeter.apache.org/
You can setup JMeter as proxy of your browser to record actions like login and then stress test your web application. You can also write scripts to it.
Don't forget FunkLoad, it's very easy to use

Categories