Converting and Unifying API data in Python

Converting and Unifying API data in Python - python

I'm trying to pull similar data in from several third party APIs, all of which have slightly varying schemas, and convert them all into a unified schema to store in a DB and expose through a unified API. This is actually a re-write of a system that already does this, minus storing in the DB, but which is hard to test and not very elegant. I figured I'd turn to the community for some wisdom.
Here are some thoughts/what I'd like to achieve.
An easy way to specify schema mappings from the external APIs schema to the internal schema. I realize that some nuances in the data might be lost by converting to a unified schema, but that's life. This schema mapping might not be easy to do and perhaps overkill from the academic papers I've found on the matter.
An alternative solution would be to allow third parties to develop the interfaces to the external APIs. The code quality of these third parties may or may not be known, but could be established via thorough tests.
Therefore the system should be easy to test, I'm thinking by mocking the external API calls to have reproducible data and ensure that parsing and conversion is being done correctly.
One of the external API interfaces crashing should not bring down the rest of them.
Some sort of schema validation/way to detect if the external API schemas have changed without warning
This will end up being integrated into a Django project, so it could be written as a Django app, which would likely make unit and integration testing easier. On the other hand, I would like to keep it as decoupled as possible from Django. Although the API interfaces would have to know what format to convert to, could this be specified at runtime?
Am I missing anything in the wishlist? Unrealistic? Headed down the wrong path? Would love to get some feedback.
I'm not sure if there are libraries/OS project which already do some of this. The less wheels I have to reinvent the better. Would any part of this be valuable as an OS project?
In the previous version I spawned a bunch of threads that would handle individual requests. Although I've never used it, I've been told I should look at gevent as a way to handle this.

For your second bullet point you should check out Temboo. Temboo normalizes access to over 100 APIs, meaning that you can talk to them all using a common syntax in the language of your choice. In this case you would use the Temboo Python SDK - available here.
(Full disclosure: I work at Temboo)

Related

Is there a Python package to realize lazy data loading for online resources?

Consider the following use case. I provide a Python package working as a standalone application. The application can load different more or less lass data sets (assume maybe 5 to 500mb) and do some processing / analysis of the data. The application code is hosted publicly, but I cannot provide and host the data there, since I am not the owner of the data. The data is available in many different public repositories and can be gathered from there. Also, this just helps to limit the application size and not clutter it with potentially unnecessary data (since different users might want to use very different data sets).
To make this work, I would have to provide user instructions like "If you want to use this data set, go to https://foo.bar/a.wav, download the file and place it into ./data". I would like to take that hideous procedure off the user. Hence, I was looking for a package that can help with exactly that. The internal workflow would look like something like this:
Developer defines a project resource
User picks a specific resource that should be loaded on execution (this is also just done via the relative path where the resource is supposed to be, e.g. "a.wav")
Package verifies if resource is available on the user local system
If not available, the package downloads the resource from an online source (specified by the developer)
I expected this to be a very common problem to many people. Hence I expected to easily find packages that help me realize such functionality. But I was not really not successful. Am I lacking just the appropriate terms to search for (lazy loading seems to be usually used for local data access or dynamic package imports)?!
What I found was lazydata which realizes this kind of functionality, but AFAIK in a pretty unsuitable way for me (providing the yourself on an hosted AWS instance).
I also found data-retriever which does kind of exactly what I need, but AFAIK only from predefined data repository (provided by them) and not arbitrary URLs.

How do I efficiently understand a framework with sparse documentation?

I have the problem that for a project I need to work with a framework (Python), that has a poor documentation. I know what it does since it is the back end of a running application. I also know that no framework is good if the documentation is bad and that I should prob. code it myself. But, I have a time constraint. Therefore my question is: Is there a cooking recipe on how to understand a poorly documented framework?
What I tried until now is checking some functions and identify the organizational units in the framework but I am lacking a system to do it more effectively.

If I were you, with time constaraints, and bound to use a specific framework. I'll go in the following manner:
List down the use cases I desire to implement using the framework
Identify the APIs provided by the framework that helps me implement the use cases
Prototype the usecases based on the available documentation and reading
The prototyping is not implementing the entire use case, but to identify the building blocks around the case and implementing them. e.g., If my usecase is to fetch the Students, along with their courses, and if I were using Hibernate to implement, I would prototype the database accesss, validating how easily am I able to access the database using Hibernate, or how easily I am able to get the relational data by means of joining/aggregation etc.
The prototyping will help me figure out the possible limitations/bugs in the framework. If the limitations are more of show-stoppers, I will implement the supporting APIs myself; or I can take a call to scrap out the entire framework and write one for myself; whichever makes more sense.

You may also use python debugging library: pdb. After importing it with import pdb you may set traces in the body of functions and classes pdb.set_trace(). Then it will stop the execution of the program in the line and you may look at existing variables and processes.

Best Python Web Framework for my API Server Needs

I am working on developing two systems:
A system that will constantly retrieve economic data from a 3rd party data feed and push it into a MySQL DB (using sqlalchemy)
A server that will allow anyone to query the data in the db over a JSON AJAX API (similar to Yelp or Yahoo API for example)
I have two main questions:
Which Python framework should I use in 2)? Pyramid is my first choice, but if you strongly suggest against it or in favor of something else like Django or Pylons I am definitely wiling to consider it.
Should I develop the two system separately? Or should 1) be a part of 2), running within the framework (using crontab or celery for example)?

Depends on what stage you are at, I would suggest to develop 2 systems because the load to pull data from 3rd party and the load to handle the API would be different. You can scale them into a different types of nodes if you want.
Django-Tastypie (https://github.com/toastdriven/django-tastypie) is not bad, it supports all JSON, XML and YAML. Also you can add OAuth easily. Though, Django itself maybe a bit heavy for your needs at this time.

You might want to check out web2py's new functionality for easily generating RESTful API's, particularly its parse_as_rest and smart_query functions. You might also consider using web2py's database abstraction layer to handle #1.
If you need any help, ask on the mailing list.

I agree with Anthony, you should look at Web2Py. It is very easy to get started, very low learning cure and easy to deploy on many systems including Linux, Windows and Amazon.
So far I have found nothing that Web2Py can not do. But more importantly it does things how you would think they should be done, so if you are not sure, very often a guess is good enough and it just works. If you do get stuck, it has by far the best and most up to date documentation for any Python Web Framework.
Even with all it's great features, easy use and up to date documentation, you will also find that the web2py user group on Google, is like having a paid for help desk staffed 24 hours a day. Most questions are answered with a couple minutes and Massimo (The original creator of Web2Py) goes out of his way not only to help, but to implement new ideas, suggestions and bug fixes within days of them being raised in the group.

XML-RPC for an object broker

is there any good reason not to use XML-RPC for an object-broker server/client architecture? Maybe something like "no it's already outfashioned, there is X for that now".
To give you more details: I want to build a framework which allows for standardized interaction and the exchange of results between many little tools (e. g. command-line tools). In case someone wants to integrate another tool she writes a wrapper for that purpose. The wrapper could, e. g., convert the STDOUT of a tool into objects usable by the architecture.
Currently I'm thinking of writing the proof-of-concept server in Python. Later it could be rewritten in C/C++. Just to make sure clients can be written in as many languages as possible I thought of using XML-RPC. CORBA seems to be too bloated for that purpose, since the server shouldn't be too complex.
Thanks for your advice and opinions,
Rainer

XML-RPC has a lot going for it. It's simple to create and to consume, easy to understand and easy to code for.
I'd say avoid SOAP and CORBA like the plague. They are way too complex, and with SOAP you have endless problems because only implementations from single vendors tend to interact nicely - probably because the complexity of the standard leads to varying interpretations.
You may want to consider a RESTful architecture. REST and XML-RPC cannot be directly compared. XML-RPC is a specific implementation of RPC, and REST is an architectural style. REST does not mandate anything much - it's more a style of approach with a bunch of conventions and suggestions. REST can look a lot like XML-RPC, but it doesn't have to.
Have a look at http://en.wikipedia.org/wiki/Representational_State_Transfer and some of the externally linked articles.
One of the goals of REST is that by creating a stateless interface over HTTP, you allow the use of standard caching mechanisms and load balancing mechanisms without having to invent new ways of doing what has already been well solved by HTTP.
Having read about REST, which hopefully is an interesting read, you may decide that for your project XML-RPC is still the best solution, which would be a perfectly reasonable conclusion depending on what exactly you are trying to achieve.

Understanding Zope internals, from Django eyes

I am a newbie to zope and I previously worked on Django for about 2.5 years. So when I first jumped into Zope(v2) (only because my new company is using it since 7 years), I faced these questions. Please help me in understanding them.
What is the "real" purpose of zodb as such? I know what it does, but tell me one great thing that zodb does and a framework like Django (which doesn't have zodb) misses.
Update: Based on the answers, Zodb replaces the need for ORM. You can directly store the object inside the db(zodb itself).
It is said one of the zope's killer feature is the TTW(Through the Web or Developing using ZMI) philosophy. But I(and any developer) prefers File-System based development(using Version control, using Eclipse, using any favorite tool outside Zope). Then where is this TTW actually used?
This is the big one. What "EXTRA Stuff" does Zope's Acquistion gain when compared to Python/Django Inheritance.
Is it really a good move to come to Zope, from Django ?
Any site like djangosnippets.org for Zope(v2)?

First things first: current zope2 versions include all of zope3, too. And if you look at modern zope2 applications like Plone, you'll see that it uses a lot of "zope 3" (now called the "zope tool kit", ZTK) under the hood.
The real purpose of the ZODB: it is one of the few object databases (as opposed to relational SQL databases) that sees real widespread use. You can "just" store all your python objects in there without needing to use an object-relational mapper. No "select * from xyz" under the hood. And adding a new attribute on a zodb object "just" persists that change. Luxurious! Especially handy when your data cannot be handily mapped to a strict relational database. If you can map it easily: just use such a database, I've used sqlalchemy a few times in zope projects.
TTW: we've come back from that. At least, the zope2 way of TTW indeed has all the drawbacks that you fear. No version control, no outside tools, etc. Plone is experimenting (google for "dexterity") with nice explicit zope 3 ways of doing TTW development that can still be mapped back to the filesystem.
TTW: the zodb makes it easy and cheap to store all sorts of config settings in the database, so you can typically adjust a lot of things through the browser. This doesn't really count as typical TTW development, though.
Acquisition: handy trick, though it leads to a huge namespace polution. Double edged sword. To improve debuggability and maintenance we try to do without in most of the cases. The acquisition happens inside the "object graph", so think "folder structure inside the zope site". A call to "contact_form" three folders down can still find the "contact_form" on the root of the site if it isn't found somewhere in between. Double edged sword!
(And regular python object oriented inheritance happens all over the place of course).
Moving from django to zope: a really good idea for certain problems and nonsensical for other problems :-) Quite a lot of zope2/plone companies have actually done some django projects for specific projects, typically those that have 99% percent of their content in a relatively straightforward SQL database. If you're more into content management, zope (and plone) is probably better.
Additional tip: don't focus only on zope2. Zope3's "component architecture" has lots of functionality for creating bigger applications (also non-web). Look at grok (http://grok.zope.org) for a friendly packaged zope, for instance. The pure component architecture is also usable inside django projects.

On the ZODB:
Another way to ask "What is the real purpose of the ZODB?" is to ask, "Why was the ZODB originally created?"
The answer to that is the project was started very early on, around 1996. This was before the existance of MySQL or PostgreSQL, when miniSQL (a free-to-use but not free software) database was still in common use, or big money databases such as Oracle. Python provided the pickle module to serialize Python objects to disk - but serialization is lower level, it doesn't allow for features such as transactions, concurrent writes, and replication. This is what the ZODB provides.
It's still in use today in Zope because it works well. If you have no existing skillset in realational databases, it's easier to learn to use the ZODB than a relational database. It's also usable simpler use-cases, for example if you have a command-line script that needs to store some configuration information, using a relational database means having to run a database server just to store a little bit of configuraiton. You could use a config file, but the ZODB also works quite nicely because it's an embedable database. That means that the database is running in the same process as the rest of your Python code.
It's also worth noting that the API used to store objects inside containers is different between Zope 2 and Zope 3. In Zope 2, containers are stored as attributes:
root.mycontainer.myattr
In Zope 3, they use the same interface as Python standard dictionary type:
root['mycontainer']myattr
This is another reason why it can be easier to learn to use the ZODB than the Django ORM, since Django has it's own interface for it's ORM which is distinct from Python's existing interfaces.
Through-the-web (TTW):
Again, understanding the reason for TTW goes back when Zope was developed. While it seems silly to break with well known developer tools such Subversion or Mercurial, Zope was developed in the late 90s when the only free version control system was CVS. Zope 2 had it's own simple version control capabilites, and they were as good as CVS (which is to say, "they were limited and sucky."). UNIX workstations cost a lot more money back then, and had far fewer resources, so System Administrators were much more guarded and careful about how servers were managed. TTW allowed people who might not normally be able to upload code to the server with sysadmin intervation a way to do that.
With text editors, emacs and vi have had ftp-modes, and Zope 2 can listen on an FTP port. This would allow you to develop so that code was stored in the ZODB (editable TTW), but it was common to edit this code using a emacs or vi.
Today in Zope, TTW is more rarely used or promoted since it no longer makes sense to do this. Disk space is cheap, servers are (relatively) cheap, and there are lots of developer tools which expect to interact with the standard filesystem.
Acquisition:
It was a mistake. It was a very confusing feature that caused lots of unexpected things to happen. In theory there are some interesting ideas to acquisition, but in practice it's best tossed in the bin and has little practical use.
Moving from Django to Zope:
Work started on Zope 3 in 2001. This fixed a lot of the problems with Zope 2. It's a testament to the Zope community that Zope 2 is still actively and well maintained, but it's hardly state-of-the-art. Zope 2 is really only interesting to learn from a historical perspective.
Zope 3 ended up getting evolved in a few different directions, and so modern incarnations of Zope are best expressed in the form of Grok, BFG or Bobo.
Grok is closest to Zope 3, and as such is a pretty large framework - it can be rather overwhelming at times when delving through it's code base. However, just like Django, or any other full-stack framework you don't need to use every part of Grok, it can be quite easy to learn the basic and create web applications with it. It's convention-over-configuration is second to none, and it's class-based Views give it a much tighter, arguably cleaner code base than a Django web application. It's URL routing system is extremely flexible, but also arguably over-engineered.
BFG is a "pay for only what you eat" framework written by long time Zope developer Chris McDonough. As such, it's closer to Pylons in spirit, where only the parts deemed core or essential to a framework are included. It also plays very well with WSGI. It only uses a few core Zope packages.
Bobo is a "micro-framework". It's just a way to route URLs and serve up an app. It doesn't use any Zope packages, so isn't strictly in the Zope family of web frameworks. But it was written by Zope's creator, Jim Fulton, who originally called the publishing part of Zope, "Bobo". The original Bobo, written in the early 90's, mapped URLs to packages and modules, so if your source code was layed out as:
mypackage.mymodule.MyClass
You could have a URL such as:
/mypackage/mymodule/MyClass
Which was very inflexible, and was replaced with URL Traversel in Zope 2, which is fairly complex. Bobo uses Routes, so it's a middle ground between dead-simple URL resolution and complex URL resolution - about the same in complexity as Django's URL resolution machinery.

I answer without much experience on both, but I had the chance to manipulate both, so I can tell you my opinion on some of your questions.
1)What is the "real" purpose of zodb
as such? Meaning I know what it does,
but tell me one great thing that zodb
does and a framework like django(which
doesn't have zodb) misses
Load distribution via ZEO and search via ZCatalog. Django is very low level on this point of view. To achieve the same, you would have to reimplement a lot of wheels, triangular.
Something I learned quite soon is: don't mess with low level database issues. You will screw them up. It's a can of worms, Dune sized.
So why choose django ORM ? You should also consider if YAGNI. django is easy and self contained, documentation is premium, and when (if) your site will grow that much, you will do the switch to a better ORM (or to a pure OODB, in case of ZODB) later on.
2)It is said one of the zope's killer
feature is the TTW(Through the Web or
Developing using ZMI) philosophy. But
I(and any developer) prefers
File-System based development(using
Version control, using Eclipse, using
any favorite tool outside Zope). Then
where is this TTW actually used?
I cannot answer properly to this question, but I would not say that it's fundamentally bad to develop with such approach. Of course it's a change of mindset, and I tend to prefer filesystem based development as well.
4)Is it really a good move to work on
Zope, from Django ?
Zope 3 is very modular, so you are free to use many of its components from django. I would advise against it though. You can, of course, but what I found most problematic is the lack of help. There are not many people using zope components and django at the same time. Sooner or later, you will have a problem and google won't help. At that point, you will realize that if your life was a videogame, you are definitely playing it at level difficult (maybe extreme, if you will have to put your nose into the zope code).

A very good reference on ZODB is ZODB/ZEO programmer's guide. ZODB is not an ORM. Its a true object database. Python objects are persisted inside the database transparently without any worries about how to transform them into a representation suitable for database. Any pickleable Python object can be saved inside the ZODB. Relational databases are suitable for large amount of flat data (like employee records) while ZODB is best for hierarchical data (typically found in web applications). I personally use Zope 3 for my applications. I never did TTW type of work. Best part of using ZODB was the fact that I never had to worry at all about how I am going to save data and how things would change when I upgrade my software from one version to next one. For example, if I add a new attribute to a Python class, all I have to do is provide a default value as a class attribute. It then becomes automatically available to all objects created with the previous version of the same class. Removing an attribute is a simple del operation on existing objects. BTW, ZODB can be used independently in any kind of Python application and isn't coupled with just ZOPE platform. I love the fact that I don't have to worry about the nitty gritties of SQL while working on Python applications thanx to ZODB. And off course if you need a database server so that you can run multiple copies of your application backed by the same server ZEO comes to your rescue on top of ZODB.
Zope started with the idea of being an Object Publishing Environment. From that perspective mapping the URL directly to the object hierarchy in ZODB was great. The URLs simply reflect the hierarchy of objects. Now so far as figuring out the URL is considered, there is always the Rotterdam debugging interface for help. For development work, I keep the development flags on in the zope configuration and look at the contents of ZODB through the Rotterdam interface. Rotterdam skin provide a great way of introspecting the Python objects stored inside the ZODB and figuring out the URLs is much more interactive. Moreover, for major containers inside my ZODB, I register them as persistent utilities inside the site manager (Zope 3 sites and site managers). Anywhere in my code, whenever I need access to such containers, all I do is getUtility(IMyContainerType). I don't even have to remember the detailed locations of those containers inside the code base. They are once registered with the site manager and going forward available anywhere inside the code base through getUtility() calls.
And the URLs also support namespaces. For example using the ++skin++ namespace, you can anytime change the skin of your web application. Using the ++language++ namespace, you can any time change the preferred language of your user interface. Using the ++attributes++ namespace you can access individual attributes of an object. URLs are simply much more powerful and much more customizable. And you can write traversal adapters, define your own namespaces, to enhance the capabilities of your URLs. To give an example, all pages which are directly accessible from the web interface, are part of my default skin. While all pages which are invoked through background AJAX calls, are under a different skin. This way, one can implement different ways of authentication mechanisms in different skins. In main skin, one is redirected to a different login page in case of authentication failure. For AJAX pages, one could simply receive an HTTP error. This could be centrally done. Zope 3 objects have interfaces and one view can be defined for multiple interfaces. Wherever you have an object which supports the given interface, all associated views become automatically available and all such URLs are automatically valid. If you think about it, its a much more powerful than a single python file or XML file where the URLs are hard-coded. I don't really know much about DJango and J2EE so cannot say if they have equivalent capability.

ZODB is a OO-style database that doesn't need a schema definition. You can simply create (nearly) all kinds of objects, and persist them.
The TTW is sometimes annoying, but you can mount the ZOPE-object-tree using webdav. Then you can edit the templates and scripts using your favorite editor.
ZOPE is especially powerful for creating CMS-like systems, IMHO there it is still unmatched - you'd have to go through a lot to make it work equally well in Django.
And through the TTW, actually non-developers like designers have a good chance of developing e.g. templates and CSS without need for developer interaction.

+1 on Wheat's answer, above: "Zope 2 is really only interesting to learn from a historical perspective". I did Zope dev for a large site for a couple of years, 50% zope 2, 50% zope 3. Even then (this was 2 years ago) we were working to migrate everything off of zope 2. Unless you already have a lot invested in an existing Zope 2 project, there's no reason to use it; there's just not much of future there. And if you do have a big existing zope 2 project, I'd suggest taking a look at a product caled Five (a joke: 2 + 3 = 5) that aims to
allow you to integrate Zope 3
technologies into Zope 2. Among
others, it allows you to use Zope 3
interfaces, ZCML-based configuration,
adapters, browser pages (including
skins, layers, and resources),
automated add and edit forms based on
schemas, object events, as well as
Zope 3-style i18n message catalogs.
When all is said and done, Zope 3 is a very different framework from 2, and IMHO, a much better (albeit more complicated) one. TTW is optional, and not recommended for most cases. Implicit acquisition is gone.
Looks like people here have covered why you might want to use the ZODB, so I thought I'd mention one other thing about Zope 3 (or Zope 2 using Five) that's good. Zope has a very powerful system for wiring together different application components called the Zope Component Architecture (ZCA). It allows you to write components that are more or less autonomous and reusable, and which can be plugged together in a standardized way. I mostly do Django development now and I sometimes find myself missing the ZCA. In Django, the ability to write reusable components is limited and kind of ad-hoc. But, like Reinout says zope.component (like most zope packages, including the ZODB) works outside of the zope framework and could be used in a Django project.
That said, the ZCA has its drawbacks, one of which is the tedious process of registering your components in XML files; it always felt a little Java-esqe to me. One reason I really like Grok http://grok.zope.org/ is that it sits on top of zope.component and does much of that grunt work for you.
So bottom line: Zope 2 is mostly a dead end. If your employer is amenable to it, start looking at Zope 3, or at least Five. I think you'll find Zope 3 has a steep learning curve compared to Django, so it might be a good idea to come at it via Grok, which smooths out a lot of Zope 3's rougher edges. But, I think for a really large or complex web application with lots of moving parts, I'd go for Zope over Django (and I say this as someone who really likes Django a lot). For smaller projects, Django would probably be faster. Quantifying "large" and "small" in this context is hard though, and would probably require a couple of thousand more words. If you really are interested in Zope 3, the book by Philipp von Weitershausen is definitely the place to start.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.