Is it reasonable to save data as python modules? - python

This is what I've done for a project. I have a few data structures that are bascially dictionaries with some methods that operate on the data. When I save them to disk, I write them out to .py files as code that when imported as a module will load the same data into such a data structure.
Is this reasonable? Are there any big disadvantages? The advantage I see is that when I want to operate with the saved data, I can quickly import the modules I need. Also, the modules can be used seperate from the rest of the application because you don't need a separate parser or loader functionality.

By operating this way, you may gain some modicum of convenience, but you pay many kinds of price for that. The space it takes to save your data, and the time it takes to both save and reload it, go up substantially; and your security exposure is unbounded -- you must ferociously guard the paths from which you reload modules, as it would provide an easy avenue for any attacker to inject code of their choice to be executed under your userid (pickle itself is not rock-solid, security-wise, but, compared to this arrangement, it shines;-).
All in all, I prefer a simpler and more traditional arrangement: executable code lives in one module (on a typical code-loading path, that does not need to be R/W once the module's compiled) -- it gets loaded just once and from an already-compiled form. Data live in their own files (or portions of DB, etc) in any of the many suitable formats, mostly standard ones (possibly including multi-language ones such as JSON, CSV, XML, ... &c, if I want to keep the option open to easily load those data from other languages in the future).

It's reasonable, and I do it all the time. Obviously it's not a format you use to exchange data, so it's not a good format for anything like a save file.
But for example, when I do migrations of websites to Plone, I often get data about the site (such as a list of which pages should be migrated, or a list of how old urls should be mapped to new ones, aor lists of tags). These you typically get in Word och Excel format. Also the data often needs massaging a bit, and I end up with what for all intents and purposes are a dictionaries mapping one URL to some other information.
Sure, I could save that as CVS, and parse it into a dictionary. But instead I typically save it as a Python file with a dictionary. Saves code.
So, yes, it's reasonable, no it's not a format you should use for any sort of save file. It however often used for data that straddles the border to configuration, like above.

The biggest drawback is that it's a potential security problem since it's hard to guarantee that the files won't contains arbitrary code, which could be really bad. So don't use this approach if anyone else than you have write-access to the files.

A reasonable option might be to use the Pickle module, which is specifically designed to save and restore python structures to disk.

Alex Martelli's answer is absolutely insightful and I agree with him. However, I'll go one step further and make a specific recommendation: use JSON.
JSON is simple, and Python's data structures map well into it; and there are several standard libraries and tools for working with JSON. The json module in Python 3.0 and newer is based on simplejson, so I would use simplejson in Python 2.x and json in Python 3.0 and newer.
Second choice is XML. XML is more complicated, and harder to just look at (or just edit with a text editor) but there is a vast wealth of tools to validate it, filter it, edit it, etc.
Also, if your data storage and retrieval needs become at all nontrivial, consider using an actual database. SQLite is terrific: it's small, and for small databases runs very fast, but it is a real actual SQL database. I would definitely use a Python ORM instead of learning SQL to interact with the database; my favorite ORM for SQLite would be Autumn (small and simple), or the ORM from Django (you don't even need to learn how to create tables in SQL!) Then if you ever outgrow SQLite, you can move up to a real database such as PostgreSQL. If you find yourself writing lots of loops that search through your saved data, and especially if you need to enforce dependencies (such as if foo is deleted, bar must be deleted too) consider going to a database.

Related

Searchable database-like structure for python objects?

I am looking for a database-like structure which can hold python objects, each of which has various fields which can be searched. Some searching yields ZODB but I'm not sure that's what i want.
To explain, I have objects which can be written/read from disk in a given file format. I would like a way to organize and search many sets of those objects. Currently, I am storing them in nested dictionaries, which are populated based on the structure of the file-system and file names.
I would like a more data-base like approach, but i don't think i need a database. i would like to be able to save the structure to disk, but don't want to interface a server or anything like that.
perhaps i just want to use numpy's structured arrays? http://docs.scipy.org/doc/numpy/user/basics.rec.html
Python comes with a database... http://docs.python.org/library/sqlite3.html
I think what you are looking for is python shelve if you are just looking to save the data to disk. It is also almost 100% compliant with the dict interface so you will have to change very little code.
ZODB will do what you are looking for. You can have your data structured there exactly as you have it already. Instead of using builtin dicts, you'll have to switch to persistent mappings, and if they tend to be large, you maybe should start with btrees right from the beginning. Here is some good starting point for further reading with simple examples and tutorials. As you'll see, ZODB doesn't provide some SQL-like access language. But as it looks like you already have structured your data to fit into some dictionaries, you probably won't miss that anyways. ZODB compared to plain shelve also offers to zip your data on the fly before writing it to disk. In general, it provides a lot more options than shelve, and shows now significant downside compared to that, so I can just warmly recommend it. I switched to it a year ago, because typical relational DBs together with ORM performed so horribly slow especially for insert-or-update operations. With ZODB I'm about an order of magnitude faster here, as there is no communication overhead between the application and the database, as the database is part of your application already.

MySql bulk import without writing a file to disk

I have a django site running with a mysql database backend. I accept fairly large uploads from one of the admin users to bulk import some data. The data comes in a format that is slightly different than the form it needs to be in the database so I need to do a little parsing.
I'd like to be able to pivot this data into csv and write it into a cStringIO object then simply use mysql's bulk import command to load that file. I'd prefer to skip writing the file to the disk first, but I can't seem to find a way around it. I've done basically this exact thing with postgresql in the past, but unfortunately this project is on mysql.
The short: Can I take an in memory file like object and somehow use mysql bulk import operation
There is an excellent tutorial called Generator Tricks for Systems Programmers that addresses processing large log files, which is similar, but not identical, to your situation. As long as you can perform the needed transform with access to only the current (and possibly previous) data in the stream, this may work for you.
I have mentioned this gem in a number of answers because I think that it introduces a different way of thinking that can be quite valuable. There is a companion piece, A Curious Course on Coroutines and Concurrency, that can seriously twist your head around.
If by "bulk import" you mean LOAD DATA [LOCAL] INFILE then, no, there's no way around first writing the data to some file, damn it all. You (and I) would really like to write the table directly from an array.
But some OSs, like Linux, allow a RAM-resident filesystem that eases some of the hurt. I'm not enough of a sysadmin to know how to set up one of these guys; I had to get my ISP's tech support to do it for me. I found an article that might have useful info.
HTH

Using Python to generate JSON Data

I'm working on a little project, the aim being to generate a report from a database for a server. The database is SQLite and contains tables like 'connections', 'downloads', etc.
The report I produce will ultimately contain a number of graphs displaying things like 'connections per day', 'top downloads this month', etc.
I plan to use flot for the graphs because the graphs it makes look very nice:
This is my current plan for how my reports will work:
Static .HTML file which is the report. This will contain headings, embedded flot graphs, etc.
JSON Data file. These will be generated by my report generation python script, they will basically contain a JSON variable for each graph representing the dataset that the graph should map. ([100,2009-2-2],[192,2009-2-3]...)
Report generation python script, this will load the SQLite database, run a list of set SQL queries and spit out the JSON Data files.
Does this sound like a sensible set up? I can't help but feel it could be improved but I don't see how. I want the reports to be static. The server they run on cannot take heavy loads so a dynamically generated report is out of the question and also unnecessary for this application.
My concerns are:
I feel that the Python script is largely pointless, all of the processing performed is done by SQLite, my script is basically going to be used to store SQL queries and package up the output. With a bit more work SQLite could probably do this for me.
It seems I'm solving a problem that must have been solved many times before 'take sql queries, spit out pretty graphs in a daily report' must have been done hundreds of times. I'm just having trouble tracking down any broad implementations.
It sounds sensible to me.
You need some programming language to talk to SQLite. You could do it in C, but if you can write the necessary glue code easily in Python, why not? You'll almost certainly save more time writing it than you'll lose from not having the most efficient possible program.
There are definitely programs to analyse logs for you - I've heard of Piwik, for instance. That's for dynamic reports, but no doubt there are projects to do static reports too. But they may not fit the data and output you're after. Writing it yourself means you know precisely what you're getting, so if it's not too much work, carry on.
I feel that the Python script is largely pointless, all of the processing performed is done by SQLite, my script is basically going to be used to store SQL queries and package up the output. With a bit more work SQLite could probably do this for me.
Maybe so, but even then, Python is a great glue language. Also, if you need to do some processing SQLite isn't good at, Python is already there.
It seems I'm solving a problem that must have been solved many times before 'take sql queries, spit out pretty graphs in a daily report' must have been done hundreds of times. I'm just having trouble tracking down any broad implementations.
I think you're leaning towards the general class of HTTP-served reporting. One thing out there that overlaps your problem set is Django, which provides a Python interface between database (SQLite is supported) and web server, along with a templating system for your outputs.
If you just want one or two pieces of a solution, then I recommend looking at SQLAlchemy for interfacing with the database, Jinja2 for templating, and/or Werkzeug for HTTP server interface.

Writing a key-value store

I am looking to write a Key/value store (probably in python) mostly just for experience, and because it's something I think that is a very useful product. I have a couple of questions. How, in general, are key/value pairs normally stored in memory and on disk? How would one go about loading the things stored on disk, back into memory? Do key/value stores keep all the key/value pairs in memory at once? or is it read from the disk?
I tried to find some literature on the subject, but didn't get very far and was hoping someone here could help me out.
It all depends on the level of complexity you want to dive into. Starting with a simple Python dict serialized to a file in a myriad of possible ways (of which pickle is probably the simplest), you can go as far as implementing a complete database system.
Look up redis - it's a key/value store written in C and operating as a server "DB". It has some good documentation and easy to read code, so you can borrow ideas for your Python implementation.
To go even farther, you can read about B-trees.
For your specific questions: above some DB size, you can never keep it all in memory, so you need some robust way of loading data from disk. Also consider whether the store is single-client or multi-client. This has serious consequences for its implementation.
Have a look at Python's shelve module which provides a persitent dictionary. It basically stores pickles in a database, usually dmb or BSDDB. Looking at how shelve works will give you some insights, and the source code comes with your python distribution.
Another product to look at is Durus. This is an object database, that it uses it's own B-tree implementation for persistence to disk.
If you're doing a key/value store in Python for learning purposes, it might be easiest to start with the pickle module. It's a fast and convenient way to write an arbitrary Python data stream to a persistent store and read it back again.
you can have a look at 'Berkley db' to see how how it works, it is a key/value DB, so you can use it directly, or as it is open-source see how it handles persistence, transactions and paging of most referred pages.
here are python bindings to it http://www.jcea.es/programacion/pybsddb.htm
Amazon released a document about Dynamo - a highly available key-value storage system. It mostly deals with the scaling issues (how to create a key/value store that runs on a large number of machines), but it also deals with some basics, and generally worth to read.
First up, I know this question quite old.
I'm the creator of aodbm ( http://sf.net/projects/aodbm/ ), which is a key-value store library. aodbm uses immutable B+Trees to store your data. So whenever a modification is made a new tree is appended to the end of the file. This probably sounds like a horrendous waste of space, but seeing as the vast majority of nodes from the previous tree are referenced, the overhead is actually quite low. Very little of an entire tree is kept in memory at any given time (at most O(log n)).
I recommend to see the talk Write optimization in external - memory data structures (slides), which gives a nice overview of modern approaches to building extra-memory databases (e. g. key-value stores), and explains log-structured merge trees.
If your key-value store targets use-cases when all the data fits main memory, the data store architecture could be a lot simpler, mapping a file to a big chunk of memory and working with that memory without bothering about disk-to-memory communication and synchronization at all, because it becomes a concern of the operating system.

Fast, searchable dict storage for Python

Current I use SQLite (w/ SQLAlchemy) to store about 5000 dict objects. Each dict object corresponds to an entry in PyPI with keys - (name, version, summary .. sometimes 'description' can be as big as the project documentation).
Writing these entries (from JSON) back to the disk (SQLite format) takes several seconds, and it feels slow.
Writing is done as frequent as once a day, but reading/searching for a particular entry based on a key (usually name or description) is done very often.
Just like apt-get.
Is there a storage library for use with Python that will suit my needs better than SQLite?
Did you put indices on name and description? Searching on 5000 indexed entries should be essentially instantaneous (of course ORMs will make your life much harder, as they usually do [even relatively good ones such as SQLAlchemy, but try "raw sqlite" and it absolutely should fly).
Writing just the updated entries (again with real SQL) should also be basically instantaneous -- ideally a single update statement should do it, but even a thousand should be no real problem, just make sure to turn off autocommit at the start of the loop (and if you want turn it back again later).
It might be overkill for your application, but you ought to check out schema-free/document-oriented databases. Personally I'm a fan of couchdb. Basically, rather than store records as rows in a table, something like couchdb stores key-value pairs, and then (in the case of couchdb) you write views in javascript to cull the data you need. These databases are usually easier to scale than relational databases, and in your case may be much faster, since you dont have to hammer your data into a shape that will fit into a relational database. On the other hand, it means that there is another service running.
Given the approximate number of objects stated (around 5,000), SQLite is probably not the problem behind speed. It's the intermediary measures; for example JSON or possibly non-optimal use of SQLAlChemy.
Try this out (fairly fast even for million objects):
y_serial.py module :: warehouse Python objects with SQLite
"Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful "standard" module for a database to store schema-less data."
http://yserial.sourceforge.net
The yserial search on your keys is done using the regular expression ("regex") code on the SQLite side, not Python, so there's another substantial speed improvement.
Let us know how it works out.
I'm solving a very similar problem for myself right now, using Nucular, which might suit your needs. It's a file-system based storage and seems very fast indeed. (It comes with an example app that indexes the whole python source tree) It's concurrent-safe, requires no external libraries and is pure python. It searches rapidly and has powerful fulltext search, indexing and so on - kind of a specialised, in-process, native python-dict store after the manner of the trendy Couchdb and mongodb, but much lighter.
It does have limitations, though - it can't store or query on nested dictionaries, so not every JSON type can be stored in it. Moreover, although its text searching is powerful, its numerical queries are weak and unindexed. Nonetheless, it may be precisely what you are after.

Categories