Is it feasible to store data for a web application inside the program itself, e.g. as a large dictionary? The data would mostly be just a few hundred short-ish text blocks (roughly blog post size), and it will not be altered/added to at all by the users (although I would want to be able to update it myself every so often).
Up until now I've been looking at standard database storage solutions, but since (for this set of objects at least) they will not be modified, is it possible simply to store them as a dictionary? Or are there serious downsides I have not considered?
Thanks.
It is certainly possible. The amount of data you can store will be limited mostly by memory available.
If you are planning to perform database-like operations on the data then you are better off with an in-memory database like SQLite. If the data is picked up from a database and hashed then you might want to use Memcached. If you are limiting yourself to plain text and the data won't grow then you can stick to memory.
The approach will start losing its charm if the data will grow in the future. In that case you are better off with other solutions.
Related
Suppose I have a tabular dataset of fixed dimension (N x M). I receive a stream of updates from Kafka updating entries in this table. Ultimately, I'd like to have a pandas dataframe with a recent version of the table, and I'm considering a few options for doing that:
Maintain it in memory as a table / dataframe. My concern here, is I don't know if I can avoid multithreading, since one process will perpetually be in a for loop of receiving messages.
Maintain it in an external structure, and have a separate process independently read from it. Choices of external data stores:
a) SQLite - Might have concurrency issues, and updates for arbitrary rows are probably a bit messy.
b) Redis - Easy to maintain, but hard to query / read the whole table at once (which is how I would generally be accessing the data).
I'm a bit of a Kafka beginner, so any advice here would be appreciated. How would you approach this problem? Thanks!
EDIT: I guess I could also just maintain it in memory and then just push the whole thing to SQLite?
My initial approach would be to ask: can I create a "good enough" solution to start with, and optimize it later if needed?
Unless you need to worry about very sensitive information (like healthcare or finance data), or data that is going to definitely going to scale up very quickly, then I would suggest trying a simple solution first and then see if you hit any problems. You may not!
Ultimately, I would probably go with the SQLite solution to start with, as it's relatively simple to set up and it's a good fit for the use case (i.e. "transactional" situations).
Here are some considerations I would think about:
Pros/cons of a single process
Unless your data is high-velocity / high-volume, your suggestion of consuming and processing the data in the same process is probably fine. Processing data locally is much faster than receiving it over the network (assuming your Kafka feed isn't on your local computer), so your data ingest from Kafka would probably be the bottleneck.
But, this could be expensive to have a Python process spinning indefinitely, and you would need to make sure to store your data out to a file or database in order to keep it from being lost if your process shut down.
Relational database (e.g. SQLite)
Using a relational database like SQLite is probably your best bet, once again depending on the velocity of the data you're receiving. But relational databases are used all the time for transactional purposes (in fact that's one of their primary intended purposes), meaning high volume and velocity of writes—so it would definitely make sense to persist your data in SQLite and make your updates there as well. You could see about breaking your data into separate tables if it made sense (e.g. third normal form), or you could keep it all in one table if that was a better fit.
Maintain the table in memory
You could also keep the table in memory, like you suggested, as long as you're persisting it to disk in some fashion (CSV, SQLite, etc.) after updates. For example, you could:
Have your copy in memory.
When you get an update, make the update to your in-memory table.
Write the table to disk.
If your process stops or restarts, read the table from memory to start.
Pandas can be slower for accessing and updating individual values in rows, though, so it might actually make more sense to keep your table in memory as a dictionary or something and write it to disk without using pandas. But if you can get away with doing it all in pandas (re: velocity and volume), that could be a fine way to start too.
I'm migrating a GAE/Java app to Python (non-GAE) due new pricing, so I'm getting a little server and I would like to find a database that fits the following requirements:
Low memory usage (or to be tuneable or predictible)
Fastest querying capability for simple document/tree-like data identified by key (I don't care about performance on writing and I assume it will have indexes)
Bindings with Pypy 1.6 compatibility (or Python 2.7 at least)
My data goes something like this:
Id: short key string
Title
Creators: an array of another data structure which has an id - used as key -, a name, a site address, etc.
Tags: array of tags. Each of them can has multiple parent tags, a name, an id too, etc.
License: a data structure which describes its license (CC, GPL, ... you say it) with name, associated URL, etc.
Addition time: when it was add in our site.
Translations: pointers to other entries that are translations of one creation.
My queries are very simple. Usual cases are:
Filter by tag ordered by addition time.
Select a few (pagination) ordered by addition time.
(Maybe, not done already) filter by creator.
(Not done but planned) some autocomplete features in forms, so I'm going to need search if some fields contains a substring ('LIKE' queries).
The data volume is not big. Right now I have about 50MB of data but I'm planning to have a huge dataset around 10GB.
Also, I want to rebuild this from scratch, so I'm open to any option. What database do you think can meet my requirements?
Edit: I want to do some benchmarks around different options and share the results. I have selected, so far, MongoDB, PostgreSQL, MySQL, Drizzle, Riak and Kyoto Cabinet.
The path of least resistance for migrating an app engine application will probably be using AppScale, which implements a major portion of the app engine API. In particular, you might want to use the HyperTable data-store, which closely mirrors the Google App Engine datastore.
Edit: ok, so you're going for a redesign. I'd like to go over some of the points you make in your question.
Low memory usage
That's pretty much the opposite of what you want in a database; You want as much of your dataset in core memory as is possible; This could mean tuning the dataset itself to fit efficiently, or adding memcached nodes so that you can spread the dataset across several hosts so that each host has a small enough fraction of the dataset that it fits in core.
To drive this point home, consider that reading a value from ram is about 1000 times faster than reading it from disk; A database that can satisfy every query from core can handle 10 times the workload compared with a database that has to visit the disk for just 1% of its queries.
I'm planning to have a huge dataset around 10GB.
I don't think that you could call 10GB a 'huge dataset'. In fact, that's something that could probably fit in the ram of a reasonably large database server; You wouldn't need more than one memcached node, much less additional persistance nodes (typical disk sizes are in the Terabytes, 100 times larger than this expected dataset.
Based on this information, I would definitely advise using a mature database product like PostgreSQL, which would give you plenty of performance for the data you're describing, easily provides all of the features you're talking about. If the time comes that you need to scale past what PostgreSQL can actually provide, you'll actually have a real workload to analyse to know what the bottlenecks really are.
I would recommend Postresql, only because it does what you want, can scale, is fast, rather easy to work with and stable.
It is exceptionally fast at the example queries given, and could be even faster with document querying.
I'm creating and processing a very large data set, with about 34 million data points, and I'm currently storing them in python dictionaries in memory (about 22,500 dictionaries, with 15 dictionaries in each of 1588 class instances). While I'm able to manage this all in memory, I'm using up all of my RAM and most of my swap.
I need to be able to first generate all of this data, and then do analysis on select portions of it at a time. Would it be beneficial from an efficiency standpoint to write some of this data to file, or store it in a database? Or am I better off just taking the hit to efficiency that comes with using my swap space. If I should be writing to file/a database, are there any python tools that you would recommend to do so?
Get a relational database, fast! Or a whole lot more RAM.
If you're using Python, then start with Python Database Programming. SQLite would be a choice, but I'd suggest MySQL based upon the amount of data you're dealing with. If you want an object-oriented approach to storing your data, you might want to look at SQLAlchemy, but you'll probably get more efficiency if you end up mapping each of your object classes to a table yourself and just coping with rows and columns.
Because you will be looking at "select portions", your application will be able to make better use of core than Virtual Memory will. VM is convenient, but - by definition - kinda stupid about locality of reference.
Use a database.
I'd probably start with module sqlite3 on the basis of simplicity, unless or until I find that it is a bottlenck.
If you have this data in Python data structures already, assuming you're not doing a lot of in-memory indexing (more than the obvious dictionary keys index), you really don't want to use a relational database - you'll pay a considerable performance penalty for no particular benefit.
You just need to get your already key-value-pair data out of memory, not change its' format. You should look into key-value stores like BDB, Voldemort, MongoDB, or Scalaris (just to name a few - some more involved and functional than others, but all should easily handle your dataset), or for a dataset that you think might grow even larger or more complex you can look into systems like Cassandra, Riak, or CouchDB (among others). ALL of these systems will offer you vastly superior performance to a relational database and more directly map to an in-memory data model.
All that being said, of course, if your dataset really could be more performant by leveraging the benefits of a relational database (complex relationships, multiple views, etc.), then go for it, but you shouldn't use a relational database if all you're trying to do is get your data structures out of memory.
(It's also possible that just marshaling/pickling your data in segments and managing it yourself would offer better performance than a relational database, assuming your access pattern made paging in/out a relatively infrequent event. It's a long shot, but if you're just holding old data around and no one really looks at it, you might as well just throw that to disk yourself.)
I am looking to write a Key/value store (probably in python) mostly just for experience, and because it's something I think that is a very useful product. I have a couple of questions. How, in general, are key/value pairs normally stored in memory and on disk? How would one go about loading the things stored on disk, back into memory? Do key/value stores keep all the key/value pairs in memory at once? or is it read from the disk?
I tried to find some literature on the subject, but didn't get very far and was hoping someone here could help me out.
It all depends on the level of complexity you want to dive into. Starting with a simple Python dict serialized to a file in a myriad of possible ways (of which pickle is probably the simplest), you can go as far as implementing a complete database system.
Look up redis - it's a key/value store written in C and operating as a server "DB". It has some good documentation and easy to read code, so you can borrow ideas for your Python implementation.
To go even farther, you can read about B-trees.
For your specific questions: above some DB size, you can never keep it all in memory, so you need some robust way of loading data from disk. Also consider whether the store is single-client or multi-client. This has serious consequences for its implementation.
Have a look at Python's shelve module which provides a persitent dictionary. It basically stores pickles in a database, usually dmb or BSDDB. Looking at how shelve works will give you some insights, and the source code comes with your python distribution.
Another product to look at is Durus. This is an object database, that it uses it's own B-tree implementation for persistence to disk.
If you're doing a key/value store in Python for learning purposes, it might be easiest to start with the pickle module. It's a fast and convenient way to write an arbitrary Python data stream to a persistent store and read it back again.
you can have a look at 'Berkley db' to see how how it works, it is a key/value DB, so you can use it directly, or as it is open-source see how it handles persistence, transactions and paging of most referred pages.
here are python bindings to it http://www.jcea.es/programacion/pybsddb.htm
Amazon released a document about Dynamo - a highly available key-value storage system. It mostly deals with the scaling issues (how to create a key/value store that runs on a large number of machines), but it also deals with some basics, and generally worth to read.
First up, I know this question quite old.
I'm the creator of aodbm ( http://sf.net/projects/aodbm/ ), which is a key-value store library. aodbm uses immutable B+Trees to store your data. So whenever a modification is made a new tree is appended to the end of the file. This probably sounds like a horrendous waste of space, but seeing as the vast majority of nodes from the previous tree are referenced, the overhead is actually quite low. Very little of an entire tree is kept in memory at any given time (at most O(log n)).
I recommend to see the talk Write optimization in external - memory data structures (slides), which gives a nice overview of modern approaches to building extra-memory databases (e. g. key-value stores), and explains log-structured merge trees.
If your key-value store targets use-cases when all the data fits main memory, the data store architecture could be a lot simpler, mapping a file to a big chunk of memory and working with that memory without bothering about disk-to-memory communication and synchronization at all, because it becomes a concern of the operating system.
Current I use SQLite (w/ SQLAlchemy) to store about 5000 dict objects. Each dict object corresponds to an entry in PyPI with keys - (name, version, summary .. sometimes 'description' can be as big as the project documentation).
Writing these entries (from JSON) back to the disk (SQLite format) takes several seconds, and it feels slow.
Writing is done as frequent as once a day, but reading/searching for a particular entry based on a key (usually name or description) is done very often.
Just like apt-get.
Is there a storage library for use with Python that will suit my needs better than SQLite?
Did you put indices on name and description? Searching on 5000 indexed entries should be essentially instantaneous (of course ORMs will make your life much harder, as they usually do [even relatively good ones such as SQLAlchemy, but try "raw sqlite" and it absolutely should fly).
Writing just the updated entries (again with real SQL) should also be basically instantaneous -- ideally a single update statement should do it, but even a thousand should be no real problem, just make sure to turn off autocommit at the start of the loop (and if you want turn it back again later).
It might be overkill for your application, but you ought to check out schema-free/document-oriented databases. Personally I'm a fan of couchdb. Basically, rather than store records as rows in a table, something like couchdb stores key-value pairs, and then (in the case of couchdb) you write views in javascript to cull the data you need. These databases are usually easier to scale than relational databases, and in your case may be much faster, since you dont have to hammer your data into a shape that will fit into a relational database. On the other hand, it means that there is another service running.
Given the approximate number of objects stated (around 5,000), SQLite is probably not the problem behind speed. It's the intermediary measures; for example JSON or possibly non-optimal use of SQLAlChemy.
Try this out (fairly fast even for million objects):
y_serial.py module :: warehouse Python objects with SQLite
"Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful "standard" module for a database to store schema-less data."
http://yserial.sourceforge.net
The yserial search on your keys is done using the regular expression ("regex") code on the SQLite side, not Python, so there's another substantial speed improvement.
Let us know how it works out.
I'm solving a very similar problem for myself right now, using Nucular, which might suit your needs. It's a file-system based storage and seems very fast indeed. (It comes with an example app that indexes the whole python source tree) It's concurrent-safe, requires no external libraries and is pure python. It searches rapidly and has powerful fulltext search, indexing and so on - kind of a specialised, in-process, native python-dict store after the manner of the trendy Couchdb and mongodb, but much lighter.
It does have limitations, though - it can't store or query on nested dictionaries, so not every JSON type can be stored in it. Moreover, although its text searching is powerful, its numerical queries are weak and unindexed. Nonetheless, it may be precisely what you are after.