This is bit of a general question but I wanted to know some different approaches to sharing data between machines.
Basically I have a process that generates large reference table(often over 10gig python dict) and then other machines run independent proceses but reference that table. The dict does not change once its created and all the other machines simply refer to it to do their work. I'm leaning towards storing all this in a database and then having all the servers query the server to get that data. I just suspect it might having multiple 10gig+ queries at the same time may not be the best way to do it. I have thought about a flat file or passing it over using a distribution tool.
Is there any other ways to share this python dict among several machines(general approaches are fine but I'm using python so any library suggestion would work also)?
Well, yes, it would make sense to store it in some kind of shared datastore. Depending on your exact needs, you may find it preferable to store the data in some kind of nosql-type storage. For example redis ( http://redis.io ) is pretty reasonable, and supports various datastructures, including hashtables.
Related
I am currently starting a kind of larger project in python and I am unsure about how to best structure it. Or to put it in different terms, how to build it in the most "pythonic" way. Let me try to explain the main functionality:
It is supposed to be a tool or toolset by which to extract data from different sources, at the moment mainly SQL-databases, in the future maybe also data from files stored on some network locations. It will probably consist of three main parts:
A data model which will hold all the data extracted from files / SQL. This will be some combination of classes / instances thereof. No big deal here
One or more scripts, which will control everything (Should the data be displayed? Outputted in another file? Which data exactly needs to be fetched? etc) Also pretty straightforward
And some module/class (or multiple modules) which will handle the data extraction of data. This is where I struggle mainly
So for the actual questions:
Should I place the classes of the data model and the "extractor" into one folder/package and access them from outside the package via my "control script"? Or should I place everything together?
How should I build the "extractor"? I already tried three different approaches for a SqlReader module/class: I tried making it just a simple module, not a class, but I didn't really find a clean way on how and where to initialize it. (Sql-connection needs to be set up) I tried making it a class and creating one instance, but then I need to pass around this instance into the different classes of the data model, because each needs to be able to extract data. And I tried making it a static class (defining
everything as a#classmethod) but again, I didn't like setting it up and it also kind of felt wrong.
Should the main script "know" about the extractor-module? Or should it just interact with the data model itself? If not, again the question, where, when and how to initialize the SqlReader
And last but not least, how do I make sure, I close the SQL-connection whenever my script ends? Meaning, even if it ends through an error. I am using cx_oracle by the way
I am happy about any hints / suggestions / answers etc. :)
For this project you will need the basic Data Science Toolkit: Pandas, Matplotlib, and maybe numpy. Also you will need SQLite3(built-in) or another SQL module to work with the databases.
Pandas: Used to extract, manipulate, analyze data.
Matplotlib: Visualize data, make human readable graphs for further data analyzation.
Numpy: Build fast, stable arrays of data that work much faster than python's lists.
Now, this is just a guideline, you will need to dig deeper in their documentation, then use what you need in your project.
Hope that this is what you were looking for!
Cheers
I am looking for a database-like structure which can hold python objects, each of which has various fields which can be searched. Some searching yields ZODB but I'm not sure that's what i want.
To explain, I have objects which can be written/read from disk in a given file format. I would like a way to organize and search many sets of those objects. Currently, I am storing them in nested dictionaries, which are populated based on the structure of the file-system and file names.
I would like a more data-base like approach, but i don't think i need a database. i would like to be able to save the structure to disk, but don't want to interface a server or anything like that.
perhaps i just want to use numpy's structured arrays? http://docs.scipy.org/doc/numpy/user/basics.rec.html
Python comes with a database... http://docs.python.org/library/sqlite3.html
I think what you are looking for is python shelve if you are just looking to save the data to disk. It is also almost 100% compliant with the dict interface so you will have to change very little code.
ZODB will do what you are looking for. You can have your data structured there exactly as you have it already. Instead of using builtin dicts, you'll have to switch to persistent mappings, and if they tend to be large, you maybe should start with btrees right from the beginning. Here is some good starting point for further reading with simple examples and tutorials. As you'll see, ZODB doesn't provide some SQL-like access language. But as it looks like you already have structured your data to fit into some dictionaries, you probably won't miss that anyways. ZODB compared to plain shelve also offers to zip your data on the fly before writing it to disk. In general, it provides a lot more options than shelve, and shows now significant downside compared to that, so I can just warmly recommend it. I switched to it a year ago, because typical relational DBs together with ORM performed so horribly slow especially for insert-or-update operations. With ZODB I'm about an order of magnitude faster here, as there is no communication overhead between the application and the database, as the database is part of your application already.
The Python native capabilities for lists, sets & dictionaries totally rock. Is there a way to continue using the native capability when the data becomes really big? The problem I'm working on involved matching (intersection) of very large lists. I haven't pushed the limits yet -- actually I don't really know what the limits are -- and don't want to be surprised with a big reimplementation after the data grows as expected.
Is it reasonable to deploy on something like Google App Engine that advertises no practical scale limit and continue using the native capability as-is forever and not really think about this?
Is there some Python magic that can hide whether the list, set or dictionary is in Python-managed memory vs. in a DB -- so physical deployment of data can be kept distinct from what I do in code?
How do you, Mr. or Ms. Python Super Expert, deal with lists, sets & dicts as data volume grows?
I'm not quite sure what you mean by native capabilities for lists, sets & dictionaries. However, you can create classes that emulate container types and sequence types by defining some methods with special names. That means that you could create a class that behaves like a list, but stores its data in a SQL database or on GAE datastore. Simply speaking, this is what an ORM does. However, mapping objects to a database is very complicated and it is probably not a good idea to invent your own ORM, but to use an existing one.
I'm afraid there is no one-size-fits-all solution. Especially GAE is not some kind of of Magic Fairy Dust you can sprinkle on your code to make it scale. There are several limitations you have to keep in mind to create an application that can scale. Some of them are general, like computational complexity, others are specific to the environment your code runs in. E.g. on GAE the maximum response time is limited to 30 seconds and querying the datastore works different that on other databases.
It's hard to give any concrete advice without knowing your specific problem, but I doubt that GAE is the right solution.
In general, if you want to work with large datasets, you either have to keep that in mind from the start or you will have to rework your code, algorithms and data structures as the datasets grow.
You are describing my dreams! However, I think you cannot do it. I always wanted something just like LINQ for Python but the language does not permit to use Python syntax for native database operations AFAIK. If it would be possible, you could just write code using lists and then use the same code for retrieving data from a database.
I would not recommend you to write a lot of code based only in lists and sets because it will not be easy to migrate it to a scalable platform. I recommend you to use something like an ORM. GAE even has its own ORM-like system and you can use other ones such as SQLAlchemy and SQLObject with e.g. SQLite.
Unfortunately, you cannot use awesome stuff such as list comprehensions to filter data from the database. Surely, you can filter data after it was gotten from the DB but you'll still need to build a query with some SQL-like language for querying objects or return a lot of objects from a database.
OTOH, there is Buzhug, a curious non-relational database system written in Python which allows the use of natural Python syntax. I have never used it and I do not know if it is scalable so I would not put my money on it. However, you can test it and see if it can help you.
You can use ORM: Object Relational Mapping: A class gets a table, an objects gets a row. I like the Django ORM. You can use it for non-web apps, too. I never used it on GAE, but I think it is possible.
Originally, I had to deal with just 1.5[TB] of data. Since I just needed fast write/reads (without any SQL), I designed my own flat binary file format (implemented using python) and easily (and happily) saved my data and manipulated it on one machine. Of course, for backup purposes, I added 2 machines to be used as exact mirrors (using rsync).
Presently, my needs are growing, and there's a need to build a solution that would successfully scale up to 20[TB] (and even more) of data. I would be happy to continue using my flat file format for storage. It is fast, reliable and gives me everything I need.
The thing I am concerned about is replication, data consistency etc (as obviously, data will have to be distributed -- not all data can be stored on one machine) across the network.
Are there any ready-made solutions (Linux / python based) that would allow me to keep using my file format for storage, yet would handle the other components that NoSql solutions normally provide? (data consistency / availability / easy replication)?
basically, all I want to make sure is that my binary files are consistent throughout my network. I am using a network of 60 core-duo machines (each with 1GB RAM and 1.5TB disk)
Approach: Distributed Map reduce in Python with The Disco Project
Seems like a good way of approaching your problem. I have used the disco project with similar problems.
You can distribute your files among n numbers of machines (processes), and implement the map and reduce functions that fit your logic.
The tutorial of the disco project, exactly describes how to implement a solution for your problems. You'll be impressed about how little code you need to write and definitely you can keep the format of your binary file.
Another similar option is to use Amazon's Elastic MapReduce
Perhaps some of the commentary on the Kivaloo system developed for Tarsnap will help you decide what's most appropriate: http://www.daemonology.net/blog/2011-03-28-kivaloo-data-store.html
Without knowing more about your application (size/type of records, frequency of reading/writing) or custom format it's hard to say more.
I am looking to write a Key/value store (probably in python) mostly just for experience, and because it's something I think that is a very useful product. I have a couple of questions. How, in general, are key/value pairs normally stored in memory and on disk? How would one go about loading the things stored on disk, back into memory? Do key/value stores keep all the key/value pairs in memory at once? or is it read from the disk?
I tried to find some literature on the subject, but didn't get very far and was hoping someone here could help me out.
It all depends on the level of complexity you want to dive into. Starting with a simple Python dict serialized to a file in a myriad of possible ways (of which pickle is probably the simplest), you can go as far as implementing a complete database system.
Look up redis - it's a key/value store written in C and operating as a server "DB". It has some good documentation and easy to read code, so you can borrow ideas for your Python implementation.
To go even farther, you can read about B-trees.
For your specific questions: above some DB size, you can never keep it all in memory, so you need some robust way of loading data from disk. Also consider whether the store is single-client or multi-client. This has serious consequences for its implementation.
Have a look at Python's shelve module which provides a persitent dictionary. It basically stores pickles in a database, usually dmb or BSDDB. Looking at how shelve works will give you some insights, and the source code comes with your python distribution.
Another product to look at is Durus. This is an object database, that it uses it's own B-tree implementation for persistence to disk.
If you're doing a key/value store in Python for learning purposes, it might be easiest to start with the pickle module. It's a fast and convenient way to write an arbitrary Python data stream to a persistent store and read it back again.
you can have a look at 'Berkley db' to see how how it works, it is a key/value DB, so you can use it directly, or as it is open-source see how it handles persistence, transactions and paging of most referred pages.
here are python bindings to it http://www.jcea.es/programacion/pybsddb.htm
Amazon released a document about Dynamo - a highly available key-value storage system. It mostly deals with the scaling issues (how to create a key/value store that runs on a large number of machines), but it also deals with some basics, and generally worth to read.
First up, I know this question quite old.
I'm the creator of aodbm ( http://sf.net/projects/aodbm/ ), which is a key-value store library. aodbm uses immutable B+Trees to store your data. So whenever a modification is made a new tree is appended to the end of the file. This probably sounds like a horrendous waste of space, but seeing as the vast majority of nodes from the previous tree are referenced, the overhead is actually quite low. Very little of an entire tree is kept in memory at any given time (at most O(log n)).
I recommend to see the talk Write optimization in external - memory data structures (slides), which gives a nice overview of modern approaches to building extra-memory databases (e. g. key-value stores), and explains log-structured merge trees.
If your key-value store targets use-cases when all the data fits main memory, the data store architecture could be a lot simpler, mapping a file to a big chunk of memory and working with that memory without bothering about disk-to-memory communication and synchronization at all, because it becomes a concern of the operating system.