Speed of standard Python data persistence methods [closed] - python

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Do you know what speed difference are of:
pickle
shelve
sqlite
some MySQL connector
MongoDB
...
If I wanted to store many dicts, which would be the prefered way and what are differences?

I don't want to drag out the comments, so I'll answer.
Given what you have said:
No server, no existing code. I only want to write a program that locally stores string to string dicts in a file or whatever. No more fancy than that.
I would say your best bet for something so simple is probably something like JSON.
However, if you need to to be super-fast, it may not be the best solution (or it may be - I honestly don't know how it performs in comparison). It's simple, and there are implementations of it for most platforms, which covers a lot of the ground you want. If you want the best speed possible, my advice would be to test it, that's the only way you'll know for sure. Of course, simple is usually a good sign for speed.
You haven't given enough information to know how important performance is here. Remember, unless you need the performance (provably) then don't bother optimising until you do. Go for something easy to read and maintain code-side, and easy to work with file-side. This is why I recommend JSON.

For persistent string to string dict's, anydbm is pretty reasonable. bsddb can be used from anydbm, and is fast but a bit sensitive to being interrupted. gdbm can be used from anydbm, and is slower but not likely to yield a corrupted database.
Also, if you want to read an entire dict into memory, make a lot of changes, and write the resulting dict back out, there's: http://stromberg.dnsalias.org/svn/dohdbm/trunk/ I'm using this one in a backup software project. It'll compress your dictionaries if you want, which can be a performance win if your I/O is particularly slow, or you have a lot of modifications to make.

Related

sorting and selecting data [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have huge tables of data that I need to manipulate (sort, calculate new quantities, select specific rows according to some conditions and so on...). So far I have been using a spreadsheet software to do the job but this is really time consuming and I am trying to find a more efficient way to do the job.
I use python but I could not figure out how to use it for such things. I am wondering if anybody can suggest something to use. SQL?!
This is a very general question, but there are multiple things that you can do to possibly make your life easier.
1.CSV These are very useful if you are storing data that is ordered in columns, and if you are looking for easy to read text files.
2.Sqlite3 Sqlite3 is a database system that does not require a server to use (it uses a file instead), and is interacted with just like any other database system. However, for very large scale projects that are handling massive amounts of data, it is not recommended.
3.MySql MySql is a database system that requires a server to interact with, but can be tweaked for very large scale projects, as well as small scale projects.
There are many other different types of systems though, so I suggest you search around and find that perfect fit. However, if you want to mess around with Sqlite3 or CSV, both Sqlite3 and CSV modules are supplied in the standard library with python 2.7 and 3.x I believe.
You will probably appreciate the sqlite3 module in Python standard library:
http://docs.python.org/library/sqlite3.html
You get a SQL database that's stored in a file on disk, with no need to configure a separate database server. It's not appropriate for multiple clients accessing at once, but for a single-threaded analysis application like yours, it's a good fit.

twitter/social data mining - Ruby or Django? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am a decent c/c++ programmer, but don't know much about web dev. I am interested in twitter/social data mining. So which is a better tool - RoR or Django? I am on level zero in both ruby and python. But python's syntax seemed easier to understand/learn. But the main Qs is that which tool has better mining related APIs?
Thanks!!
They both have all what you need. But Python does better here I think. Python has a very interesting library for text mining called NLTK, and Numpy/Scipy for analytical computations which allow you to achieve almost c comparable performances. On the other hand for pure data mining I'd suggest python+Pandas (Pandas is really well written and fast and there is no ruby equivalent as far as I know) or python + some R code called thru rpy. If in your data mining code you need to compute some symbolic math you can decide to use Sympy (slower because it's written in python but very complete) or Theano (way faster but with less features; it can even make your code run on the GPU thru CUDA)
If you are merely collecting data from twitter, you don't need a MVC frame work like Django or RoR. Actually you can use C++ libraries to collect data from Twitter, store them in database, build the indexing and so on, and then use C or C++ to perform data mining task against your data. Or you can performance the analysis on the go.
If you want to build your own web interface to present your work, or the likes, Django and RoR are both very good and easy to pick up framework.
This is not a real question, please read the faq

Programming a scalable database [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I want to ask you what programming language I should use to develop a horizontally scalable database. I don't care too much about performance.
Currently, I only know PHP and Python, but I wonder if Python is good for scalability.
Or is this even possible in Python?
The reasons I don't use an existing system is, I need deep insight into the system, and there is no database out there that can store indexes the way I want. (It's a mix of non relational, sparse free multidimensional, and graph design)
EDIT:
I already have most of the core code written in Python and investigated ways to improve adding data for that type of database design, what limits the use of other databases even more.
EDIT 2:
Forgot to note, the database tables are several hundred gigabytes.
The deveopment of a scalable database is language independent, i cannot say much about PHP, but i can tell you good things about Python, it's easy to read, easy to learn, etc. In my opinion it makes the code much cleaner than other languges.
Betweent PHP & Python, definitely Python. Where I work, the entire system is written in Python and it scales quite well.
p.s.: Do take a look at Mongo Db though.
You're looking for MongoDB.
Mongodb has some excellent python drivers. It is a joy to work with.
Since this is clearly a request for "opinion", I thought I'd offer my $.02
We looked at MongoDB 12-months ago, and started to really like it...but for one issue. MongoDB limits the largest database to amount of physical RAM installed on the MongoDB server. For our tests, this meant we were limited to 4 GB databases. This didn't fit our needs, so we walked away (too bad really, because Mongo looked great).
We moved back to home turf, and went with PostgreSQL for our project. It is an exceptional system, with lots to like.
But we've kept an eye on the NoSQL crowd ever since, and it looks like Riak is doing some really interesting work.
(fyi -- it's also possible the MongoDB project has resolved the DB size issue -- we haven't kept up with that project).

ZODB In Real Life [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Writing an app in Python, and been playing with various ORM setups and straight SQL. All of which are ugly as sin.
I have been looking at ZODB as an object store, and it looks a promising alternative... would you recommend it? What are your experiences, problems, and criticism, particularly regarding developer's perspectives, scalability, integrity, long-term maintenance and alternatives? Anyone start a project with it and ditch it? Why?
Whilst the ideas behind ZODB, Pypersyst and others are interesting, there seems to be a lack of enthusiasm around for them :(
I've used ZODB for more than ten years now, in Zope and outside. It's great if your data is hierarchical. The largest data store a customer operates has maybe. I don't know, 100GB in it? Something on that order of magnitude anyway.
Here is a performance comparison against Postgres.
If you're writing a WSGI web app, these packages may be useful:
repoze.tm2 (docs)
repoze.zodbconn (docs)
Compared to "any key-value store", the key features for ZODB would be automatic integration of attribute changes with real ACID transactions, and clean, "arbitrary" references to other persistent objects.
The ZODB is bigger than just the FileStorage used by default in Zope:
The RelStorage backend lets you put your data in an RDBMS which can be backed up, replicated, etc. using standard tools.
ZEO allows easy scaling of appservers and off-line jobs.
The two-phase commit support allows coordinating transactions among multiple databases, including RDBMSes (assuming that they provide a TPC-aware layer).
Easy hierarchy based on object attributes or containment: you don't need to write recursive self-joins to emulate it.
Filesystem-based BLOB support makes serving large files trivial to implement.
Overall, I'm very happy using ZODB for nearly any problem where the shape of the data is not obviously "square".
I would recommend it.
I really don't have any criticisms. If it's an object store your looking for, this is the one to use. I've stored 2.5 million objects in it before and didn't feel a pinch.
ZODB has been used for plenty of large databases
Most ZODB usage is/was probably Zope users who migrated away if they migrate away from Zope
Performance is not so good as relatonal database+ORM especially if you have lots of writes.
Long term maintenance is not so bad, you want to pack the database from time to time, but that can be done live.
You have to use ZEO if you are going to use more than one process with your ZODB which is quite a lot slower than using ZODB directly
I have no idea how ZODB performs on flash disks.
With pickling you should be able to use any key value database in a similar fashion.

What features of Python 3.0 will change your everyday coding? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Py3k just came out and has gobs of neat new stuff! I'm curious, what are SO pythonistas most excited about? What features are going to affect the way you write code on a daily basis, or have you been looking forward to?
There are a few things I'm quite interested in:
Text and data instead of unicode and 8 bit
Extended Iterable Unpacking
Function annotations
Binary literals
New exception catching syntax
A number of Python 2.6 features, eg: the with statement
I hope that exception chaining catches on. Losing exception stack traces due to the antipattern presented below had been my pet peeve for a long time:
try:
doSomething( someObject)
except:
someCleanup()
# Thanks for passing the error-causing object,
# but the original stack trace is lost :-(
raise MyError("Bad, bad object!", someObject)
I know, I know, adding some context info to the original exception and preserving the original stack trace was possible, but it required a really ugly hack. Now you can (and should!) just:
raise MyError("Bad, bad object!", someObject) from original_exception
and easily get both of the above. So, as a part of my holy mission against lost stack traces:
Folks, don't forget the from clause when reraising exceptions! Thank you.
Quite frankly, none of it. While I'll probably find myself using some of the new syntax, I mainly use Python for quick and simple scripts and regular expressions.
I think the new features will make a lot of little things a little easier for a lot of people and a few big things easy for a few people. However, I am skeptical of any claims that a lot of people will end up finding massive gains in productivity.
In short, I think these changes will make things a little better overall, but don't expect any miracles.
Not so much a feature, but I think the library cleanup will be of great help, esp. to new python programmers. On more than one occasion have I wanted to do something in python only to find two included libraries that offer that functionality, with no obvious reason why I should chose one over the other.
Despite what they did to achieve smallest possible migration course with interpreted languages, I find the whole release of python3 as ten years of painful path of migration. Therefore I don't find it particularly attracting.
The improvements they did are all good and important. Two different types for strings have been a real source of annoyances everywhere, therefore it's good they got rid from unicode object and introduced bytes object aside now unicode str.
The bignum vs. num -change was from convenience and I think that too was a good choice. In overall they cleaned the language from harmful components they accumulated during the last ten years.
Second worst thing they did was 10% slower implementation, as if speed wouldn't be python's problem already.
I believe the release of python3 pushes down python's reputation rather than improves it. Right now they are back in the start with their language when it comes down to library support.
Not having to do as much..
Not having to worry about using unicode() or u"".
Not having to search though the docs of urllib urllib2 and httplib to find where that functions I need to to encode a file and upload it via a POST request
Not having to worry about wether except TypeError, something: will catch a TypeError and something, or TypeError into `something..
And conversely, having to look at the docs again! I know python well enough now I can do most stuff without referring to pydoc, but every time that I do, I discover some other useful module or function.
The print statement. <sniff> I'm starting to miss it already.
Actually, before even going to Python 2.6, we're purging print in favor of logging.debug. This is just to get out of the habit of using print casually for debugging, support and development.
What remains are some programs that actually produce stuff on stdout. For those, we may introduce a 2.6/3.0 compatible "print" function in one of our libraries.
Dictionary comprehensions aren't necessarily earth-shattering but they're very nice.
While {k: v for k, v in list} is longer than dict(list) it's more flexible and self-explanitory.
One of the most underestimated features of Python 3 is the introduction of Abstract Base Classes. This is something that won't revolutionize Python programming straight away, but represents an interesting shift from a loose duck typing approach into the direction of better defined interfaces.
More information can be found in PEP 3119.
Just about all of them as I am taking the release of Python 3 as motivation to learn the language.
Unicode (utf-8) is really important for people living in non-english speaking countries.
I didn't like to specify the encoding at the beginning of the file, because I always forget. Usually my text is compatible with ASCII because I'm using UTF-8, so it is working without the encoding specification. But If I write my name (with an accent) or a € sign, it breaks ... I ended up writing unicode characters with their \uxxxx representation but it is kinda cryptic!

Categories