most efficient lookup for url caching system for REST API - python

What's the best way to store, index, and lookup text strings (URLs in this case)?
I'm creating a caching system for one of my sites. It's actually a bit more complex than that, thus the reason I'm rolling my own. I'm looking for the quickest, most-efficient way to resolve lookups on URLs, which obviously are text strings.
I'm currently using MySQL for a lot of my backend, and obviously I could just throw this in a table as a text field for the URL and its contents and turn on full text indexing, but that just feels slow and fundamentally wrong. Is there something else I should be looking at, whether it's MySQL or some other tool? Should I MD5 the URL, does that give me anything?
I've heard interesting things about mongodb too, but not sure if that buys me anything.

Memcached - easy, quick, found everywhere. I use it a lot.

MongoDB is a database, not a caching system. The speed difference between it and MySQL is not likely to be huge.
As D Mac mention memcached is an excellent choice for this. You do need to be aware that memcached is a true caching system and can throw your data away at any moment. You must be able to handle that.
A good compromise is redis, which is an in-memory database so it won't throw your data away like memcached will but also it will be an order of magnitude quicker than MySQL or MongoDB. The only downside to redis is that your whole dataset must fit into memory.
Your question contains lots of subquestions but not a lot of detail on what you're actually doing so it's hard to give a good answer.

Related

Setup for high volume of database writing

I am researching a project that would require hundreds of database writes per a minute. I have never dealt with this level of data writes before and I am looking for good scalable techniques and technologies.
I am a comfortable python developer with experience in django and sql alchemy. I am thinking I will build the data interface on django, but I don't think that it is a good idea to go through the orm to do the amount of data writes I will require. I am definitely open to learning new technologies.
The solution will live on Amazon web services, so I have access to all their tools. Ultimately I am looking for advice on database selection, data writing techniques, and any other needs I may have that I do not realize.
Any advice on where to start?
Thanks,
CG
Follow the trends, in other words, enter the world of NOSQL. Some technologies that are worthy include mongodb and redis. They are really fast, scalable, and with decent python drivers. For example, mongodb plays really nice with django, and has many common things with traditional SQL, like MySQL. On the other hand, redis has more "primitive" data structures but is superior in terms of speed (which of course depends somehow on the drivers). Using any of them ( or both, it's a clever idea for something glorious ) you are free ( and sometimes enforced ) to write your own "low-level" logic to accomplish your needs.
You should actually be okay with low hundreds of writes per minute through SQLAlchemy (thats only a couple a second); if you're talking more like a thousand a minute, yeah that might be problematic.
What kind of data do you have? If it's fairly flat (few tables, few relations), you might want to investigate a non-relational database such as CouchDB or Mongo. If you want to use SQL, I strongly reccommend PostgreSQL, it seems to deal with large databases and frequent writes a lot better than MySQL.
It also depends how complex the data is that you're inserting.
I think unfortunately, you're going to just have to try a couple things and run benchmarks, as each situation is different and query optimizers are basically magic.
If it's just a few hundred writes you still can do with a relational DB. I'd pick PostgreSQL (8.0+),
which has a separate background writer process. It also has tuneable serialization levels so you
can enable some tradeoffs between speed and strict ACID compliance, some even at transaction level.
Postgres is well documented, but it assumes some deeper understanding of SQL and relational DB theory to fully understand and make the most of it.
The alternative would be new fangled "NO-SQL" system, which can probably scale even better, but at the cost of buying into a very different technology system.
Any way, if you are using python and it is not 100% critical to lose writes on shutdown or power loss, and you need a low latency, use a threadsafe Queue.Queue and worker threads to decouple the writes from your main application thread(s).

Looking for a webapp framework that does away with the database

I've been developing with Django during the last year or so and really
enjoy it. But sometimes I find that the ORM is a bit of a
straitjacket. All the data that I shuffle back and forth to the
database would easily fit into 1GB of RAM. Even if the project grew a
couple of orders of magnitude it would still fit into 1GB.
I'd like a solution where my application only needs to read from disk
at startup but writes to disk implicitly as I update my objects. I
don't care to much about any speed increase this might give me. What
I'm really after is the added flexibility. If I have a problem that
would fit nicely with a linked list or a tree or some other
data structure, I shouldn't have to graft that onto a relational
database.
Python would be nice but other languages are fine. I'm in the exploratory
phase on this one. I want to get a feel for what solutions are
out there. When googling this question I got a lot of hits related to
different Nosql projects. But Nosql, as I understand it, is all about
what you do when you outgrow relational databases because you've got too
much data. I'm really at the other end of the spectrum. I've got so
little data that a relational database is actually overkill.
Object databases is an
other thing that came up when googling this question, which reminded
me of Zope and ZODB. I dabbled a bit with Zope an eon ago and really
disliked it. But reading up a bit on object databases made me think that it
might what I'm looking for. Then again, their general failure to
attract users makes me suspicious. Object databases have been around
for a very long time and still haven't caught on. I guess
that means there's something wrong with them?
If you are looking for "storing data-structures in-memory" and "backing up them to disk", you are really looking for a persistent cache system and Redis fits the bill perfectly.
If you want to use django, there is an awesome cache system built-in and that is pluggable to redis backend via Redis-Cache project. More over the data-structures accomodated by redis have one-to-one mapping to python data-structures, and so it is pretty seamless.
I am not so sure if skipping the concept of database itself, is a good idea. Database provides so much power in terms of aggregating, annotating, relations etc, all within acceptable performance levels until you hit real large scale.
Perhaps another idea would be to to use SQLite in-memory database. SQLite is so ubiquitous these days, it has disappeared into the infrastructure. It is built in, Android apps, iphone ones and has support from all standard libraries. It is also an awesome software developed and tested so well, it is very hard to make any case against using it.
The company I work for (Starcounter) creates a database that works exactly in the way that you describe. We have been running the database for a few years with our partner customers and are about to make the product publicly available. The main reason we created it is for ease of use and performance. I'll be happy to send you a copy if you send me a message on our corporate forum (I’m Starconter Jack).
On the subject on OO databases; the reason that OO databases failed is mainly because they were more experiments than real products. They were poorly implemented, supported only the OO paradigm and ignored standards such as SQL and ODBC. They also lacked stability, performance and maturity. Their story is analogues to the early tablets, eBooks and smartphones a decade or two before the iPhone, iPad and Kindle.
But just as with any technology, there are two waves (look up "the hype cycle"). While the first wave will disappoint, the second wave will be good. The first one will be driven by the concepts and ideas and will lack commercial success and real life usability. The second wave will want nothing to do with the musty smell of the failures of the first one and will therefore use new and fresh acronyms and buzzword.
The future database will spring out of the NoSQL movement. It will have added SQL support and many will think this is novel. It will have added good language integration (and most languages are oo) and many will think this is also novel. It will support documents and many will think this will be novel. Many will rediscover the need for transactions etc. etc.
Some grumpy old men will try to tell us that all we have done is to reinvent existing ideas. In some way they will be right; in some ways they won’t. This time around, the concepts are matured. New ideas will be added and pragmatism will be allowed.
But then again, an iPad is still, in a way, a PDA.

Can a Python list, set or dictionary be implemented invisibly using a database?

The Python native capabilities for lists, sets & dictionaries totally rock. Is there a way to continue using the native capability when the data becomes really big? The problem I'm working on involved matching (intersection) of very large lists. I haven't pushed the limits yet -- actually I don't really know what the limits are -- and don't want to be surprised with a big reimplementation after the data grows as expected.
Is it reasonable to deploy on something like Google App Engine that advertises no practical scale limit and continue using the native capability as-is forever and not really think about this?
Is there some Python magic that can hide whether the list, set or dictionary is in Python-managed memory vs. in a DB -- so physical deployment of data can be kept distinct from what I do in code?
How do you, Mr. or Ms. Python Super Expert, deal with lists, sets & dicts as data volume grows?
I'm not quite sure what you mean by native capabilities for lists, sets & dictionaries. However, you can create classes that emulate container types and sequence types by defining some methods with special names. That means that you could create a class that behaves like a list, but stores its data in a SQL database or on GAE datastore. Simply speaking, this is what an ORM does. However, mapping objects to a database is very complicated and it is probably not a good idea to invent your own ORM, but to use an existing one.
I'm afraid there is no one-size-fits-all solution. Especially GAE is not some kind of of Magic Fairy Dust you can sprinkle on your code to make it scale. There are several limitations you have to keep in mind to create an application that can scale. Some of them are general, like computational complexity, others are specific to the environment your code runs in. E.g. on GAE the maximum response time is limited to 30 seconds and querying the datastore works different that on other databases.
It's hard to give any concrete advice without knowing your specific problem, but I doubt that GAE is the right solution.
In general, if you want to work with large datasets, you either have to keep that in mind from the start or you will have to rework your code, algorithms and data structures as the datasets grow.
You are describing my dreams! However, I think you cannot do it. I always wanted something just like LINQ for Python but the language does not permit to use Python syntax for native database operations AFAIK. If it would be possible, you could just write code using lists and then use the same code for retrieving data from a database.
I would not recommend you to write a lot of code based only in lists and sets because it will not be easy to migrate it to a scalable platform. I recommend you to use something like an ORM. GAE even has its own ORM-like system and you can use other ones such as SQLAlchemy and SQLObject with e.g. SQLite.
Unfortunately, you cannot use awesome stuff such as list comprehensions to filter data from the database. Surely, you can filter data after it was gotten from the DB but you'll still need to build a query with some SQL-like language for querying objects or return a lot of objects from a database.
OTOH, there is Buzhug, a curious non-relational database system written in Python which allows the use of natural Python syntax. I have never used it and I do not know if it is scalable so I would not put my money on it. However, you can test it and see if it can help you.
You can use ORM: Object Relational Mapping: A class gets a table, an objects gets a row. I like the Django ORM. You can use it for non-web apps, too. I never used it on GAE, but I think it is possible.

Best DataMining Database

I am an occasional Python programer who only have worked so far with MYSQL or SQLITE databases. I am the computer person for everything in a small company and I have been started a new project where I think it is about time to try new databases.
Sales department makes a CSV dump every week and I need to make a small scripting application that allow people form other departments mixing the information, mostly linking the records. I have all this solved, my problem is the speed, I am using just plain text files for all this and unsurprisingly it is very slow.
I thought about using mysql, but then I need installing mysql in every desktop, sqlite is easier, but it is very slow. I do not need a full relational database, just some way of play with big amounts of data in a decent time.
Update: I think I was not being very detailed about my database usage thus explaining my problem badly. I am working reading all the data ~900 Megas or more from a csv into a Python dictionary then working with it. My problem is storing and mostly reading the data quickly.
Many thanks!
Quick Summary
You need enough memory(RAM) to solve your problem efficiently. I think you should upgrade memory?? When reading the excellent High Scalability Blog you will notice that for big sites to solve there problem efficiently they store the complete problem set in memory.
You do need a central database solution. I don't think hand doing this with python dictionary's only will get the job done.
How to solve "your problem" depends on your "query's". What I would try to do first is put your data in elastic-search(see below) and query the database(see how it performs). I think this is the easiest way to tackle your problem. But as you can read below there are a lot of ways to tackle your problem.
We know:
You used python as your program language.
Your database is ~900MB (I think that's pretty large, but absolute manageable).
You have loaded all the data in a python dictionary. Here I am assume the problem lays. Python tries to store the dictionary(also python dictionary's aren't the most memory friendly) in your memory, but you don't have enough memory(How much memory do you have????). When that happens you are going to have a lot of Virtual Memory. When you attempt to read the dictionary you are constantly swapping data from you disc into memory. This swapping causes "Trashing". I am assuming that your computer does not have enough Ram. If true then I would first upgrade your memory with at least 2 Gigabytes extra RAM. When your problem set is able to fit in memory solving the problem is going to be a lot faster. I opened my computer architecture book where it(The memory hierarchy) says that main memory access time is about 40-80ns while disc memory access time is 5 ms. That is a BIG difference.
Missing information
Do you have a central server. You should use/have a server.
What kind of architecture does your server have? Linux/Unix/Windows/Mac OSX? In my opinion your server should have linux/Unix/Mac OSX architecture.
How much memory does your server have?
Could you specify your data set(CSV) a little better.
What kind of data mining are you doing? Do you need full-text-search capabilities? I am not assuming you are doing any complicated (SQL) query's. Performing that task with only python dictionary's will be a complicated problem. Could you formalize the query's that you would like to perform? For example:
"get all users who work for departement x"
"get all sales from user x"
Database needed
I am the computer person for
everything in a small company and I
have been started a new project where
I think it is about time to try new
databases.
You are sure right that you need a database to solve your problem. Doing that yourself only using python dictionary's is difficult. Especially when your problem set can't fit in memory.
MySQL
I thought about using mysql, but then
I need installing mysql in every
desktop, sqlite is easier, but it is
very slow. I do not need a full
relational database, just some way of
play with big amounts of data in a
decent time.
A centralized(Client-server architecture) database is exactly what you need to solve your problem. Let all the users access the database from 1 PC which you manage. You can use MySQL to solve your problem.
Tokyo Tyrant
You could also use Tokyo Tyrant to store all your data. Tokyo Tyrant is pretty fast and it does not have to be stored in RAM. It handles getting data a more efficient(instead of using python dictionary's). However if your problem can completely fit in Memory I think you should have look at Redis(below).
Redis:
You could for example use Redis(quick start in 5 minutes)(Redis is extremely fast) to store all sales in memory. Redis is extremely powerful and can do this kind of queries insanely fast. The only problem with Redis is that it has to fit completely in RAM, but I believe he is working on that(nightly build already supports it). Also like I already said previously solving your problem set completely from memory is how big sites solve there problem in a timely manner.
Document stores
This article tries to evaluate kv-stores with document stores like couchdb/riak/mongodb. These stores are better capable of searching(a little slower then KV stores), but aren't good at full-text-search.
Full-text-search
If you want to do full-text-search queries you could like at:
elasticsearch(videos): When I saw the video demonstration of elasticsearch it looked pretty cool. You could try put(post simple json) your data in elasticsearch and see how fast it is. I am following elastissearch on github and the author is commiting a lot of new code to it.
solr(tutorial): A lot of big companies are using solr(github, digg) to power there search. They got a big boost going from MySQL full-text search to solr.
You probably do need a full relational DBMS, if not right now, very soon. If you start now while your problems and data are simple and straightforward then when they become complex and difficult you will have plenty of experience with at least one DBMS to help you. You probably don't need MySQL on all desktops, you might install it on a server for example and feed data out over your network, but you perhaps need to provide more information about your requirements, toolset and equipment to get better suggestions.
And, while the other DBMSes have their strengths and weaknesses too, there's nothing wrong with MySQL for large and complex databases. I don't know enough about SQLite to comment knowledgeably about it.
EDIT: #Eric from your comments to my answer and the other answers I form even more strongly the view that it is time you moved to a database. I'm not surprised that trying to do database operations on a 900MB Python dictionary is slow. I think you have to first convince yourself, then your management, that you have reached the limits of what your current toolset can cope with, and that future developments are threatened unless you rethink matters.
If your network really can't support a server-based database than (a) you really need to make your network robust, reliable and performant enough for such a purpose, but (b) if that is not an option, or not an early option, you should be thinking along the lines of a central database server passing out digests/extracts/reports to other users, rather than simultaneous, full RDBMS working in a client-server configuration.
The problems you are currently experiencing are problems of not having the right tools for the job. They are only going to get worse. I wish I could suggest a magic way in which this is not the case, but I can't and I don't think anyone else will.
Have you done any bench marking to confirm that it is the text files that are slowing you down? If you haven't, there's a good chance that tweaking some other part of the code will speed things up so that it's fast enough.
It sounds like each department has their own feudal database, and this implies a lot of unnecessary redundancy and inefficiency.
Instead of transferring hundreds of megabytes to everyone across your network, why not keep your data in MySQL and have the departments upload their data to the database, where it can be normalized and accessible by everyone?
As your organization grows, having completely different departmental databases that are unaware of each other, and contain potentially redundant or conflicting data, is going to become very painful.
Does the machine this process runs on have sufficient memory and bandwidth to handle this efficiently? Putting MySQL on a slow machine and recoding the tool to use MySQL rather than text files could potentially be far more costly than simply adding memory or upgrading the machine.
Here is a performance benchmark of different database suits ->
Database Speed Comparison
I'm not sure how objective the above comparison is though, seeing as it's hosted on sqlite.org. Sqlite only seems to be a bit slower when dropping tables, otherwise you shouldn't have any problems using it. Both sqlite and mysql seem to have their own strengths and weaknesses, in some tests the one is faster then the other, in other tests, the reverse is true.
If you've been experiencing lower then expected performance, perhaps it is not sqlite that is the causing this, have you done any profiling or otherwise to make sure nothing else is causing your program to misbehave?
EDIT: Updated with a link to a slightly more recent speed comparison.
It has been a couple of months since I posted this question and I wanted to let you all know how I solved this problem. I am using Berkeley DB with the module bsddb instead loading all the data in a Python dictionary. I am not fully happy, but my users are.
My next step is trying to get a shared server with redis, but unless users starts complaining about speed, I doubt I will get it.
Many thanks everybody who helped here, and I hope this question and answers are useful to somebody else.
If you have that problem with a CSV file, maybe you can just pickle the dictionary and generate a pickle "binary" file with pickle.HIGHEST_PROTOCOL option. It can be faster to read and you get a smaller file. You can load the CSV file once and then generate the pickled file, allowing faster load in next accesses.
Anyway, with 900 Mb of information, you're going to deal with some time loading it in memory. Another approach is not loading it on one step on memory, but load only the information when needed, maybe making different files by date, or any other category (company, type, etc..)
Take a look at mongodb.

Fast, searchable dict storage for Python

Current I use SQLite (w/ SQLAlchemy) to store about 5000 dict objects. Each dict object corresponds to an entry in PyPI with keys - (name, version, summary .. sometimes 'description' can be as big as the project documentation).
Writing these entries (from JSON) back to the disk (SQLite format) takes several seconds, and it feels slow.
Writing is done as frequent as once a day, but reading/searching for a particular entry based on a key (usually name or description) is done very often.
Just like apt-get.
Is there a storage library for use with Python that will suit my needs better than SQLite?
Did you put indices on name and description? Searching on 5000 indexed entries should be essentially instantaneous (of course ORMs will make your life much harder, as they usually do [even relatively good ones such as SQLAlchemy, but try "raw sqlite" and it absolutely should fly).
Writing just the updated entries (again with real SQL) should also be basically instantaneous -- ideally a single update statement should do it, but even a thousand should be no real problem, just make sure to turn off autocommit at the start of the loop (and if you want turn it back again later).
It might be overkill for your application, but you ought to check out schema-free/document-oriented databases. Personally I'm a fan of couchdb. Basically, rather than store records as rows in a table, something like couchdb stores key-value pairs, and then (in the case of couchdb) you write views in javascript to cull the data you need. These databases are usually easier to scale than relational databases, and in your case may be much faster, since you dont have to hammer your data into a shape that will fit into a relational database. On the other hand, it means that there is another service running.
Given the approximate number of objects stated (around 5,000), SQLite is probably not the problem behind speed. It's the intermediary measures; for example JSON or possibly non-optimal use of SQLAlChemy.
Try this out (fairly fast even for million objects):
y_serial.py module :: warehouse Python objects with SQLite
"Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful "standard" module for a database to store schema-less data."
http://yserial.sourceforge.net
The yserial search on your keys is done using the regular expression ("regex") code on the SQLite side, not Python, so there's another substantial speed improvement.
Let us know how it works out.
I'm solving a very similar problem for myself right now, using Nucular, which might suit your needs. It's a file-system based storage and seems very fast indeed. (It comes with an example app that indexes the whole python source tree) It's concurrent-safe, requires no external libraries and is pure python. It searches rapidly and has powerful fulltext search, indexing and so on - kind of a specialised, in-process, native python-dict store after the manner of the trendy Couchdb and mongodb, but much lighter.
It does have limitations, though - it can't store or query on nested dictionaries, so not every JSON type can be stored in it. Moreover, although its text searching is powerful, its numerical queries are weak and unindexed. Nonetheless, it may be precisely what you are after.

Categories