General queries vs detailed queries to database - python

I'm writing a web application in python and postgreSQL. Users are to access a lot of information during a session. All such information (almost) are indexed in the database. My question is, should I litter the code with specific queries, or is it better practice to query larger chunks of information, cashing it, and letting python process the chunk for finer pieces?
For example: A user is to ask for entries in a payment log. Either one writes a query asking for the specific entries requested, or one collect the payment history of the user and then use python to select the specific entries.
Of course cashing is preferred when working with heavy queries, but since nearly all my data is indexed, direct database access is fast and the cashing approach would not yield much if any extra speed. But are there other factors that may still render the cashing approach preferable?

Database designers spend a lot of time on caching and optimization. Unless you hit a specific problem, it's probably better to let the database do the database stuff, and your code do the rest instead of having your code try to take over some of the database functionality.

Related

Best practice to update a column of all documents to Elasticsearch

I'm developing a log analysis system. The input are log files. I have an external Python program that reads the log files and decide whether a record (line) or the log files is "normal" or "malicious". I want to use Elasticsearch Update API to append my Python program's result ("normal" or "malicious") to Elasticsearch by adding a new column called result. So I can see my program's result clearly via Kibana UI.
Simply speaking, my Python code and Elasticsearch both use log files as input respectively. Now I want to update the result from Python code to Elasticsearch. What's the best way to do it?
I can think of several ways:
Elasticsearch automatically assigns a ID (_id) to a document. If I can find out how Elasticsearch calculates _id, then my Python code can calculate it by itself, then update the corresponding Elasticsearch document via _id. But the question is, Elasticsearch official documentation doesn't say about what algorithm it uses to generate _id.
Add an ID (like line number) to the log files by myself. Both my program and Elasticsearch will know this ID. My program can use this ID to update. However, the downside is that my program has to search for this ID every time because it's only a normal field instead of a built-in _id. The performance will be very bad.
My Python code gets the logs from Elasticsearch instead of reading the log files directly. But this makes the system fragile, as Elasticsearch becomes a critical point. I only want Elasticsearch to be a log viewer currently.
So the first solution will be ideal in the current view. But I'm not sure if there are any better ways to do it?
If possible, re-structure your application so that instead of dumping plain-text to a log file you're directly writing structured log information to something like Elasticsearch. Thank me later.
That isn't always feasible (e.g. if you don't control the log source). I have a few opinions on your solutions.
This feels super brittle. Elasticsearch does not base _id on the properties of a particular document. It's selected based off of existing _id fields that it has stored (and I think also off of a random seed). Even if it could work, relying on an undocumented property is a good way to shoot yourself in the foot when dealing with a team that makes breaking changes even for its documented code as often as Elasticsearch does.
This one actually isn't so bad. Elasticsearch supports manually choosing the id of a document. Even if it didn't, it performs quite well for bulk terms queries and wouldn't be as much of a bottleneck as you might think. If you really have so much data that this could break your application then Elasticsearch might not be the best tool.
This solution is great. It's super extensible and doesn't rely on a complicated dependence on how the log file is constructed, how you've chosen to index that log in Elasticsearch, and how you're choosing to read it with Python. Rather you just get a document, and if you need to update it then you do that updating.
Elasticsearch isn't really a worse point of failure here than before (if ES goes down, your app goes down in any of these solutions) -- you're just doing twice as many queries (read and write). If a factor of 2 kills your application, you either need a better solution to the problem (i.e. avoid Elasticsearch), or you need to throw more hardware at it. ES supports all kinds of sharding configurations, and you can make a robust server on the cheap.
One question though, why do you have logs in Elasticsearch that need to be updated with this particular normal/malicious property? If you're the one putting them into ES then just tag them appropriately before you ever store them to prevent the extra read that's bothering you. If that's not an option then you'll still probably be wanting to read ES directly to pull the logs into Python anyway to avoid the enormous overhead of parsing the original log file again.
If this is a one-time hotfix to existing ES data while you're rolling out normal/malicious, then don't worry about a 2x speed improvement. Just throttle the query if you're concerned about bringing down the cluster. The hotfix will execute eventually, and probably faster than if we keep deliberating about the best option.

Does it help concurrency if an sql table is split into multiple tables within the same database file?

I have a reasonably large SQL table in a single database file. This table is accessed by multiple independent processes. I am using sqlalchemy and python to access this table. One of these processes runs a fairly lengthy task on a subset of the table and only writes to a few specific fields. Occasionally I will run into concurrency issues with this setup.
So far I have been unable to reproduce the issue with minimal code. Which tells me that I am not understanding something here.
Example:
for x in session.query()
do_something(x)
session.commit()
The entire loop could takes minutes to complete before the commit is issued.
It must have something to do with both processes trying to write to the same table at the same time.
I am considering splitting this table up into two tables.
Of course I could use a different database that has better concurrency support, but my code is not yet at a place where this is easily done.
Q: Does anybody here have experience with that approach and/or does it seem a worthwhile approach to reduce my concurrency issues?
Database engines handle concurrency issues internally, which covers most applications use cases.
For instance, rows of a table can be locked from other processes until the current transaction is done; this is done in a fast enough time to not impact applications even with a fairly high number of transactions. It's always a good strategy to break down the task into smaller (perhaps singular) transactions, therefore locking the minimum number of rows possible.
It's quite difficlt to tell without the code, but the first question that came to my mind is: why aren't you using locks? I imagine the big task looks a bit like this:
read something from table
do something with stuff read
write something to table
If so, you should only need to lock the table before writing and release the lock immediately after.
Are you using InnoDB, or MyISAM? The former provides row locking functionality, while the latter only provides table locking.
That aside, read http://docs.sqlalchemy.org/en/rel_0_9/orm/session.html#contextual-thread-local-sessions, as that should cover all basic use cases.
Sounds like you also might want to look into SQLAlchemy's dogpile.cache integration.

SQL query or Programmatic Filter for Big Data?

I am working with Python, fetching huge amounts of data from MS SQL Server Database and processing those for making graphs.
The real issue is that I wanted to know whether it would be a good idea to repeatedly perform queries to filter the data (using pyodbc for SQL queries) using attributes like WHERE and SELECT DISTINCT etc. in queries
OR
To fetch the data and use the list comprehensions, map and filter functionalities of python to filter the data in my code itself.
If I choose the former, there would be around 1k queries performed reducing significant load on my python code, otherwise if I choose the latter, I would be querying once and add on a bunch of functions to go through all the records I have fetched, more or less the same number of times(1k).
The thing is python is not purely functional, (if it was, I wouldnt be asking and would have finished and tested my work hundreds of times by now).
Which one would you people recommend?
For reference I am using Python 2.7. It would be highly appreciated if you could provide sources of information too. Also, Space is not an issue for fetching the whole data.
Thanks
If you have bandwidth to burn, and prefer Python to SQL, go ahead and do one big query and filter in Python.
Otherwise, you're probably better off with multiple queries.
Sorry, no references here. ^_^

Setup for high volume of database writing

I am researching a project that would require hundreds of database writes per a minute. I have never dealt with this level of data writes before and I am looking for good scalable techniques and technologies.
I am a comfortable python developer with experience in django and sql alchemy. I am thinking I will build the data interface on django, but I don't think that it is a good idea to go through the orm to do the amount of data writes I will require. I am definitely open to learning new technologies.
The solution will live on Amazon web services, so I have access to all their tools. Ultimately I am looking for advice on database selection, data writing techniques, and any other needs I may have that I do not realize.
Any advice on where to start?
Thanks,
CG
Follow the trends, in other words, enter the world of NOSQL. Some technologies that are worthy include mongodb and redis. They are really fast, scalable, and with decent python drivers. For example, mongodb plays really nice with django, and has many common things with traditional SQL, like MySQL. On the other hand, redis has more "primitive" data structures but is superior in terms of speed (which of course depends somehow on the drivers). Using any of them ( or both, it's a clever idea for something glorious ) you are free ( and sometimes enforced ) to write your own "low-level" logic to accomplish your needs.
You should actually be okay with low hundreds of writes per minute through SQLAlchemy (thats only a couple a second); if you're talking more like a thousand a minute, yeah that might be problematic.
What kind of data do you have? If it's fairly flat (few tables, few relations), you might want to investigate a non-relational database such as CouchDB or Mongo. If you want to use SQL, I strongly reccommend PostgreSQL, it seems to deal with large databases and frequent writes a lot better than MySQL.
It also depends how complex the data is that you're inserting.
I think unfortunately, you're going to just have to try a couple things and run benchmarks, as each situation is different and query optimizers are basically magic.
If it's just a few hundred writes you still can do with a relational DB. I'd pick PostgreSQL (8.0+),
which has a separate background writer process. It also has tuneable serialization levels so you
can enable some tradeoffs between speed and strict ACID compliance, some even at transaction level.
Postgres is well documented, but it assumes some deeper understanding of SQL and relational DB theory to fully understand and make the most of it.
The alternative would be new fangled "NO-SQL" system, which can probably scale even better, but at the cost of buying into a very different technology system.
Any way, if you are using python and it is not 100% critical to lose writes on shutdown or power loss, and you need a low latency, use a threadsafe Queue.Queue and worker threads to decouple the writes from your main application thread(s).

Python Psych Experiment needs (simple) database: please advise

I am coding a psychology experiment in Python. I need to store user information and scores somewhere, and I need it to work as a web application (and be secure).
Don't know much about this - I'm considering XML databases, BerkleyDB, sqlite, an openoffice spreadsheet, or I'm very interested in the python "shelve" library.
(most of my info coming from this thread: http://developers.slashdot.org/story/08/05/20/2150246/FOSS-Flat-File-Database
DATA: I figure that I'm going to have maximally 1000 users. For each user I've got to store...
Username / Pass
User detail fields (for a simple profile)
User scores on the exercise (2 datapoints: each trial gets a score (correct/incorrect/timeout, and has an associated number from 0.1 to 1.0 that I need to record)
Metadata about the trials (when, who, etc.)
Results of data analysis for user
VERY rough estimate, each user generates 100 trials / day. So maximum of 10k datapoints / day. It needs to run that way for about 3 months, so about 1m datapoints. Safety multiplier 2x gives me a target of a database that can handle 2m datapoints.
((note: I could either store trial response data as individual data points, or group trials into Python list objects of varying length (user "sessions"). The latter would dramatically bring down the number database entries, though not the amount of data. Does it matter? How?))
I want a solution that will work (at least) until I get to this 1000 users level. If my program is popular beyond that level, I'm alright with doing some work modding in a beefier DB. Also reiterating that it must be easily deployable as a web application.
Beyond those basic requirements, I just want the easiest thing that will make this work. I'm pretty green.
Thanks for reading
Tr3y
SQLite can certainly handle those amount of data, it has a very large userbase with a few very well known users on all the major platforms, it's fast, light, and there are awesome GUI clients that allows you to browse and extract/filter data with a few clicks.
SQLite won't scale indefinitely, of course, but severe performance problems begins only when simultaneous inserts are needed, which I would guess is a problem appearing several orders of magnitude after your prospected load.
I'm using it since a few years now, and I never had a problem with it (although for larger sites I use MySQL). Personally I find that "Small. Fast. Reliable. Choose any three." (which is the tagline on SQLite's site) is quite accurate.
As for the ease of use... SQLite3 bindings (site temporarily down) are part of the python standard library. Here you can find a small tutorial. Interestingly enough, simplicity is a design criterion for SQLite. From here:
Many people like SQLite because it is small and fast. But those qualities are just happy accidents. Users also find that SQLite is very reliable. Reliability is a consequence of simplicity. With less complication, there is less to go wrong. So, yes, SQLite is small, fast, and reliable, but first and foremost, SQLite strives to be simple.
There's a pretty spot-on discussion of when to use SQLite here. My favorite line is this:
Another way to look at SQLite is this: SQLite is not designed to replace Oracle. It is designed to replace fopen().
It seems to me that for your needs, SQLite is perfect. Indeed, it seems to me very possible that you will never need anything else:
With the default page size of 1024 bytes, an SQLite database is limited in size to 2 terabytes (2^41 bytes).
It doesn't sound like you'll have that much data at any point.
I would consider MongoDB. It's very easy to get started, and is built for multi-user setups (unlike SQLite).
It also has a much simpler model. Instead of futzing around with tables and fields, you simply take all the data in your form and stuff it in the database. Even if your form changes (oops, forgot a field) you won't need to change MongoDB.

Categories