Efficient way to combine results of two database queries - python

I have two tables on different servers, and I'd like some help finding an efficient way to combine and match the datasets. Here's an example:
From server 1, which holds our stories, I perform a query like:
query = """SELECT author_id, title, text
FROM stories
ORDER BY timestamp_created DESC
LIMIT 10
"""
results = DB.getAll(query)
for i in range(len(results)):
#Build a string of author_ids, e.g. '1314,4134,2624,2342'
But, I'd like to fetch some info about each author_id from server 2:
query = """SELECT id, avatar_url
FROM members
WHERE id IN (%s)
"""
values = (uid_list)
results = DB.getAll(query, values)
Now I need some way to combine these two queries so I have a dict that has the story as well as avatar_url and member_id.
If this data were on one server, it would be a simple join that would look like:
SELECT *
FROM members, stories
WHERE members.id = stories.author_id
But since we store the data on multiple servers, this is not possible.
What is the most efficient way to do this? I understand the merging probably has to happen in my application code ... any efficient sample code that minimizes the number of dict loops would be greatly appreciated!
Thanks.

If memory isn't a problem, you could use a dictionary.
results1_dict = dict((row[0], list(row[1:])) for row in results1)
results2_dict = dict((row[0], list(row[1:])) for row in results2)
for key, value in results2_dict:
if key in results1_dict:
results1_dict[key].extend(value)
else:
results1_dict[key] = value
This isn't particularly efficient (n2), but it is relatively simple and you can tweak it to do precisely what you need.

The only option looks to be Database Link, but is unfortunately unavailable in MySQL.
You'll have to do the merging in your application code. Better to keep the data in same database.

You will have to bring the data together somehow.
There are things like server links (though that is probably not the correct term in mysql context) that might allow querying accross different DBs. This opens up another set of problems (security!)
The easier solution is to bring the data together in one DB.
The last (least desirable) solution is to join in code as Padmarag suggests.

Is it possible to setup replication of the needed tables from one server to a database on the other?
That way you could have all your data on one server.
Also, see FEDERATED storage engine, available since mysql 5.0.3.

Related

django-tables2 flooding database with queries

im using django-tables2 in order to show values from a database query. And everythings works fine. Im now using Django-dabug-toolbar and was looking through my pages with it. More out of curiosity than performance needs. When a lokked at the page with the table i saw that the debug toolbar registerd over 300 queries for a table with a little over 300 entries. I dont think flooding the DB with so many queries is a good idea even if there is no performance impact (at least not now). All the data should be coming from only one query.
Why is this happening and how can i reduce the number of queries?
Im posting this as a future reference for myself and other who might have the same problem.
After searching for a bit I found out that django-tables2 was sending a single query for each row. The query was something like SELECT * FROM "table" LIMIT 1 OFFSET 1 with increasing offset.
I reduced the number of sql calls by calling query = list(query) before i create the table and pass the query. By evaluating the query in the python view code the table now seems to work with the evaulated data instead and there is only one database call instead of hundreds.
This was a bug and has been fixed in https://github.com/bradleyayers/django-tables2/issues/427

Using Python to Query multiple SQL databases on different servers

I have been doing a fair amount of manual data analysis, reporting and dash boarding recently via SQL and wonder if perhaps python would be able to automate a lot of this. I am not familiar with Python at all so I hope my question makes sense. For security/performance issues, we store databases on a number of servers (more than 5) which contain data that would be pertinent to a query. Unfortunately, these servers are set up so they cannot talk to each other so I cant pull data from the two servers in the same query. I believe this is a limitation due to using windows credentials/security.
For my data analysis and reporting needs, I need to be able to grab pertinent data from two or more of these so the way I currently do this is by running a query, grabbing the results, running another query with the results, doing some formula work in excel, and then running another query and so on and so forth until I get what I need.
Unfortunately this both time consuming, and also makes me pull massive datasets (in the multiple millions of rows), which I then have to continually narrow down based on criteria that are in said databases.
I know Python has the ability to query SQL Server, however I figured I would ask the experts:
Can I manipulate the data in the background with Python similar to how I can do with excel (lookups, statistical functions, etc, perhaps even XML/webAPI?
Can Python handle connections to multiple different database servers at the same time?
Does Python handle windows credentials well?
If Python is not the tool for this, can you name one that would work better?
Please let me know if I can provide additional pertinent details.
Ideally, I would like to end up creating our own separate database and creating automated processes to pull everything from other databases but currently that is not possible due to project constraints.
Thanks!
I didn't use windows credential. But i have used Python to work with multiple MS-SQL databases at the same time. It worked very well. You can use the library pymssql or better with SQLAlchemy
But i think you should start with a basic tutorial about Python first. Because you want to work with millions of rows, it's very important to understand list, set, tuple, dict in Python. For good performance, you should use the right type.
A basic example with pymssql
import pymssql
conn1 = pymssql.connect("Host1", "user1", "password1", "db1")
conn2 = pymssql.connect("Host2", "user2", "password2", "db2")
cursor1 = conn1.cursor()
cursor2 = conn2.cursor()
cursor1.execute('SELECT * FROM TABLE1 LIMIT 10')
cursor2.execute('SELECT * FROM TABLE2 LIMIT 10')
result1 = cursor1.fetchall()
result2 = cursor2.fetchall()
# print each row
for row in result1:
print(row)
# print each row
for row in result2:
print(row)
You can do all of what you asked. Python allows to create multiple connection objects via a library, so for example, let's say you use MySQL python you would create two different objects like this:
NOT ACTUAL CODE, JUST EXAMPLE
conn1 = mysqlConnect(server1, user, pass)
conn2 = mysqlConnect(server2, user, pass)
Like this, conn1 connects to one database and conn2 connects to a different one, usually you would do:
conn1.execute(query_to_server_1)
conn2.execute(query_to_server_2)
This helps maintain two different connections in the same script. If you are looking for multi threading, python offers an incredible library that will help you execute multiple task from one master script.

Which way of fetching is more Efficient in Redis?

Hi I'm fairly new to Redis and Currently face a problem. My problem is "I don't know which way is better performance"
Way#1 : Cache All data to Redis and then Query to it ( I don't know is it possible to Query to Redis ? if possible How ? )
for example in following table cache all data to single Key ( By this way in my table we have 1 key ) and then Query for users with same City.
Way#2 : Cache all users with same City in separate Key ( By this way in my table we have 4 key ) and then Fetch each Key separately.
Cache all users with same City in separate Key - the Redis way. Fast insert, fast get in cost of much memory consumption or some data redundancy.
In general you can't follow your way#1 example. Why not? Redis do not have any in box solutions for query data in sql terms. You can't do something like select something from somethere where criteria in most of Redis data structures. You can write LUA script for complex map/redus solutioin on your data - but not in out of box.
You should remeber, each time you want to say Join this and this data you should understand - you can do this only in client application space or in redis LUA script. Yes, you have some types in join capacity with ZSET and SET's but it not that you require.

What are some ways to maintain data consistency at the application layer of NoSQL?

My python web application uses DynamoDB as its datastore, but this is probably applicable to other NoSQL tables where index consistency is done at the application layer. I'm de-normalizing data and creating indicies in several tables to facilitate lookups.
For example, for my users table:
* Table 1: (user_id) email, employee_id, first name, last name, etc ...
Table 2: (email) user_id
Table 3: (employee_id) user_id
Table 1 is my "primary table" where user info is stored. If the user_id is known, all info about a user can be retrieved in a single GET query.
Table 2 and 3 enable lookups by email or employee_id, requiring a query to those tables first to get the user_id, then a second query to Table 1 to retrieve the rest of the information.
My concern is with the de-normalized data -- what is the best way to handle deletions from Table 1 to ensure the matching data gets deleted from Tables 2 + 3? Also ensuring inserts?
Right now my chain of events is something like:
1. Insert row in table 1
2. Insert row in table 2
3. Insert row in table 3
Does it make sense to add "checks" at the end? Some thing like:
4. Check that all 3 rows have been inserted.
5. If a row is missing, remove rows from all tables and raise an error.
Any other techniques?
Short answer is: There is no way to ensure consistency. This is the price you agreed to pay when moving to NoSQL in trade of performances and scalability.
DynamoDB-mapper has a "transaction engine". Transaction objects are plain DynamoDB Items and may be persisted. This way, If a logical group of actions aka transaction has succeeded, we can be sure of it by looking at the persisted status. But we have no mean to be sure it has not...
To do a bit of advertisment :) , dynamodb-mapper transaction engine supports
single/multiple targets
sub transactions
transaction creating objects (not released yet)
If you are rolling your own mapper (which is an enjoyable task), feel free to have a look at our source code: https://bitbucket.org/Ludia/dynamodb-mapper/src/52c75c5df921/dynamodb_mapper/transactions.py
Disclaimer: I am one of the main dynamodb-mapper project. Feel free to contribute :)
Disclaimer: I haven't actually used DynamoDB, just looked through the data model and API, so take this for what it's worth.
The use case you're giving is one primary table for the data, with other tables for hand-rolled indices. This really sounds like work for an RDBMS (maybe with some sharding for growth). But, if that won't cut it, here a couple of ideas which may or may not work for you.
A. Leave it as it is. If you'll never serve data from your index tables, then maybe you can afford to have lazy deletion and insertion as long as you handle the primary table first. Say this happens:
1) Delete JDoe from Main table
xxxxxxxxxx Process running code crashes xxxxxxx
2) Delete from email index // Never gets here
3) Delete from employee_id index // Never gets here
Well, if an "email" query comes in, you'll resolve the corresponding user_id from the index (now stale), but it won't show up on the main table. You know that something is wrong, so you can return a failure/error and clean up the indexes. In other words, you just live with some stale data and save yourself the trouble, cleaning it up as necessary. You'll have to figure out how much stale data to expect, and maybe write a script that does some housekeeping daily.
B. If you really want to simulate locks and transactions, you could consider using something like Apache Zookeeper, which is a distributed system for managing shared resources like locks. It'd be more work and overhead, but you could probably set it up to do what you want.

Django objects.filter, how "expensive" would this be?

I am trying to make a search view in Django. It is a search form with freetext input + some options to select, so that you can filter on years and so on. This is some of the code I have in the view so far, the part that does the filtering. And I would like some input on how expensive this would be on the database server.
soknad_list = Soknad.objects.all()
if var1:
soknad_list = soknad_list.filter(pub_date__year=var1)
if var2:
soknad_list = soknad_list.filter(muncipality__name__exact=var2)
if var3:
soknad_list = soknad_list.filter(genre__name__exact=var3)
# TEXT SEARCH
stop_word_list = re.compile(STOP_WORDS, re.IGNORECASE)
search_term = '%s' % request.GET['q']
cleaned_search_term = stop_word_list.sub('', search_term)
cleaned_search_term = cleaned_search_term.strip()
if len(cleaned_search_term) != 0:
soknad_list = soknad_list.filter(Q(dream__icontains=cleaned_search_term) | Q(tags__icontains=cleaned_search_term) | Q(name__icontains=cleaned_search_term) | Q(school__name__icontains=cleaned_search_term))
So what I do is, first make a list of all objects, then I check which variables exists (I fetch these with GET on an earlier point) and then I filter the results if they exists. But this doesn't seem too elegant, it probably does a lot of queries to achieve the result, so is there a better way to this?
It does exactly what I want, but I guess there is a better/smarter way to do this. Any ideas?
filter itself doesn't execute a query, no query is executed until you explicitly fetch items from query (e.g. get), and list( query ) also executes it.
You can see the query that will be generated by using:
soknad_list.query.as_sql()[0]
You can then put that into your database shell to see how long the query takes, or use EXPLAIN (if your database backend supports it) to see how expensive it is.
As Aaron mentioned, you should get a hold of the query text that is going to be run against the database and use an EXPLAIN (or other some method) to view the query execution plan. Once you have a hold of the execution plan for the query you can see what is going on in the database itself. There are a lot of operations that see very expensive to run through procedural code that are very trivial for any database to run, especially if you provide indexes that the database can use for speeding up your query.
If I read your question correctly, you're retrieving a result set of all rows in the Soknad table. Once you have these results back you use the filter() method to trim down your results meet your criteria. From looking at the Django documentation, it looks like this will do an in-memory filter rather than re-query the database (of course, this really depends on which data access layer you're using and not on Django itself).
The most optimal solution would be to use a full-text search engine (Lucene, ferret, etc) to handle this for you. If that is not available or practical the next best option would be to to construct a query predicate (WHERE clause) before issuing your query to the database and let the database perform the filtering.
However, as with all things that involve the database, the real answer is 'it depends.' The best suggestion is to try out several different approaches using data that is close to production and benchmark them over at least 3 iterations before settling on a final solution to the problem. It may be just as fast, or even faster, to filter in memory rather than filter in the database.

Categories