First let me begin by saying that I'm somewhat new to sql (but have been doing python for a long while). I've been having trouble finding a good solution to my problem on the net.
The problem:
I have an undefined number (although probably less than 100) of identically structured sqlite databases that I need to query and merge results from. The databases themselves are not particularly huge.
I've been looking at the ATTACH command and following things in this tutorial:
http://souptonuts.sourceforge.net/readme_sqlite_tutorial.html
import sqlite3 as lite
Database = "library.db"
con = lite.connect(Database)
con.row_factory = lite.Row
cur = con.cursor()
cur.execute("ATTACH DATABASE 'library.db' As 'db1'")
cur.execute("ATTACH DATABASE 'library2.db' As 'db2'")
cur.execute("""
SELECT 'db1',* FROM db1.table
UNION
SELECT 'db2',* FROM db2.table
""")
but it seems like there should be a better way than to explicitly spell out each database in the execute command. Also, it looks like there's a limit to the number of databases that I can attach to? https://sqlite.org/limits.html
I've also looked at something like merging them together into a large database:
How can I merge many SQLite databases?
but it seems inefficient merge the databases together each time that a query needs to be made or one of the many individual databases gets changed.
Before I continue down some path, I was wondering if there were any better options for this type of thing that I'm not aware of?
Other potentially useful information about the project:
There are two main tables.
The tables from db to db can have duplicates.
I need to be able to grab unique values in columns for the "merged" databases.
Knowing which database what data came from is not essential.
The individual databases are updated frequently.
Queries to the "merged" database are made frequently.
You could avoid spelling out all databases in every query by using views:
CREATE VIEW MyTable_all AS
SELECT 'db1', db1.* FROM db1.MyTable
UNION ALL
SELECT 'db2', db2.* FROM db1.MyTable
...
However, if there are too many databases, you cannot use ATTACH.
In that case, you have to merge all the databases together.
If doing this every time for all databases is too slow, you can synchronize a single database at a time by keeping the source for each record:
DELETE FROM MyTable WHERE SourceDB = 1;
INSERT INTO MyTable SELECT 1, * FROM db1.MyTable;
Related
I'm starting with Impala SQL and Hadoop and have a (probably simple) question.
I have a Hadoop database with hundrets of tables with the same schema and naming convention (e.g. process_1, process_2, process_3 and so on). How would I query all the tables and concatenate them into one big table or dataframe? Is it possible to do so by using just Impala SQL which returns one dataframe in python?
Something like:
SELECT * FROM 'process_*';
Or do I need to run SHOW TABLES 'process_*', use a loop in python and query each table seperately?
If you are looking purely Impala solution, then one approach would be to create a view on top of all of the tables. Something as below:
create view process_view_all_tables as
select * from process1
union all
select * from process2
union all
...
select * from processN;
The disadvantage with this approach is as below:
You need to union multiple tables together. Union is an expensive operation in terms of memory utilisation. Works ok if you have less number of tables say in range of 2-5 tables.
You need to add all the tables manually. If you a new process table in future, you would need to ALTER the view and then add the new table. This is a maintenance headache.
The view assumes that all the PROCESS tables are of the same schema.
In the Second approach, as you said, you could query the list of tables from Impala using SHOW TABLES LIKE 'process*' and write a small program to iterate over the list of tables and create the files.
Once you have the file generated, you could port the file back to HDFS and create a table on top of it.
The only disadvantage with the second approach is that for every iteration there would impala database requests which is particularly disadvantageous in a multi-tenant database env.
In my opinion, you should try the second approach.
Hope this helps :)
I have to do a lot of querys to a big SQL server database and the process is very slow.
I can do something like
import pyodbc
#database connection
tsql = "SELECT * FROM table1"
with cursor.execute(tsql):
rows = cursor.fetchall()
But how can I do stuff like SELECT specific rows after it's loaded in memory? Because I have a lot of queries that depends of other queries and I think it's not very efficient iterate of thousands of rows to filter specific data so I think there might be a better approach to do that.
Thanks in advance.
If you truly have a "big SQL server database" then you don't want to load entire tables into memory. Instead, for each task just load the columns of interest from the relevant rows using standard SQL methods, e.g.,
crsr.execute("SELECT firstname, lastname FROM table1 WHERE country = 'Canada'")
rows = crsr.fetchall()
If you need information from several related tables then search the web for basic SQL tutorials that describe how to use the JOIN keyword to accomplish that.
I have been doing a fair amount of manual data analysis, reporting and dash boarding recently via SQL and wonder if perhaps python would be able to automate a lot of this. I am not familiar with Python at all so I hope my question makes sense. For security/performance issues, we store databases on a number of servers (more than 5) which contain data that would be pertinent to a query. Unfortunately, these servers are set up so they cannot talk to each other so I cant pull data from the two servers in the same query. I believe this is a limitation due to using windows credentials/security.
For my data analysis and reporting needs, I need to be able to grab pertinent data from two or more of these so the way I currently do this is by running a query, grabbing the results, running another query with the results, doing some formula work in excel, and then running another query and so on and so forth until I get what I need.
Unfortunately this both time consuming, and also makes me pull massive datasets (in the multiple millions of rows), which I then have to continually narrow down based on criteria that are in said databases.
I know Python has the ability to query SQL Server, however I figured I would ask the experts:
Can I manipulate the data in the background with Python similar to how I can do with excel (lookups, statistical functions, etc, perhaps even XML/webAPI?
Can Python handle connections to multiple different database servers at the same time?
Does Python handle windows credentials well?
If Python is not the tool for this, can you name one that would work better?
Please let me know if I can provide additional pertinent details.
Ideally, I would like to end up creating our own separate database and creating automated processes to pull everything from other databases but currently that is not possible due to project constraints.
Thanks!
I didn't use windows credential. But i have used Python to work with multiple MS-SQL databases at the same time. It worked very well. You can use the library pymssql or better with SQLAlchemy
But i think you should start with a basic tutorial about Python first. Because you want to work with millions of rows, it's very important to understand list, set, tuple, dict in Python. For good performance, you should use the right type.
A basic example with pymssql
import pymssql
conn1 = pymssql.connect("Host1", "user1", "password1", "db1")
conn2 = pymssql.connect("Host2", "user2", "password2", "db2")
cursor1 = conn1.cursor()
cursor2 = conn2.cursor()
cursor1.execute('SELECT * FROM TABLE1 LIMIT 10')
cursor2.execute('SELECT * FROM TABLE2 LIMIT 10')
result1 = cursor1.fetchall()
result2 = cursor2.fetchall()
# print each row
for row in result1:
print(row)
# print each row
for row in result2:
print(row)
You can do all of what you asked. Python allows to create multiple connection objects via a library, so for example, let's say you use MySQL python you would create two different objects like this:
NOT ACTUAL CODE, JUST EXAMPLE
conn1 = mysqlConnect(server1, user, pass)
conn2 = mysqlConnect(server2, user, pass)
Like this, conn1 connects to one database and conn2 connects to a different one, usually you would do:
conn1.execute(query_to_server_1)
conn2.execute(query_to_server_2)
This helps maintain two different connections in the same script. If you are looking for multi threading, python offers an incredible library that will help you execute multiple task from one master script.
I have a MySQL server providing access to both a database for the Django ORM and a separate database called "STATES" that I built. I would like to query tables in my STATES database and return results (typically a couple of rows) to Django for rendering, but I don't know the best way to do this.
One way would be to use Django directly. Maybe I could move the relevant tables into the Django ORM database? I'm nervous about doing this because the STATES database contains large tables (10 million rows x 100 columns), and I worry about deleting that data or messing it up in some other way (I'm not very experienced with Django). I also imagine I should avoid creating a separate connection for each query, so I should use the Django connection to query STATE tables?
Alternatively, I could treat the STATE database as existing on a totally different server. I could import SQLAlchemy, create a connection, query STATE.table, return the result, and close that connection.
Which is better, or is there another path?
The docs describe how to connect to multiple databases by adding another database ("state_db") to DATABASES in settings.py, I can then do the following.
from django.db import connections
def query(lname)
c = connections['state_db'].cursor()
c.execute("SELECT last_name FROM STATE.table WHERE last_name=%s;",[lname])
rows = c.fetchall()
...
This is slower than I expected, but I'm guessing this is close to optimal because it uses the open connection and Django without adding extra complexity.
I'm using python and psycopg2 to remotely query some psql databases, and I'm trying to figure out the best way to select the data I need from the remote table, and insert it into a table on a separate DB (local application server).
Most of the stuff I've read has directed me to avoid executemany and look toward COPY operations, but I'm unsure how to implement this on a specific select statement as opposed to the entire table. Should I be headed this way or am I completely off?
but I'm unsure how to implement this on a specific select statement as opposed to the entire table
COPY isn't limited to tables, you can use a query as the source as well, check out the examples in the manual, it shows how to use COPY to create a text file based on a query:
http://www.postgresql.org/docs/current/static/sql-copy.html#AEN59055
(3rd example)
Take a look at http://ryrobes.com/featured-articles/using-a-simple-python-script-for-end-to-end-data-transformation-and-etl-part-1/
Granted, this is pulling from Oracle and inserting into SQL Server, but the concepts should be the same.