Load SQL Server Database into memory with Python - python

I have to do a lot of querys to a big SQL server database and the process is very slow.
I can do something like
import pyodbc
#database connection
tsql = "SELECT * FROM table1"
with cursor.execute(tsql):
rows = cursor.fetchall()
But how can I do stuff like SELECT specific rows after it's loaded in memory? Because I have a lot of queries that depends of other queries and I think it's not very efficient iterate of thousands of rows to filter specific data so I think there might be a better approach to do that.
Thanks in advance.

If you truly have a "big SQL server database" then you don't want to load entire tables into memory. Instead, for each task just load the columns of interest from the relevant rows using standard SQL methods, e.g.,
crsr.execute("SELECT firstname, lastname FROM table1 WHERE country = 'Canada'")
rows = crsr.fetchall()
If you need information from several related tables then search the web for basic SQL tutorials that describe how to use the JOIN keyword to accomplish that.

Related

Using Impala to select multiple tables with wildcard pattern and concatenate them

I'm starting with Impala SQL and Hadoop and have a (probably simple) question.
I have a Hadoop database with hundrets of tables with the same schema and naming convention (e.g. process_1, process_2, process_3 and so on). How would I query all the tables and concatenate them into one big table or dataframe? Is it possible to do so by using just Impala SQL which returns one dataframe in python?
Something like:
SELECT * FROM 'process_*';
Or do I need to run SHOW TABLES 'process_*', use a loop in python and query each table seperately?
If you are looking purely Impala solution, then one approach would be to create a view on top of all of the tables. Something as below:
create view process_view_all_tables as
select * from process1
union all
select * from process2
union all
...
select * from processN;
The disadvantage with this approach is as below:
You need to union multiple tables together. Union is an expensive operation in terms of memory utilisation. Works ok if you have less number of tables say in range of 2-5 tables.
You need to add all the tables manually. If you a new process table in future, you would need to ALTER the view and then add the new table. This is a maintenance headache.
The view assumes that all the PROCESS tables are of the same schema.
In the Second approach, as you said, you could query the list of tables from Impala using SHOW TABLES LIKE 'process*' and write a small program to iterate over the list of tables and create the files.
Once you have the file generated, you could port the file back to HDFS and create a table on top of it.
The only disadvantage with the second approach is that for every iteration there would impala database requests which is particularly disadvantageous in a multi-tenant database env.
In my opinion, you should try the second approach.
Hope this helps :)

Fetching Million records from SQL server and saving to pandas dataframe

I am trying to fetch data from SQL server Database (Just a simple SELECT * query).
The table contains around 3-5 Million records. Perfomring a SELECT * on the SQL server directly using SSMS takes around 11-15 minutes.
However, when I am connecting via Python and trying to save data into a pandas dataframe, it takes forever. More than 1 hour.
Here is the code I am using:
import pymssql
import pandas as pd
startTime = datetime.now()
## instance a python db connection object- same form as psycopg2/python-mysql drivers also
conn = pymssql.connect(server=r"xyz", database = "abc", user="user",password="pwd")
print ('Connecting to DB: ',datetime.now() - startTime )
stmt = "SELECT * FROM BIG_TABLE;"
# Excute Query here
df_big_table = pd.read_sql(stmt,conn)
There must be a way to do this in a better way? Perhaps parallel processing or something to fetch the data quickly.
My end goal is to Migrate this table from SQL server to PostGres.
This is the way I am doing:
Fetch data from SQL server using python
Save it to a pandas dataframe
Save this data in CSV to disk.
Copy the CSV from disk to Postgres.
Proably, I can combine step 3,4 so that I can do the transition in memory, rather than using disk IO.
There are many complexity like table constrains and definitions, etc. Which I will be taking care later on. I cannot use a third party tool.
I am stuck at Step 1,2. So help with the Python script/ Some other opensource language would be really appreciated.
If there is any other way to reach to my end goal, I welcome sugessions!
Have you tried using 'chunksize' option of pandas.read_sql? you can get all of that into a single dataframe and produce the csv.
If it takes more time then you can split each chunk into multiple files using the pandas.read_sql as a iterator and then after you did your work consolidate those files into a single one and submit it to postgres.

Is there way to view data on RDS?

I have few questions to ask about Amazon RDS
I was able to create an instance and I was able to view data that I insert using the below code and I was able to see the results through workbench, but is there any way to view data on RDS?
I have a huge data set, probably containing 80k tuples, I read this data from csv and I put it in the database, does changing to a better instance will affect the speed of inserting the data to the database, because I tried a larger instance with around 7.5 GB RAM, but it didn't work, insertion was very slow.What could be the problem? am I not connected properly?
I used the following code to connect
db = MySQLdb.connect(host='west........{the endpoint}', port=3306, db='clouddb', user='{my_username}',passwd='{my_password')
cursor = db.cursor()
cursor.execute('CREATE TABLE consumer_complaint(Complaint VARCHAR(255),Product VARCHAR(255),Subproduct VARCHAR(255),Issue VARCHAR(255),State VARCHAR(255),ZIPcode VARCHAR(255),Company VARCHAR(255),Companyresponse VARCHAR(255),Timelyresponse VARCHAR(255),Consumerdisputed VARCHAR(255));')
I'm getting no errors in connecting.

Multiple identical databases

First let me begin by saying that I'm somewhat new to sql (but have been doing python for a long while). I've been having trouble finding a good solution to my problem on the net.
The problem:
I have an undefined number (although probably less than 100) of identically structured sqlite databases that I need to query and merge results from. The databases themselves are not particularly huge.
I've been looking at the ATTACH command and following things in this tutorial:
http://souptonuts.sourceforge.net/readme_sqlite_tutorial.html
import sqlite3 as lite
Database = "library.db"
con = lite.connect(Database)
con.row_factory = lite.Row
cur = con.cursor()
cur.execute("ATTACH DATABASE 'library.db' As 'db1'")
cur.execute("ATTACH DATABASE 'library2.db' As 'db2'")
cur.execute("""
SELECT 'db1',* FROM db1.table
UNION
SELECT 'db2',* FROM db2.table
""")
but it seems like there should be a better way than to explicitly spell out each database in the execute command. Also, it looks like there's a limit to the number of databases that I can attach to? https://sqlite.org/limits.html
I've also looked at something like merging them together into a large database:
How can I merge many SQLite databases?
but it seems inefficient merge the databases together each time that a query needs to be made or one of the many individual databases gets changed.
Before I continue down some path, I was wondering if there were any better options for this type of thing that I'm not aware of?
Other potentially useful information about the project:
There are two main tables.
The tables from db to db can have duplicates.
I need to be able to grab unique values in columns for the "merged" databases.
Knowing which database what data came from is not essential.
The individual databases are updated frequently.
Queries to the "merged" database are made frequently.
You could avoid spelling out all databases in every query by using views:
CREATE VIEW MyTable_all AS
SELECT 'db1', db1.* FROM db1.MyTable
UNION ALL
SELECT 'db2', db2.* FROM db1.MyTable
...
However, if there are too many databases, you cannot use ATTACH.
In that case, you have to merge all the databases together.
If doing this every time for all databases is too slow, you can synchronize a single database at a time by keeping the source for each record:
DELETE FROM MyTable WHERE SourceDB = 1;
INSERT INTO MyTable SELECT 1, * FROM db1.MyTable;

Efficient way to combine results of two database queries

I have two tables on different servers, and I'd like some help finding an efficient way to combine and match the datasets. Here's an example:
From server 1, which holds our stories, I perform a query like:
query = """SELECT author_id, title, text
FROM stories
ORDER BY timestamp_created DESC
LIMIT 10
"""
results = DB.getAll(query)
for i in range(len(results)):
#Build a string of author_ids, e.g. '1314,4134,2624,2342'
But, I'd like to fetch some info about each author_id from server 2:
query = """SELECT id, avatar_url
FROM members
WHERE id IN (%s)
"""
values = (uid_list)
results = DB.getAll(query, values)
Now I need some way to combine these two queries so I have a dict that has the story as well as avatar_url and member_id.
If this data were on one server, it would be a simple join that would look like:
SELECT *
FROM members, stories
WHERE members.id = stories.author_id
But since we store the data on multiple servers, this is not possible.
What is the most efficient way to do this? I understand the merging probably has to happen in my application code ... any efficient sample code that minimizes the number of dict loops would be greatly appreciated!
Thanks.
If memory isn't a problem, you could use a dictionary.
results1_dict = dict((row[0], list(row[1:])) for row in results1)
results2_dict = dict((row[0], list(row[1:])) for row in results2)
for key, value in results2_dict:
if key in results1_dict:
results1_dict[key].extend(value)
else:
results1_dict[key] = value
This isn't particularly efficient (n2), but it is relatively simple and you can tweak it to do precisely what you need.
The only option looks to be Database Link, but is unfortunately unavailable in MySQL.
You'll have to do the merging in your application code. Better to keep the data in same database.
You will have to bring the data together somehow.
There are things like server links (though that is probably not the correct term in mysql context) that might allow querying accross different DBs. This opens up another set of problems (security!)
The easier solution is to bring the data together in one DB.
The last (least desirable) solution is to join in code as Padmarag suggests.
Is it possible to setup replication of the needed tables from one server to a database on the other?
That way you could have all your data on one server.
Also, see FEDERATED storage engine, available since mysql 5.0.3.

Categories