Incorporate Pandas Data Frame in Query to Database - python

I'm trying to figure out how to treat a pandas data frame as a SQL table when querying a database in Python.
I'm coming from a SAS background where work tables can easily be incorporated into direct database queries.
For example:
Select a.first_col,
b.second_col
from database.table1 a
left join work.table1 b on a.id = b.id;
Here work.table1 is not in the database, but is a dataset held in the local SAS server.
In my research I have found ways to write a data frame to a database and then include that in the query. I do not have write access to the database, so that is not an option for me.
I also know that I can use sqlalchemy with pd.to_sql() to put a data frame into a SQL engine, but I can't figure out if there is a way to connect that engine with the pyodbc connection I have with the database.
I also tried this though I didn't think it would work (names of tables and columns altered).
df = pd.DataFrame([A342,B432,W345],columns=['id'])
query = '''
select a.id, b.id
from df a
left join database.base_table b on a.id= b.id
'''
query_results = pd.read_sql_query(query,connection)
As I expected it didn't work.
I'm connecting to a Netezza database, I'm not sure if that matters.

I don't think it is possible, it would have to be written to its own table in order to be queried, although I'm not familiar with Netezza.
Why not perform the join (in pandas a "merge") purely in pandas? Understand if the sql table is massive this isn't feasible but,
query = """
SELECT id, x
FROM a
"""
a = pd.read_sql(query, conn)
df = # some dataframe in memory
pd.merge(df, a, on='id', how='left')
see https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html for good docs about sql and pandas similarities

Related

How to query a T-SQL temp table with connectorx (pandas slow)

I am using pyodbc to run a query to create a temp table from a bunch of other tables. I then want to pull that whole temp table into pandas, but my pd.read_sql call takes upwards of 15 minutes. I want to try the connectorX library to see if it will speed things up.
For pandas the working way to query the temp table simply looks like:
conn = connection("connection string")
cursor = conn.cursor()
cursor.execute("""Do a bunch of stuff that ultimately creates one #finalTable""")
df = pd.read_sql("SELECT * FROM #finalTable", con=conn)
I've been reading the documentation and it appears I can only pass a connection string to the connectorx.read_sql function, and I haven't been able to find a way to pass it an existing connection that carries the temp table I need.
Am I able to query the temp table with connectorX? If so how?
If not, what would be a faster way to query a large temp table?
Thanks!

How to convert select_from object into a new table in sqlalchemy

I have a database that contains two tables in the data, cdr and mtr. I want a join of the two based on columns ego_id and alter_id, and I want to output this into another table in the same database, complete with the column names, without the use of pandas.
Here's my current code:
mtr_table = Table('mtr', MetaData(), autoload=True, autoload_with=engine)
print(mtr_table.columns.keys())
cdr_table = Table('cdr', MetaData(), autoload=True, autoload_with=engine)
print(cdr_table.columns.keys())
query = db.select([cdr_table])
query = query.select_from(mtr_table.join(cdr_table,
((mtr_table.columns.ego_id == cdr_table.columns.ego_id) &
(mtr_table.columns.alter_id == cdr_table.columns.alter_id))),
)
results = connection.execute(query).fetchmany()
Currently, for my test code, what I do is to convert the results as a pandas dataframe and then put it back in the original SQL database:
df = pd.DataFrame(results, columns=results[0].keys())
df.to_sql(...)
but I have two problems:
loading everything into a pandas dataframe would require too much memory when I start working with the full database
the columns names are (apparently) not included in results and would need to be accessed by results[0].keys()
I've checked this other stackoverflow question but it uses the ORM framework of sqlalchemy, which I unfortunately don't understand. If there's a simpler way to do this (like pandas' to_sql), I think this would be easier.
What's the easiest way to go about this?
So I found out how to do this via CREATE TABLE AS:
query = """
CREATE TABLE mtr_cdr AS
SELECT
mtr.idx,cdr.*
FROM mtr INNER JOIN cdr
ON (mtr.ego_id = cdr.ego_id AND mtr.alter_id = cdr.alter_id)""".format(new_table)
with engine.connect() as conn:
conn.execute(query)
The query string seems to be highly sensitive to parentheses though. If I put a parentheses enclosing the whole SELECT...FROM... statement, it doesn't work.

SQL query using pyodbc where the selected data is in a dataframe

What is the most efficient way to query my SQL (T-SQL) database when I want to inner join the queried data onto a pandas dataframe afterwards?
I don't know how to pass information into SQL from Python via a PYODBC query so my current best idea is to form the query in a way that I know aligns with my Python dataframe (i.e. I know all the information has STARTDATE > 2016, so it's easy for me to request that, and I know that PRODUCT = Private_Car). However if I use:
SELECT *
FROM rmrClaim
WHERE (PRODUCT = 'Private_Car') AND (YEAR >= 2016)
I am still going to bring in far more data than necessary. What I would rather be able to do is select only data which contains my merge key (ID) from the SQL DB.
Is there a more efficient way to query the DB so that given a pandas dataframe I can only bring the data which I will need for inner joining afterwards?
Can I pass a list from python into a sql query using PYODBC?
Edit - Trying to phrase differently:
I have a dataframe from CSV (dataframe A), and I want to take data from my SQL DB to produce a dataframe (dataframe B). The data in my SQL DB is much much larger than the data in dataframe A so I want to be able to send a SQL query which only requests data that is within dataframe A so that I don't end up with a dataframe B which is 10x larger than dataframe A. My current idea for this is to use knowledge I have of dataframe A (i.e. that all of the data in dataframe A is after 2016) however if there is a way to pass a list into my SQL query I can more efficiently query a subset of data
use the pyodbc and write your query before passing it to pandas dataframe. Here is an example:
import pandas as pd
import pyodbc
connstr = "Driver={SQL Server};Server=MSSQLSERVER;Database=Claims;Trusted_Connection=yes;"
df = pd.read_sql("SELECT * FROM rmrClaim WHERE (PRODUCT = 'Private_Car') AND (YEAR >= 2016) AND ID in {} ".format(dfA.Column), pyodbc.connect(connstr))
df

Build normalised MSSQL dB from CSV files in Python + Pandas + sqlAlchemy

I am learning by doing - Python, Pandas, SQL & Deep Learning. I want to build a database with data for a deep learning experiment (with Keras and Tensorflow). The source data is ~10GB (total) of forex timestamped bid/ask data in 8 CSV files with source information encoded as three 3-4 char strings for categories Contributor, Region and City.
I can connect to my (empty) MSSQL database via pyodbc and sqlAlchemy; I can read my CSV files into dataframes; I can create a simple table in the dB and even create one from a simple dataframe; I can convert the date and time fields into the milliseconds since epoch I want. (And, FWIW, I already have already implemented a working toy LSTM model to adapt to the price data, and I also have some analytical functions I wrote and compiled in Mathematica; I'll either call the C from Python or get Mathematica to work directly on the database.)
The issue is putting the CSV data into the database. Since there are only a dozen or so different sources in each category I believe I should put Contributor etc. into separate tables with e.g Contributor_ID as ints (?) so that data is stored compactly and e.g. SELECT... WHERE Region = "SHRUBBERY" are efficient. (AFAICT I definitely shouldn't use enums because I may get more sources & categories later).
My question is - assuming the aforementioned high level of ignorance! - how can/should I a) create the tables and relationships using python and then b) populate those tables?
Optional extra: to save space, the CSV files omit the Region and City where the row values are the same as those for the row above - reading the CSVs to collect just the source information (which takes about 50s for each category) I know how to deduplicate and dropna, but when I want to populate the dB, how can I most efficiently replace the na's with the values from the previous row? A simple For loop would do it, but is there e.g. some way to "propagate" the last "real" value in a column to replace the na using pandas?
CSV example:
Date Time Bid Price Ask Price Contributor Region City
04/02/2017 00:00.1 1.00266 1.00282 DCFX ASI AKL
04/02/2017 00:00.1 1.00263 1.0028 DCFX
04/02/2017 00:00.2 1.00224 1.00285 FXN NAM NYC
04/02/2017 00:00.2 1.00223 1.00288 FXN
All input gratefully received :)
Relational databases (RDBMS) aim to store data into related, logical groupings with a system of primary key/foreign keys to normalize storage which among other advantages maintains referential integrity (i.e., no orphaned records) and avoids repetition of stored data. For your situation, consider the following:
DATABASE DESIGN: Understand the workflow or "story" of your data pieces (e.g., which comes first/after in data entry) and construct the necessary schema of tables. Classic Database 101 example is the Customers-Products-Orders where many customers can purchase multiple products to fill many orders (1-to-many and many-to-many relationships) where primary keys of parent tables are the foreign key of child tables. Hence, aim for a schema layout as below from this SO answer.
For your needs, your schema may involve Contributors, Regions, Cities, Markets, Company (Ticker), and Prices. This step will make use of DDL commands (CREATE TABLE, CREATE INDEX, CREATE SCHEMA) which can be run in pyodbc cursors or sqlAlchemy engine calls, sufficing the connected user has such privileges.
But typically, database design commands are run in a specialized admin console/IDE or command line tools and not application layer code like Python such as SQL Server's Management Studio or sqlcmd; similarly, Oracle's SQL Developer/sqlplus, MySQL's Workbench/cli or PostgreSQL's PgAdmin/psql. Below is example of setup for Prices table:
# INITIALIZE SQLALCHEMY ENGINE
connection_string = 'mssql+pyodbc://{}:{}#{}/{}'\
.format(db_user,db_password,db_server,db_database)
engine = create_engine(connection_string)
sql = """
CREATE TABLE Prices (
ID INT IDENTITY(1,1) PRIMARY KEY,
DateTime DATETIME,
BidPrice DOUBLE(10,4),
AskPrice DOUBLE(10,4),
ContributorID INT,
RegionID INT,
CityID INT,
CONSTRAINT FK_Contributor FOREIGN KEY (ContributorID) REFERENCES Contributors (ID),
CONSTRAINT FK_Region FOREIGN KEY (RegionID) REFERENCES Regions (ID),
CONSTRAINT FK_City FOREIGN KEY (CityID) REFERENCES Cities (ID)
)
"""
# SQL ACTION QUERY VIA TRANSACTION
with engine.begin() as conn:
conn.execute(sql)
DATA POPULATION: Because a dataset/dataframe, csv, or spreadsheet are NOT equivalent to a normalized RDBMS table but are actually queries of multiple tables, migration of these sources will require some SQL wrangling to align to your above schema. Simple upload of dataframes into SQL Server tables will lead to inefficient and repetitive storage. Therefore, consider below steps:
Staging Tables (using to_sql)
Use staging, temp tables which would be raw dumps from pandas. And for NAs issue, use DataFrame or Series forward fill , ffill, for populating NAs from above rows.
# FILL IN NAs IN ALL COLUMNS FROM PREVIOUS ROW
df = df.ffill() # OR df.fillna(method='ffill')
# FILL IN NAs FOR SPECIFIC COLUMNS
df['Region'] = df['Region'].ffill()
df['City'] = df['City'].ffill()
# DUMP DATA INTO DATA FRAME
df.to_sql(name='pandas_prices_dump', con=engine, if_exists='replace', index=False)
Migration to Final Tables (joining lookup tables by string names)
Then, run action queries (i.e., DML commands: INSERT INTO, UPDATE, DELETE) for populating final tables from staging, temp tables.
sql = """
INSERT INTO Prices (Datetime, BidPrice, AskPrice,
ContributorID, RegionID, CityID)
SELECT pd.Datetime, pd.BidPrice, pd.AskPrice, c.ID, r.ID, cy.ID
FROM pandas_prices_dump pd
INNER JOIN Contributors c
ON c.ContributorName = pd.Contributor
INNER JOIN Regions r
ON r.RegionName = pd.Region
INNER JOIN Cities cy
ON cy.CityName = pd.City
"""
# APPEND FINAL DATA
with engine.begin() as conn:
conn.execute(sql)
# DROP STAGING TABLE
with engine.begin() as conn:
conn.execute("DROP TABLE pandas_prices_dump")
Test/Check Final Tables (using read_sql, joining lookup tables by IDs)
# IMPORT INTO PANDAS (EQUIVALENT TO ORIGINAL df)
sql = """
SELECT p.Datetime, p.BidPrice, p.AskPrice,
c.ContributorName As Contributor, r.RegionName As Region,
cy.CityName As City
FROM Prices p
INNER JOIN Contributors c
ON c.ID = pd.ContributorID
INNER JOIN Regions r
ON r.ID = pd.RegionID
INNER JOIN Cities cy
ON cy.ID = pd.CityID
"""
prices_data = pd.read_sql(sql, engine)

Update MSSQL table through SQLAlchemy using dataframes

I'm trying to replace some old MSSQL stored procedures with python, in an attempt to take some of the heavy calculations off of the sql server. The part of the procedure I'm having issues replacing is as follows
UPDATE mytable
SET calc_value = tmp.calc_value
FROM dbo.mytable mytable INNER JOIN
#my_temp_table tmp ON mytable.a = tmp.a AND mytable.b = tmp.b AND mytable.c = tmp.c
WHERE (mytable.a = some_value)
and (mytable.x = tmp.x)
and (mytable.b = some_other_value)
Up to this point, I've made some queries with SQLAlchemy, stored those data in Dataframes, and done the requisite calculations on them. I don't know now how to put the data back into the server using SQLAlchemy, either with raw SQL or function calls. The dataframe I have on my end would essentially have to work in the place of the temporary table created in MSSQL Server, but I'm not sure how I can do that.
The difficulty is of course that I don't know of a way to join between a dataframe and a mssql table, and I'm guessing this wouldn't work so I'm looking for a workaround
As the pandas doc suggests here :
from sqlalchemy import create_engine
engine = create_engine("mssql+pyodbc://user:password#DSN", echo = False)
dataframe.to_sql('tablename', engine , if_exists = 'replace')
engine parameter for msSql is basically the connection string check it here
if_exist parameter is a but tricky since 'replace' actually drops the table first and then recreates it and then inserts all data at once.
by setting the echo attribute to True it shows all background logs and sql's.

Categories