Build normalised MSSQL dB from CSV files in Python + Pandas + sqlAlchemy

Build normalised MSSQL dB from CSV files in Python + Pandas + sqlAlchemy - python

I am learning by doing - Python, Pandas, SQL & Deep Learning. I want to build a database with data for a deep learning experiment (with Keras and Tensorflow). The source data is ~10GB (total) of forex timestamped bid/ask data in 8 CSV files with source information encoded as three 3-4 char strings for categories Contributor, Region and City.
I can connect to my (empty) MSSQL database via pyodbc and sqlAlchemy; I can read my CSV files into dataframes; I can create a simple table in the dB and even create one from a simple dataframe; I can convert the date and time fields into the milliseconds since epoch I want. (And, FWIW, I already have already implemented a working toy LSTM model to adapt to the price data, and I also have some analytical functions I wrote and compiled in Mathematica; I'll either call the C from Python or get Mathematica to work directly on the database.)
The issue is putting the CSV data into the database. Since there are only a dozen or so different sources in each category I believe I should put Contributor etc. into separate tables with e.g Contributor_ID as ints (?) so that data is stored compactly and e.g. SELECT... WHERE Region = "SHRUBBERY" are efficient. (AFAICT I definitely shouldn't use enums because I may get more sources & categories later).
My question is - assuming the aforementioned high level of ignorance! - how can/should I a) create the tables and relationships using python and then b) populate those tables?
Optional extra: to save space, the CSV files omit the Region and City where the row values are the same as those for the row above - reading the CSVs to collect just the source information (which takes about 50s for each category) I know how to deduplicate and dropna, but when I want to populate the dB, how can I most efficiently replace the na's with the values from the previous row? A simple For loop would do it, but is there e.g. some way to "propagate" the last "real" value in a column to replace the na using pandas?
CSV example:
Date Time Bid Price Ask Price Contributor Region City
04/02/2017 00:00.1 1.00266 1.00282 DCFX ASI AKL
04/02/2017 00:00.1 1.00263 1.0028 DCFX
04/02/2017 00:00.2 1.00224 1.00285 FXN NAM NYC
04/02/2017 00:00.2 1.00223 1.00288 FXN
All input gratefully received :)

Relational databases (RDBMS) aim to store data into related, logical groupings with a system of primary key/foreign keys to normalize storage which among other advantages maintains referential integrity (i.e., no orphaned records) and avoids repetition of stored data. For your situation, consider the following:
DATABASE DESIGN: Understand the workflow or "story" of your data pieces (e.g., which comes first/after in data entry) and construct the necessary schema of tables. Classic Database 101 example is the Customers-Products-Orders where many customers can purchase multiple products to fill many orders (1-to-many and many-to-many relationships) where primary keys of parent tables are the foreign key of child tables. Hence, aim for a schema layout as below from this SO answer.
For your needs, your schema may involve Contributors, Regions, Cities, Markets, Company (Ticker), and Prices. This step will make use of DDL commands (CREATE TABLE, CREATE INDEX, CREATE SCHEMA) which can be run in pyodbc cursors or sqlAlchemy engine calls, sufficing the connected user has such privileges.
But typically, database design commands are run in a specialized admin console/IDE or command line tools and not application layer code like Python such as SQL Server's Management Studio or sqlcmd; similarly, Oracle's SQL Developer/sqlplus, MySQL's Workbench/cli or PostgreSQL's PgAdmin/psql. Below is example of setup for Prices table:
# INITIALIZE SQLALCHEMY ENGINE
connection_string = 'mssql+pyodbc://{}:{}#{}/{}'\
.format(db_user,db_password,db_server,db_database)
engine = create_engine(connection_string)
sql = """
CREATE TABLE Prices (
ID INT IDENTITY(1,1) PRIMARY KEY,
DateTime DATETIME,
BidPrice DOUBLE(10,4),
AskPrice DOUBLE(10,4),
ContributorID INT,
RegionID INT,
CityID INT,
CONSTRAINT FK_Contributor FOREIGN KEY (ContributorID) REFERENCES Contributors (ID),
CONSTRAINT FK_Region FOREIGN KEY (RegionID) REFERENCES Regions (ID),
CONSTRAINT FK_City FOREIGN KEY (CityID) REFERENCES Cities (ID)
)
"""
# SQL ACTION QUERY VIA TRANSACTION
with engine.begin() as conn:
conn.execute(sql)
DATA POPULATION: Because a dataset/dataframe, csv, or spreadsheet are NOT equivalent to a normalized RDBMS table but are actually queries of multiple tables, migration of these sources will require some SQL wrangling to align to your above schema. Simple upload of dataframes into SQL Server tables will lead to inefficient and repetitive storage. Therefore, consider below steps:
Staging Tables (using to_sql)
Use staging, temp tables which would be raw dumps from pandas. And for NAs issue, use DataFrame or Series forward fill , ffill, for populating NAs from above rows.
# FILL IN NAs IN ALL COLUMNS FROM PREVIOUS ROW
df = df.ffill() # OR df.fillna(method='ffill')
# FILL IN NAs FOR SPECIFIC COLUMNS
df['Region'] = df['Region'].ffill()
df['City'] = df['City'].ffill()
# DUMP DATA INTO DATA FRAME
df.to_sql(name='pandas_prices_dump', con=engine, if_exists='replace', index=False)
Migration to Final Tables (joining lookup tables by string names)
Then, run action queries (i.e., DML commands: INSERT INTO, UPDATE, DELETE) for populating final tables from staging, temp tables.
sql = """
INSERT INTO Prices (Datetime, BidPrice, AskPrice,
ContributorID, RegionID, CityID)
SELECT pd.Datetime, pd.BidPrice, pd.AskPrice, c.ID, r.ID, cy.ID
FROM pandas_prices_dump pd
INNER JOIN Contributors c
ON c.ContributorName = pd.Contributor
INNER JOIN Regions r
ON r.RegionName = pd.Region
INNER JOIN Cities cy
ON cy.CityName = pd.City
"""
# APPEND FINAL DATA
with engine.begin() as conn:
conn.execute(sql)
# DROP STAGING TABLE
with engine.begin() as conn:
conn.execute("DROP TABLE pandas_prices_dump")
Test/Check Final Tables (using read_sql, joining lookup tables by IDs)
# IMPORT INTO PANDAS (EQUIVALENT TO ORIGINAL df)
sql = """
SELECT p.Datetime, p.BidPrice, p.AskPrice,
c.ContributorName As Contributor, r.RegionName As Region,
cy.CityName As City
FROM Prices p
INNER JOIN Contributors c
ON c.ID = pd.ContributorID
INNER JOIN Regions r
ON r.ID = pd.RegionID
INNER JOIN Cities cy
ON cy.ID = pd.CityID
"""
prices_data = pd.read_sql(sql, engine)

Related

Incorporate Pandas Data Frame in Query to Database

I'm trying to figure out how to treat a pandas data frame as a SQL table when querying a database in Python.
I'm coming from a SAS background where work tables can easily be incorporated into direct database queries.
For example:
Select a.first_col,
b.second_col
from database.table1 a
left join work.table1 b on a.id = b.id;
Here work.table1 is not in the database, but is a dataset held in the local SAS server.
In my research I have found ways to write a data frame to a database and then include that in the query. I do not have write access to the database, so that is not an option for me.
I also know that I can use sqlalchemy with pd.to_sql() to put a data frame into a SQL engine, but I can't figure out if there is a way to connect that engine with the pyodbc connection I have with the database.
I also tried this though I didn't think it would work (names of tables and columns altered).
df = pd.DataFrame([A342,B432,W345],columns=['id'])
query = '''
select a.id, b.id
from df a
left join database.base_table b on a.id= b.id
'''
query_results = pd.read_sql_query(query,connection)
As I expected it didn't work.
I'm connecting to a Netezza database, I'm not sure if that matters.

I don't think it is possible, it would have to be written to its own table in order to be queried, although I'm not familiar with Netezza.
Why not perform the join (in pandas a "merge") purely in pandas? Understand if the sql table is massive this isn't feasible but,
query = """
SELECT id, x
FROM a
"""
a = pd.read_sql(query, conn)
df = # some dataframe in memory
pd.merge(df, a, on='id', how='left')
see https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html for good docs about sql and pandas similarities

Best way to perform bulk insert SQLAlchemy

I have a tabled called products
which has following columns
id, product_id, data, activity_id
What I am essentially trying to do is copy bulk of existing products and update it's activity_id and create new entry in the products table.
Example:
I already have 70 existing entries in products with activity_id 2
Now I want to create another 70 entries with same data except for updated activity_id
I could have thousands of existing entries that I'd like to make a copy of and update the copied entries activity_id to be a new id.
products = self.session.query(model.Products).filter(filter1, filter2).all()
This returns all the existing products for a filter.
Then I iterate through products, then simply clone existing products and just update activity_id field.
for product in products:
product.activity_id = new_id
self.uow.skus.bulk_save_objects(simulation_skus)
self.uow.flush()
self.uow.commit()
What is the best/ fastest way to do these bulk entries so it kills time, as of now it's OK performance, is there a better solution?

You don't need to load these objects locally, all you really want to do is have the database create these rows.
You essentially want to run a query that creates the rows from the existing rows:
INSERT INTO product (product_id, data, activity_id)
SELECT product_id, data, 2 -- the new activity_id value
FROM product
WHERE activity_id = old_id
The above query would run entirely on the database server; this is far preferable over loading your query into Python objects, then sending all the Python data back to the server to populate INSERT statements for each new row.
Queries like that are something you could do with SQLAlchemy core, the half of the API that deals with generating SQL statements. However, you can use a query built from a declarative ORM model as a starting point. You'd need to
Access the Table instance for the model, as that then lets you create an INSERT statement via the Table.insert() method.
You could also get the same object from models.Product query, more on that later.
Access the statement that would normally fetch the data for your Python instances for your filtered models.Product query; you can do so via the Query.statement property.
Update the statement to replace the included activity_id column with your new value, and remove the primary key (I'm assuming that you have an auto-incrementing primary key column).
Apply that updated statement to the Insert object for the table via Insert.from_select().
Execute the generated INSERT INTO ... FROM ... query.
Step 1 can be achieved by using the SQLAlchemy introspection API; the inspect() function, applied to a model class, gives you a Mapper instance, which in turn has a Mapper.local_table attribute.
Steps 2 and 3 require a little juggling with the Select.with_only_columns() method to produce a new SELECT statement where we swapped out the column. You can't easily remove a column from a select statement but we can, however, use a loop over the existing columns in the query to 'copy' them across to the new SELECT, and at the same time make our replacement.
Step 4 is then straightforward, Insert.from_select() needs to have the columns that are inserted and the SELECT query. We have both as the SELECT object we have gives us its columns too.
Here is the code for generating your INSERT; the **replace keyword arguments are the columns you want to replace when inserting:
from sqlalchemy import inspect, literal
from sqlalchemy.sql import ClauseElement
def insert_from_query(model, query, **replace):
# The SQLAlchemy core definition of the table
table = inspect(model).local_table
# and the underlying core select statement to source new rows from
select = query.statement
# validate asssumptions: make sure the query produces rows from the above table
assert table in select.froms, f"{query!r} must produce rows from {model!r}"
assert all(c.name in select.columns for c in table.columns), f"{query!r} must include all {model!r} columns"
# updated select, replacing the indicated columns
as_clause = lambda v: literal(v) if not isinstance(v, ClauseElement) else v
replacements = {name: as_clause(value).label(name) for name, value in replace.items()}
from_select = select.with_only_columns([
replacements.get(c.name, c)
for c in table.columns
if not c.primary_key
])
return table.insert().from_select(from_select.columns, from_select)
I included a few assertions about the model and query relationship, and the code accepts arbitrary column clauses as replacements, not just literal values. You could use func.max(models.Product.activity_id) + 1 as a replacement value (wrapped as a subselect), for example.
The above function executes steps 1-4, producing the desired INSERT SQL statement when printed (I created a products model and query that I thought might be representative):
>>> print(insert_from_query(models.Product, products, activity_id=2))
INSERT INTO products (product_id, data, activity_id) SELECT products.product_id, products.data, :param_1 AS activity_id
FROM products
WHERE products.activity_id != :activity_id_1
All you have to do is execute it:
insert_stmt = insert_from_query(models.Product, products, activity_id=2)
self.session.execute(insert_stmt)

Python SQL loop variables through multiple queries

I'm having trouble with a Python Teradata (tdodbc) query with looping through the same query with different variables and merging the results. I received good direction in another post and ended up here. My issue now is that the dataframe only ends up with query results of the final variable in the loop, "state5". Unfortunately we have 5 states each in their own databases with the same schema. I can run the same query, but want to loop the variables so I can run for all 5 states and return an appended query. This was easy using SAS Macro variables and mending, but need to bring data to python for EDA and data science.
from teradata import tdodbc
udaExec = td.UdaExec(appConfigFile="udaexec.ini")
with udaExec.connect("${dataSourceName}") as session:
state_dataframes = []
STATES = ["state1", "state2", "state3", "state4", "state5"]
for state in STATES:
query1 = """database my_db_{};"""
query2 = """
select top 10
'{}' as state
,a.*
from table_a
"""
session.execute(query1.format(state))
session.execute(query2.format(state))
state_dataframes.append(pd.read_sql(query2, session))
all_states_df = pd.concat(state_dataframes)

I was able to finally get this to work although it may not be the most eloquent way to do it. I did try to do the drop tables as a single variable "query5" but was receiving a DDL error. Once I separated each drop table into it's own session.execute, it worked.
udaExec = td.UdaExec(appConfigFile="udaexec.ini")
with udaExec.connect("${dataSourceName}") as session:
state_dataframes = []
STATES = ["state1", "state2", "state3", "state4", "state5"]
for state in STATES:
query1 = """database my_db_{};"""
query2 = """
create set volatile table v_table
,no fallback, no before journal, no after journal as
(
select top 10
'{}' as state
,t.*
from table t
)
with data
primary index (dw_key)
on commit preserve rows;
"""
query3 = """
create set volatile table v_table_2
,no fallback, no before journal, no after journal as
(
select t.*
from v_table t
)
with data
primary index (dw_key)
on commit preserve rows;
"""
query4 = """
select t.*
from v_table_2 t
"""
session.execute(query1.format(state))
session.execute(query2.format(state))
session.execute(query3)
state_dataframes.append(pd.read_sql(query4, session))
session.execute("DROP TABLE v_table")
session.execute("DROP TABLE v_table_2")
all_states_df = pd.concat(state_dataframes)
Edit for clarity: correcting the query in the question only required proper indentation. In my Teradata environment I have limited spool space which requires building many vol tables to break apart queries. Since I spent a good amount of time trying to solve this, I added to the answer to help others who may run into this scenario.

Upsert / merge tables in SQLite

I have created a database using sqlite3 in python that has thousands of tables. Each of these tables contains thousands of rows and ten columns. One of the columns is the date and time of an event: it is a string that is formatted as YYYY-mm-dd HH:MM:SS, which I have defined to be the primary key for each table. Every so often, I collect some new data (hundreds of rows) for each of these tables. Each new dataset is pulled from a server and loaded in directly as a pandas data frame or is stored as a CSV file. The new data contains the same ten columns as my original data. I need to update the tables in my database using this new data in the following way:
Given a table in my database, for each row in the new dataset, if the date and time of the row matches the date and time of an existing row in my database, update the remaining columns of that row using the values in the new dataset.
If the date and time does not yet exist, create a new row and insert it to my database.
Below are my questions:
I've done some searching on Google and it looks like I should be using the UPSERT (merge) functionality of sqlite but I can't seem to find any examples showing how to use it. Is there an actual UPSERT command, and if so, could someone please provide an example (preferably with sqlite3 in Python) or point me to a helpful resource?
Also, is there a way to do this in bulk so that I can UPSERT each new dataset into my database without having to go row by row? (I found this link, which suggests that it is possible, but I'm new to using databases and am not sure how to actually run the UPSERT command.)
Can UPSERT also be performed directly using pandas.DataFrame.to_sql?
My backup solution is loading in the table to be UPSERTed using pd.read_sql_query("SELECT * from table", con), performing pandas.DataFrame.merge, deleting the said table from the database, and then adding in the updated table to the database using pd.DataFrame.to_sql (but this would be inefficient).

Instead of going through upsert command, why don't you create your own algorithim that will find values and replace them if date & time is found, else it will insert new row. Check out my code, i wrote for you. Let me know if you are still confused. You can even do that for hundereds of tables just by replacing table name in algorithim with some variable and changing it for the whole list of your table names.
import sqlite3
import pandas as pd
csv_data = pd.read_csv("my_CSV_file.csv") # Your CSV Data Path
def manual_upsert():
con = sqlite3.connect(connection_str)
cur = con.cursor()
cur.execute("SELECT * FROM my_CSV_data") # Viewing Data from Column
data = cur.fetchall()
old_data_list = [] # Collection of All Dates already in Database table.
for line in data:
old_data_list.append(line[0]) # I suppose you Date Column is on 0 Index.
for new_data in csv_data:
if new_data[0] in old_data_list:
cur.execute("UPDATE my_CSV_data SET column1=?, column2=?, column3=? WHERE my_date_column=?", # it will update column based on date if condition is true
(new_data[1],new_data[2],new_data[3],new_data[0]))
else:
cur.execute("INSERT INTO my_CSV_data VALUES(?,?,?,?)", # It will insert new row if date is not found.
(new_data[0],new_data[1],new_data[2],new_data[3]))
con.commit()
con.close()
manual_upsert()

First, even though the questions are related, ask them separately in the future.
There is documentation on UPSERT handling in SQLite that documents how to use it but it is a bit abstract. You can check examples and discussion here: SQLite - UPSERT *not* INSERT or REPLACE
Use a transaction and the statements are going to be executed in bulk.
As presence of this library suggests to_sql does not create UPSERT commands (only INSERT).

Less Memory-intense way of copying tables & renaming columns in sqlite/pandas

I have found a very nice way to:
read a table from a sql database
rename the columns with a dict (read from a yaml file)
rewrite the table to another database
The only problem is, that as the table becomes bigger(10col x several million rows), reading the table into a pandas is so memory-intensive, that it causes the process to be killed.
There must be an easier way. I looked at alter table statements but they seem to be very complicated as well& will not do the copying in another db. Any ideas on how to do the same operation without using this much memory. Feeling like pandas are a crutch I use due to my bad sql.
import pandas as pd
import sqlite3
def translate2generic(sourcedb, targetdb, sourcetable,
targettable, toberenamed):
"""Change table's column names to fit generic api keys.
:param: Path to source db
:param: Path to target db
:param: Name of table to be translated in source
:param: Name of the newly to be created table in targetdb
:param: dictionary of translations
:return: New column names in target db
"""
sourceconn = sqlite3.connect(sourcedb)
targetconn = sqlite3.connect(targetdb)
table = pd.read_sql_query('select * from ' + sourcetable, sourceconn) #this is the line causing the crash
# read dict in the format {"oldcol1name": "newcol1name", "oldcol2name": "newcol2name"}
rename = {v: k for k, v in toberenamed.items()}
# rename columns
generic_table = table.rename(columns=rename)
# Write table to new database
generic_table.to_sql(targettable, targetconn, if_exists="replace")
targetconn.close()
sourceconn.close()
I've looked also at solutions such as this one but they suppose you know the type of the columns.
An elegant solution would be very much appreciated.
Edit: I know there is a method in sqlite since the September release 3.25.0, but I am stuck with version 2.6.0

To elaborate on my comments...
If you have a table in foo.db and want to copy that table's data to a new table in bar.db with different column names:
$ sqlite3 foo.db
sqlite> ATTACH 'bar.db' AS bar;
sqlite> CREATE TABLE bar.newtable(newcolumn1, newcolumn2);
sqlite> INSERT INTO bar.newtable SELECT oldcolumn1, oldcolumn2 FROM main.oldtable;

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.