I'm having trouble with a Python Teradata (tdodbc) query with looping through the same query with different variables and merging the results. I received good direction in another post and ended up here. My issue now is that the dataframe only ends up with query results of the final variable in the loop, "state5". Unfortunately we have 5 states each in their own databases with the same schema. I can run the same query, but want to loop the variables so I can run for all 5 states and return an appended query. This was easy using SAS Macro variables and mending, but need to bring data to python for EDA and data science.
from teradata import tdodbc
udaExec = td.UdaExec(appConfigFile="udaexec.ini")
with udaExec.connect("${dataSourceName}") as session:
state_dataframes = []
STATES = ["state1", "state2", "state3", "state4", "state5"]
for state in STATES:
query1 = """database my_db_{};"""
query2 = """
select top 10
'{}' as state
,a.*
from table_a
"""
session.execute(query1.format(state))
session.execute(query2.format(state))
state_dataframes.append(pd.read_sql(query2, session))
all_states_df = pd.concat(state_dataframes)
I was able to finally get this to work although it may not be the most eloquent way to do it. I did try to do the drop tables as a single variable "query5" but was receiving a DDL error. Once I separated each drop table into it's own session.execute, it worked.
udaExec = td.UdaExec(appConfigFile="udaexec.ini")
with udaExec.connect("${dataSourceName}") as session:
state_dataframes = []
STATES = ["state1", "state2", "state3", "state4", "state5"]
for state in STATES:
query1 = """database my_db_{};"""
query2 = """
create set volatile table v_table
,no fallback, no before journal, no after journal as
(
select top 10
'{}' as state
,t.*
from table t
)
with data
primary index (dw_key)
on commit preserve rows;
"""
query3 = """
create set volatile table v_table_2
,no fallback, no before journal, no after journal as
(
select t.*
from v_table t
)
with data
primary index (dw_key)
on commit preserve rows;
"""
query4 = """
select t.*
from v_table_2 t
"""
session.execute(query1.format(state))
session.execute(query2.format(state))
session.execute(query3)
state_dataframes.append(pd.read_sql(query4, session))
session.execute("DROP TABLE v_table")
session.execute("DROP TABLE v_table_2")
all_states_df = pd.concat(state_dataframes)
Edit for clarity: correcting the query in the question only required proper indentation. In my Teradata environment I have limited spool space which requires building many vol tables to break apart queries. Since I spent a good amount of time trying to solve this, I added to the answer to help others who may run into this scenario.
Related
Using Python looping through a number of SQL Server databases creating tables using select into, but when I run the script nothing happens i.e. no error messages and the tables have not been created. Below is an extract example of what I am doing. Can anyone advise?
df = [] # dataframe of database names as example
for i, x in df.iterrows():
SQL = """
Drop table if exists {x}..table
Select
Name
Into
{y}..table
From
MainDatabase..Details
""".format(x=x['Database'],y=x['Database'])
cursor.execute(SQL)
conn.commit()
Looks like your DB driver doesn't support multiple statements behavior, try to split your query to 2 single statements one with drop and other with select:
for i, x in df.iterrows():
drop_sql = """
Drop table if exists {x}..table
""".format(x=x['Database'])
select_sql = """
Select
Name
Into
{y}..table
From
MainDatabase..Details
""".format(x=x['Database'], y=x['Database'])
cursor.execute(drop_sql)
cursor.execute(select_sql)
cursor.commit()
And second tip, your x=x['Database'] and y=x['Database'] are the same, is this correct?
I am making a script, that should create a schema for each customer. I’m fetching all metadata from a database that defines how each customer’s schema should look like, and then create it. Everything is well defined, the types, names of tables, etc. A customer has many tables (fx, address, customers, contact, item, etc), and each table has the same metadata.
My procedure now:
get everything I need from the metadataDatabase.
In a for loop, create a table, and then Alter Table and add each metadata (This is done for each table).
Right now my script runs in about a minute for each customer, which I think is too slow. It has something to do with me having a loop, and in that loop, I’m altering each table.
I think that instead of me altering (which might be not so clever approach), I should do something like the following:
Note that this is just a stupid but valid example:
for table in tables:
con.execute("CREATE TABLE IF NOT EXISTS tester.%s (%s, %s);", (table, "last_seen date", "valid_from timestamp"))
But it gives me this error (it seems like it reads the table name as a string in a string..):
psycopg2.errors.SyntaxError: syntax error at or near "'billing'"
LINE 1: CREATE TABLE IF NOT EXISTS tester.'billing' ('last_seen da...
Consider creating tables with a serial type (i.e., autonumber) ID field and then use alter table for all other fields by using a combination of sql.Identifier for identifiers (schema names, table names, column names, function names, etc.) and regular format for data types which are not literals in SQL statement.
from psycopg2 import sql
# CREATE TABLE
query = """CREATE TABLE IF NOT EXISTS {shm}.{tbl} (ID serial)"""
cur.execute(sql.SQL(query).format(shm = sql.Identifier("tester"),
tbl = sql.Identifier("table")))
# ALTER TABLE
items = [("last_seen", "date"), ("valid_from", "timestamp")]
query = """ALTER TABLE {shm}.{tbl} ADD COLUMN {col} {typ}"""
for item in items:
# KEEP IDENTIFIER PLACEHOLDERS
final_query = query.format(shm="{shm}", tbl="{tbl}", col="{col}", typ=i[1])
cur.execute(sql.SQL(final_query).format(shm = sql.Identifier("tester"),
tbl = sql.Identifier("table"),
col = sql.Identifier(item[0]))
Alternatively, use str.join with list comprehension for one CREATE TABLE:
query = """CREATE TABLE IF NOT EXISTS {shm}.{tbl} (
"id" serial,
{vals}
)"""
items = [("last_seen", "date"), ("valid_from", "timestamp")]
val = ",\n ".join(["{{}} {typ}".format(typ=i[1]) for i in items])
# KEEP IDENTIFIER PLACEHOLDERS
pre_query = query.format(shm="{shm}", tbl="{tbl}", vals=val)
final_query = sql.SQL(pre_query).format(*[sql.Identifier(i[0]) for i in items],
shm = sql.Identifier("tester"),
tbl = sql.Identifier("table"))
cur.execute(final_query)
SQL (sent to database)
CREATE TABLE IF NOT EXISTS "tester"."table" (
"id" serial,
"last_seen" date,
"valid_from" timestamp
)
However, this becomes heavy as there are too many server roundtrips.
How many tables with how many columns are you creating that this is slow? Could you ssh to a machine closer to your server and run the python there?
I don't get that error. Rather, I get an SQL syntax error. A values list is for conveying data. But ALTER TABLE is not about data, it is about metadata. You can't use a values list there. You need the names of the columns and types in double quotes (or no quotes) rather than single quotes. And you can't have a comma between name and type. And you can't have parentheses around each pair. And each pair needs to be introduced with "ADD", you can't have it just once. You are using the wrong tool for the job. execute_batch is almost the right tool, except it will use single quotes rather than double quotes around the identifiers. Perhaps you could add a flag to it tell it to use quote_ident.
Not only is execute_values the wrong tool for the job, but I think python in general might be as well. Why not just load from a .sql file?
I am learning by doing - Python, Pandas, SQL & Deep Learning. I want to build a database with data for a deep learning experiment (with Keras and Tensorflow). The source data is ~10GB (total) of forex timestamped bid/ask data in 8 CSV files with source information encoded as three 3-4 char strings for categories Contributor, Region and City.
I can connect to my (empty) MSSQL database via pyodbc and sqlAlchemy; I can read my CSV files into dataframes; I can create a simple table in the dB and even create one from a simple dataframe; I can convert the date and time fields into the milliseconds since epoch I want. (And, FWIW, I already have already implemented a working toy LSTM model to adapt to the price data, and I also have some analytical functions I wrote and compiled in Mathematica; I'll either call the C from Python or get Mathematica to work directly on the database.)
The issue is putting the CSV data into the database. Since there are only a dozen or so different sources in each category I believe I should put Contributor etc. into separate tables with e.g Contributor_ID as ints (?) so that data is stored compactly and e.g. SELECT... WHERE Region = "SHRUBBERY" are efficient. (AFAICT I definitely shouldn't use enums because I may get more sources & categories later).
My question is - assuming the aforementioned high level of ignorance! - how can/should I a) create the tables and relationships using python and then b) populate those tables?
Optional extra: to save space, the CSV files omit the Region and City where the row values are the same as those for the row above - reading the CSVs to collect just the source information (which takes about 50s for each category) I know how to deduplicate and dropna, but when I want to populate the dB, how can I most efficiently replace the na's with the values from the previous row? A simple For loop would do it, but is there e.g. some way to "propagate" the last "real" value in a column to replace the na using pandas?
CSV example:
Date Time Bid Price Ask Price Contributor Region City
04/02/2017 00:00.1 1.00266 1.00282 DCFX ASI AKL
04/02/2017 00:00.1 1.00263 1.0028 DCFX
04/02/2017 00:00.2 1.00224 1.00285 FXN NAM NYC
04/02/2017 00:00.2 1.00223 1.00288 FXN
All input gratefully received :)
Relational databases (RDBMS) aim to store data into related, logical groupings with a system of primary key/foreign keys to normalize storage which among other advantages maintains referential integrity (i.e., no orphaned records) and avoids repetition of stored data. For your situation, consider the following:
DATABASE DESIGN: Understand the workflow or "story" of your data pieces (e.g., which comes first/after in data entry) and construct the necessary schema of tables. Classic Database 101 example is the Customers-Products-Orders where many customers can purchase multiple products to fill many orders (1-to-many and many-to-many relationships) where primary keys of parent tables are the foreign key of child tables. Hence, aim for a schema layout as below from this SO answer.
For your needs, your schema may involve Contributors, Regions, Cities, Markets, Company (Ticker), and Prices. This step will make use of DDL commands (CREATE TABLE, CREATE INDEX, CREATE SCHEMA) which can be run in pyodbc cursors or sqlAlchemy engine calls, sufficing the connected user has such privileges.
But typically, database design commands are run in a specialized admin console/IDE or command line tools and not application layer code like Python such as SQL Server's Management Studio or sqlcmd; similarly, Oracle's SQL Developer/sqlplus, MySQL's Workbench/cli or PostgreSQL's PgAdmin/psql. Below is example of setup for Prices table:
# INITIALIZE SQLALCHEMY ENGINE
connection_string = 'mssql+pyodbc://{}:{}#{}/{}'\
.format(db_user,db_password,db_server,db_database)
engine = create_engine(connection_string)
sql = """
CREATE TABLE Prices (
ID INT IDENTITY(1,1) PRIMARY KEY,
DateTime DATETIME,
BidPrice DOUBLE(10,4),
AskPrice DOUBLE(10,4),
ContributorID INT,
RegionID INT,
CityID INT,
CONSTRAINT FK_Contributor FOREIGN KEY (ContributorID) REFERENCES Contributors (ID),
CONSTRAINT FK_Region FOREIGN KEY (RegionID) REFERENCES Regions (ID),
CONSTRAINT FK_City FOREIGN KEY (CityID) REFERENCES Cities (ID)
)
"""
# SQL ACTION QUERY VIA TRANSACTION
with engine.begin() as conn:
conn.execute(sql)
DATA POPULATION: Because a dataset/dataframe, csv, or spreadsheet are NOT equivalent to a normalized RDBMS table but are actually queries of multiple tables, migration of these sources will require some SQL wrangling to align to your above schema. Simple upload of dataframes into SQL Server tables will lead to inefficient and repetitive storage. Therefore, consider below steps:
Staging Tables (using to_sql)
Use staging, temp tables which would be raw dumps from pandas. And for NAs issue, use DataFrame or Series forward fill , ffill, for populating NAs from above rows.
# FILL IN NAs IN ALL COLUMNS FROM PREVIOUS ROW
df = df.ffill() # OR df.fillna(method='ffill')
# FILL IN NAs FOR SPECIFIC COLUMNS
df['Region'] = df['Region'].ffill()
df['City'] = df['City'].ffill()
# DUMP DATA INTO DATA FRAME
df.to_sql(name='pandas_prices_dump', con=engine, if_exists='replace', index=False)
Migration to Final Tables (joining lookup tables by string names)
Then, run action queries (i.e., DML commands: INSERT INTO, UPDATE, DELETE) for populating final tables from staging, temp tables.
sql = """
INSERT INTO Prices (Datetime, BidPrice, AskPrice,
ContributorID, RegionID, CityID)
SELECT pd.Datetime, pd.BidPrice, pd.AskPrice, c.ID, r.ID, cy.ID
FROM pandas_prices_dump pd
INNER JOIN Contributors c
ON c.ContributorName = pd.Contributor
INNER JOIN Regions r
ON r.RegionName = pd.Region
INNER JOIN Cities cy
ON cy.CityName = pd.City
"""
# APPEND FINAL DATA
with engine.begin() as conn:
conn.execute(sql)
# DROP STAGING TABLE
with engine.begin() as conn:
conn.execute("DROP TABLE pandas_prices_dump")
Test/Check Final Tables (using read_sql, joining lookup tables by IDs)
# IMPORT INTO PANDAS (EQUIVALENT TO ORIGINAL df)
sql = """
SELECT p.Datetime, p.BidPrice, p.AskPrice,
c.ContributorName As Contributor, r.RegionName As Region,
cy.CityName As City
FROM Prices p
INNER JOIN Contributors c
ON c.ID = pd.ContributorID
INNER JOIN Regions r
ON r.ID = pd.RegionID
INNER JOIN Cities cy
ON cy.ID = pd.CityID
"""
prices_data = pd.read_sql(sql, engine)
I am trying to update many records at a time using SQLAlchemy, but am finding it to be very slow. Is there an optimal way to perform this?
For some reference, I am performing an update on 40,000 records and it took about 1 hour.
Below is the code I am using. The table_name refers to the table which is loaded, the column is the single column which is to be updated, and the pairs refer to the primary key and new value for the column.
def update_records(table_name, column, pairs):
table = Table(table_name, db.MetaData, autoload=True,
autoload_with=db.engine)
conn = db.engine.connect()
values = []
for id, value in pairs:
values.append({'row_id': id, 'match_value': str(value)})
stmt = table.update().where(table.c.id == bindparam('row_id')).values({column: bindparam('match_value')})
conn.execute(stmt, values)
Passing a list of arguments to execute() essentially issues 40k individual UPDATE statements, which is going to have a lot of overhead. The solution for this is to increase the number of rows per query. For MySQL, this means inserting into a temp table and then doing an update:
# assuming temp table already created
conn.execute(temp_table.insert().values(values))
conn.execute(table.update().values({column: temp_table.c.match_value})
.where(table.c.id == temp_table.c.row_id))
Or, alternatively, you can use INSERT ... ON DUPLICATE KEY UPDATE to avoid creating the temp table, but SQLAlchemy does not support that natively, so you'll need to use a custom compiled construct for that (e.g. this gist).
According to document fast-execution-helpers, batch update statements can be issued as one statement. In my experiments, this trick reduce update or deletion time from 30 mins to 1 mins.
engine = create_engine(
"postgresql+psycopg2://scott:tiger#host/dbname",
executemany_mode='values_plus_batch',
executemany_values_page_size=5000, executemany_batch_page_size=5000)
I'm trying to load a file directly from S3 and ultimately trying to just do some SparkSQL on it. Eventually, I plan on bringing in multiple files which will be multiple tables (1:1 map between files and tables).
So I've been following this tutorial that is fairly good at describing each step. I'm a bit stuck on declaring the proper schema and which variable, if any, is referred in the FROM clause in the SQL statement.
Here's my code:
sqlContext.sql("CREATE TABLE IF NOT EXISTS region (region_id INT, name STRING, comment STRING)")
region = sc.textFile("s3n://thisisnotabucketname/region.tbl")
raw_data = sc.textFile("s3n://thisisnotabucketname/region.tbl")
csv_data = raw_data.map(lambda l: l.split("|"))
row_data = csv_data.map(lambda p: Row( region_id=int(p[0]), name=p[1], comment=p[2] ))
interactions_df = sqlContext.createDataFrame(row_data)
interactions_df.registerTempTable("interactions")
tcp_interactions = sqlContext.sql(""" SELECT region_id, name, comment FROM region WHERE region_id > 1 """)
tcp_interactions = sqlContext.sql(""" SELECT * FROM region """)
tcp_interactions.show()
And here's some sample data. There is no header
0|AFRICA|lar deposits. blithely final packages cajole. regular waters are final requests. regular accounts are according to |
1|AMERICA|hs use ironic, even requests. s|
2|ASIA|ges. thinly even pinto beans ca|
tcp_interactions.show() is returning nothing. Just the header of region_id|name|comment|. What am I doing incorrectly? In the SQL statement, is region pointing to the region variable declared in the first line of code, or is it pointing to something else?