I am trying to use use a temp table with SQLAlchemy and join it against an existing table. This is what I have so far
engine = db.get_engine(db.app, 'MY_DATABASE')
df = pd.DataFrame({"id": [1, 2, 3], "value": [100, 200, 300], "date": [date.today(), date.today(), date.today()]})
temp_table = db.Table('#temp_table',
db.Column('id', db.Integer),
db.Column('value', db.Integer),
db.Column('date', db.DateTime))
temp_table.create(engine)
df.to_sql(name='tempdb.dbo.#temp_table',
con=engine,
if_exists='append',
index=False)
query = db.session.query(ExistingTable.id).join(temp_table, temp_table.c.id == ExistingTable.id)
out_df = pd.read_sql(query.statement, engine)
temp_table.drop(engine)
return out_df.to_dict('records')
This doesn't return any results because the insert statements that to_sql does don't get run (I think this is because they are run using sp_prepexec, but I'm not entirely sure about that).
I then tried just writing out the SQL statement (CREATE TABLE #temp_table..., INSERT INTO #temp_table..., SELECT [id] FROM...) and then running pd.read_sql(query, engine). I get the error message
This result object does not return rows. It has been closed automatically.
I guess this is because the statement does more than just SELECT?
How can I fix this issue (either solution would work, although the first would be preferable as it avoids hard-coded SQL). To be clear, I can't modify the schema in the existing database—it's a vendor database.
In case the number of records to be inserted in the temporary table is small/moderate, one possibility would be to use a literal subquery or a values CTE instead of creating temporary table.
# MODEL
class ExistingTable(Base):
__tablename__ = 'existing_table'
id = sa.Column(sa.Integer, primary_key=True)
name = sa.Column(sa.String)
# ...
Assume also following data is to be inserted into temp table:
# This data retrieved from another database and used for filtering
rows = [
(1, 100, datetime.date(2017, 1, 1)),
(3, 300, datetime.date(2017, 3, 1)),
(5, 500, datetime.date(2017, 5, 1)),
]
Create a CTE or a sub-query containing that data:
stmts = [
# #NOTE: optimization to reduce the size of the statement:
# make type cast only for first row, for other rows DB engine will infer
sa.select([
sa.cast(sa.literal(i), sa.Integer).label("id"),
sa.cast(sa.literal(v), sa.Integer).label("value"),
sa.cast(sa.literal(d), sa.DateTime).label("date"),
]) if idx == 0 else
sa.select([sa.literal(i), sa.literal(v), sa.literal(d)]) # no type cast
for idx, (i, v, d) in enumerate(rows)
]
subquery = sa.union_all(*stmts)
# Choose one option below.
# I personally prefer B because one could reuse the CTE multiple times in the same query
# subquery = subquery.alias("temp_table") # option A
subquery = subquery.cte(name="temp_table") # option B
Create final query with the required joins and filters:
query = (
session
.query(ExistingTable.id)
.join(subquery, subquery.c.id == ExistingTable.id)
# .filter(subquery.c.date >= XXX_DATE)
)
# TEMP: Test result output
for res in query:
print(res)
Finally, get pandas data frame:
out_df = pd.read_sql(query.statement, engine)
result = out_df.to_dict('records')
You can try to use another solution - Process-Keyed Table
A process-keyed table is simply a permanent table that serves as a
temp table. To permit processes to use the table simultaneously, the
table has an extra column to identify the process. The simplest way to
do this is the global variable ##spid (##spid is the process id in SQL
Server).
...
One alternative for the process-key is to use a GUID (data type
uniqueidentifier).
http://www.sommarskog.se/share_data.html#prockeyed
Related
The following Python code successfully appends the rows belonging to the pandas dataframe into an MS SQL table via the SqlAlchemy engine previously configured.
df.to_sql(schema='stg', name = 'TEST', con=engine, if_exists='append', index=False)
I want to obtain the auto-generated IDs numbers for each of the rows inserted into the stg.Test table. In other words, what is the SqlAlchemy equivalent to the Sql Server OUTPUT clause during an INSERT statement
Unfortunately, there is no easy solution to your problem like an additional parameter in your statement. You have to use the behavior that new rows get the highest id + 1 assigned. With this knowledge, you can calculate the ids of all your rows.
Option 1: Explained in this answer. You select the current maximum id, before the insert statement. Then, you assign ids to all the entries in your DataFrame greater than the previous maximum. Lastly, insert the df which already includes the ids.
Option 2: You insert the DataFrame and then acquire the highest id. With the number of entries inserted you can calculate the id of all entries. This is how such an insert function could look like:
def insert_df_and_return_ids(df, engine):
# It is important to use same connection for both statements if
# something like last_insert_rowid() is used
conn = engine.connect()
# Insert the df into the database
df.to_sql('students', conn, if_exists='append', index=False)
# Aquire the maximum id
result = conn.execute('SELECT max(id) FROM students') # Should work for all SQL variants
# result = conn.execute('Select last_insert_rowid()') # Specifically for SQLite
# result = conn.execute('Select last_insert_id()') # Specifically for MySql
entries = df.shape[0]
last_id = -1
# Iterate over result to get last inserted id
for row in result:
last_id = int(str(row[0]))
conn.close()
# Generate list of ids
list_of_ids = list(range(last_id - entries + 1, last_id + 1))
return list_of_ids
PS: I could not test the function on an MS SQL server, but the behavior should be the same. In order to test if everything behaves as it should you can use this:
import numpy as np
import pandas as pd
import sqlalchemy as sa
# Change connection to MS SQL server
engine = sa.create_engine('sqlite:///test.lite', echo=False)
# Create table
meta = sa.MetaData()
students = sa.Table(
'students', meta,
sa.Column('id', sa.Integer, primary_key = True),
sa.Column('name', sa.String),
)
meta.create_all(engine)
# DataFrame to insert with two entries
df = pd.DataFrame({'name': ['Alice', 'Bob']})
ids = insert_df_and_return_ids(df, engine)
print(ids) # [1,2]
conn = engine.connect()
# Insert any entry with a high id in order to check if new ids are always the maximum
result = conn.execute("Insert into students (id, name) VALUES (53, 'Charlie')")
conn.close()
# Insert data frame again
ids = insert_df_and_return_ids(df, engine)
print(ids) # [54, 55]
EDIT: If multiple threads are utilized, transactions can be used to make the option thread-safe at least for SQLite:
conn = engine.connect()
transaction = conn.begin()
df.to_sql('students', conn, if_exists='append', index=False)
result = conn.execute('SELECT max(id) FROM students')
transaction.commit()
I am using table merging in order to select items from my db against a list of parameter tuples. The query works fine, but cur.fetchall() does not return the entire table that I want.
For example:
data = (
(1, '2020-11-19'),
(1, '2020-11-20'),
(1, '2020-11-21'),
(2, '2020-11-19'),
(2, '2020-11-20'),
(2, '2020-11-21')
)
query = """
with data(song_id, date) as (
values %s
)
select t.*
from my_table t
join data d
on t.song_id = d.song_id and t.date = d.date::date
"""
execute_values(cursor, query, data)
results = cursor.fetchall()
In practice, my list of tuples to check against is thousands of rows long, and I expect the response to also be thousands of rows long.
But I am only getting 5 rows back if I call cur.fetchall() at the end of this request.
I know that this is because execute_values batches the requests, but there is some strange behavior.
If I pass page_size=10 then I get 2 items back. And if I set fetch=True then I get no results at all (even though the rowcount does not match that).
My thought was to batch these requests, but the page_size for the batch is not matching the number of items that I'm expecting per batch.
How should I change this request so that I can get all the results I'm expecting?
Edit: (years later after much experience with this)
What you really want to do here, is use the COPY command to bulk insert your data into a temporary dataframe. Then use that temporary dataframe to merge on both your columns as you would a normal table. With psycopg2 you can use the copy_expert method to perform the COPY. To reiterate (according to this example) here's how you would do that...
Also... trust me when I say this... if SPEED is an issue for you, this is by far, not-even-close, the fastest method out there.
code in this example is not tested
df = pd.DataFrame('<whatever your dataframe is>')
# Start by creating the temporary table
string = '''
create temp table mydata as (
item_id int,
date date
);
'''
cur.execute(string)
# Now you need to generate an sql string that will copy
# your data into the db
string = sql.SQL("""
copy {} ({})
from stdin (
format csv,
null "NaN",
delimiter ',',
header
)
""").format(sql.Identifier('mydata'), sql.SQL(',').join([sql.Identifier(i) for i in df.columns])
# Write your dataframe to the disk as a csv
df.to_csv('./temp_dataframe.csv', index=False, na_rep='NaN')
# Copy into the database
with open('./temp_dataframe.csv') as csv_file:
cur.copy_expert(string, csv_file)
# Now your data should be in your temporary table, so we can
# perform our select like normal
string = '''
select t.*
from my_table t
join mydata d
on t.item_id = d.item_id and t.date = d.date
'''
cur.execute(string)
data = cur.fetchall()
I got a number of dataframes with information associated with different systems.
Now I'm trying to write the inforamtion to multiple tables (number of systems I got) by using sqlAlchemy.
(I'm pretty new to python and sqlAlchemy tho)
So I'm wondering if theres a nicer possibility to write the values of each column of the dataframe to DIFFERENT tables?
E.g. Column 1 of dataframe 3, 4 to table 1, column 2 of dataframe 3, 4 to table 2, and so on.
Also I keep getting the integrity error "Duplicaty entry" if any values are written twice to the same column in the table.
x = 0
for index in a_id:
table_sim = Table(
f'simulated_for_sys_{np.int_(index)}', meta,
Column('timestamp', DateTime, primary_key = True),
Column('system__id', Integer),
Column('simulated_yield_in_kWh', Float),
Column('global_irradiance_tilted_in_kWh_per_m2', Float)
)
# Checking if table already exist
if not engine.dialect.has_table(engine, f'simulated_for_sys_{np.int_(index)}'):
print("Tables created", table_sim)
# Specified table
meta.create_all(engine)
else:
print("Table already exists...")
conn = engine.connect()
# Write timestamps from 01.01.xxxx till now
for timestamp_utc in timestamp_df['timestamp_utc']:
print(timestamp_utc)
ins = table_sim.insert().values(timestamp = timestamp_utc.to_pydatetime() )
result = conn.execute(ins)
# Write id to table (giving duplicate error..)
for system_id in sys_ids_df['system_id']:
ins1 = table_sim.insert().values(system__id = system_id)
conn = engine.connect()
result = conn.execute(ins1)
# Write pr information from column x of dataframeto table (also duplicate error in between if
# same values appear)
colname = f'pr_{x}'
for colname in pr_daily_df[f'{colname}']:
ins2 = table_sim.insert().values(simulated_yield_in_kWh = colname)
result = conn.execute(ins2)
# Write rad information from column x of dataframe to table (also duplicate error in between if
# same values
appear)
colname = f'rad_{x}'
for colname in rad_daily_df[f'{colname}']:
ins3 = table_sim.insert().values(global_irradiance_tilted_in_kWh_per_m2 = colname)
result = conn.execute(ins3)
x += 1
So I have about 4-5 million rows of data per table. I have about 10-15 of these tables. I created a table that will join 30,000 rows to some of these million rows based on some ID and snapshot date.
Is there a way to write my existing data table to a SQL query where it will filter the results down for me so that I do not have to load the entire tables into memory?
At the moment I've been loading each table in one at a time, and then releasing the memory. However, it still takes up 100% memory on my computer.
for table in tablesToJoin:
if df is not None:
print("DF LENGTH", len(df))
query = """SET NOCOUNT ON; SELECT * FROM """ + table + """ (nolock) where snapshotdate = '"""+ date +"""'"""
query += """ SET NOCOUNT OFF;"""
start = time.time()
loadedDf = pd.read_sql_query(query, conn)
if df is None:
df = loadedDf
else:
loadedDf.info(verbose=True, null_counts=True)
df.info(verbose=True, null_counts=True)
df = df.merge(loadedDf, how='left', on=["MemberID", "SnapshotDate"])
#df = df.fillna(0)
print("DATA AFTER ALL MERGING", len(df))
print("Length of data loaded:", len(loadedDf))
print("Time to load data from sql", (time.time() - start))
I once faced the same problem as you are. My solution was to filter as much as possible in the SQL layer. Since I don't have your code and your DB, what I write below is untested code and very possibly contain bugs. You will have to correct them as needed.
The idea is to read as little as possible from the DB. pandas is not designed to analyze frames of millions of rows (at least on a typical computer). To do that, pass the filter criteria from df to your DB call:
from sqlalchemy import MetaData, and_, or_
engine = ... # construct your SQL Alchemy engine. May correspond to your `conn` object
meta = MetaData()
meta.reflect(bind=engine, only=tablesToJoin)
for table in tablesToJoin:
t = meta[table]
# Building the WHERE clause. This is equivalent to:
# WHERE ((MemberID = <MemberID 1>) AND (SnapshotDate = date))
# OR ((MemberID = <MemberID 2>) AND (SnapshotDate = date))
# OR ((MemberID = <MemberID 3>) AND (SnapshotDate = date))
cond = _or(**[and_(t.c['MemberID'] == member_id, t.c['SnapshotDate'] == date) for member_id in df['MemberID'] ])
# Be frugal here: only get the columns that you need, or you will blow your memory
# If you specify None, it's equivalent to a `SELECT *`
statement = t.select(None).where(cond)
# Note that it's `read_sql`, not `read_sql_query` here
loadedDf = pd.read_sql(statement, engine)
# loadedDf should be much smaller now since you have already filtered it at the DB level
# Now do your joins...
I have a database with two tables. The ssi_processed_files_prod table contains file information including the created date and a boolean indicating if the data has been deleted. The data table contains the actual data the boolean references.
I want to get a list of IDs over the age of 45 days from the file_info table, delete the associated rows from the data table, then set the boolean from file_info to True to indicate the data has been deleted.
file_log_test= Table('ssi_processed_files_prod', metadata, autoload=True, autoload_with=engine)
stmt = select([file_log_test.columns.id])
stmt = stmt.where(func.datediff(text('day'),
file_log_test.columns.processing_end_time, func.getDate()) > 45)
connection = engine.connect()
results = connection.execute(stmt).fetchall()
This query returns the correct results, however, I have not been able to work with the output effectively.
For those who would like to know the answer. This was based on reading the Essential SQL Alchemy book. The initial block of cod was correct, but I had to flatten the results into a list. From there I could use the in_() conjuction to work with the list of ids. This allowed me to delete rows from the relevant table and update data status in anohter.
file_log_test= Table('ssi_processed_files_prod', metadata, autoload=True,
autoload_with=engine)
stmt = select([file_log_test.columns.id])
stmt = stmt.where(func.datediff(text('day'),
file_log_test.columns.processing_end_time, func.getDate()) > 45)
connection = engine.connect()
results = connection.execute(stmt).fetchall()
ids_to_delete = [x[0] for x in results]
d = delete(data).where(data.c.filename_id.in_(ids_to_delete))
connection.execute(d)