I have a database with two tables. The ssi_processed_files_prod table contains file information including the created date and a boolean indicating if the data has been deleted. The data table contains the actual data the boolean references.
I want to get a list of IDs over the age of 45 days from the file_info table, delete the associated rows from the data table, then set the boolean from file_info to True to indicate the data has been deleted.
file_log_test= Table('ssi_processed_files_prod', metadata, autoload=True, autoload_with=engine)
stmt = select([file_log_test.columns.id])
stmt = stmt.where(func.datediff(text('day'),
file_log_test.columns.processing_end_time, func.getDate()) > 45)
connection = engine.connect()
results = connection.execute(stmt).fetchall()
This query returns the correct results, however, I have not been able to work with the output effectively.
For those who would like to know the answer. This was based on reading the Essential SQL Alchemy book. The initial block of cod was correct, but I had to flatten the results into a list. From there I could use the in_() conjuction to work with the list of ids. This allowed me to delete rows from the relevant table and update data status in anohter.
file_log_test= Table('ssi_processed_files_prod', metadata, autoload=True,
autoload_with=engine)
stmt = select([file_log_test.columns.id])
stmt = stmt.where(func.datediff(text('day'),
file_log_test.columns.processing_end_time, func.getDate()) > 45)
connection = engine.connect()
results = connection.execute(stmt).fetchall()
ids_to_delete = [x[0] for x in results]
d = delete(data).where(data.c.filename_id.in_(ids_to_delete))
connection.execute(d)
Related
The following Python code successfully appends the rows belonging to the pandas dataframe into an MS SQL table via the SqlAlchemy engine previously configured.
df.to_sql(schema='stg', name = 'TEST', con=engine, if_exists='append', index=False)
I want to obtain the auto-generated IDs numbers for each of the rows inserted into the stg.Test table. In other words, what is the SqlAlchemy equivalent to the Sql Server OUTPUT clause during an INSERT statement
Unfortunately, there is no easy solution to your problem like an additional parameter in your statement. You have to use the behavior that new rows get the highest id + 1 assigned. With this knowledge, you can calculate the ids of all your rows.
Option 1: Explained in this answer. You select the current maximum id, before the insert statement. Then, you assign ids to all the entries in your DataFrame greater than the previous maximum. Lastly, insert the df which already includes the ids.
Option 2: You insert the DataFrame and then acquire the highest id. With the number of entries inserted you can calculate the id of all entries. This is how such an insert function could look like:
def insert_df_and_return_ids(df, engine):
# It is important to use same connection for both statements if
# something like last_insert_rowid() is used
conn = engine.connect()
# Insert the df into the database
df.to_sql('students', conn, if_exists='append', index=False)
# Aquire the maximum id
result = conn.execute('SELECT max(id) FROM students') # Should work for all SQL variants
# result = conn.execute('Select last_insert_rowid()') # Specifically for SQLite
# result = conn.execute('Select last_insert_id()') # Specifically for MySql
entries = df.shape[0]
last_id = -1
# Iterate over result to get last inserted id
for row in result:
last_id = int(str(row[0]))
conn.close()
# Generate list of ids
list_of_ids = list(range(last_id - entries + 1, last_id + 1))
return list_of_ids
PS: I could not test the function on an MS SQL server, but the behavior should be the same. In order to test if everything behaves as it should you can use this:
import numpy as np
import pandas as pd
import sqlalchemy as sa
# Change connection to MS SQL server
engine = sa.create_engine('sqlite:///test.lite', echo=False)
# Create table
meta = sa.MetaData()
students = sa.Table(
'students', meta,
sa.Column('id', sa.Integer, primary_key = True),
sa.Column('name', sa.String),
)
meta.create_all(engine)
# DataFrame to insert with two entries
df = pd.DataFrame({'name': ['Alice', 'Bob']})
ids = insert_df_and_return_ids(df, engine)
print(ids) # [1,2]
conn = engine.connect()
# Insert any entry with a high id in order to check if new ids are always the maximum
result = conn.execute("Insert into students (id, name) VALUES (53, 'Charlie')")
conn.close()
# Insert data frame again
ids = insert_df_and_return_ids(df, engine)
print(ids) # [54, 55]
EDIT: If multiple threads are utilized, transactions can be used to make the option thread-safe at least for SQLite:
conn = engine.connect()
transaction = conn.begin()
df.to_sql('students', conn, if_exists='append', index=False)
result = conn.execute('SELECT max(id) FROM students')
transaction.commit()
I was playing around with SQLalchemy and Microsoft SQL Server to get a hang of the functions when I came across a strange behavior. I was taught that the attribute rowcount on the result proxy object will tell how many rows were effected by executing a statement. However, when I select or insert single or multiple rows in my test database, I always get -1. How could this be and how can I fix this to reflect the reality?
connection = engine.connect()
metadata = MetaData()
# Ex1: select statement for all values
student = Table('student', metadata, autoload=True, autoload_with=engine)
stmt = select([student])
result_proxy = connection.execute(stmt)
results = result_proxy.fetchall()
print(result_proxy.rowcount)
# Ex2: inserting single values
stmt = insert(student).values(firstname='Severus', lastname='Snape')
result_proxy = connection.execute(stmt)
print(result_proxy.rowcout)
# Ex3: inserting multiple values
stmt = insert(student)
values_list = [{'firstname': 'Rubius', 'lastname': 'Hagrid'},
{'firstname': 'Minerva', 'lastname': 'McGonogall'}]
result_proxy = connection.execute(stmt, values_list)
print(result_proxy.rowcount)
The print function for each block seperately run example code prints -1. The Ex1 successfully fetches all rows and both insert statements successfully write the data to the database.
According to the following issue, the rowcount attribute isn't always to be trusted. Is that true here as well? And when, how can I compensate with a Count statement in a SQLalcehmy transaction?
PDO::rowCount() returning -1
The single-row INSERT … VALUES ( … ) is trivial: If the statement succeeds then one row was affected, and if it fails (throws an error) then zero rows were affected.
For a multi-row INSERT simply perform it inside a transaction and rollback if an error occurs. Then the number of rows affected will either be zero or len(values_list).
To get the number of rows that a SELECT will return, wrap the select query in a SELECT count(*) query and run that first, for example:
select_stmt = sa.select([Parent])
count_stmt = sa.select([sa.func.count(sa.text("*"))]).select_from(
select_stmt.alias("s")
)
with engine.connect() as conn:
conn.execution_options(isolation_level="SERIALIZABLE")
rows_found = conn.execute(count_stmt).scalar()
print(f"{rows_found} row(s) found")
results = conn.execute(select_stmt).fetchall()
for item in results:
print(item.id)
So I have about 4-5 million rows of data per table. I have about 10-15 of these tables. I created a table that will join 30,000 rows to some of these million rows based on some ID and snapshot date.
Is there a way to write my existing data table to a SQL query where it will filter the results down for me so that I do not have to load the entire tables into memory?
At the moment I've been loading each table in one at a time, and then releasing the memory. However, it still takes up 100% memory on my computer.
for table in tablesToJoin:
if df is not None:
print("DF LENGTH", len(df))
query = """SET NOCOUNT ON; SELECT * FROM """ + table + """ (nolock) where snapshotdate = '"""+ date +"""'"""
query += """ SET NOCOUNT OFF;"""
start = time.time()
loadedDf = pd.read_sql_query(query, conn)
if df is None:
df = loadedDf
else:
loadedDf.info(verbose=True, null_counts=True)
df.info(verbose=True, null_counts=True)
df = df.merge(loadedDf, how='left', on=["MemberID", "SnapshotDate"])
#df = df.fillna(0)
print("DATA AFTER ALL MERGING", len(df))
print("Length of data loaded:", len(loadedDf))
print("Time to load data from sql", (time.time() - start))
I once faced the same problem as you are. My solution was to filter as much as possible in the SQL layer. Since I don't have your code and your DB, what I write below is untested code and very possibly contain bugs. You will have to correct them as needed.
The idea is to read as little as possible from the DB. pandas is not designed to analyze frames of millions of rows (at least on a typical computer). To do that, pass the filter criteria from df to your DB call:
from sqlalchemy import MetaData, and_, or_
engine = ... # construct your SQL Alchemy engine. May correspond to your `conn` object
meta = MetaData()
meta.reflect(bind=engine, only=tablesToJoin)
for table in tablesToJoin:
t = meta[table]
# Building the WHERE clause. This is equivalent to:
# WHERE ((MemberID = <MemberID 1>) AND (SnapshotDate = date))
# OR ((MemberID = <MemberID 2>) AND (SnapshotDate = date))
# OR ((MemberID = <MemberID 3>) AND (SnapshotDate = date))
cond = _or(**[and_(t.c['MemberID'] == member_id, t.c['SnapshotDate'] == date) for member_id in df['MemberID'] ])
# Be frugal here: only get the columns that you need, or you will blow your memory
# If you specify None, it's equivalent to a `SELECT *`
statement = t.select(None).where(cond)
# Note that it's `read_sql`, not `read_sql_query` here
loadedDf = pd.read_sql(statement, engine)
# loadedDf should be much smaller now since you have already filtered it at the DB level
# Now do your joins...
I have a massive table (over 100B records), that I added an empty column to. I parse strings from another field (string) if the required string is available, extract an integer from that field, and want to update it in the new column for all rows that have that string.
At the moment, after data has been parsed and saved locally in a dataframe, I iterate on it to update the Redshift table with clean data. This takes approx 1sec/iteration, which is way too long.
My current code example:
conn = psycopg2.connect(connection_details)
cur = conn.cursor()
clean_df = raw_data.apply(clean_field_to_parse)
for ind, row in clean_df.iterrows():
update_query = build_update_query(row.id, row.clean_integer1, row.clean_integer2)
cur.execute(update_query)
where update_query is a function to generate the update query:
def update_query(id, int1, int2):
query = """
update tab_tab
set
clean_int_1 = {}::int,
clean_int_2 = {}::int,
updated_date = GETDATE()
where id = {}
;
"""
return query.format(int1, int2, id)
and where clean_df is structured like:
id . field_to_parse . clean_int_1 . clean_int_2
1 . {'int_1':'2+1'}. 3 . np.nan
2 . {'int_2':'7-0'}. np.nan . 7
Is there a way to update specific table fields in bulk, so that there is no need to execute one query at a time?
I'm parsing the strings and running the update statement from Python. The database is stored on Redshift.
As mentioned, consider pure SQL and avoid iterating through billions of rows by pushing the Pandas data frame to Postgres as a staging table and then run one single UPDATE across both tables. With SQLAlchemy you can use DataFrame.to_sql to create a table replica of data frame. Even add an index of the join field, id, and drop the very large staging table at end.
from sqlalchemy import create_engine
engine = create_engine("postgresql+psycopg2://myuser:mypwd!#myhost/mydatabase")
# PUSH TO POSTGRES (SAME NAME AS DF)
clean_df.to_sql(name="clean_df", con=engine, if_exists="replace", index=False)
# SQL UPDATE (USING TRANSACTION)
with engine.begin() as conn:
sql = "CREATE INDEX idx_clean_df_id ON clean_df(id)"
conn.execute(sql)
sql = """UPDATE tab_tab t
SET t.clean_int_1 = c.int1,
t.clean_int_2 = c.int2,
t.updated_date = GETDATE()
FROM clean_df c
WHERE c.id = t.id
"""
conn.execute(sql)
sql = "DROP TABLE IF EXISTS clean_df"
conn.execute(sql)
engine.dispose()
I have a database of every major airport's lat/long coords all across the world. I only need a portion of them (specifically in the USA) that are listed in a separate .csv file.
This csv file has two columns I extracted data from into two lists: The origin airport code (IATA code) and the destination airport code (also IATA).
My database has a column for IATA, and essentially I'm trying to query this database to get the respective lat/long coords for each airport in the two lists I have.
Here's my code:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('sqlite:///airport_coordinates.db')
# The dataframe that contains the IATA codes for the airports I need
airport_relpath = "data/processed/%s_%s_combined.csv" % (file, airline)
script_dir = os.path.dirname(os.getcwd())
temp_file = os.path.join(script_dir, airport_relpath)
fields = ["Origin_Airport_Code", "Destination_Airport_Code"]
df_airports = pd.read_csv(temp_file, usecols=fields)
# the origin/destination IATA codes for the airports I need
origin = df_airports.Origin_Airport_Code.values
dest = df_airports.Destination_Airport_Code.values
# query the database for the lat/long coords of the airports I need
sql = ('SELECT lat, long FROM airportCoords WHERE iata IN %s' %(origin))
indexcols = ['lat', 'long']
df_origin = pd.read_sql(sql, engine)
# testing the origin coordinates
print(df_origin)
This is the error I'm getting:
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such
table: 'JFK' 'JFK' 'JFK' ... 'MIA' 'JFK' 'MIA' [SQL: "SELECT lat, long
FROM airportCoords WHERE iata IN ['JFK' 'JFK' 'JFK' ... 'MIA' 'JFK'
'MIA']"] (Background on this error at: http://sqlalche.me/e/e3q8)
It's definitely because I'm not querying it correctly (since it thinks my queries are supposed to tables).
I tried looping through the list to query each element individually, but the list contains over 604,885 elements and my computer was not able to come up with any output.
Your error is in using string interpolation:
sql = ('SELECT lat, long FROM airportCoords WHERE iata IN %s' %(origin))
Because origin is a Numpy array this results in a [....] SQL identifier syntax in the query; see the SQLite documentation:
If you want to use a keyword as a name, you need to quote it. There are four ways of quoting keywords in SQLite:
[...]
[keyword] A keyword enclosed in square brackets is an identifier. [...]
You asked SQLite to check if iata is in a table named ['JFK' 'JFK' 'JFK' ... 'MIA' 'JFK' 'MIA'] because that's the string representation of a Numpy array.
You are already using SQLAlchemy, it would be easier if you used that library to generate all SQL for you, including the IN (....) membership test:
from sqlalchemy import *
filter = literal_column('iata', String).in_(origin)
sql = select([
literal_column('lat', Float),
literal_column('long', Float),
]).select_from(table('airportCoords')).where(filter)
then pass sql in as the query.
I used literal_column() and table() objects here to shortcut directly to the names of the objects, but you could also ask SQLAlchemy to reflect your database table directly from the engine object you already created, then use the resulting table definition to generate the query:
metadata = MetaData()
airport_coords = Table('airportCoords', metadata, autoload=True, autoload_with=engine)
at which point the query would be defined as:
filter = airport_coords.c.iata.in_(origin)
sql = select([airport_coords.c.lat, airport_coords.c.long]).where(filter)
I'd also include the iata code in the output, otherwise you will have no path back to connecting IATA code to the matching coordinates:
sql = select([airport_coords.c.lat, airport_coords.c.long, airport_coords.c.iata]).where(filter)
Next, as you say you have 604,885 elements in the list, so you probably want to load that CSV data into a temporary table to keep the query efficient:
engine = create_engine('sqlite:///airport_coordinates.db')
# code to read CSV file
# ...
df_airports = pd.read_csv(temp_file, usecols=fields)
# SQLAlchemy table wrangling
metadata = MetaData()
airport_coords = Table('airportCoords', metadata, autoload=True, autoload_with=engine)
temp = Table(
"airports_temp",
metadata,
*(Column(field, String) for field in fields),
prefixes=['TEMPORARY']
)
with engine.begin() as conn:
# insert CSV values into a temporary table in SQLite
temp.create(conn, checkfirst=True)
df_airports.to_sql(temp.name), engine, if_exists='append')
# Join the airport coords against the temporary table
joined = airport_coords.join(temp, airport_coords.c.iata==temp.c.Origin_Airport_Code)
# select coordinates per airport, include the iata code
sql = select([airport_coords.c.lat, airport_coords.c.long, airport_coords.c.iata]).select_from(joined)
df_origin = pd.read_sql(sql, engine)