I have worked on a dataframe (previously extracted from a table with SQLAlchemy), and now I want to retrieve the changes updating that table.
I have done it in this very unefficient way:
engine = sql.create_engine(connect_string)
connection = engine.connect()
metadata = sql.MetaData()
pbp = sql.Table('playbyplay', metadata, autoload=True, autoload_with=engine)
for i in range(1,len(playbyplay_substitutions)):
query_update = ('update playbyplay set Player_1_Visitor = {0}, Player_2_Visitor = {1} ,Player_3_Visitor = {2} ,Player_4_Visitor = {3} ,Player_5_Visitor = {4} where id_match = {5} and actionNumber = {6}'.format(playbyplay_substitutions.loc[i,'Player_1_Visitor_y'], playbyplay_substitutions.loc[i,'Player_2_Visitor_y'], playbyplay_substitutions.loc[i,'Player_3_Visitor_y'], playbyplay_substitutions.loc[i,'Player_4_Visitor_y'], playbyplay_substitutions.loc[i,'Player_5_Visitor_y'], playbyplay_substitutions.loc[i,'id_match'],playbyplay_substitutions.loc[i,'actionNumber']))
connection.execute(query_update)
playbyplay_substitutions is my dataframe, playbyplay is my table, and the rest are the fields that I want to update or the keys in my table. I am looking for a more efficient solution than the one that I currently have for SQLAlchemy integrated with MySQL.
Consider using proper placeholders instead of manually formatting strings:
query_update = sql.text("""
UPDATE playbyplay
SET Player_1_Visitor = :Player_1_Visitor_y
, Player_2_Visitor = :Player_2_Visitor_y
, Player_3_Visitor = :Player_3_Visitor_y
, Player_4_Visitor = :Player_4_Visitor_y
, Player_5_Visitor = :Player_5_Visitor_y
WHERE id_match = :id_match AND actionNumber = :actionNumber
""")
# .iloc[1:] mimics the original for-loop that started from 1
args = playbyplay_substitutions[[
'Player_1_Visitor_y', 'Player_2_Visitor_y', 'Player_3_Visitor_y',
'Player_4_Visitor_y', 'Player_5_Visitor_y', 'id_match',
'actionNumber']].iloc[1:].to_dict('record')
connection.execute(query_update, args)
If your driver is sufficiently clever, this allows it to prepare a statement once and reuse it over the data, instead of emitting queries one by one. This also avoids possible accidental SQL injection problems, where your data resembles SQL constructs when formatted as a string manually.
Related
I am trying to better understand how I can bulk update rows in sqlAlchemy using a Python function for each row that requires dumping results to json without having to iterate over them individually:
def do_something(x):
return x.id + x.offset
table.update({Table.updated_field: do_something(Table)})
This is a simplification of what I am trying to accomplish except I get the error TypeError: Object of type InstrumentedAttribute is not JSON serializable.
Any thoughts on how to fix the issue here?
Why are you casting your Table Id to a json String? remove it and try.
Edit:
You can't call the same object in bulk, you can for example:
table.update({Table.updated_field: json.dumps(object_of_my_table variable._asdict())})
If you want update your column attribute with the whole object you will must loop and dump it in the update as:
for table in dbsession.query(Table):
table.update_field = json.dumps(table._asdict())
dbsession.add(table).
If you need to update millions of rows the best way from my experience is with Pandas and bulk_update_mappings.
You can load data from the DB in bulk as a DataFrame with read_sql by passing a query statement and your engine object.
import pandas as pd
query = session.query(Table)
table_data = pd.read_sql(query.statement, engine)
Note that read_sql has a chunksize parameter which causes it to return an iterator, so if the table is too large to fit in memory, you can throw this in a loop with however many rows your pc can handle at once:
for table_chunk in pd.read_sql(query.statement, engine, chunksize=1e6):
...
From there you can use apply to alter each column with any custom function you want:
table_data["column_1"] = table_data["column_1"].apply(do_something)
Then, converting the DataFrame to a dict with the records orientation puts it in the appropriate format for bulk_update_mappings:
table_data = table_data.to_dict("records")
session.bulk_update_mappings(Table, table_data)
session.commit()
Additionally, if you need to perform a lot of json operations for your updates, I've used orjson for updates like this in the past which also provides a notable speed improvement over the standard library's json.
Without the requirement to serialise to JSON,
session.query(Table).update({'updated_field': Table.id + Table.offset})
would work fine, performing all computations and updates in the database. However
session.query(Table).update({'updated_field': json.dumps(Table.id + Table.offset)})
does not work, because it mixes Python-level operations (json.dumps) with database-level operations (add id and offset for all rows).
Fortunately, many RDBMS provide JSON functions (SQLite, PostgreSQL, MariaDB, MySQL) so that we can do the work solely in the database layer. This is considerably more efficient that fetching data into the Python layer, mutating it, and writing it back to the database. Unfortunately, the available functions and their behaviours are not consistent across RDBMS.
The following script should work for SQLite, PostgreSQL and MaraiDB (and probably MySQL too). These assumptions are made:
id and offset are both columns in the database table being updated
both are integers, as is their sum
the desired result is that their sum is written to a JSON column as an scalar
import sqlalchemy as sa
from sqlalchemy import orm
from sqlalchemy.dialects.postgresql import JSONB
urls = [
'sqlite:///so73956014.db',
'postgresql+psycopg2:///test',
'mysql+pymysql://root:root#localhost/test',
]
for url in urls:
engine = sa.create_engine(url, echo=False, future=True)
print(f'Checking {engine.dialect.name}')
Base = orm.declarative_base()
JSON = JSONB if engine.dialect.name == 'postgresql' else sa.JSON
class Table(Base):
__tablename__ = 't73956014'
id = sa.Column(sa.Integer, primary_key=True)
offset = sa.Column(sa.Integer)
updated_field = sa.Column(JSON)
Base.metadata.drop_all(engine, checkfirst=True)
Base.metadata.create_all(engine)
Session = orm.sessionmaker(engine, future=True)
with Session.begin() as s:
ts = [Table(offset=o * 10) for o in range(1, 4)]
s.add_all(ts)
# Use DB-specific function to serialise to JSON.
if engine.dialect.name == 'postgresql':
func = sa.func.to_jsonb
else:
func = sa.func.json_quote
# MariaDB requires that argument to json_quote is character type.
if engine.dialect.name in ['mysql', 'mariadb']:
expr = sa.cast(Table.id + Table.offset, sa.Text)
else:
expr = Table.id + Table.offset
with Session.begin() as s:
s.query(Table).update(
{Table.updated_field: func(expr)}
)
with Session() as s:
ts = s.scalars(sa.select(Table))
for t in ts:
print(t.id, t.offset, t.updated_field)
engine.dispose()
Output:
Checking sqlite
1 10 11
2 20 22
3 30 33
Checking postgresql
1 10 11
2 20 22
3 30 33
Checking mysql
1 10 11
2 20 22
3 30 33
Other functions can be used if the desired result is an object or array. If updating an existing JSON column value the column my need to use the Mutable extension.
So I have about 4-5 million rows of data per table. I have about 10-15 of these tables. I created a table that will join 30,000 rows to some of these million rows based on some ID and snapshot date.
Is there a way to write my existing data table to a SQL query where it will filter the results down for me so that I do not have to load the entire tables into memory?
At the moment I've been loading each table in one at a time, and then releasing the memory. However, it still takes up 100% memory on my computer.
for table in tablesToJoin:
if df is not None:
print("DF LENGTH", len(df))
query = """SET NOCOUNT ON; SELECT * FROM """ + table + """ (nolock) where snapshotdate = '"""+ date +"""'"""
query += """ SET NOCOUNT OFF;"""
start = time.time()
loadedDf = pd.read_sql_query(query, conn)
if df is None:
df = loadedDf
else:
loadedDf.info(verbose=True, null_counts=True)
df.info(verbose=True, null_counts=True)
df = df.merge(loadedDf, how='left', on=["MemberID", "SnapshotDate"])
#df = df.fillna(0)
print("DATA AFTER ALL MERGING", len(df))
print("Length of data loaded:", len(loadedDf))
print("Time to load data from sql", (time.time() - start))
I once faced the same problem as you are. My solution was to filter as much as possible in the SQL layer. Since I don't have your code and your DB, what I write below is untested code and very possibly contain bugs. You will have to correct them as needed.
The idea is to read as little as possible from the DB. pandas is not designed to analyze frames of millions of rows (at least on a typical computer). To do that, pass the filter criteria from df to your DB call:
from sqlalchemy import MetaData, and_, or_
engine = ... # construct your SQL Alchemy engine. May correspond to your `conn` object
meta = MetaData()
meta.reflect(bind=engine, only=tablesToJoin)
for table in tablesToJoin:
t = meta[table]
# Building the WHERE clause. This is equivalent to:
# WHERE ((MemberID = <MemberID 1>) AND (SnapshotDate = date))
# OR ((MemberID = <MemberID 2>) AND (SnapshotDate = date))
# OR ((MemberID = <MemberID 3>) AND (SnapshotDate = date))
cond = _or(**[and_(t.c['MemberID'] == member_id, t.c['SnapshotDate'] == date) for member_id in df['MemberID'] ])
# Be frugal here: only get the columns that you need, or you will blow your memory
# If you specify None, it's equivalent to a `SELECT *`
statement = t.select(None).where(cond)
# Note that it's `read_sql`, not `read_sql_query` here
loadedDf = pd.read_sql(statement, engine)
# loadedDf should be much smaller now since you have already filtered it at the DB level
# Now do your joins...
I have a massive table (over 100B records), that I added an empty column to. I parse strings from another field (string) if the required string is available, extract an integer from that field, and want to update it in the new column for all rows that have that string.
At the moment, after data has been parsed and saved locally in a dataframe, I iterate on it to update the Redshift table with clean data. This takes approx 1sec/iteration, which is way too long.
My current code example:
conn = psycopg2.connect(connection_details)
cur = conn.cursor()
clean_df = raw_data.apply(clean_field_to_parse)
for ind, row in clean_df.iterrows():
update_query = build_update_query(row.id, row.clean_integer1, row.clean_integer2)
cur.execute(update_query)
where update_query is a function to generate the update query:
def update_query(id, int1, int2):
query = """
update tab_tab
set
clean_int_1 = {}::int,
clean_int_2 = {}::int,
updated_date = GETDATE()
where id = {}
;
"""
return query.format(int1, int2, id)
and where clean_df is structured like:
id . field_to_parse . clean_int_1 . clean_int_2
1 . {'int_1':'2+1'}. 3 . np.nan
2 . {'int_2':'7-0'}. np.nan . 7
Is there a way to update specific table fields in bulk, so that there is no need to execute one query at a time?
I'm parsing the strings and running the update statement from Python. The database is stored on Redshift.
As mentioned, consider pure SQL and avoid iterating through billions of rows by pushing the Pandas data frame to Postgres as a staging table and then run one single UPDATE across both tables. With SQLAlchemy you can use DataFrame.to_sql to create a table replica of data frame. Even add an index of the join field, id, and drop the very large staging table at end.
from sqlalchemy import create_engine
engine = create_engine("postgresql+psycopg2://myuser:mypwd!#myhost/mydatabase")
# PUSH TO POSTGRES (SAME NAME AS DF)
clean_df.to_sql(name="clean_df", con=engine, if_exists="replace", index=False)
# SQL UPDATE (USING TRANSACTION)
with engine.begin() as conn:
sql = "CREATE INDEX idx_clean_df_id ON clean_df(id)"
conn.execute(sql)
sql = """UPDATE tab_tab t
SET t.clean_int_1 = c.int1,
t.clean_int_2 = c.int2,
t.updated_date = GETDATE()
FROM clean_df c
WHERE c.id = t.id
"""
conn.execute(sql)
sql = "DROP TABLE IF EXISTS clean_df"
conn.execute(sql)
engine.dispose()
I have a database with two tables. The ssi_processed_files_prod table contains file information including the created date and a boolean indicating if the data has been deleted. The data table contains the actual data the boolean references.
I want to get a list of IDs over the age of 45 days from the file_info table, delete the associated rows from the data table, then set the boolean from file_info to True to indicate the data has been deleted.
file_log_test= Table('ssi_processed_files_prod', metadata, autoload=True, autoload_with=engine)
stmt = select([file_log_test.columns.id])
stmt = stmt.where(func.datediff(text('day'),
file_log_test.columns.processing_end_time, func.getDate()) > 45)
connection = engine.connect()
results = connection.execute(stmt).fetchall()
This query returns the correct results, however, I have not been able to work with the output effectively.
For those who would like to know the answer. This was based on reading the Essential SQL Alchemy book. The initial block of cod was correct, but I had to flatten the results into a list. From there I could use the in_() conjuction to work with the list of ids. This allowed me to delete rows from the relevant table and update data status in anohter.
file_log_test= Table('ssi_processed_files_prod', metadata, autoload=True,
autoload_with=engine)
stmt = select([file_log_test.columns.id])
stmt = stmt.where(func.datediff(text('day'),
file_log_test.columns.processing_end_time, func.getDate()) > 45)
connection = engine.connect()
results = connection.execute(stmt).fetchall()
ids_to_delete = [x[0] for x in results]
d = delete(data).where(data.c.filename_id.in_(ids_to_delete))
connection.execute(d)
I have the following code which reads a MYSQL select command formed from left joining many tables together. I then want to write the result to another table. When I do that however, (with Pandas), it works properly and the data is added to the table, but it somehow destroys all indices from the table, including the primary key.
Here is the code:
q = "SELECT util.peer_id as peer_id, util.date as ts, weekly_total_page_loads as page_loads FROM %s.%s as util LEFT JOIN \
(SELECT peer_id, date, score FROM %s.%s WHERE date = '%s') as scores \
ON util.peer_id = scores.peer_id AND util.date = scores.date WHERE util.date = '%s';"\
% (config.database_peer_groups, config.table_medians, \
config.database_peer_groups, config.db_score, date, date)
group_export = pd.read_sql(q, con = db)
q = 'USE %s;' % (config.database_export)
cursor.execute(q)
group_export.to_sql(con = db, name = config.table_group_export, if_exists = 'replace', flavor = 'mysql', index = False)
db.commit()
Any Ideas?
Edit:
It seems that,by using if_exists='replace', Pandas drops the table and recreates it, and when it recreates it, it doesn't rebuild the indices.
Furthermore, this question: to_sql pandas method changes the scheme of sqlite tables
suggests that by using a sqlalchemy engine it might potentially solve the problem.
Edit:
When I use if_exists="append" the problem doesn't appear, it is only with if_exists="replace" that the problem occures.