I have a massive table (over 100B records), that I added an empty column to. I parse strings from another field (string) if the required string is available, extract an integer from that field, and want to update it in the new column for all rows that have that string.
At the moment, after data has been parsed and saved locally in a dataframe, I iterate on it to update the Redshift table with clean data. This takes approx 1sec/iteration, which is way too long.
My current code example:
conn = psycopg2.connect(connection_details)
cur = conn.cursor()
clean_df = raw_data.apply(clean_field_to_parse)
for ind, row in clean_df.iterrows():
update_query = build_update_query(row.id, row.clean_integer1, row.clean_integer2)
cur.execute(update_query)
where update_query is a function to generate the update query:
def update_query(id, int1, int2):
query = """
update tab_tab
set
clean_int_1 = {}::int,
clean_int_2 = {}::int,
updated_date = GETDATE()
where id = {}
;
"""
return query.format(int1, int2, id)
and where clean_df is structured like:
id . field_to_parse . clean_int_1 . clean_int_2
1 . {'int_1':'2+1'}. 3 . np.nan
2 . {'int_2':'7-0'}. np.nan . 7
Is there a way to update specific table fields in bulk, so that there is no need to execute one query at a time?
I'm parsing the strings and running the update statement from Python. The database is stored on Redshift.
As mentioned, consider pure SQL and avoid iterating through billions of rows by pushing the Pandas data frame to Postgres as a staging table and then run one single UPDATE across both tables. With SQLAlchemy you can use DataFrame.to_sql to create a table replica of data frame. Even add an index of the join field, id, and drop the very large staging table at end.
from sqlalchemy import create_engine
engine = create_engine("postgresql+psycopg2://myuser:mypwd!#myhost/mydatabase")
# PUSH TO POSTGRES (SAME NAME AS DF)
clean_df.to_sql(name="clean_df", con=engine, if_exists="replace", index=False)
# SQL UPDATE (USING TRANSACTION)
with engine.begin() as conn:
sql = "CREATE INDEX idx_clean_df_id ON clean_df(id)"
conn.execute(sql)
sql = """UPDATE tab_tab t
SET t.clean_int_1 = c.int1,
t.clean_int_2 = c.int2,
t.updated_date = GETDATE()
FROM clean_df c
WHERE c.id = t.id
"""
conn.execute(sql)
sql = "DROP TABLE IF EXISTS clean_df"
conn.execute(sql)
engine.dispose()
Related
I am trying to write result into Oracle Database using executemany command. However, I am neither getting any error message nor the database table is getting updated.
Using same connection object and cursor I am able to connect to database, extract data and insert the data into new table. However, I am NOT able to update the result to a existing table
I am using Oracle 19c database , python 3.8.8 and cx_oracle 8.0.0.
Reading the file from Oracle database table 'bench_project_d'. To reproduce the error I have created a csv file and reading the data from csv file.
The data has 7 fields
ROW_ID type: NUMBER(19,0),
GROUP_ID type: NUMBER(19,0),
PLANNED_UNITS type: NUMBER,
IQR_PTU(Calculated on the fly),
Q1_PTU (Calculated on the fly) ,
Q2_PTU (Calculated on the fly) ,
ANOMALY type: NUMBER
All the fields are having data except the new column "ANOMALY".
In this field , all are null values. This is a field where we want to store the results.
While importing the data to python, we are taking first 6 feature, calculate the anomaly field and push the anomaly result to database
'''python
#connecting to Database
username="**********"
password="********"
tsn="*********"
conn = cx_Oracle.connect(username, password, tsn)
cur = conn.cursor()
global_df = pd.read_csv("project_data.csv")
#Filtering the group having more than 3 projects
grouped = global_df.groupby("GROUP_ID")
filtered_df = grouped.filter(lambda x: x["GROUP_ID"].count()>3)
filtered_df
#Handling zero Interquartile Range
x = filtered_df[filtered_df['IQR_PTU'] == 0]['GROUP_ID'].unique()
for i in x:
filtered_df.loc[filtered_df['GROUP_ID'] == i,'IQR_PTU'] = (filtered_df[filtered_df['GROUP_ID']==i]['PLANNED_UNITS'].quantile(0.95)) - (filtered_df[filtered_df['GROUP_ID']==i]['PLANNED_UNITS'].quantile(0.05))
#Creating the output 'Anomaly' field
filtered_df['ANOMALY'] =0
filtered_df.loc[filtered_df['PLANNED_UNITS'] > (filtered_df['Q2_PTU']+(1.5*filtered_df['IQR_PTU'])),'ANOMALY']=1
filtered_df.loc[filtered_df['PLANNED_UNITS'] < (filtered_df['Q1_PTU']-(1.5*filtered_df['IQR_PTU'])),'ANOMALY']=-1
#Formating the Dataframe
result_df = df.loc[:,['ROW_ID','GROUP_ID', 'ANOMALY']]
result_df = result_df.round(2)
result_df=result_df.astype('object')
value_to_stre_db=result_df.values.tolist()
#Pushing the result to Database table bench_project_d
statement = 'update bench_project_d set GROUP_ID = :2, ANOMALY= :3 where ROW_ID = :1'
cur.executemany(statement, value_to_stre_db)
conn.commit()
EDIT 1:
I have tried to convert list of array to list of tuples and executed the same code again. But still no luck.
rows = [tuple(x) for x in value_to_stre_db]
#Pushing the result to Database table bench_project_d
statement = ''update bench_project_d set GROUP_ID = :2, ANOMALY= :3 where ROW_ID = :1''
cur.executemany(statement, rows)
conn.commit()
I have a database with around 10 columns. Sometimes I need to insert a row which has only 3 of the required columns, the rest are not in the dic.
The data to be inserted is a dictionary named row :
(this insert is to avoid duplicates)
row = {'keyword':'abc','name':'bds'.....}
df = pd.DataFrame([row]) # df looks good, I see columns and 1 row.
engine = getEngine()
connection = engine.connect()
df.to_sql('temp_insert_data_index', connection, if_exists ='replace',index=False)
result = connection.execute(('''
INSERT INTO {t} SELECT * FROM temp_insert_data_index
ON CONFLICT DO NOTHING''').format(t=table_name))
connection.close()
Problem : when I don't have all columns in the row(dic), it will insert dic fields by order (a 3 keys dic will be inserted to the first 3 columns) and not to the right columns. ( I expect the keys in dic to fit the db columns)
Why ?
Consider explicitly naming the columns to be inserted in INSERT INTO and SELECT clauses which is best practice for SQL append queries. Doing so, the dynamic query should work for all or subset of columns. Below uses F-string (available Python 3.6+) for all interpolation to larger SQL query:
# APPEND TO STAGING TEMP TABLE
df.to_sql('temp_insert_data_index', connection, if_exists='replace', index=False)
# STRING OF COMMA SEPARATED COLUMNS
cols = ", ".join(df.columns)
sql = (
f"INSERT INTO {table_name} ({cols}) "
f"SELECT {cols} FROM temp_insert_data_index "
"ON CONFLICT DO NOTHING"
)
result = connection.execute(sql)
connection.close()
The following Python code successfully appends the rows belonging to the pandas dataframe into an MS SQL table via the SqlAlchemy engine previously configured.
df.to_sql(schema='stg', name = 'TEST', con=engine, if_exists='append', index=False)
I want to obtain the auto-generated IDs numbers for each of the rows inserted into the stg.Test table. In other words, what is the SqlAlchemy equivalent to the Sql Server OUTPUT clause during an INSERT statement
Unfortunately, there is no easy solution to your problem like an additional parameter in your statement. You have to use the behavior that new rows get the highest id + 1 assigned. With this knowledge, you can calculate the ids of all your rows.
Option 1: Explained in this answer. You select the current maximum id, before the insert statement. Then, you assign ids to all the entries in your DataFrame greater than the previous maximum. Lastly, insert the df which already includes the ids.
Option 2: You insert the DataFrame and then acquire the highest id. With the number of entries inserted you can calculate the id of all entries. This is how such an insert function could look like:
def insert_df_and_return_ids(df, engine):
# It is important to use same connection for both statements if
# something like last_insert_rowid() is used
conn = engine.connect()
# Insert the df into the database
df.to_sql('students', conn, if_exists='append', index=False)
# Aquire the maximum id
result = conn.execute('SELECT max(id) FROM students') # Should work for all SQL variants
# result = conn.execute('Select last_insert_rowid()') # Specifically for SQLite
# result = conn.execute('Select last_insert_id()') # Specifically for MySql
entries = df.shape[0]
last_id = -1
# Iterate over result to get last inserted id
for row in result:
last_id = int(str(row[0]))
conn.close()
# Generate list of ids
list_of_ids = list(range(last_id - entries + 1, last_id + 1))
return list_of_ids
PS: I could not test the function on an MS SQL server, but the behavior should be the same. In order to test if everything behaves as it should you can use this:
import numpy as np
import pandas as pd
import sqlalchemy as sa
# Change connection to MS SQL server
engine = sa.create_engine('sqlite:///test.lite', echo=False)
# Create table
meta = sa.MetaData()
students = sa.Table(
'students', meta,
sa.Column('id', sa.Integer, primary_key = True),
sa.Column('name', sa.String),
)
meta.create_all(engine)
# DataFrame to insert with two entries
df = pd.DataFrame({'name': ['Alice', 'Bob']})
ids = insert_df_and_return_ids(df, engine)
print(ids) # [1,2]
conn = engine.connect()
# Insert any entry with a high id in order to check if new ids are always the maximum
result = conn.execute("Insert into students (id, name) VALUES (53, 'Charlie')")
conn.close()
# Insert data frame again
ids = insert_df_and_return_ids(df, engine)
print(ids) # [54, 55]
EDIT: If multiple threads are utilized, transactions can be used to make the option thread-safe at least for SQLite:
conn = engine.connect()
transaction = conn.begin()
df.to_sql('students', conn, if_exists='append', index=False)
result = conn.execute('SELECT max(id) FROM students')
transaction.commit()
I have the following code which reads a MYSQL select command formed from left joining many tables together. I then want to write the result to another table. When I do that however, (with Pandas), it works properly and the data is added to the table, but it somehow destroys all indices from the table, including the primary key.
Here is the code:
q = "SELECT util.peer_id as peer_id, util.date as ts, weekly_total_page_loads as page_loads FROM %s.%s as util LEFT JOIN \
(SELECT peer_id, date, score FROM %s.%s WHERE date = '%s') as scores \
ON util.peer_id = scores.peer_id AND util.date = scores.date WHERE util.date = '%s';"\
% (config.database_peer_groups, config.table_medians, \
config.database_peer_groups, config.db_score, date, date)
group_export = pd.read_sql(q, con = db)
q = 'USE %s;' % (config.database_export)
cursor.execute(q)
group_export.to_sql(con = db, name = config.table_group_export, if_exists = 'replace', flavor = 'mysql', index = False)
db.commit()
Any Ideas?
Edit:
It seems that,by using if_exists='replace', Pandas drops the table and recreates it, and when it recreates it, it doesn't rebuild the indices.
Furthermore, this question: to_sql pandas method changes the scheme of sqlite tables
suggests that by using a sqlalchemy engine it might potentially solve the problem.
Edit:
When I use if_exists="append" the problem doesn't appear, it is only with if_exists="replace" that the problem occures.
I have two tables in my SQL.
Table 1 contains many data, but Table 2 contains huge data.
Here's the code I implement using Python
import MySQLdb
db = MySQLdb.connect(host = "localhost", user = "root", passwd="", db="fak")
cursor = db.cursor()
#Execute SQL Statement:
cursor.execute("SELECT invention_title FROM auip_wipo_sample WHERE invention_title IN (SELECT invention_title FROM us_pat_2005_to_2012)")
#Get the result set as a tuple:
result = cursor.fetchall()
#Iterate through results and print:
for record in result:
print record
print "Finish."
#Finish dealing with the database and close it
db.commit()
db.close()
However, it takes so long. I have run the Python script for 1 hour, and it still doesn't give me any results yet.
Please help me.
Do you have index on invention_title in both tables? If not, then create it:
ALTER TABLE auip_wipo_sample ADD KEY (`invention_title`);
ALTER TABLE us_pat_2005_to_2012 ADD KEY (`invention_title`);
Then combine your query into one which don't use subqueries:
SELECT invention_title FROM auip_wipo_sample
INNER JOIN us_pat_2005_to_2012 ON auip_wipo_sample.invention_title = us_pat_2005_to_2012.invention_title
And let me know about your results.