I want to incrementally keep adding data frame rows into MySQL DB avoiding any duplicate entries to go in MySQL.
I am currently doing this by looping through every row using df.apply()and calling MySQL insert ignore(duplicates) to add unique rows into MySQL database. But using pandas.apply is very slow(45 secs for 10k rows). I want to achieve this using pandas.to_sql() method which takes 0.5 secs to push 10k entries into DB but doesn't support ignore duplicate in append mode.
Is there an efficient and fast way to achieve this?
Input CSV
Date,Open,High,Low,Close,Volume
1994-01-03,111.7,112.75,111.55,112.65,0
1994-01-04,112.68,113.47,112.2,112.65,0
1994-01-05,112.6,113.63,112.3,113.0,0
1994-01-06,113.02,113.43,112.25,112.62,0
1994-01-07,112.55,112.8,111.5,111.88,0
1994-01-10,111.8,112.43,111.35,112.25,0
1994-01-11,112.18,112.88,112.05,112.4,0
1994-01-12,112.38,112.82,111.95,112.28,0
code
nifty_data.to_sql(name='eod_data', con=engine, if_exists = 'append', index=False) # option-1
nifty_data.apply(addToDb, axis=1) # option-2
def addToDb(row):
sql = "INSERT IGNORE INTO eod_data (date, open, high, low, close, volume) VALUES (%s,%s,%s,%s,%s,%s)"
val = (row['Date'], row['Open'], row['High'], row['Low'], row['Close'], row['Volume'])
mycursor.execute(sql, val)
mydb.commit()`
option-1: doesn't allow insert ignore (~0.5 secs)
option-2: has to loop through and is very slow (~45 secs)
You can create a temporary table:
nifty_data.to_sql(name='temporary_table', con=engine, if_exists = 'append', index=False)
And then run an INSERT IGNORE statement from that:
with engine.begin() as cnx:
insert_sql = 'INSERT IGNORE INTO eod_data (SELECT * FROM temporary_table)'
cnx.execute(insert_sql)
just make sure the column orders are the same or you might have to manually declare them.
Related
I have a database with around 10 columns. Sometimes I need to insert a row which has only 3 of the required columns, the rest are not in the dic.
The data to be inserted is a dictionary named row :
(this insert is to avoid duplicates)
row = {'keyword':'abc','name':'bds'.....}
df = pd.DataFrame([row]) # df looks good, I see columns and 1 row.
engine = getEngine()
connection = engine.connect()
df.to_sql('temp_insert_data_index', connection, if_exists ='replace',index=False)
result = connection.execute(('''
INSERT INTO {t} SELECT * FROM temp_insert_data_index
ON CONFLICT DO NOTHING''').format(t=table_name))
connection.close()
Problem : when I don't have all columns in the row(dic), it will insert dic fields by order (a 3 keys dic will be inserted to the first 3 columns) and not to the right columns. ( I expect the keys in dic to fit the db columns)
Why ?
Consider explicitly naming the columns to be inserted in INSERT INTO and SELECT clauses which is best practice for SQL append queries. Doing so, the dynamic query should work for all or subset of columns. Below uses F-string (available Python 3.6+) for all interpolation to larger SQL query:
# APPEND TO STAGING TEMP TABLE
df.to_sql('temp_insert_data_index', connection, if_exists='replace', index=False)
# STRING OF COMMA SEPARATED COLUMNS
cols = ", ".join(df.columns)
sql = (
f"INSERT INTO {table_name} ({cols}) "
f"SELECT {cols} FROM temp_insert_data_index "
"ON CONFLICT DO NOTHING"
)
result = connection.execute(sql)
connection.close()
Question How do I append my dataframe to database so that it checks if stock_ticker exists , only to append the rows where stock_ticker does not exist?
This is the process that I did
Import CSV file to pandas dataframe
Assign column names to be same as in database
Sending the dataframe to database using the code below but getting
sqlite3.IntegrityError: UNIQUE constraint failed: stocks.stock_ticker
conn = sqlite3.connect('stockmarket.db')
c = conn.cursor()
df.to_sql(name='stocks', con=conn, if_exists='append', index=False)
conn.commit()
I looked at other Integrity Error cases but can't seem to find one that works with appending dataframes? I found and tried this but all it does is just not append anything.
try:
conn = sqlite3.connect('stockmarket.db')
c = conn.cursor()
df.to_sql(name='stocks', con=conn, if_exists='append', index=False)
conn.commit()
except sqlite3.IntegrityError:
print("Already in database")
I am not sure I am understanding the iterating thing correctly
How to iterate over rows in a DataFrame in Pandas
So I tried this, but it just prints out already in database for each of them. Even tough there is 4 new stock tickers.
for index, row in df.iterrows():
try:
conn = sqlite3.connect('stockmarket.db')
c = conn.cursor()
df.to_sql(name='stocks', con=conn, if_exists='append', index=False)
conn.commit()
except sqlite3.IntegrityError:
print("Already in database")
The database looks like this
any insight much appreciated :)
It looks like this happens because Pandas doesn't allow for declaring a proper ON CONFLICT policy, in case you try to append data to a table that has the same (unique) primary key or violates some other UNIQUEness constraint. if_exists only refers to the whole table itself, not each individual row.
I think you already came up with a pretty good answer, and maybe with a small modification it would work for you:
# After connecting
for i in range(len(df)):
try:
df[df.index == i].to_sql(name='stocks', con=conn, if_exists='append', index=False)
conn.commit()
except sqlite3.IntegrityError:
pass
Now, this might be a problem if you want to actually replace the value if a newer one appears in your Pandas data and let's say you want to replace the old one that you have in the database. In that case, you might want to use the raw SQL command as a string, and pass the Pandas values iteratively. For example:
insert_statement = """
INSERT INTO stocks (stock_id,
stock_ticker,
{other columns})
VALUES (%s, %s, {as many %s as columns})
ON CONFLICT (stock_id) DO UPDATE
SET {Define which values you will update on conflict}"""
And then you could run
for i in range(len(df)):
values = tuple(df.iloc[i])
cursor.execute(insert_statement, values)
I've been trying to use this piece of code:
# df is the dataframe
if len(df) > 0:
df_columns = list(df)
# create (col1,col2,...)
columns = ",".join(df_columns)
# create VALUES('%s', '%s",...) one '%s' per column
values = "VALUES({})".format(",".join(["%s" for _ in df_columns]))
#create INSERT INTO table (columns) VALUES('%s',...)
insert_stmt = "INSERT INTO {} ({}) {}".format(table,columns,values)
cur = conn.cursor()
cur = db_conn.cursor()
psycopg2.extras.execute_batch(cur, insert_stmt, df.values)
conn.commit()
cur.close()
So I could connect into Postgres DB and insert values from a df.
I get these 2 errors for this code:
LINE 1: INSERT INTO mrr.shipments (mainFreight_freight_motherVesselD...
psycopg2.errors.UndefinedColumn: column "mainfreight_freight_mothervesseldepartdatetime" of relation "shipments" does not exist
for some reason, the columns can't get the values properly
What can I do to fix it?
You should not do your own string interpolation; let psycopg2 handle it. From the docs:
Warning Never, never, NEVER use Python string concatenation (+) or string parameters interpolation (%) to pass variables to a SQL query string. Not even at gunpoint.
Since you also have dynamic column names, you should use psycopg2.sql to create the statement and then use the standard method of passing query parameters to psycopg2 instead of using format.
I have a massive table (over 100B records), that I added an empty column to. I parse strings from another field (string) if the required string is available, extract an integer from that field, and want to update it in the new column for all rows that have that string.
At the moment, after data has been parsed and saved locally in a dataframe, I iterate on it to update the Redshift table with clean data. This takes approx 1sec/iteration, which is way too long.
My current code example:
conn = psycopg2.connect(connection_details)
cur = conn.cursor()
clean_df = raw_data.apply(clean_field_to_parse)
for ind, row in clean_df.iterrows():
update_query = build_update_query(row.id, row.clean_integer1, row.clean_integer2)
cur.execute(update_query)
where update_query is a function to generate the update query:
def update_query(id, int1, int2):
query = """
update tab_tab
set
clean_int_1 = {}::int,
clean_int_2 = {}::int,
updated_date = GETDATE()
where id = {}
;
"""
return query.format(int1, int2, id)
and where clean_df is structured like:
id . field_to_parse . clean_int_1 . clean_int_2
1 . {'int_1':'2+1'}. 3 . np.nan
2 . {'int_2':'7-0'}. np.nan . 7
Is there a way to update specific table fields in bulk, so that there is no need to execute one query at a time?
I'm parsing the strings and running the update statement from Python. The database is stored on Redshift.
As mentioned, consider pure SQL and avoid iterating through billions of rows by pushing the Pandas data frame to Postgres as a staging table and then run one single UPDATE across both tables. With SQLAlchemy you can use DataFrame.to_sql to create a table replica of data frame. Even add an index of the join field, id, and drop the very large staging table at end.
from sqlalchemy import create_engine
engine = create_engine("postgresql+psycopg2://myuser:mypwd!#myhost/mydatabase")
# PUSH TO POSTGRES (SAME NAME AS DF)
clean_df.to_sql(name="clean_df", con=engine, if_exists="replace", index=False)
# SQL UPDATE (USING TRANSACTION)
with engine.begin() as conn:
sql = "CREATE INDEX idx_clean_df_id ON clean_df(id)"
conn.execute(sql)
sql = """UPDATE tab_tab t
SET t.clean_int_1 = c.int1,
t.clean_int_2 = c.int2,
t.updated_date = GETDATE()
FROM clean_df c
WHERE c.id = t.id
"""
conn.execute(sql)
sql = "DROP TABLE IF EXISTS clean_df"
conn.execute(sql)
engine.dispose()
I have the following code which reads a MYSQL select command formed from left joining many tables together. I then want to write the result to another table. When I do that however, (with Pandas), it works properly and the data is added to the table, but it somehow destroys all indices from the table, including the primary key.
Here is the code:
q = "SELECT util.peer_id as peer_id, util.date as ts, weekly_total_page_loads as page_loads FROM %s.%s as util LEFT JOIN \
(SELECT peer_id, date, score FROM %s.%s WHERE date = '%s') as scores \
ON util.peer_id = scores.peer_id AND util.date = scores.date WHERE util.date = '%s';"\
% (config.database_peer_groups, config.table_medians, \
config.database_peer_groups, config.db_score, date, date)
group_export = pd.read_sql(q, con = db)
q = 'USE %s;' % (config.database_export)
cursor.execute(q)
group_export.to_sql(con = db, name = config.table_group_export, if_exists = 'replace', flavor = 'mysql', index = False)
db.commit()
Any Ideas?
Edit:
It seems that,by using if_exists='replace', Pandas drops the table and recreates it, and when it recreates it, it doesn't rebuild the indices.
Furthermore, this question: to_sql pandas method changes the scheme of sqlite tables
suggests that by using a sqlalchemy engine it might potentially solve the problem.
Edit:
When I use if_exists="append" the problem doesn't appear, it is only with if_exists="replace" that the problem occures.