Update Firebird table with data from csv Python - python

Let's say we have a CSV file with payment information like this:
somthing1:500.00
somthing2:300.00
somthing3:200.00
I need to update different rows in a tableB in a Firebird database that has a column called paid with the values in the second column in the CSV file. For that reason iI created a global temporary table just to pump the data in it and then to update the desired table. What I tried is this:
idd = 0
with open('file.csv', 'r', encoding='utf-8', newline='') as file:
reader = list(csv.reader(file, delimiter=':'))
for row in reader:
idd += 1
c.execute("""INSERT INTO temp_table (id,code,paid) VALUES (?,?,?)""",
(idd,str(row[0]),str(row[1]), ))
This works as expected - the table is populated. After that I tried to update other table like this:
c.execute("""select paid from temp_table;""")
res = c.fetchall()
for r in res:
c.execute(f"""UPDATE tableB SET paid = {r[0]} WHERE oid = 10;""")
This does work - for every result from the SELECT query it updates all of the rows with the same value of the next result from that. I tried with MERGE in the database itself with the same result - every row is updated with the same value from temp_table:
MERGE INTO tableB b
USING (SELECT paid FROM temp_table) e
ON b.paid = 0 AND oid = 10
WHEN MATCHED THEN
UPDATE SET b.paid = e.paid;
I need some way to update first row from tableB with first row from the CSV and so on. What am I missing?

The problem is that you are not correlating rows from the temp_table with rows in the tableB.
In your initial attempt, you update all rows in tableB with oid = 10. Similarly, in your second attempt you update all rows with paid = 0 without correlating to temp_table.
You need add a condition to match rows between the tables, say both have an id column:
MERGE INTO tableB b
USING (SELECT paid FROM temp_table) e
ON b.paid = 0 AND b.oid = 10 AND b.id = e.id
WHEN MATCHED THEN
UPDATE SET b.paid = e.paid;

Related

Selecting rows from sql if they are also in a dataframe

I have a MS sql server with a lot of rows( around 4 million) from all the customers and their information.
I can also get a list of phone numbers of all visitors of my website in a given timeframe that I can get in a csv file and then covert to a dataframe in python. What I want to do is to select two columns from my server(one is the phone number and the other one is a property of that person) but I only want to select this records from people who are in both my dataframe and my server.
What I currently do is selecting all customers from sql server and then merge them with my dataframe. But obviously this is not very fast. Is there any way to do this faster?
query2 = """
SELECT encrypt_phone, col2
FROM DatabaseTable
"""
cursor.execute(query2)
tables = cursor.fetchall()
df2 = pd.DataFrame.from_records(tables, columns=[x[0] for x in cursor.description])
df1.merge(df2, how='inner', indicator=True)
If your DataFrame have not many rows, I would do it the simple way as here :
V = df["colx"].unique()
Q = 'SELECT encrypt_phone, col2 FROM DatabaseTable WHERE coly IN ({})'.format(','.join(['?']*len(V)))
cursor.execute(Q, tuple(V))
tables = cursor.fetchall()
df2 = pd.DataFrame.from_records(tables, columns=[x[0] for x in cursor.description])
NB : colx and coly are the columns that refer to the customers (id, or name, ..) in the pandas DataFrame and in the SQL table, respectively.
Otherwise, you may need to store df1 as a table in your DB and then perform a sub-query :
df1.to_sql('DataFrameTable', conn, index=False) #this will store df1 in the DB
Q = "SELECT encrypt_phone, col2 FROM DatabaseTable WHERE coly IN (SELECT colx FROM DataFrameTable)"
df2 = pd.read_sql_query(Q, conn)

How to perform an SQL update on multiple rows using Python

I am retrieving data from a Postgresdb and storing it in a Pandas dataframe for further processing. While doing that I want to update the queried table and set a flag saying that these rows are getting processed.
engine = engine = create_engine(connection_string, connect_args=credentials)
query = load_query(filename='queries/get_data.sql')
df = pd.read_sql(query, engine)
ids = df['id']
update_query = "update table1" +\
"set status = 'processing'," +\
f"where session_id in ({ids})"
with engine.connect() as con:
rs = con.execute(update_query)
The dataframe then looks like this:
ID
descr
Cell 1
Cell 2
Cell 3
Cell 4
Now I want to update the column "status". What do I need to do? I know I need a list, separeted by commas and each value in qoutes... But I wasnt able to build id.
Help appreciated

How to create summary statistics for an entire SQLite data base?

Consider some SQLite database.db, with a large number of tables and columns.
Panda's .describe() produces the summary statistics that I want (see below). However, it requires reading each table in full - a problem for large data bases. Is there an (SQL or Python) alternative that is less memory hungry? Specifiying column names manually is not feasible here.
import pandas as pd
import sqlite3
con = sqlite3.connect("file:database.db", uri=True)
tables = pd.read_sql("SELECT name FROM sqlite_master WHERE type='table'", con)
columns = []
for _, row in tables.iterrows():
col = pd.read_sql(f"PRAGMA table_info({row['name']})", con)
col['table'] = row['name']
stats = pd.read_sql(f"""SELECT * FROM {row['name']}""", con)
stats = stats.describe(include='all')
stats = stats.transpose()
col = col.merge(stats, left_on='name', right_index=True)
columns.append(col)
columns = pd.concat(columns)
Perhaps a little over the top but you could use TRIGGERS to maintain statistics and eliminate the need for full table scans. There will obviously be some overhead for maintaining the statistics but the overheads are distributed over time.
Perhaps consider the following demo (in SQL) where there are two main tables tablea and tablex (could be any number of tables). Another table called statistic which will be used to dynamically store statistics.
For each main table 3 triggers are created 1 for when a row is inserted, one for when a row is updated and one for when a row is deleted. So 6 triggers in all for the 2 main tables.
The statistic tables has 5 columns
tablename which is the primary key and holds the name is the table the row stores statistics about
row_count for the number of rows (in theory)
The insert trigger for the respective table increments the row_count
The delete trigger decrements the row_count
insert_count
The insert trigger increments the insert_count
update_count
the update trigger increments the update_count
delete_count
the delete trigger increments the delete_count
All of the triggers first try to insert the respective row for the table with all values using the default of 0. As the tablename is the primary key the INSERT OR IGNORE ensures that the row is only added the once (unless the row is deleted (effectively resetting the stats for the table))
The demo includes some insertions, deletions and updates and finally extraction of the statistics:-
DROP TABLE IF EXISTS tablea;
DROP TABLE IF EXISTS tablex;
DROP TABLE IF EXISTS statistic;
CREATE TABLE IF NOT EXISTS statistic (
tablename TEXT PRIMARY KEY,
row_count INTEGER DEFAULT 0,
insert_count INTEGER DEFAULT 0,
update_count INTEGER DEFAULT 0,
delete_count INTEGER DEFAULT 0
);
CREATE TABLE IF NOT EXISTS tablea (id INTEGER PRIMARY KEY, data1 TEXT);
CREATE TRIGGER IF NOT EXISTS tablea_after_ins AFTER INSERT ON tablea
BEGIN
INSERT OR IGNORE INTO statistic (tablename) VALUES('tablea');
UPDATE statistic SET row_count = row_count +1, insert_count = insert_count + 1 WHERE tablename = 'tablea';
END
;
CREATE TRIGGER IF NOT EXISTS tablea_after_update AFTER UPDATE ON tablea
BEGIN
INSERT OR IGNORE INTO statistic (tablename) VALUES('tablea');
UPDATE statistic SET update_count = update_count + 1 WHERE tablename = 'tablea';
END
;
CREATE TRIGGER IF NOT EXISTS tablea_after_delete AFTER DELETE ON tablea
BEGIN
INSERT OR IGNORE INTO statistic (tablename) VALUES('tablea');
UPDATE statistic SET row_count = row_count -1, delete_count = delete_count + 1 WHERE tablename = 'tablea';
END
;
CREATE TABLE IF NOT EXISTS tablex (id INTEGER PRIMARY KEY, data1 TEXT);
CREATE TRIGGER IF NOT EXISTS tablex_after_ins AFTER INSERT ON tablex
BEGIN
INSERT OR IGNORE INTO statistic (tablename) VALUES('tablex');
UPDATE statistic SET row_count = row_count +1, insert_count = insert_count + 1 WHERE tablename = 'tablex';
END
;
CREATE TRIGGER IF NOT EXISTS tablex_after_update AFTER UPDATE ON tablex
BEGIN
INSERT OR IGNORE INTO statistic (tablename) VALUES('tablex');
UPDATE statistic SET update_count = update_count + 1 WHERE tablename = 'tablex';
END
;
CREATE TRIGGER IF NOT EXISTS tablea_after_delete AFTER DELETE ON tablea
BEGIN
INSERT OR IGNORE INTO statistic (tablename) VALUES('tablex');
UPDATE statistic SET row_count = row_count -1, delete_count = delete_count + 1 WHERE tablename = 'tablex';
END
;
INSERT INTO tablea (data1) VALUES('a');
INSERT INTO tablea (data1) VALUES('b'),('c'),('d'),('z');
DELETE FROM tablea WHERE data1 LIKE 'z';
UPDATE tablea set data1 = 'letter_'||data1;
DELETE FROM tablea WHERE data1 LIKE '%_c';
INSERT OR IGNORE INTO tablex (data1) VALUES
('1a'),('2a'),('3a'),('4a'),('5a')
,('1b'),('2b'),('3b'),('4b'),('5b')
,('1c'),('2c'),('3c'),('4c'),('5c')
,('1d'),('2d'),('3d'),('4d'),('5d')
;
SELECT * FROM statistic;
/* Cleanup the demo environment */
DROP TABLE IF EXISTS tablea;
DROP TABLE IF EXISTS tablex;
DROP TABLE IF EXISTS statistic;
When run the result is :-
note that the mass insert into tablex records all 20 rows added (i.e the trigger is triggered for every insert and that the triggering is part of the transaction)

psycopg2 Syntax errors at or near "' '"

I have a dataframe named Data2 and I wish to put values of it inside a postgresql table. For reasons, I cannot use to_sql as some of the values in Data2 are numpy arrays.
This is Data2's schema:
cursor.execute(
"""
DROP TABLE IF EXISTS Data2;
CREATE TABLE Data2 (
time timestamp without time zone,
u bytea,
v bytea,
w bytea,
spd bytea,
dir bytea,
temp bytea
);
"""
)
My code segment:
for col in Data2_mcw.columns:
for row in Data2_mcw.index:
value = Data2_mcw[col].loc[row]
if type(value).__module__ == np.__name__:
value = pickle.dumps(value)
cursor.execute(
"""
INSERT INTO Data2_mcw(%s)
VALUES (%s)
"""
,
(col.replace('\"',''),value)
)
Error generated:
psycopg2.errors.SyntaxError: syntax error at or near "'time'"
LINE 2: INSERT INTO Data2_mcw('time')
How do I rectify this error?
Any help would be much appreciated!
There are two problems I see with this code.
The first problem is that you cannot use bind parameters for column names, only for values. The first of the two %s placeholders in your SQL string is invalid. You will have to use string concatenation to set column names, something like the following (assuming you are using Python 3.6+):
cursor.execute(
f"""
INSERT INTO Data2_mcw({col})
VALUES (%s)
""",
(value,))
The second problem is that a SQL INSERT statement inserts an entire row. It does not insert a single value into an already-existing row, as you seem to be expecting it to.
Suppose your dataframe Data2_mcw looks like this:
a b c
0 1 2 7
1 3 4 9
Clearly, this dataframe has six values in it. If you were to run your code on this dataframe, then it would insert six rows into your database table, one for each value, and the data in your table would look like the following:
a b c
1
3
2
4
7
9
I'm guessing you don't want this: you'd rather your database table contained the following two rows instead:
a b c
1 2 7
3 4 9
Instead of inserting one value at a time, you will have to insert one entire row at time. This means you have to swap your two loops around, build the SQL string up once beforehand, and collect together all the values for a row before passing it to the database. Something like the following should hopefully work (please note that I don't have a Postgres database to test this against):
column_names = ",".join(Data2_mcw.columns)
placeholders = ",".join(["%s"] * len(Data2_mcw.columns))
sql = f"INSERT INTO Data2_mcw({column_names}) VALUES ({placeholders})"
for row in Data2_mcw.index:
values = []
for col in Data2_mcw.columns:
value = Data2_mcw[col].loc[row]
if type(value).__module__ == np.__name__:
value = pickle.dumps(value)
values.append(value)
cursor.execute(sql, values)

Why are my SQL Query parameters not returning proper vales?

I'm trying to create an SQL queries for a large list of records (>42 million) to insert into a remote database. Right now I'm building queries in the format INSERT INTO tablename (columnnames) VALUES (values)
tablename, columnnames, and values are all of varying length so I'm generating a number of placeholders equal to the number of values required.
The result is I have a string called sqcommand that looks like INSERT INTO ColName (?,?,?) VALUES (?,?,?); and a list of parameters that looks like ([Name1, Name2, Name3, Val1, Val2, Val3]).
When try to execute the query as db.execute(sqlcommand, params) I get errors indicating I'm trying to insert into columns "#P1", "#P2", "#P3" et cetera. Why aren't the values from my list properly translating? Where is it getting "#P1" from? I know I don't have a column of that name and as far as I can tell I'm not referencing a column of that name yet the execute method is still trying to use it.
UPDATE: As per request, the full code is below, modified to avoid anything that might be private. The end result of this is to move data, row by row, from an sqlite3 db file to an AWS SQL server.
newDB = pyodbc.connect(newDataBase)
oldDB = sqlite3.connect(oldDatabase)
tables = oldDB.execute("SELECT * FROM sqlite_master WHERE type='table';").fetchall()
t0 = datetime.now()
for table in tables:
print('Parsing:', str(table[1]))
t1 = datetime.now()
colInfo = oldDB.execute('PRAGMA table_info('+table[1]+');').fetchall()
cols = list()
cph = ""
i = 0
for col in colInfo:
cph += "?,"
cols.append(str(col[1]))
rowCount = oldDB.execute("SELECT COUNT(*) FROM "+table[1]+" ;").fetchall()
count = 0
while count <= int(rowCount[0][0]):
params = list()
params.append(cols)
count += 1
row = oldDB.execute("SELECT * FROM "+table[1]+" LIMIT 1;").fetchone()
ph = ""
for val in row:
ph += "?,"
params = params.append(str(val))
ph = ph[:-1]
cph = cph[:-1]
print(str(table[1]))
sqlcommand = "INSERT INTO "+str(table[1])+" ("+cph+") VALUES ("+ph+");"
print(sqlcommand)
print(params)
newDB.execute(sqlcommand, params)
sqlcommand = "DELETE FROM ? WHERE ? = ?;"
oldDB.execute(sqlcommand, (str(table[1]), cols[0], vals[0],))
newDB.commit()
Unbeknownst to me, column names can't be passed as parameters. Panagiotis Kanavos answered this in a comment. I guess I'll have to figure out a different way to generate the queries. Thank you all very much, I appreciate it.

Categories