DuckDB - efficiently insert pandas dataframe to table with sequence - python

CREATE TABLE temp (
id UINTEGER,
name VARCHAR,
age UINTEGER
);
CREATE SEQUENCE serial START 1;
Insertion with series works just fine:
INSERT INTO temp VALUES(nextval('serial'), 'John', 13)
How I can use the sequence with pandas dataframe?
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df)
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
con.execute("INSERT INTO temp SELECT * FROM df")
RuntimeError: Binder Error: table temp has 3 columns but 2 values were supplied
I don't want to iterate item by item. The goal is to efficiently insert 1000s of items from python to DB. I'm ok to change pandas to something else.

Can't you have nextval('serial') as part of your select query when reading the df?
e.g.,
con.execute("INSERT INTO temp SELECT nextval('serial'), Name, Age FROM df")

Related

Why does to_sql require some column names to be uppercase but not others?

I have a dataframe, whose data I want to append to an existing table within a DB2 database.
The dataframe is pieced together from several variables using this code:
df = pd.DataFrame([[V_UK_week_ending, V_UK_unique_visitors_tw, V_UK_country, V_UK_key]],columns=['week_ending','unique_visitors_tw','country', 'key'])
The dataframe itself looks like this:
week_ending unique_visitors_tw country key
0 2023-01-02 6439376 UK 02/01/2023 - UK
This is an overview of the table I'm appending the dataframe to (data comes from SYSCAT.COLUMNS):
TABSCHEMA TABNAME COLNAME COLNO TYPENAME LENGTH SCALE
--------- ------- ------------------ ----- -------- ------ -----
DMN SJ_TEST COUNTRY 4 VARCHAR 10 0
DMN SJ_TEST KEY 5 VARCHAR 20 0
DMN SJ_TEST UNIQUE_VISITORS_TW 1 INTEGER 4 0
DMN SJ_TEST WEEK_ENDING 0 DATE 4 0
This is the code I am using to insert the data from the dataframe into the table:
df.to_sql(V_table, V_conn, schema = V_schema, if_exists = 'append', index = False)
This results in the following error:
sqlalchemy.exc.ProgrammingError: (ibm_db_dbi.ProgrammingError) ibm_db_dbi::ProgrammingError: Binding Error: [IBM][CLI Driver][DB2/LINUXX8664] SQL0206N "key" is not valid in the context where it is used. SQLSTATE=42703\r SQLCODE=-206
[SQL: INSERT INTO "DMN"."SJ_TEST" (week_ending, unique_visitors_tw, country, "key") VALUES (?, ?, ?, ?)]
[parameters: ('2023-01-02', '6439376', 'UK', '02/01/2023 - UK')]
However, if I change the code setting up the dataframe so that the key column's name is in uppercase, the to_sql statement works fine:
df = pd.DataFrame([[V_UK_week_ending, V_UK_unique_visitors_tw, V_UK_country, V_UK_key]],columns=['week_ending','unique_visitors_tw','country', 'KEY'])
Does anyone know why this is required please? I don't understand why some columns would need to be in uppercase but not others.

psycopg2 Syntax errors at or near "' '"

I have a dataframe named Data2 and I wish to put values of it inside a postgresql table. For reasons, I cannot use to_sql as some of the values in Data2 are numpy arrays.
This is Data2's schema:
cursor.execute(
"""
DROP TABLE IF EXISTS Data2;
CREATE TABLE Data2 (
time timestamp without time zone,
u bytea,
v bytea,
w bytea,
spd bytea,
dir bytea,
temp bytea
);
"""
)
My code segment:
for col in Data2_mcw.columns:
for row in Data2_mcw.index:
value = Data2_mcw[col].loc[row]
if type(value).__module__ == np.__name__:
value = pickle.dumps(value)
cursor.execute(
"""
INSERT INTO Data2_mcw(%s)
VALUES (%s)
"""
,
(col.replace('\"',''),value)
)
Error generated:
psycopg2.errors.SyntaxError: syntax error at or near "'time'"
LINE 2: INSERT INTO Data2_mcw('time')
How do I rectify this error?
Any help would be much appreciated!
There are two problems I see with this code.
The first problem is that you cannot use bind parameters for column names, only for values. The first of the two %s placeholders in your SQL string is invalid. You will have to use string concatenation to set column names, something like the following (assuming you are using Python 3.6+):
cursor.execute(
f"""
INSERT INTO Data2_mcw({col})
VALUES (%s)
""",
(value,))
The second problem is that a SQL INSERT statement inserts an entire row. It does not insert a single value into an already-existing row, as you seem to be expecting it to.
Suppose your dataframe Data2_mcw looks like this:
a b c
0 1 2 7
1 3 4 9
Clearly, this dataframe has six values in it. If you were to run your code on this dataframe, then it would insert six rows into your database table, one for each value, and the data in your table would look like the following:
a b c
1
3
2
4
7
9
I'm guessing you don't want this: you'd rather your database table contained the following two rows instead:
a b c
1 2 7
3 4 9
Instead of inserting one value at a time, you will have to insert one entire row at time. This means you have to swap your two loops around, build the SQL string up once beforehand, and collect together all the values for a row before passing it to the database. Something like the following should hopefully work (please note that I don't have a Postgres database to test this against):
column_names = ",".join(Data2_mcw.columns)
placeholders = ",".join(["%s"] * len(Data2_mcw.columns))
sql = f"INSERT INTO Data2_mcw({column_names}) VALUES ({placeholders})"
for row in Data2_mcw.index:
values = []
for col in Data2_mcw.columns:
value = Data2_mcw[col].loc[row]
if type(value).__module__ == np.__name__:
value = pickle.dumps(value)
values.append(value)
cursor.execute(sql, values)

make column dataframe become sql query statement

I am using jupyter notebook to access Teradata database.
Assume I have a dataframe
Name Age
Sam 5
Tom 6
Roy 7
I want to let the whole column "Name" content become the WHERE condition of a sql query.
query = '''select Age
from xxx
where Name in (Sam, Tom, Roy)'''
age = pd.read_sql(query,conn)
How to format the column so that the whole column can be insert to the sql statement automatically instead of manually paste the column content?
Join the Name column and insert into the query using f-string:
query = f'''select Age
from xxx
where Name in ({", ".join(df.Name)})'''
print(query)
select Age
from xxx
where Name in (Sam, Tom, Roy)

Update SQL Database based on matched ID in Dataframe

I have the dataframe below with the respective values and would like to update my SQL Database Server if the ID matches with my dataframe
df dataframe
ID
VALUE
123
9
456
11
SQL Database Server, table1
ID
VALUE
456
62
623
41
123
3
563
67
After updating, I want my SQL Database Server to look like this where you'll notice that ID 123 & 456 has been given a new value based on my dataframe.
ID
VALUE
456
11
623
41
123
9
563
67
Anyone knows how I could utilise this in my query when executing?
query = DELETE/UPDATE table table1 where ID = ID IN DATAFRAME
conn.execute(query)
You can create a parameter list(df_list) along with a DML statement, and arrange the order of columns due to the appearance within the statement. In this case those two arguments(id and value) should be reversely ordered such as
cur=con.cursor()
sql = "UPDATE [table1] SET [value] = ? WHERE [id] = ?"
cols = df.columns.tolist()
df_list = df[cols[-1:] + cols[:-1]].values.tolist()
cur.executemany(sql,df_list)
cur.close()
con.commit()
con.close()
You can simply make corelated query as follows:
update table1 t1
set t1.value = (select df.value from df where df.id = t1.id)
where exists (select 1 from df where df.id = t1.id);
OR use Inner join in update as follows:
UPDATE T
SET T.value = d.value -- , another column updates here
FROM table1 as t
INNER JOIN df as d ON t.id = d.id;

how to collapse/compress/reduce string columns in pandas

Essentially, what I am trying to do is join Table_A to Table_B using a key to do a lookup in Table_B to pull column records for names present in Table_A.
Table_B can be thought of as the master name table that stores various attributes about a name. Table_A represents incoming data with information about a name.
There are two columns that represent a name - a column named 'raw_name' and a column named 'real_name'. The 'raw_name' has the string "code_" before the real_name.
i.e.
raw_name = CE993_VincentHanna
real_name = VincentHanna
Key = real_name, which exists in Table_A and Table_B
Please see the mySQL tables and query here: http://sqlfiddle.com/#!9/65e13/1
For all real_names in Table_A that DO-NOT exist in Table_B I want to store raw_name/real_name pairs into an object so I can send an alert to the data-entry staff for manual insertion.
For all real_names in Table_A that DO exist in Table_B, which means we know about this name and can add the new raw_name associated with this real_name into our master Table_B
In mySQL, this is easy to do as you can see in my sqlfidde example. I join on real_name and I compress/collapse the result by groupby a.real_name since I don't care if there are multiple records in Table_B for the same real_name.
All I want is to pull the attributes (stats1, stats2, stats3) so I can assign them to the newly discovered raw_name.
In the mySQL query result I can then separate the NULL records to be sent for manual data-entry and automatically insert the remaining records into Table_B.
Now, I am trying to do the same in Pandas but am stuck at the point of groupby on real-name.
e = {'raw_name': pd.Series(['AW103_Waingro', 'CE993_VincentHanna', 'EES43_NeilMcCauley', 'SME16_ChrisShiherlis',
'MEC14_MichaelCheritto', 'OTP23_RogerVanZant', 'MDU232_AlanMarciano']),
'real_name': pd.Series(['Waingro', 'VincentHanna', 'NeilMcCauley', 'ChrisShiherlis', 'MichaelCheritto',
'RogerVanZant', 'AlanMarciano'])}
f = {'raw_name': pd.Series(['SME893_VincentHanna', 'TVA405_VincentHanna', 'MET783_NeilMcCauley',
'CE321_NeilMcCauley', 'CIN453_NeilMcCauley', 'NIPS16_ChrisShiherlis',
'ALTW12_MichaelCheritto', 'NSP42_MichaelCheritto', 'CONS23_RogerVanZant',
'WAUE34_RogerVanZant']),
'real_name': pd.Series(['VincentHanna', 'VincentHanna', 'NeilMcCauley', 'NeilMcCauley', 'NeilMcCauley',
'ChrisShiherlis', 'MichaelCheritto', 'MichaelCheritto', 'RogerVanZant',
'RogerVanZant']),
'stats1': pd.Series(['meh1', 'meh1', 'yo1', 'yo1', 'yo1', 'hello1', 'bye1', 'bye1', 'namaste1',
'namaste1']),
'stats2': pd.Series(['meh2', 'meh2', 'yo2', 'yo2', 'yo2', 'hello2', 'bye2', 'bye2', 'namaste2',
'namaste2']),
'stats3': pd.Series(['meh3', 'meh3', 'yo3', 'yo3', 'yo3', 'hello3', 'bye3', 'bye3', 'namaste3',
'namaste3'])}
df_e = pd.DataFrame(e)
df_f = pd.DataFrame(f)
df_new = pd.merge(df_e, df_f, how='left', on='real_name', suffixes=['_left', '_right'])
df_new_grouped = df_new.groupby(df_new['raw_name_left'])
Now how do I compress/collapse the groups in df_new_grouped on real-name like I did in mySQL.
Once I have an object with the collapsed results I can slice the dataframe to report real_names we don't have a record of (NULL values) and those that we already know and can store the newly discovered raw_name.
You can drop duplicates based on columns raw_name_left and also remove the raw_name_right column using drop
In [99]: df_new.drop_duplicates('raw_name_left').drop('raw_name_right', 1)
Out[99]:
raw_name_left real_name stats1 stats2 stats3
0 AW103_Waingro Waingro NaN NaN NaN
1 CE993_VincentHanna VincentHanna meh1 meh2 meh3
3 EES43_NeilMcCauley NeilMcCauley yo1 yo2 yo3
6 SME16_ChrisShiherlis ChrisShiherlis hello1 hello2 hello3
7 MEC14_MichaelCheritto MichaelCheritto bye1 bye2 bye3
9 OTP23_RogerVanZant RogerVanZant namaste1 namaste2 namaste3
11 MDU232_AlanMarciano AlanMarciano NaN NaN NaN
Just to be thorough, this can also be done using Groupby, which I found on Wes McKinney's blog although drop_duplicates is cleaner and more efficient.
http://wesmckinney.com/blog/filtering-out-duplicate-dataframe-rows/
>index = [gp_keys[0] for gp_keys in df_new_grouped.groups.values()]
>unique_df = df_new.reindex(index)
>unique_df

Categories