I have a dataframe named Data2 and I wish to put values of it inside a postgresql table. For reasons, I cannot use to_sql as some of the values in Data2 are numpy arrays.
This is Data2's schema:
cursor.execute(
"""
DROP TABLE IF EXISTS Data2;
CREATE TABLE Data2 (
time timestamp without time zone,
u bytea,
v bytea,
w bytea,
spd bytea,
dir bytea,
temp bytea
);
"""
)
My code segment:
for col in Data2_mcw.columns:
for row in Data2_mcw.index:
value = Data2_mcw[col].loc[row]
if type(value).__module__ == np.__name__:
value = pickle.dumps(value)
cursor.execute(
"""
INSERT INTO Data2_mcw(%s)
VALUES (%s)
"""
,
(col.replace('\"',''),value)
)
Error generated:
psycopg2.errors.SyntaxError: syntax error at or near "'time'"
LINE 2: INSERT INTO Data2_mcw('time')
How do I rectify this error?
Any help would be much appreciated!
There are two problems I see with this code.
The first problem is that you cannot use bind parameters for column names, only for values. The first of the two %s placeholders in your SQL string is invalid. You will have to use string concatenation to set column names, something like the following (assuming you are using Python 3.6+):
cursor.execute(
f"""
INSERT INTO Data2_mcw({col})
VALUES (%s)
""",
(value,))
The second problem is that a SQL INSERT statement inserts an entire row. It does not insert a single value into an already-existing row, as you seem to be expecting it to.
Suppose your dataframe Data2_mcw looks like this:
a b c
0 1 2 7
1 3 4 9
Clearly, this dataframe has six values in it. If you were to run your code on this dataframe, then it would insert six rows into your database table, one for each value, and the data in your table would look like the following:
a b c
1
3
2
4
7
9
I'm guessing you don't want this: you'd rather your database table contained the following two rows instead:
a b c
1 2 7
3 4 9
Instead of inserting one value at a time, you will have to insert one entire row at time. This means you have to swap your two loops around, build the SQL string up once beforehand, and collect together all the values for a row before passing it to the database. Something like the following should hopefully work (please note that I don't have a Postgres database to test this against):
column_names = ",".join(Data2_mcw.columns)
placeholders = ",".join(["%s"] * len(Data2_mcw.columns))
sql = f"INSERT INTO Data2_mcw({column_names}) VALUES ({placeholders})"
for row in Data2_mcw.index:
values = []
for col in Data2_mcw.columns:
value = Data2_mcw[col].loc[row]
if type(value).__module__ == np.__name__:
value = pickle.dumps(value)
values.append(value)
cursor.execute(sql, values)
Related
Context: I am using MSSQL, pandas, and pyodbc.
Steps:
Obtain dataframe from query using pyodbc (no problemo)
Process columns to generate the context of a new (but already existing) column
Fill an auxilliary column with UPDATE statements (i.e. UPDATE t SET t.value = df.value FROM dbo.table t where t.ID = df.ID)
Now how do I execute the sql code in the auxilliary column, without looping through each row?
sample data
The first two columns are obtained by querying dbo.table, the third columns exists but is empty in the database. The fourth column only exists in the dataframe to prepare the SQL statement that would correspond to updating dbo.table
ID
raw
processed
strSQL
1
lorum.ipsum#test.com
lorum ipsum
UPDATE t SET t.processed = 'lorum ipsum' FROM dbo.table t WHERE t.ID = 1
2
rumlo.sumip#test.com
rumlo sumip
UPDATE t SET t.processed = 'rumlo sumip' FROM dbo.table t WHERE t.ID = 2
3
...
...
...
I would like to execute the SQL script in each row in an efficient manner.
After I recommended .executemany() in a comment to the question, a subsequent comment from #Charlieface suggested that a table-valued parameter (TVP) would provide even better performance. I didn't think it would make that much difference, but I was wrong.
For an existing table named MillionRows
ID TextField
-- ---------
1 foo
2 bar
3 baz
…
and example data of the form
num_rows = 1_000_000
rows = [(f"text{x:06}", x + 1) for x in range(num_rows)]
print(rows)
# [('text000000', 1), ('text000001', 2), ('text000002', 3), …]
my test using a standard executemany() call with cnxn.autocommit = False and crsr.fast_executemany = True
crsr.executemany("UPDATE MillionRows SET TextField = ? WHERE ID = ?", rows)
took about 180 seconds (3 minutes).
However, by creating a user-defined table type
CREATE TYPE dbo.TextField_ID AS TABLE
(
TextField nvarchar(255) NULL,
ID int NOT NULL,
PRIMARY KEY (ID)
)
and a stored procedure
CREATE PROCEDURE [dbo].[mr_update]
#tbl dbo.TextField_ID READONLY
AS
BEGIN
SET NOCOUNT ON;
UPDATE MillionRows SET TextField = t.TextField
FROM MillionRows mr INNER JOIN #tbl t ON mr.ID = t.ID
END
when I used
crsr.execute("{CALL mr_update (?)}", (rows,))
it did the same update in approximately 80 seconds (less than half the time).
I am comparing two datasets to look for duplicate entries on certain columns.
I have done this first in SAS using the PROC SQL command as below(what I consider the true outcome) using the following query:
proc sql;
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
local AS a INNER JOIN neighbor AS b
ON a.surname1 = b.surname2
AND a.yob1 = b.yob2
AND a.cob1 = b.cob2;
quit;
I output this result to csv giving output_sas.csv
I have also done this in Python using SQLite3 using the same query:
conn = sqlite3.connect(file_path + db_name)
cur = conn.cursor()
cur.execute("""
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
local AS a INNER JOIN neighbor AS b
ON a.surname1 = b.surname2
AND a.yob1 = b.yob2
AND a.cob1 = b.cob2
""")
I output this to csv giving output_python.csv.
The problem:
The outputs should be the same but they are not:
output_sas.csv contains 123 more records than output_python.csv.
Within the SAS output file, there are 123 records that contain blank space "" within the yob1 and yob2 columns i.e. as an example, The 123 records in the sas_data.csv look like this sample:
yob1 yob2 cob1 cob2 surname1 surname2
"" "" 1 1 xx xx
"" "" 2 2 yy yy
.
.
.
# Continues for 123 records
I find that this difference is due to the yob1 and yob2 columns, which, in the above 123 records contains blank space. These 123 record pairs are missing from the output_python.csv file.
[Note: In this work, a string of length zero corresponds to a missing value]
In short:
The PROC SQL routine in SAS is evaluating blank space as equal i.e. "" == "" -> TRUE.
The Python SQLite code appears to be doing opposite i.e. "" == "" ->
FALSE
This is happening even though "" == "" -> True in Python.
The question:
Why is this the case and what do I need to change to match up the SQLite output to the PROC SQL output?
Note: Both routines are using the same input datasets. They are entirely equal, and I even manually amend the Python code to ensure that the columns yob1 and yob2 contain "" for missing values.
Update 1:
At the moment my SAS PROC SQL code works on uses data1.sas7bdat, named local and data2.sas7bdat, named neighbor.
To use the same dataset in Python, in SAS, I export these to csv and read in to Python.
If I do:
import pandas as pd
# read in
dflocal = pd.read_csv(csv_path_local, index_col=False)
dfneighbor = pd.read_csv(csv_path_neighbor, index_col=False)
Pandas converts missing values to nan. We can use isnull() to find the number of nan values in each of the columns:
# find null / nan values in yob1 and yob2 in each dataset
len(dflocal.loc[dflocal.yob1.isnull()])
78
len(dfneighbor.loc[dfneighbor.yob2.isnull()])
184
To solve the null value problem, I then explicitly convert nan to a string of length zero "" by running:
dflocal['yob1'].fillna(value="", axis=0, inplace=True)
dfneighbor['yob2'].fillna(value="", axis=0, inplace=True)
We can test if the values got updated by testing a known nan:
dflocal.iloc[393].yob1
`""`
type(dflocal.iloc[393].yob1)
str
So they are a string of length 0.
Then read these into SQL via:
dflocal.to_sql('local', con=conn, flavor='sqlite', if_exists='replace', index=False)
dfneighbor.to_sql('neighbor', con=conn, flavor='sqlite', if_exists='replace', index=False)
Then execute the same SQLite3 code:
conn = sqlite3.connect(file_path + db_name)
cur = conn.cursor()
cur.execute("""
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
local AS a INNER JOIN neighbor AS b
ON a.surname1 = b.surname2
AND a.yob1 = b.yob2
AND a.cob1 = b.cob2
""")
Even though I have made this explicit change I STILL get the same missing 123 values, even though, the null values have been changed to a string of length zero "".
Potential Solution:
However, if I instead import the dataset with the na_filter=False argument , this does the conversion from null to "" for me.
dflocal = pd.read_csv(csv_path_local, index_col=False, na_filter=False)
dfneighbor = pd.read_csv(csv_path_neighbor, index_col=False, na_filter=False")
# find null / nan values in yob1 and yob2 in each dataset
len(dflocal.loc[dflocal.yob1.isnull()])
0
len(dfneighbor.loc[dfneighbor.yob2.isnull()])
0
When I import these datasets to my database and run this through the same SQL code:
conn = sqlite3.connect(file_path + db_name)
cur = conn.cursor()
cur.execute("""
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
local AS a INNER JOIN neighbor AS b
ON a.surname1 = b.surname2
AND a.yob1 = b.yob2
AND a.cob1 = b.cob2
""")
HOORAY I GET THE SAME OUTPUT AS THE SAS CODE!
But Why does the first solution not work? I'm doing the same thing in both cases (the first doing it manually with fill_na, and the second using na_filter=False).
In SAS, there isn't really a concept of null values for characters. It is more of an empty string.
However, in most SQL implementations (including SQlite, I assume), a null value and an empty string will be different.
A blank value in SAS is indeed evaluated as "" = "" which is true
In your average DBMS however, what you would call 'blank values' are often null values, not empty strings (""). And null=null is not true. You cannot compare null values with anything, including null values.
What you could do is change your SQlite to
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
local AS a INNER JOIN neighbor AS b
ON a.surname1 = b.surname2
AND coalesce(a.yob1,'') = coalesce(b.yob2,'')
AND a.cob1 = b.cob2
The coalesce function will replace yob with an empty string when yob is null.
Be aware however that, if yob1 is null and yob2 actually is an empty string, adding those coalesce functions will change what would have been a null='' condition, which is not true, to a ''='' which is true.
If that is not what you'd want, you could also just write it like this:
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
local AS a INNER JOIN neighbor AS b
ON a.surname1 = b.surname2
AND (a.yob1 = b.yob2
OR (a.yob1 is null AND a.yob2 is null)
)
AND a.cob1 = b.cob2
Sounds like you are being hit by the way that SQLite3 (and most DBMS) handle null values. In SAS you can compare null values to actual values but in most DBMS systems you cannot. Thus in SAS complementary logical comparisons like (A=B) and (A ne B) will always yield one as true and the other as false. But in a DBMS when either A or B or both is NULL then both (A=B) and (A ne B) will be FALSE. A NULL value is neither less than nor greater than another value.
In SAS if both values are NULL then they are equal and it one is NULL and the other is not then they are not equal. NULL numeric values are less than any actual number. NULL character variables do not exist and so are just treated as a blank filled value. Note that SAS also ignores trailing blanks when comparing character variables.
What this means in practice is that you need to add extra code to handle the NULL values when querying a DBMS.
ON (a.surname1 = b.surname2 or (a.surname1 is null and b.surname1 is null))
AND (a.yob1 = b.yob2 or (a.yob1 is null and b.yob2 is null))
AND (a.cob1 = b.cob2 or (a.cob1 is null and b.cob2 is null))
I have a tuple with a single value that's the result of a database query (it gives me the max ID # currently in the database). I need to add 1 to the value to utilize for my subsequent query to create a new profile associated with the next ID #.
Having trouble converting the tuple into an integer so that I can add 1 (tried the roundabout way here by turning the values into a string and then turning into a int). Help, please.
sql = """
SELECT id
FROM profiles
ORDER BY id DESC
LIMIT 1
"""
cursor.execute(sql)
results = cursor.fetchall()
maxID = int(','.join(str(results)))
newID = maxID + 1
If you are expecting just the one row, then use cursor.fetchone() instead of fetchall() and simply index into the one row that that method returns:
cursor.execute(sql)
row = cursor.fetchone()
newID = row[0] + 1
Rather than use an ORDER BY, you can ask the database directly for the maximum value:
sql = """SELECT MAX(id) FROM profiles"""
I have a access table like this (Date format is mm/dd/yyyy)
col_1_id-----col_2
1 1/1/2003
2
3 1/5/2009
4
5 3/2/2008
Output should be a table where Co_1 is between 2 to 4 (Blank cell must be blank)
2
3 1/5/2009
4
I tried with sql query. The output print 'None' in blank cell. I need blank in blank cell.
Other thing is when I tried to insert this value in another table it only insert
rows having date value. The code stops when it gets any row without date. I need to insert rows as it is.
I tried in python with
import pyodbc
DBfile = Data.mdb
conn = pyodbc.connect ('Driver = {Microsoft Access Driver (*.mdb)}; DBQ =' +DBfile
cursor = conn.cursor()
sql_table = "CREATE TABLE Table_new (Col_1 integer, Col_2 Date)"
cursor.execute.sql_table()
conn.commit()
i = 0
while i => 2 and i <= 4:
sql = "INSERT INTO New_Table (Col_1, Col_2)VALUES ('%s', '%s')" % (A[i][0], A[i][1])
cursor.execute(sql)
conn.commit()
i = i + 1
cursor.close
conn.close
`
Instead of using A[i][x] which dictates the value for you, why not simply add an OR logic to eliminate the possibility of None.
For any cell you wish to keep as "blank" (assume you mean empty string), let's say A[i][1], just do
A[i][1] or ""
Which will yield empty string "" if the cell gives you None.
The string representation of None is actually 'None', not the empty string. Try:
"... ('%s', '%s')" % (A[i][0], A[i][1] if A[i][1] else '')
I am using python and postgresql. I have a table with 6 column. One id and 5 entries. I want to copy the id and most repeated entry in 5 entries to a new table.
I have done this:
import psycopg2
connection=psycopg2.connect("dbname=homedb user=ria")
cursor=connection.cursor()
l_dict= {'licence_id':1}
cursor.execute("SELECT * FROM im_entry.usr_table")
rows=cursor.fetchall()
cursor.execute("INSERT INTO im_entry.pr_table (image_1d) SELECT image_1d FROM im_entry.usr_table")
for row in rows:
p = findmax(row) #to get most repeated entry from first table
.................
.................
Then how can I enter this p value to the new table?
Please help me
p is a tuple so you can create a new execute with the INSERT statement passing the tuple (or part):
cursor.execute("INSERT INTO new_table (x, ...) VALUES (%s, ...)", p)
where:
(x, ....) contains the column names
(%s, ...) %s is repeated for each column