I am comparing two datasets to look for duplicate entries on certain columns.
I have done this first in SAS using the PROC SQL command as below(what I consider the true outcome) using the following query:
proc sql;
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
local AS a INNER JOIN neighbor AS b
ON a.surname1 = b.surname2
AND a.yob1 = b.yob2
AND a.cob1 = b.cob2;
quit;
I output this result to csv giving output_sas.csv
I have also done this in Python using SQLite3 using the same query:
conn = sqlite3.connect(file_path + db_name)
cur = conn.cursor()
cur.execute("""
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
local AS a INNER JOIN neighbor AS b
ON a.surname1 = b.surname2
AND a.yob1 = b.yob2
AND a.cob1 = b.cob2
""")
I output this to csv giving output_python.csv.
The problem:
The outputs should be the same but they are not:
output_sas.csv contains 123 more records than output_python.csv.
Within the SAS output file, there are 123 records that contain blank space "" within the yob1 and yob2 columns i.e. as an example, The 123 records in the sas_data.csv look like this sample:
yob1 yob2 cob1 cob2 surname1 surname2
"" "" 1 1 xx xx
"" "" 2 2 yy yy
.
.
.
# Continues for 123 records
I find that this difference is due to the yob1 and yob2 columns, which, in the above 123 records contains blank space. These 123 record pairs are missing from the output_python.csv file.
[Note: In this work, a string of length zero corresponds to a missing value]
In short:
The PROC SQL routine in SAS is evaluating blank space as equal i.e. "" == "" -> TRUE.
The Python SQLite code appears to be doing opposite i.e. "" == "" ->
FALSE
This is happening even though "" == "" -> True in Python.
The question:
Why is this the case and what do I need to change to match up the SQLite output to the PROC SQL output?
Note: Both routines are using the same input datasets. They are entirely equal, and I even manually amend the Python code to ensure that the columns yob1 and yob2 contain "" for missing values.
Update 1:
At the moment my SAS PROC SQL code works on uses data1.sas7bdat, named local and data2.sas7bdat, named neighbor.
To use the same dataset in Python, in SAS, I export these to csv and read in to Python.
If I do:
import pandas as pd
# read in
dflocal = pd.read_csv(csv_path_local, index_col=False)
dfneighbor = pd.read_csv(csv_path_neighbor, index_col=False)
Pandas converts missing values to nan. We can use isnull() to find the number of nan values in each of the columns:
# find null / nan values in yob1 and yob2 in each dataset
len(dflocal.loc[dflocal.yob1.isnull()])
78
len(dfneighbor.loc[dfneighbor.yob2.isnull()])
184
To solve the null value problem, I then explicitly convert nan to a string of length zero "" by running:
dflocal['yob1'].fillna(value="", axis=0, inplace=True)
dfneighbor['yob2'].fillna(value="", axis=0, inplace=True)
We can test if the values got updated by testing a known nan:
dflocal.iloc[393].yob1
`""`
type(dflocal.iloc[393].yob1)
str
So they are a string of length 0.
Then read these into SQL via:
dflocal.to_sql('local', con=conn, flavor='sqlite', if_exists='replace', index=False)
dfneighbor.to_sql('neighbor', con=conn, flavor='sqlite', if_exists='replace', index=False)
Then execute the same SQLite3 code:
conn = sqlite3.connect(file_path + db_name)
cur = conn.cursor()
cur.execute("""
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
local AS a INNER JOIN neighbor AS b
ON a.surname1 = b.surname2
AND a.yob1 = b.yob2
AND a.cob1 = b.cob2
""")
Even though I have made this explicit change I STILL get the same missing 123 values, even though, the null values have been changed to a string of length zero "".
Potential Solution:
However, if I instead import the dataset with the na_filter=False argument , this does the conversion from null to "" for me.
dflocal = pd.read_csv(csv_path_local, index_col=False, na_filter=False)
dfneighbor = pd.read_csv(csv_path_neighbor, index_col=False, na_filter=False")
# find null / nan values in yob1 and yob2 in each dataset
len(dflocal.loc[dflocal.yob1.isnull()])
0
len(dfneighbor.loc[dfneighbor.yob2.isnull()])
0
When I import these datasets to my database and run this through the same SQL code:
conn = sqlite3.connect(file_path + db_name)
cur = conn.cursor()
cur.execute("""
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
local AS a INNER JOIN neighbor AS b
ON a.surname1 = b.surname2
AND a.yob1 = b.yob2
AND a.cob1 = b.cob2
""")
HOORAY I GET THE SAME OUTPUT AS THE SAS CODE!
But Why does the first solution not work? I'm doing the same thing in both cases (the first doing it manually with fill_na, and the second using na_filter=False).
In SAS, there isn't really a concept of null values for characters. It is more of an empty string.
However, in most SQL implementations (including SQlite, I assume), a null value and an empty string will be different.
A blank value in SAS is indeed evaluated as "" = "" which is true
In your average DBMS however, what you would call 'blank values' are often null values, not empty strings (""). And null=null is not true. You cannot compare null values with anything, including null values.
What you could do is change your SQlite to
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
local AS a INNER JOIN neighbor AS b
ON a.surname1 = b.surname2
AND coalesce(a.yob1,'') = coalesce(b.yob2,'')
AND a.cob1 = b.cob2
The coalesce function will replace yob with an empty string when yob is null.
Be aware however that, if yob1 is null and yob2 actually is an empty string, adding those coalesce functions will change what would have been a null='' condition, which is not true, to a ''='' which is true.
If that is not what you'd want, you could also just write it like this:
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
local AS a INNER JOIN neighbor AS b
ON a.surname1 = b.surname2
AND (a.yob1 = b.yob2
OR (a.yob1 is null AND a.yob2 is null)
)
AND a.cob1 = b.cob2
Sounds like you are being hit by the way that SQLite3 (and most DBMS) handle null values. In SAS you can compare null values to actual values but in most DBMS systems you cannot. Thus in SAS complementary logical comparisons like (A=B) and (A ne B) will always yield one as true and the other as false. But in a DBMS when either A or B or both is NULL then both (A=B) and (A ne B) will be FALSE. A NULL value is neither less than nor greater than another value.
In SAS if both values are NULL then they are equal and it one is NULL and the other is not then they are not equal. NULL numeric values are less than any actual number. NULL character variables do not exist and so are just treated as a blank filled value. Note that SAS also ignores trailing blanks when comparing character variables.
What this means in practice is that you need to add extra code to handle the NULL values when querying a DBMS.
ON (a.surname1 = b.surname2 or (a.surname1 is null and b.surname1 is null))
AND (a.yob1 = b.yob2 or (a.yob1 is null and b.yob2 is null))
AND (a.cob1 = b.cob2 or (a.cob1 is null and b.cob2 is null))
Related
I am not experienced with SQL or SQLite3.
I have a list of ids from another table. I want to use the list as a key in my query and get all records based on the list. I want the SQL query to feed directly into a DataFrame.
import pandas as pd
import sqlite3
cnx = sqlite3.connect('c:/path/to/data.sqlite')
# the below values are ones found in "s_id"
id_list = ['C20','C23','C25','C28', ... ,'C83']
# change list to sql string.
id_sql = ", ".join(str(x) for x in id_list)
df = pd.read_sql_query(f"SELECT * FROM table WHERE s_id in ({id_sql})", cnx)
I am getting a DatabaseError: Execution failed on sql 'SELECT * FROM ... : no such column: C20.
When I saw this error I thought the code just needs a simple switch. So I tried this
df = pd.read_sql_query(f"SELECT * FROM table WHERE ({id_sql}) in s_id", cnx)
it did not work.
So how can I get this to work?
The table is like.
id
s_id
date
assigned_to
date_complete
notes
0
C10
1/6/2020
Jack
1/8/2020
None
1
C20
1/10/2020
Jane
1/12/2020
Call back
2
C23
1/11/2020
Henry
1/12/2020
finished
n
C83
rows
of
more
data
n+1
D85
9/10/2021
Jeni
9/12/2021
Call back
Currently, you are missing the single quotes around your literal values and consequently the SQLite engine assumes you are attempting to query columns. However, avoid concatenation of values altogether but bind them to parameters which pandas pandas.read_sql supports with the params argument:
# the below values are ones found in "s_id"
id_list = ['C20','C23','C25','C28', ... ,'C83']
# build equal length string of ? place holders
prm_list = ", ".join("?" for _ in id_list)
# build prepared SQL statement
sql = f"SELECT * FROM table WHERE s_id IN ({prm_list})"
# run query, passing parameters and values separately
df = pd.read_sql(sql, con=cnx, params=id_list)
So the problem is that it is missing single quote marks in the sql string. So for the in part needs ' on each side of the s_id values.
import pandas as pd
import sqlite3
cnx = sqlite3.connect('c:/path/to/data.sqlite')
# the below values are ones found in "s_id"
id_list = ['C20','C23','C25','C28', ... ,'C83']
# change list to sql string.
id_sql = "', '".join(str(x) for x in id_list)
df = pd.read_sql_query(f"SELECT * FROM table WHERE s_id in ('{id_sql}')", cnx)
I have a dataframe named Data2 and I wish to put values of it inside a postgresql table. For reasons, I cannot use to_sql as some of the values in Data2 are numpy arrays.
This is Data2's schema:
cursor.execute(
"""
DROP TABLE IF EXISTS Data2;
CREATE TABLE Data2 (
time timestamp without time zone,
u bytea,
v bytea,
w bytea,
spd bytea,
dir bytea,
temp bytea
);
"""
)
My code segment:
for col in Data2_mcw.columns:
for row in Data2_mcw.index:
value = Data2_mcw[col].loc[row]
if type(value).__module__ == np.__name__:
value = pickle.dumps(value)
cursor.execute(
"""
INSERT INTO Data2_mcw(%s)
VALUES (%s)
"""
,
(col.replace('\"',''),value)
)
Error generated:
psycopg2.errors.SyntaxError: syntax error at or near "'time'"
LINE 2: INSERT INTO Data2_mcw('time')
How do I rectify this error?
Any help would be much appreciated!
There are two problems I see with this code.
The first problem is that you cannot use bind parameters for column names, only for values. The first of the two %s placeholders in your SQL string is invalid. You will have to use string concatenation to set column names, something like the following (assuming you are using Python 3.6+):
cursor.execute(
f"""
INSERT INTO Data2_mcw({col})
VALUES (%s)
""",
(value,))
The second problem is that a SQL INSERT statement inserts an entire row. It does not insert a single value into an already-existing row, as you seem to be expecting it to.
Suppose your dataframe Data2_mcw looks like this:
a b c
0 1 2 7
1 3 4 9
Clearly, this dataframe has six values in it. If you were to run your code on this dataframe, then it would insert six rows into your database table, one for each value, and the data in your table would look like the following:
a b c
1
3
2
4
7
9
I'm guessing you don't want this: you'd rather your database table contained the following two rows instead:
a b c
1 2 7
3 4 9
Instead of inserting one value at a time, you will have to insert one entire row at time. This means you have to swap your two loops around, build the SQL string up once beforehand, and collect together all the values for a row before passing it to the database. Something like the following should hopefully work (please note that I don't have a Postgres database to test this against):
column_names = ",".join(Data2_mcw.columns)
placeholders = ",".join(["%s"] * len(Data2_mcw.columns))
sql = f"INSERT INTO Data2_mcw({column_names}) VALUES ({placeholders})"
for row in Data2_mcw.index:
values = []
for col in Data2_mcw.columns:
value = Data2_mcw[col].loc[row]
if type(value).__module__ == np.__name__:
value = pickle.dumps(value)
values.append(value)
cursor.execute(sql, values)
I was running below codes in python. It's doing a merge from one table to another table. But sometimes it gave me errors due to duplicates. How do I know which records have been merge and which one has not so that I can trace the records and fix it. Or at least, how to make my code log hinted error message so that I can trace it?
# Exact match client on NAME/DOB (not yet using name_dob_v)
sql = """
merge into nf.es es using (
select id, name_last, name_first, dob
from fd.emp
where name_last is not null and name_first is not null and dob is not null
) es6
on (upper(es.patient_last_name) = upper(es6.name_last) and upper(es.patient_first_name) = upper(es6.name_first)
and es.patient_dob = ems6.dob)
when matched then update set
es.client_id = ems6.id
, es.client_id_comment = '2 exact name/exact dob match'
where
es.client_id is null -- exclude those already matched
and es.patient_last_name is not null and es.patient_first_name is not null and es.patient_dob is not null
and es.is_lock = 'Locked' and es.is_active = 'Yes' and es.patient_last_name NOT IN ('DOE','UNKNOWN','DELETE', 'CANCEL','CANCELLED','CXL','REFUSED')
"""
log.info(sql)
curs.execute(sql)
msg = "nf.es rows updated with es6 client_id due to exact name/dob match: %d" % curs.rowcount
log.info(msg)
emailer.append(msg)
You can't know, merge won't tell you. You have to actually find them and take appropriate action.
Maybe it'll help if you select distinct values:
merge into nf.es es using (
select DISTINCT --> this
id, name_last, name_first, dob
from fd.emp
...
If it still doesn't work, then join table to be merged with the one in using clause on all columns you're doing it already and see which rows are duplicate. Something like this:
SELECT *
FROM (SELECT d.id,
d.name_last,
d.name_first,
d.dob
FROM fd.emp d
JOIN nf.es e
ON UPPER (e.patient_last_name) = UPPER (d.name_last)
AND UPPER (e.patient_first_name) = UPPER (d.name_first)
WHERE d.name_last IS NOT NULL
AND d.name_first IS NOT NULL
AND d.dob IS NOT NULL)
GROUP BY id,
name_last,
name_first,
dob
HAVING COUNT (*) > 2;
Context: I am using MSSQL, pandas, and pyodbc.
Steps:
Obtain dataframe from query using pyodbc (no problemo)
Process columns to generate the context of a new (but already existing) column
Fill an auxilliary column with UPDATE statements (i.e. UPDATE t SET t.value = df.value FROM dbo.table t where t.ID = df.ID)
Now how do I execute the sql code in the auxilliary column, without looping through each row?
sample data
The first two columns are obtained by querying dbo.table, the third columns exists but is empty in the database. The fourth column only exists in the dataframe to prepare the SQL statement that would correspond to updating dbo.table
ID
raw
processed
strSQL
1
lorum.ipsum#test.com
lorum ipsum
UPDATE t SET t.processed = 'lorum ipsum' FROM dbo.table t WHERE t.ID = 1
2
rumlo.sumip#test.com
rumlo sumip
UPDATE t SET t.processed = 'rumlo sumip' FROM dbo.table t WHERE t.ID = 2
3
...
...
...
I would like to execute the SQL script in each row in an efficient manner.
After I recommended .executemany() in a comment to the question, a subsequent comment from #Charlieface suggested that a table-valued parameter (TVP) would provide even better performance. I didn't think it would make that much difference, but I was wrong.
For an existing table named MillionRows
ID TextField
-- ---------
1 foo
2 bar
3 baz
…
and example data of the form
num_rows = 1_000_000
rows = [(f"text{x:06}", x + 1) for x in range(num_rows)]
print(rows)
# [('text000000', 1), ('text000001', 2), ('text000002', 3), …]
my test using a standard executemany() call with cnxn.autocommit = False and crsr.fast_executemany = True
crsr.executemany("UPDATE MillionRows SET TextField = ? WHERE ID = ?", rows)
took about 180 seconds (3 minutes).
However, by creating a user-defined table type
CREATE TYPE dbo.TextField_ID AS TABLE
(
TextField nvarchar(255) NULL,
ID int NOT NULL,
PRIMARY KEY (ID)
)
and a stored procedure
CREATE PROCEDURE [dbo].[mr_update]
#tbl dbo.TextField_ID READONLY
AS
BEGIN
SET NOCOUNT ON;
UPDATE MillionRows SET TextField = t.TextField
FROM MillionRows mr INNER JOIN #tbl t ON mr.ID = t.ID
END
when I used
crsr.execute("{CALL mr_update (?)}", (rows,))
it did the same update in approximately 80 seconds (less than half the time).
Hello dear stackoverflow community,
here is my problem:
A) I have data in csv with some boolean columns;
unfortunately, the values in these columns are t or f (single letter);
this is an artifact (from Redshift) that I cannot control.
B) I need to create a spark dataframe from this data,
hopefully converting t -> true and f -> false.
For that, I create a Hive DB and a temp Hive table
and then SELECT * from it, like this:
sql_str = """SELECT * FROM {db}.{s}_{t} """.format(
db=hive_db_name, s=schema, t=table)
df = sql_cxt.sql(sql_str)
This works, I can print df, and it gives me all my columns with correct data types.
But:
C) If I create the table like this:
CREATE EXTERNAL TABLE IF NOT EXISTS {db}.{schema}_{table}({cols})
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|t'
STORED AS TEXTFILE
LOCATION ...
, this converts all my t and f to Nulls.
So:
D) I found out about LazySimpleSerDe that presumably must do what I mean (convert t and f to true and false on the fly). From https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties (quote):
"""
hive.lazysimple.extended_boolean_literal
Default Value: false
Added in: Hive 0.14 with HIVE-3635
LazySimpleSerDe uses this property to determine
if it treats 'T', 't', 'F', 'f', '1', and '0' as extended,
legal boolean literals, in addition to 'TRUE' and 'FALSE'.
The default is false, which means only 'TRUE' and 'FALSE'
are treated as legal boolean literals.
"""
According to this (or at least so I think), I now create a table in Hive DB like this:
create_table_sql = """
CREATE EXTERNAL TABLE IF NOT EXISTS {db_name}.{schema}_{table}({cols})
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ("separatorChar" = "\|")
STORED AS TEXTFILE
LOCATION '{loc}'
TBLPROPERTIES ('hive.lazysimple.extended_boolean_literal'='true')
""".format(db_name=hive_db_name,
schema=schema,
table=table,
cols=",\n".join(cols),
loc=location)
return sql_cxt.sql(create_table_sql)
This does create a table,
I can again see all the columns with proper data types,
the df.count() is correct, but df.head(3) still
gives me all values for my boolean columns == Null.
(:___
I tried for hours different variants for my CREATE TABLE...
with or without SERDEPROPERTIES,
with or without TBLPROPERTIES,
with "FIELDS TERMINATED BY..." or without,
etc.
All give me either
Null in place of 't' and 'f', or
an empty df (nothing from df.head(5)), or
a syntax error, or
some 100 pages of Java exceptions.
The real problem is, I would say, that there is no single example of CREATE TABLE with LazySimpleSerDe
that does the job that is described in the docs.
I would really, really appreciate your help or any ideas. I pulled out almost all my hair.
Thank you in advance!
According to the patches in jira issues:
SET hive.lazysimple.extended_boolean_literal=true;
So for example, if you have a tab-delimited text file, containing header rows, and 't'/'f' for true false:
create table mytable(myfield boolean)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
location '/path'
tblproperties (
'skip.header.line.count' = '1'
);
...
select count(*) from mytable where myfield is null; <-- returns 100% null
...
SET hive.lazysimple.extended_boolean_literal=true;
select count(*) from mytable where myfield is null; <-- changes the serde to interpret the booleans with a more forgiving interpretation, yields a different count