Python - Pandasql - IIf Function doesn't work Error - python

Trying to execute the following code:
import pandas as pd
import pandasql as ps
df = pd.DataFrame({'A':[5,6,7], 'B':[7,8,9]})
print(df)
A B
0 5 7
1 6 8
2 7 9
qry = """SELECT df.*, IIf(A Is Null,[B],[A]) AS NEW_A FROM df;"""
df1 = ps.sqldf(qry, globals())
print(df1)
yields this error:
PandaSQLException: (sqlite3.OperationalError) no such function: IIf
[SQL: 'SELECT df.*, IIf(A Is Null,[B],[A]) AS NEW_A FROM df;']
I've tried various combinations of different syntax regarding square brackets, globals/locals etc. but couldn't find the issue. Is this function simply not existing?
I simply copied the SQL query from an MS Access query but for other cases this works fine.

There is no iif() function in SQLite.
In this case you can use coalesce():
SELECT df.*, coalesce(A, B) AS NEW_A FROM df
The functionality of iif() can be achieved with a CASE statement:
SELECT df.*, CASE WHEN A is null THEN B ELSE A END AS NEW_A FROM df
but in this case coalesce() is simpler.
UPDATE:
Starting from version 3.32.0 of SQLite (2020-05-22) the function iif() is supported.

Related

How to replicate an SQL group by Query with Stats (e.g count()) and get Arbitrary Number of Values as Sample in Python

I would like to replicate this SQL query in python data input is a dataframe. * would like to do so by using pandas.
select field_A, count(*) cnt, arbitrary(field_B)
from table1
group by field_A
Help will be appreciated.
a simple python solution would be to use pandas and pandasql
import pandas as pd
import pandasql as ps
df = pd.DataFrame(data=[[0, '10/11/12'], [1, '12/11/10']],
columns=['int_column', 'date_column'])
ps.sqldf("""select * from df""")
output
int_column date_column
0 0 10/11/12
1 1 12/11/10

Retrieve Value from Pyspark SQL dataframe in Python program

I am using pyspark.sql in a standalone Python program to run a query on a VERSION 0 of a table stored on Databricks.
I can return a data frame using the following code but can not seem to access the value (which is an int 5 in this case)
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
table_diff_df = spark.sql("select count(*) as result from (select * from myTableName version as of 0 limit 5)")
logger.console(type(table_diff_df))
logger.console(table_diff_df[0][0])
logger.console(type(table_diff_df[0]['result']))
logger.console(table_diff_df)
logger.console("tablediff as str :" + str(table_diff_df))
output 1
<class 'pyspark.sql.dataframe.DataFrame'>
Column<b'result[0]'>
<class 'pyspark.sql.column.Column'>
DataFrame[result: bigint]
tablediff as str :DataFrame[result: bigint]
By adjusting my query to the following(appending .collect()) I have been able to get the value of 5 as required (however I had to remove Version as of 0)
table_diff_df = spark.sql("select count(*) as result from (select * from myTableName limit 5)").collect()
logger.console(type(table_diff_df))
logger.console(table_diff_df[0][0])
output 2
<class 'list'>
5
In my case I MUST run the query on the Version 0 table, but when I add that back into my query as shown below I get the following error
table_diff_df = spark.sql("select count(*) as result from (select * from myTableName Version as of 0 limit 5)").collect()
logger.console(table_diff_df[0][0])
output 3
Time travel is only supported for Delta tables
Is there a simple way to access the value using the Dataframe I have shown in the first code snippet(output1)? and if not how can I get around the problem of time travel is only supported delta table. The table that I am querying is a delta table however I believe calling .collect() is converting it directly to a list(output2)?

Using Pandas Dataframe within a SQL Join

I'm trying to perform a SQL join on the the contents of a dataframe with an external table I have in a Postgres Database.
This is what the Dataframe looks like:
>>> df
name author count
0 a b 10
1 c d 5
2 e f 2
I need to join it with a Postgres table that looks like this:
TABLE: blog
title author url
a b w.com
b b x.com
e g y.com
This is what I'm attempting to do, but this doesn't appear to be the right syntax for the query:
>>> sql_join = r"""select b.*, frame.* from ({0}) frame
join blog b
on frame.name = b.title
where frame.owner = b.owner
order by frame.count desc
limit 30;""".format(df)
>>> res = pd.read_sql(sql_join, connection)
I'm not sure how I can use the values in the dataframes within the sql query.
Can someone point me in the right direction? Thanks!
Edit: As per my use case, I'm not able to convert the blog table into a dataframe given memory and performance constraints.
I managed to do this without having to convert the dataframe to a temp table or without reading SQL into a dataframe from the blog table.
For anyone else facing the same issue, this is achieved using a virtual table of sorts.
This is what my final sql query looks like this:
>>> inner_string = "VALUES ('a','b',10), ('c','d',5), ('e','f',2)"
>>> sql_join = r"""SELECT * FROM blog
JOIN ({0}) AS frame(title, owner, count)
ON blog.title = frame.title
WHERE blog.owner = frame.owner
ORDER BY frame.count DESC
LIMIT 30;""".format(inner_string)
>>> res = pd.read_sql(sql_join, connection)
You can use string manipulation to convert all rows in the dataframe into one large string similar to inner_string.
You should create another dataframe from the Postgres table and then join both dataframes.
You can use read_sql to create a df from table:
import psycopg2 ## Python connector library to Postgres
import pandas as pd
conn = psycopg2.connect(...) ## Put your DB credentials here
blog_df = pd.read_sql('blog', con=conn)
## This will bring `blog` table's data into blog_df
It should look like this:
In [258]: blog_df
Out[258]:
title author url
0 a b w.com
1 b b x.com
2 e g y.com
Now, you can join df and blog_df using merge like below:
In [261]: pd.merge(df, blog_df, left_on='name', right_on='title')
Out[261]:
name author_x count title author_y url
0 a b 10 a b w.com
1 e f 2 e g y.com
You will get result like above. You can clean it further.
Let me know if this helps.
I've had similar problems. I found a work-around that allows me to join two different servers where i only have read-only rights. using sqlalchemy insert the pandas dataframe and then join
import sqlalchemy as sa
import pandas as pd
metadata = MetaData()
sql_of_df = sa.Table(
"##df",
metadata,
sa.Column("name", sa.String(x), primary_key=True),
sa.Column("author", sa.String(x), nullable=False),
sa.Columnt("count", sa.Integer),
)
metadata.create_all(engine)
dataframe_dict = df.to_dict(orient='records')
insert_statement = sql_of_df.insert().values(
{
"name":sa.bindparam("name"),
"author":sa.bindparam("author"),
"count":sa.bindparam("count"),
}
)
session.execute(insert_statement, dataframe_dict)
statement=sa.text("SELECT * from blog Inner join ##df on blog.Title = ##df.name")
session.execute(statement)

How to call an Oracle function of refcursor return type using sqlalchemy and python

Function is given below:
CREATE or replace function get_data(data_key integer) return sys_refcursor
is
result1 sys_refcursor;
BEGIN
open result1 for 'Select DISTINCT COL1
FROM REF_TABLE
where DATA_KEY='||data_key;
return result1;
END;
Calling of the above function
variable rc refcursor;
exec :rc :=get_data(30038);
print rc;
This is working fine from sql developer.
How to call the same using python and sqlalchemy ?
I used cx_Oracle instead of SQLAlchemy and it worked. Result output is same what I was expecting from sqlalchemy. I converted the function to procedure.
Procedure
CREATE or replace procedure get_data(data_key in integer, result out
sys_refcursor) as
BEGIN
open result for 'Select DISTINCT COL1
FROM ref_table
where DATA_KEY='||data_key;
END;
Code for calling the above procedure using python and cx_Oracle
import cx_Oracle
import pandas as pd
conn = cx_Oracle.connect('system/<password>#127.0.0.1/XE')
cur = conn.cursor()
myvar = cur.var(cx_Oracle.CURSOR)
cur.callproc('get_data', (30038, myvar))
data = myvar.getvalue().fetchall()
if len(data) == 0:
data = {}
df = pd.DataFrame(data, columns=[i[0] for i in myvar.getvalue().description])
print df
Output of the above code
COL1
--------------------------
0 219586
1 246751
2 228245
3 244517
4 220765
5 243467
6 246622
7 222784

pandas read_sql drops dot in column names

is that a bug or I'm doing specifically something wrong ?
I create a df, put it in a sql table, df and table have a column with a dot in it.
now when I read the df from the sql table, column names aren't the same.
I wrote this little piece of code so that people can test it.
import sqlalchemy
import pandas as pd
import numpy as np
engine = sqlalchemy.create_engine('sqlite:///test.sqlite')
dfin = pd.DataFrame(np.random.randn(10,2), columns=['column with a . dot', 'without'])
print(dfin)
dfin.to_sql('testtable', engine, if_exists='fail')
tables = engine.table_names()
for table in tables:
sql = 'SELECT t.* FROM "' + table + '" t'
dfout = pd.read_sql(sql, engine)
print(dfout.columns)
print dfout
Solution is to pass sqlite_raw_colnames=True to your engine
In [141]: engine = sqlalchemy.create_engine('sqlite:///', execution_options={'sqlite_raw_colnames':True})
In [142]: dfin.to_sql('testtable', engine, if_exists='fail')
In [143]: pd.read_sql("SELECT * FROM testtable", engine).head()
Out[143]:
index column with a . dot without
0 0 0.213645 0.321328
1 1 -0.511033 0.496510
2 2 -1.114511 -0.030571
3 3 -1.370342 0.359123
4 4 0.101111 -1.010498
SQLAlchemy does this stripping of dots deliberately (in some cases SQLite may store col names as "tablename.colname"), see eg sqlalchemy+sqlite stripping column names with dots? and https://groups.google.com/forum/?hl=en&fromgroups#!topic/sqlalchemy/EqAuTFlMNZk
This seems a bug, but not necessarily in the pandas read_sql function, as this relies on the keys method of the SQLAlchemy ResultProxy object to determine the column names. And this seems to truncate the column names:
In [15]: result = engine.execute("SELECT * FROM testtable")
In [16]: result.keys()
Out[16]: [u'index', u' dot', u'without']
So the question is if this is a bug in SQLAlchemy, or that pandas should make a workaround (by eg using result.cursor.description which gives the correct names)
For now, you can also use the sqlite fallback mode, using a DBAPI connection instead of SQLAlchemy engine (as this relies on cursor.description, here the correct column names are used:
In [20]: con = sqlite3.connect(':memory:')
In [21]: dfin.to_sql('testtable', con, if_exists='fail')
In [22]: pd.read_sql("SELECT * FROM testtable", con).head()
Out[22]:
index column with a . dot without
0 0 0.213645 0.321328
1 1 -0.511033 0.496510
2 2 -1.114511 -0.030571
3 3 -1.370342 0.359123
4 4 0.101111 -1.010498

Categories