Retrieve Value from Pyspark SQL dataframe in Python program - python

I am using pyspark.sql in a standalone Python program to run a query on a VERSION 0 of a table stored on Databricks.
I can return a data frame using the following code but can not seem to access the value (which is an int 5 in this case)
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
table_diff_df = spark.sql("select count(*) as result from (select * from myTableName version as of 0 limit 5)")
logger.console(type(table_diff_df))
logger.console(table_diff_df[0][0])
logger.console(type(table_diff_df[0]['result']))
logger.console(table_diff_df)
logger.console("tablediff as str :" + str(table_diff_df))
output 1
<class 'pyspark.sql.dataframe.DataFrame'>
Column<b'result[0]'>
<class 'pyspark.sql.column.Column'>
DataFrame[result: bigint]
tablediff as str :DataFrame[result: bigint]
By adjusting my query to the following(appending .collect()) I have been able to get the value of 5 as required (however I had to remove Version as of 0)
table_diff_df = spark.sql("select count(*) as result from (select * from myTableName limit 5)").collect()
logger.console(type(table_diff_df))
logger.console(table_diff_df[0][0])
output 2
<class 'list'>
5
In my case I MUST run the query on the Version 0 table, but when I add that back into my query as shown below I get the following error
table_diff_df = spark.sql("select count(*) as result from (select * from myTableName Version as of 0 limit 5)").collect()
logger.console(table_diff_df[0][0])
output 3
Time travel is only supported for Delta tables
Is there a simple way to access the value using the Dataframe I have shown in the first code snippet(output1)? and if not how can I get around the problem of time travel is only supported delta table. The table that I am querying is a delta table however I believe calling .collect() is converting it directly to a list(output2)?

Related

Reading an SQL query into a Dask DataFrame

I'm trying create a function that takes an SQL SELECT query as a parameter and use dask to read its results into a dask DataFrame using the dask.read_sql_query function. I am new to dask and to SQLAlchemy.
I first tried this:
import dask.dataFrame as dd
query = "SELECT name, age, date_of_birth from customer"
df = dd.read_sql_query(sql=query, con=con_string, index_col="name", npartitions=10)
As you probably already know, this won't work because the sql parameter has to be an SQLAlchemy selectable and more importantly, TextClause isn't supported.
I then wrapped the query behind a select like this:
import dask.dataFrame as dd
from sqlalchemy import sql
query = "SELECT name, age, date_of_birth from customer"
sa_query = sql.select(sql.text(query))
df = dd.read_sql_query(sql=sa_query, con=con_string, index_col="name")
This fails too with a very weird error that I have been trying to solve. The problem is that dask needs to infer the types of the columns and it does so by reading the first head_row rows in the table - 5 rows by default - and infer the types there. This line in the dask codebase adds a LIMIT ? to the query, which ends up being
SELECT name, age, date_of_birth from customer LIMIT param_1
The param_1 doesn't get substituted at all with the right value - 5 in this case. It then fails on the next line, https://github.com/dask/dask/blob/main/dask/dataframe/io/sql.py#L119, tjat evaluates the SQL expression.
sqlalchemy.exc.ProgrammingError: (mariadb.ProgrammingError) You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'SELECT name, age, date_of_birth from customer
LIMIT ?' at line 1
[SQL: SELECT SELECT name, age, date_of_birth from customer
LIMIT ?]
[parameters: (5,)]
(Background on this error at: https://sqlalche.me/e/14/f405)
I can't understand why param_1 wasn't substituted with the value of head_rows. One can see from the error message that it detects there's a parameter that needs to be used for the substitution but for some reason it doesn't actually substitute it.
Perhaps, I didn't correctly create the SQLAlchemy selectable?
I can simply use pandas.read_sql and create a dask dataframe from the resulting pandas dataframe but that defeats the purpose of using dask in the first place.
I have the following constraints:
I cannot change the function to accept a ready-made sqlalchemy
selectable. This feature will be added to a private library used at
my company and various projects using this library do not use
sqlalchemy.
Passing meta to the custom function is not an option because it would require the caller do create it. However, passing a meta attribute to read_sql_query and setting head_rows=0 is completely ok as long as there's an efficient way to retrieve/create
while dask-sql might work for this case, using it is not an
option, unfortunately
How can I go about correctly reading an SQL query into dask dataframe?
The crux of the problem is this line:
sa_query = sql.select(sql.text(query))
What is happening is that we are constructing a nested SELECT query,
which can cause a problem downstream.
Let's first create a test database:
# create a test database (using https://stackoverflow.com/a/64898284/10693596)
from sqlite3 import connect
from dask.datasets import timeseries
con = "delete_me_test.sqlite"
db = connect(con)
# create a pandas df and store (timestamp is dropped to make sure
# that the index is numeric)
df = (
timeseries(start="2000-01-01", end="2000-01-02", freq="1h", seed=0)
.compute()
.reset_index()
)
df.to_sql("ticks", db, if_exists="replace")
Next, let's try to get things working with pandas without sqlalchemy:
from pandas import read_sql_query
con = "sqlite:///test.sql"
query = "SELECT * FROM ticks LIMIT 3"
meta = read_sql_query(sql=query, con=con).set_index("index")
print(meta)
# id name x y
# index
# 0 998 Ingrid 0.760997 -0.381459
# 1 1056 Ingrid 0.506099 0.816477
# 2 1056 Laura 0.316556 0.046963
Now, let's add sqlalchemy functions:
from pandas import read_sql_query
from sqlalchemy.sql import text, select
con = "sqlite:///test.sql"
query = "SELECT * FROM ticks LIMIT 3"
sa_query = select(text(query))
meta = read_sql_query(sql=sa_query, con=con).set_index("index")
# OperationalError: (sqlite3.OperationalError) near "SELECT": syntax error
# [SQL: SELECT SELECT * FROM ticks LIMIT 3]
# (Background on this error at: https://sqlalche.me/e/14/e3q8)
Note the SELECT SELECT due to running sqlalchemy.select on an existing query. This can cause problems. How to fix this? In general, I don't think there's a safe and robust way of transforming arbitrary SQL queries into their sqlalchemy equivalent, but if this is for an application where you know that users will only run SELECT statements, you can manually sanitize the query before passing it to sqlalchemy.select:
from dask.dataframe import read_sql_query
from sqlalchemy.sql import select, text
con = "sqlite:///test.sql"
query = "SELECT * FROM ticks"
def _remove_leading_select_from_query(query):
if query.startswith("SELECT "):
return query.replace("SELECT ", "", 1)
else:
return query
sa_query = select(text(_remove_leading_select_from_query(query)))
ddf = read_sql_query(sql=sa_query, con=con, index_col="index")
print(ddf)
print(ddf.head(3))
# Dask DataFrame Structure:
# id name x y
# npartitions=1
# 0 int64 object float64 float64
# 23 ... ... ... ...
# Dask Name: from-delayed, 2 tasks
# id name x y
# index
# 0 998 Ingrid 0.760997 -0.381459
# 1 1056 Ingrid 0.506099 0.816477
# 2 1056 Laura 0.316556 0.046963

while iterating over a pandas Series, query an SQLite database with each member of the Series

I have a pandas Series made from the following python dictionary, so:
gr8 = {'ERF13' : 'AT2G44840', 'BBX32' : 'AT3G21150', 'NAC061' : 'AT3G44350', 'NAC090' : 'AT5G22380', 'ERF019' : 'AT1G22810'}
gr8obj = pd.Series(gr8)
( where I have previously imported pandas as pd )
I have an SQLite database, AtRegnet.db
I want to iterate over the pandase Series, gr8obj, and query the database, AtRegnet.db, for each member of the series.
This is what I have tried:
for i in gr8obj:
resdf = pd.read_sql('SELECT * FROM AtRegNet WHERE TargetLocus = ?' (i), con=sqlite3.connect("/home/anno/SQLiteDBs/AtRegnet.db"))
fresdf = resdf.append(resdf)
fresdf
( the table in the AtRegnet.db that I want is AtRegNet and the column I am searching on is called TargetLocus. )
I know that when I work on the SQLite3 database directly with a SQL command,
select * from AtRegNet where TargetLocus="AT3G23230"
that I get back 80 lines from the database. (AT3G23230 is one of members of gr8obj)
You can try using a f-string. And the value for TargetLocus in your query should also be in quotes
resdf = pd.read_sql(f'''SELECT * FROM AtRegNet WHERE TargetLocus = \'{i}\'''')

How to convert correlated sub query to pyspark code

I am trying to convert correlated subquery in to pyspark code and i am getting below error.
Attribute(s) with the same name appear in the operation:
Sample sql query:
select(few columns)
from table0 spw1
JOIN table1 spw2
ON ( spw2.date_key=(SELECT MAX(date_key)
FROM table2 spw3
WHERE spw3.name =spw1.name
AND spw3.id = spw1.id
AND spw3.v_key=spw1.v_key
AND spw3.start_date < spw1.prev_start_date)

Finding difference between two time stamp in pyspark sql

Below table structure, you can notice the column name
cal_avg_latency = spark.sql("SELECT UnitType, ROUND(AVG(TIMESTAMP_DIFF(OnSceneDtTmTS, ReceivedDtTmTS, MINUTE)), 2) as latency, count(*) as total_count FROM `SFSC_Incident_Census_view` WHERE EXTRACT(DATE from ReceivedDtTmTS) == EXTRACT(DATE from OnSceneDtTmTS) GROUP BY UnitType ORDER BY latency ASC")
Error:
ParseException: "\nmismatched input 'FROM' expecting <EOF>(line 1, pos 122)\n\n== SQL ==\nSELECT UnitType, ROUND(AVG(TIMESTAMP_DIFF(OnSceneDtTmTS, ReceivedDtTmTS, MINUTE)), 2) as latency, count(*) as total_count FROM SFSC_Incident_Census_view WHERE EXTRACT((DATE FROM ReceivedDtTmTS) == EXTRACT(DATE FROM OnSceneDtTmTS)) GROUP BY UnitType ORDER BY latency ASC\n--------------------------------------------------------------------------------------------------------------------------^^^\n"
Error is in WHERE condition but even my TIMESTAMP_DIFF function not working
cal_avg_latency = spark.sql("SELECT UnitType, ROUND(AVG(TIMESTAMP_DIFF(OnSceneDtTmTS, ReceivedDtTmTS, MINUTE)), 2) as latency, count(*) as total_count FROM SFSC_Incident_Census_view GROUP BY UnitType ORDER BY latency ASC")
Error :
AnalysisException: "Undefined function: 'TIMESTAMP_DIFF'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 27"
The error message seems pretty clear. Hive doesn't have a TIMESTAMP_DIFF function.
If your columns are already appropriately cast as a timestamp type, you can subtract them directly. Otherwise, you can cast them explicity, and take the difference:
SELECT ROUND(AVG(MINUTE(CAST(OnSceneDtTmTS AS timestamp) - CAST(ReceivedDtTmTS AS timestamp))), 2) AS latency
I have solve the problem using pyspark query.
from pyspark.sql import functions as F
import pyspark.sql.functions as func
timeFmt = "yyyy-MM-dd'T'HH:mm:ss.SSS"
timeDiff = (F.unix_timestamp('OnSceneDtTmTS', format=timeFmt)
- F.unix_timestamp('ReceivedDtTmTS', format=timeFmt))
FSCDataFrameTsDF = FSCDataFrameTsDF.withColumn("Duration", timeDiff)
#convert seconds to minute and round the seconds for further use.
FSCDataFrameTsDF = FSCDataFrameTsDF.withColumn("Duration_minutes",func.round(FSCDataFrameTsDF.Duration / 60.0))
Output:

Use psycopg2 to obtain long value from PostgreSQl

I am facing problem in retrieving long value from PostgreSQL
I use the following SQL command :
SELECT *, (extract(epoch FROM start_timestamp) * 1000) FROM lot
WHERE EXTRACT(EPOCH FROM lot.start_timestamp) * 1000 >=1265299200000 AND
EXTRACT(EPOCH FROM lot.start_timestamp) * 1000 <=1265990399999
ORDER BY start_timestamp DESC limit 9 offset 0
I am interested in the unix timestamp in ms
I run this command manually through pgAdmin.
From pgAdmin, I saw 1265860762817 for column (extract(epoch FROM start_timestamp) * 1000)
However, when I retrieve through psycopg2, here is what I use :
cur = conn.cursor()
cur.execute(sql)
rows = cur.fetchall()
for row in rows :
print row[3]
And here is what I get :
1.26586076282e+12
I pass this value, from python cgi to JavaScript in JSON
If I do manual conversion in JavaScript
1.26586076282e+12
// summary.date is 1.26586076282e+12
var timestamp = summary.date * 1;
// Get 1265860762820
alert(timestamp);
There are 3ms error difference
1265860762817
1265860762820-
==============
3
==============
How can I avoid such error? I can I make sure psycopg2 is returning me 1265860762817, not 1.26586076282e+12
I have the same results using both psycopg2 and pygres. While this select returns float value you can modify your code to format returned value:
def format_float_fld(v):
#return str(v)
return ('%20.0f' % (v)).strip()
If you use str(s) then you will get scientific notation.
You can also change query to return bigint instead of float:
SELECT (EXTRACT(EPOCH FROM TIMESTAMP '2010-02-16 20:38:40.123') * 1000)::bigint;

Categories