Finding difference between two time stamp in pyspark sql

Finding difference between two time stamp in pyspark sql - python

Below table structure, you can notice the column name
cal_avg_latency = spark.sql("SELECT UnitType, ROUND(AVG(TIMESTAMP_DIFF(OnSceneDtTmTS, ReceivedDtTmTS, MINUTE)), 2) as latency, count(*) as total_count FROM `SFSC_Incident_Census_view` WHERE EXTRACT(DATE from ReceivedDtTmTS) == EXTRACT(DATE from OnSceneDtTmTS) GROUP BY UnitType ORDER BY latency ASC")
Error:
ParseException: "\nmismatched input 'FROM' expecting <EOF>(line 1, pos 122)\n\n== SQL ==\nSELECT UnitType, ROUND(AVG(TIMESTAMP_DIFF(OnSceneDtTmTS, ReceivedDtTmTS, MINUTE)), 2) as latency, count(*) as total_count FROM SFSC_Incident_Census_view WHERE EXTRACT((DATE FROM ReceivedDtTmTS) == EXTRACT(DATE FROM OnSceneDtTmTS)) GROUP BY UnitType ORDER BY latency ASC\n--------------------------------------------------------------------------------------------------------------------------^^^\n"
Error is in WHERE condition but even my TIMESTAMP_DIFF function not working
cal_avg_latency = spark.sql("SELECT UnitType, ROUND(AVG(TIMESTAMP_DIFF(OnSceneDtTmTS, ReceivedDtTmTS, MINUTE)), 2) as latency, count(*) as total_count FROM SFSC_Incident_Census_view GROUP BY UnitType ORDER BY latency ASC")
Error :
AnalysisException: "Undefined function: 'TIMESTAMP_DIFF'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 27"

The error message seems pretty clear. Hive doesn't have a TIMESTAMP_DIFF function.
If your columns are already appropriately cast as a timestamp type, you can subtract them directly. Otherwise, you can cast them explicity, and take the difference:
SELECT ROUND(AVG(MINUTE(CAST(OnSceneDtTmTS AS timestamp) - CAST(ReceivedDtTmTS AS timestamp))), 2) AS latency

I have solve the problem using pyspark query.
from pyspark.sql import functions as F
import pyspark.sql.functions as func
timeFmt = "yyyy-MM-dd'T'HH:mm:ss.SSS"
timeDiff = (F.unix_timestamp('OnSceneDtTmTS', format=timeFmt)
- F.unix_timestamp('ReceivedDtTmTS', format=timeFmt))
FSCDataFrameTsDF = FSCDataFrameTsDF.withColumn("Duration", timeDiff)
#convert seconds to minute and round the seconds for further use.
FSCDataFrameTsDF = FSCDataFrameTsDF.withColumn("Duration_minutes",func.round(FSCDataFrameTsDF.Duration / 60.0))
Output:

Related

Python Airflow bigquery 400 configuration.query.createDisposition cannot be set for scripts

Recently I started getting an error in my BigQueryExecuteQueryOperator (from airflow.providers.google.cloud.operators.bigquery import BigQueryExecuteQueryOperator)
execute_query_job = BigQueryExecuteQueryOperator(
task_id = "execute_query_job_{}".format(destination_table),
use_legacy_sql = False,
sql = sql_query,
destination_dataset_table = destination_table,
create_disposition = "CREATE_IF_NEEDED",
write_disposition = 'WRITE_TRUNCATE',
dag = dag
)
job_id_execute = execute_query_job.execute(context=context)
The above code block works how it suppose to work, so it is working fine. But when I change my sql_query to a new one I am getting the Error 400: configuration.query.createDisposition cannot be set for scripts.
SQL script which works for the code block,..
with data_table as(
select pltfm_name, event_dt as event_date
from `project_id.dataset.data_tabele`
AND event_dt BETWEEN DATE('start_date',"America/Los_Angeles") AND DATE('end_date',"America/Los_Angeles")
),
activity_data as (
select DATE(timestamp, "America/Los_Angeles") as event_date,
COUNT (distinct CASE WHEN eventid = 'mp' AS bp
from `project_id.dataset.data_tabele`
AND DATE(timestamp, "America/Los_Angeles") between DATE("start_date","America/Los_Angeles") AND DATE("end_date","America/Los_Angeles")
group by 1
),
cal as (
select event_date FROM UNNEST(GENERATE_DATE_ARRAY(DATE("start_date","America/Los_Angeles"), DATE("end_date","America/Los_Angeles"))) event_date
)
select a.event_date,
coalesce(c.bp, 0) as bp,
from cal a
left join activity_data c on a.event_date = c.event_date;
But the below SQL script doesn't work and it'll give an error.
DECLARE
temp string DEFAULT 'D';
SET temp = 'M';
WITH
BASE_DATA AS (
SELECT
CASE
WHEN temp = 'M' THEN DATE_TRUNC(EventDate,MONTH)
WHEN temp = 'Q'THEN DATE_TRUNC(EventDate,QUARTER)
END
ed,
SUM(CASE
WHEN temp = 'M' THEN tl
WHEN temp = 'Q' THEN tl
END) AS tl_count
FROM
`project_id.dataset.data_table`
WHERE
CASE
WHEN temp = 'M' THEN (DATE(EventDate) BETWEEN DATE_ADD(DATE_TRUNC(DATE(CURRENT_DATE()), MONTH), INTERVAL -2 MONTH) AND DATE_ADD(DATE_TRUNC(CURRENT_DATE(), MONTH), INTERVAL -1 DAY))
WHEN temp = 'Q' THEN (DATE(EventDate) BETWEEN DATE_ADD(DATE_TRUNC(DATE(CURRENT_DATE()), QUARTER), INTERVAL -2 QUARTER)
AND DATE_ADD(DATE_TRUNC(CURRENT_DATE(), QUARTER), INTERVAL -1 DAY))
END
GROUP BY
1
ORDER BY
1 DESC)
SELECT
ed,
tl_count
FROM
BASE_DATA
ORDER BY
ed DESC;
So the above SQL script throws the error, but runs perfectly in GCP BigQuery. I have looked around and it seems Airflow can't execute the query with a DECLARE statement or something similar. (Kind of similar issue --> https://www.py4u.net/discuss/174607). I have tried what they suggested but it still didn't work, end-up with the same error. So, now I am not sure what causing the issue here and if there is another way to approach this in Airflow.
Does anyone know what might be happening and a solution or a workaround?

As you've surmised, the DECLARE statement means that there's multiple discrete steps in this SQL text, so this is executed as a SCRIPT rather than a single statement: https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting
The easiest thing is probably to remove the job config properties related to destination table/dispositions and instead update the final SELECT ... to be a CREATE OR REPLACE TABLE ... AS SELECT ...: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_table_statement

Retrieve Value from Pyspark SQL dataframe in Python program

I am using pyspark.sql in a standalone Python program to run a query on a VERSION 0 of a table stored on Databricks.
I can return a data frame using the following code but can not seem to access the value (which is an int 5 in this case)
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
table_diff_df = spark.sql("select count(*) as result from (select * from myTableName version as of 0 limit 5)")
logger.console(type(table_diff_df))
logger.console(table_diff_df[0][0])
logger.console(type(table_diff_df[0]['result']))
logger.console(table_diff_df)
logger.console("tablediff as str :" + str(table_diff_df))
output 1
<class 'pyspark.sql.dataframe.DataFrame'>
Column<b'result[0]'>
<class 'pyspark.sql.column.Column'>
DataFrame[result: bigint]
tablediff as str :DataFrame[result: bigint]
By adjusting my query to the following(appending .collect()) I have been able to get the value of 5 as required (however I had to remove Version as of 0)
table_diff_df = spark.sql("select count(*) as result from (select * from myTableName limit 5)").collect()
logger.console(type(table_diff_df))
logger.console(table_diff_df[0][0])
output 2
<class 'list'>
5
In my case I MUST run the query on the Version 0 table, but when I add that back into my query as shown below I get the following error
table_diff_df = spark.sql("select count(*) as result from (select * from myTableName Version as of 0 limit 5)").collect()
logger.console(table_diff_df[0][0])
output 3
Time travel is only supported for Delta tables
Is there a simple way to access the value using the Dataframe I have shown in the first code snippet(output1)? and if not how can I get around the problem of time travel is only supported delta table. The table that I am querying is a delta table however I believe calling .collect() is converting it directly to a list(output2)?

Adding SQL for loop in python

i am a newbees for programming, i have an db file with some date, open, high, low , close data in it, and name with 0001.HK; 0002.HK; 0003.HK
then i try to build a loop to take out some data in the database.
conn = sqlite3.connect(os.path.join('data', "hkprice.db"))
def read_price(stock_id):
connect = 'select Date, Open, High, Low, Close, Volume from ' + stock_id
df = pd.read_sql(connect, conn,index_col=['Date'], parse_dates=['Date'])
for y in range(1 ,2):
read_price(str(y).zfill(4) + '.HK')
when it output it show: Execution failed on sql 'select Date, Open, High, Low, Close, Volume from 0001.HK': unrecognized token: "0001.HK"
but i should have the 0001.HK table in the database
what should i do?

If you want to use variables with a query, you need to put a placeholder ?. So in your particular case:
connect = 'select Date, Open, High, Low, Close, Volume from ?'
After that in read_sql you can provide a list of your variables to the params kwarg like so:
df = pd.read_sql(connect, conn, params=[stock_id], index_col=['Date'], parse_dates=['Date'])
If you have multiple parameters and, hence, multiple ? placeholders then when you supply the list of variables to params they need to be in exactly the same order as your ?.
EDIT:
For example if I had a query where I wanted to get data between some dates, this is how I would do it:
start = ['list of dates']
end = ['another list of dates']
query = """select *
from table
where start_date >= ? and
end_date < ?
"""
df = pd.read_sql_query(query, conn, params=[start, end])
Here interpreter will see the first ? and grab the first item from the first list, then when it gets to the second ? it will grab the first item from the second list. If there's a mismatch between the number of ? and the number of supplied params then it will throw an error.

Filtering of records between sysdate and sysdate+7 from Oracle Sql is not working correctly

I am firing an SQL query to filter records between sysdate and sysdate+7 but I am getting records outside the range as well. What is wrong my SQL
cursor.execute("""
select
'Shipment' as object_type
, trunc(sc.effective_timestamp) reference_date
, sc.location_name location
from
master.cons_search c
inner orbit.status_cons sc ON (c.tms_cons_id=sc.cons_id)
where
1=1
AND c.global_company IN ('SWEET234')
AND sc.type = '1201'
and (trunc(c.ets) >= trunc(sysdate) and trunc(c.ets) <= (trunc(sysdate) + 7))
""")
data=cursor.fetchall()
I even tried a between function
and trunc(c.ets) between trunc(sysdate) and (trunc(sysdate) + 7)
But all of them giving results outside the range . What is the issue here?

You are filtering on c.ets.
You are selecting sc.effective_timestamp.
I suspect that you are confused about the dates. If you filter on the same column you are selecting, then you should not see out-of-range dates.

How to make an efficient query for extracting enteries of all days in a database in sets?

I have a database that includes 440 days of several series with a sampling time of 5 seconds. There is also missing data.
I want to calculate the daily average. The way I am doing it now is that I make 440 queries and do the averaging afterward. But, this is very time consuming since for every query the whole database is searched for related entries. I imagine there must be a more efficient way of doing this.
I am doing this in python, and I am just learning sql. Here's the query section of my code:
time_cur = date_begin
Data = numpy.zeros(shape=(N, NoFields - 1))
X = []
nN = 0
while time_cur<date_end:
X.append(time_cur)
cur = con.cursor()
cur.execute("SELECT * FROM os_table \
WHERE EXTRACT(year from datetime_)=%s\
AND EXTRACT(month from datetime_)=%s\
AND EXTRACT(day from datetime_)=%s",\
(time_cur.year, time_cur.month, time_cur.day));
Y = numpy.array([0]*(NoFields-1))
n = 0.0
while True:
n = n + 1
row = cur.fetchone()
if row == None:
break
Y = Y + numpy.array(row[1:])
Data[nN][:] = Y/n
nN = nN + 1
time_cur = time_cur + datetime.timedelta(days=1)
And, my data looks like this:
datetime_,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14
2012-11-13-00:07:53,42,30,0,0,1,9594,30,218,1,4556,42,1482,42
2012-11-13-00:07:58,70,55,0,0,2,23252,55,414,2,2358,70,3074,70
2012-11-13-00:08:03,44,32,0,0,0,11038,32,0,0,5307,44,1896,44
2012-11-13-00:08:08,36,26,0,0,0,26678,26,0,0,12842,36,1141,36
2012-11-13-00:08:13,33,26,0,0,0,6590,26,0,0,3521,33,851,33
I appreciate your suggestions.
Thanks
Iman

I don't know the np function so I don't understand what are you averaging. If you show your table and the logic to get the average...
But this is how to get a daily average for a single column
import psycopg2
conn = psycopg2.connect('host=localhost4 port=5432 dbname=cpn')
cursor = conn.cursor()
cursor.execute('''
select
datetime::date as day,
avg(c1) as c1_average,
avg(c2) as c2_average
from os_table
where datetime between %s and %s
group by 1
order by 1
''',
(time_cur, date_end)
);
rs = cursor.fetchall()
conn.close()
for day in rs:
print day[0], day[1], day[2]

This answer uses SQL Server syntax - I am not sure how different PostgreSQL is - it should be fairly similar, you may find things like the DATEADD, DATEDIFF and CONVERT statements are different, (Actually, almost certainly the CONVERT statement - just convert the date to a varchar instead -I am just using it as a reportName, so it not vital) You should be able to follow the theory of this, even if the code doesn't run in PostgreSQL without tweaking.
First Create a Reports Table ( you will use this to link to the actual table you want to report on )
CREATE TABLE Report_Periods (
report_name VARCHAR(30) NOT NULL PRIMARY KEY,
report_start_date DATETIME NOT NULL,
report_end_date DATETIME NOT NULL,
CONSTRAINT date_ordering
CHECK (report_start_date <= report_end_date)
)
Next populate the report table with the dates you need to report on, there are many ways to do this - the method I've chosen here will only use the days you need, but you could create this with all dates you are ever likely to use, so you only have to do it once.
INSERT INTO Report_Periods (report_name, report_start_date, report_end_date)
SELECT CONVERT(VARCHAR, [DatePartOnly], 0) AS DateName,
[DatePartOnly] AS StartDate,
DATEADD(ms, -3, DATEADD(dd,1,[DatePartOnly])) AS EndDate
FROM ( SELECT DISTINCT DATEADD(DD, DATEDIFF(DD, 0, datetime_), 0) AS [DatePartOnly]
FROM os_table ) AS M
Note in SQL server, the smallest time allowed is 3 milliseconds - so the above statement adds 1 day, then subtracts 3 milliseconds to create a start and end datetime for a day. Again PostgrSQL may have different values
This means you can simply link the reports table back to your os_table to get averages, counts etc very simply
SELECT AVG(value) AS AvgValue, COUNT(value) AS NumValues, R.report_name
FROM os_table AS T
JOIN Report_Periods AS R ON T.datetime_>= R.report_start_date AND T.datetime_<= R.report_end_date
GROUP BY R.report_name

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding difference between two time stamp in pyspark sql - python

Related

Python Airflow bigquery 400 configuration.query.createDisposition cannot be set for scripts

Retrieve Value from Pyspark SQL dataframe in Python program

Adding SQL for loop in python

Filtering of records between sysdate and sysdate+7 from Oracle Sql is not working correctly

How to make an efficient query for extracting enteries of all days in a database in sets?

Categories

Resources