I'm getting some data from a database using pandas read sql query operations:
import pandas as pd
date_filter = '2020-01-01'
df = pd.read_sql_query(f"SELECT * FROM table WHERE date_id>= {date_filter }", my_connection)
If I run this code the filter will not be applied and it brings all the data.
However, if I put a double quote on the variable:
df = pd.read_sql_query(f"SELECT * FROM table WHERE date_id>= '{date_filter }'", my_connection)
It makes the filter correctly.
Why? Am I making any error?
You need to wrap apostrophes around the variable, because without the variable, you would end up with a query like
SELECT * FROM table WHERE date_id>= 2020-09-08
this is invalid SQL. You need to wrap apostrophes around the value to have a valid syntax:
SELECT * FROM table WHERE date_id>= '2020-09-08'
This is the reason you need to wrap apostrophes around your template.
Without quotes, I would guess the SQL parser may be taking the subtraction of year / month / day parts in your variable:
SELECT * FROM table WHERE date_id >= 2020-01-01
SELECT * FROM table WHERE date_id >= 2018
And since most date times in databases are formatted seconds from epoch (1970-01-01 00:00:00). The human readable conversion of 2018 is a few minutes from epoch, specifically Thursday, January 1, 1970 12:33:38 AM. Hence, why nearly all your data is returned.
SELECT * FROM table WHERE date_id >= '1970-01-01 12:33:38'
As you learned, quoting literal date values works but an even better solution is parameterization which is supported in pandas.read_sql. To be clear, F-strings are not parameterization but simply a newer version of string interpolation. Also, do note the parameter placeholder differs depending on DB-API and the symbol should not be quoted or combined in any arithmetic or other operations.
# cxOracle
df = pd.read_sql_query("SELECT * FROM table WHERE date_id >= :1",
my_connection, params=[date_filter])
# psycopg2, pymysql, pymssql
df = pd.read_sql_query("SELECT * FROM table WHERE date_id >= %s",
my_connection, params=[date_filter])
# pyodbc, sqlite3, ibm_db, jaydebeapi
df = pd.read_sql_query("SELECT * FROM table WHERE date_id >= ?",
my_connection, params=[date_filter])
I am using pyspark.sql in a standalone Python program to run a query on a VERSION 0 of a table stored on Databricks.
I can return a data frame using the following code but can not seem to access the value (which is an int 5 in this case)
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
table_diff_df = spark.sql("select count(*) as result from (select * from myTableName version as of 0 limit 5)")
logger.console(type(table_diff_df))
logger.console(table_diff_df[0][0])
logger.console(type(table_diff_df[0]['result']))
logger.console(table_diff_df)
logger.console("tablediff as str :" + str(table_diff_df))
output 1
<class 'pyspark.sql.dataframe.DataFrame'>
Column<b'result[0]'>
<class 'pyspark.sql.column.Column'>
DataFrame[result: bigint]
tablediff as str :DataFrame[result: bigint]
By adjusting my query to the following(appending .collect()) I have been able to get the value of 5 as required (however I had to remove Version as of 0)
table_diff_df = spark.sql("select count(*) as result from (select * from myTableName limit 5)").collect()
logger.console(type(table_diff_df))
logger.console(table_diff_df[0][0])
output 2
<class 'list'>
5
In my case I MUST run the query on the Version 0 table, but when I add that back into my query as shown below I get the following error
table_diff_df = spark.sql("select count(*) as result from (select * from myTableName Version as of 0 limit 5)").collect()
logger.console(table_diff_df[0][0])
output 3
Time travel is only supported for Delta tables
Is there a simple way to access the value using the Dataframe I have shown in the first code snippet(output1)? and if not how can I get around the problem of time travel is only supported delta table. The table that I am querying is a delta table however I believe calling .collect() is converting it directly to a list(output2)?
Below table structure, you can notice the column name
cal_avg_latency = spark.sql("SELECT UnitType, ROUND(AVG(TIMESTAMP_DIFF(OnSceneDtTmTS, ReceivedDtTmTS, MINUTE)), 2) as latency, count(*) as total_count FROM `SFSC_Incident_Census_view` WHERE EXTRACT(DATE from ReceivedDtTmTS) == EXTRACT(DATE from OnSceneDtTmTS) GROUP BY UnitType ORDER BY latency ASC")
Error:
ParseException: "\nmismatched input 'FROM' expecting <EOF>(line 1, pos 122)\n\n== SQL ==\nSELECT UnitType, ROUND(AVG(TIMESTAMP_DIFF(OnSceneDtTmTS, ReceivedDtTmTS, MINUTE)), 2) as latency, count(*) as total_count FROM SFSC_Incident_Census_view WHERE EXTRACT((DATE FROM ReceivedDtTmTS) == EXTRACT(DATE FROM OnSceneDtTmTS)) GROUP BY UnitType ORDER BY latency ASC\n--------------------------------------------------------------------------------------------------------------------------^^^\n"
Error is in WHERE condition but even my TIMESTAMP_DIFF function not working
cal_avg_latency = spark.sql("SELECT UnitType, ROUND(AVG(TIMESTAMP_DIFF(OnSceneDtTmTS, ReceivedDtTmTS, MINUTE)), 2) as latency, count(*) as total_count FROM SFSC_Incident_Census_view GROUP BY UnitType ORDER BY latency ASC")
Error :
AnalysisException: "Undefined function: 'TIMESTAMP_DIFF'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 27"
The error message seems pretty clear. Hive doesn't have a TIMESTAMP_DIFF function.
If your columns are already appropriately cast as a timestamp type, you can subtract them directly. Otherwise, you can cast them explicity, and take the difference:
SELECT ROUND(AVG(MINUTE(CAST(OnSceneDtTmTS AS timestamp) - CAST(ReceivedDtTmTS AS timestamp))), 2) AS latency
I have solve the problem using pyspark query.
from pyspark.sql import functions as F
import pyspark.sql.functions as func
timeFmt = "yyyy-MM-dd'T'HH:mm:ss.SSS"
timeDiff = (F.unix_timestamp('OnSceneDtTmTS', format=timeFmt)
- F.unix_timestamp('ReceivedDtTmTS', format=timeFmt))
FSCDataFrameTsDF = FSCDataFrameTsDF.withColumn("Duration", timeDiff)
#convert seconds to minute and round the seconds for further use.
FSCDataFrameTsDF = FSCDataFrameTsDF.withColumn("Duration_minutes",func.round(FSCDataFrameTsDF.Duration / 60.0))
Output:
I have a database that includes 440 days of several series with a sampling time of 5 seconds. There is also missing data.
I want to calculate the daily average. The way I am doing it now is that I make 440 queries and do the averaging afterward. But, this is very time consuming since for every query the whole database is searched for related entries. I imagine there must be a more efficient way of doing this.
I am doing this in python, and I am just learning sql. Here's the query section of my code:
time_cur = date_begin
Data = numpy.zeros(shape=(N, NoFields - 1))
X = []
nN = 0
while time_cur<date_end:
X.append(time_cur)
cur = con.cursor()
cur.execute("SELECT * FROM os_table \
WHERE EXTRACT(year from datetime_)=%s\
AND EXTRACT(month from datetime_)=%s\
AND EXTRACT(day from datetime_)=%s",\
(time_cur.year, time_cur.month, time_cur.day));
Y = numpy.array([0]*(NoFields-1))
n = 0.0
while True:
n = n + 1
row = cur.fetchone()
if row == None:
break
Y = Y + numpy.array(row[1:])
Data[nN][:] = Y/n
nN = nN + 1
time_cur = time_cur + datetime.timedelta(days=1)
And, my data looks like this:
datetime_,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14
2012-11-13-00:07:53,42,30,0,0,1,9594,30,218,1,4556,42,1482,42
2012-11-13-00:07:58,70,55,0,0,2,23252,55,414,2,2358,70,3074,70
2012-11-13-00:08:03,44,32,0,0,0,11038,32,0,0,5307,44,1896,44
2012-11-13-00:08:08,36,26,0,0,0,26678,26,0,0,12842,36,1141,36
2012-11-13-00:08:13,33,26,0,0,0,6590,26,0,0,3521,33,851,33
I appreciate your suggestions.
Thanks
Iman
I don't know the np function so I don't understand what are you averaging. If you show your table and the logic to get the average...
But this is how to get a daily average for a single column
import psycopg2
conn = psycopg2.connect('host=localhost4 port=5432 dbname=cpn')
cursor = conn.cursor()
cursor.execute('''
select
datetime::date as day,
avg(c1) as c1_average,
avg(c2) as c2_average
from os_table
where datetime between %s and %s
group by 1
order by 1
''',
(time_cur, date_end)
);
rs = cursor.fetchall()
conn.close()
for day in rs:
print day[0], day[1], day[2]
This answer uses SQL Server syntax - I am not sure how different PostgreSQL is - it should be fairly similar, you may find things like the DATEADD, DATEDIFF and CONVERT statements are different, (Actually, almost certainly the CONVERT statement - just convert the date to a varchar instead -I am just using it as a reportName, so it not vital) You should be able to follow the theory of this, even if the code doesn't run in PostgreSQL without tweaking.
First Create a Reports Table ( you will use this to link to the actual table you want to report on )
CREATE TABLE Report_Periods (
report_name VARCHAR(30) NOT NULL PRIMARY KEY,
report_start_date DATETIME NOT NULL,
report_end_date DATETIME NOT NULL,
CONSTRAINT date_ordering
CHECK (report_start_date <= report_end_date)
)
Next populate the report table with the dates you need to report on, there are many ways to do this - the method I've chosen here will only use the days you need, but you could create this with all dates you are ever likely to use, so you only have to do it once.
INSERT INTO Report_Periods (report_name, report_start_date, report_end_date)
SELECT CONVERT(VARCHAR, [DatePartOnly], 0) AS DateName,
[DatePartOnly] AS StartDate,
DATEADD(ms, -3, DATEADD(dd,1,[DatePartOnly])) AS EndDate
FROM ( SELECT DISTINCT DATEADD(DD, DATEDIFF(DD, 0, datetime_), 0) AS [DatePartOnly]
FROM os_table ) AS M
Note in SQL server, the smallest time allowed is 3 milliseconds - so the above statement adds 1 day, then subtracts 3 milliseconds to create a start and end datetime for a day. Again PostgrSQL may have different values
This means you can simply link the reports table back to your os_table to get averages, counts etc very simply
SELECT AVG(value) AS AvgValue, COUNT(value) AS NumValues, R.report_name
FROM os_table AS T
JOIN Report_Periods AS R ON T.datetime_>= R.report_start_date AND T.datetime_<= R.report_end_date
GROUP BY R.report_name
I'm trying to debug a SQL statement generated with sqlite3 python module...
c.execute("SELECT * FROM %s WHERE :column = :value" % Photo.DB_TABLE_NAME, {"column": column, "value": value})
It is returning no rows when I do a fetchall()
When I run this directly on the database
SELECT * FROM photos WHERE album_id = 10
I get the expected results.
Is there a way to see the constructed query to see what the issue is?
To actually answer your question, you can use the set_trace_callback of the connection object to attach the print function; this will make all queries get printed when they are executed. Here is an example in action:
# Import and connect to database
import sqlite3
conn = sqlite3.connect('example.db')
# This attaches the tracer
conn.set_trace_callback(print)
# Get the cursor, execute some statement as an example
c = conn.cursor()
c.execute("CREATE TABLE stocks (symbol text)")
t = ('RHAT',)
c.execute("INSERT INTO stocks VALUES (?)", t)
c.execute('SELECT * FROM stocks WHERE symbol=?', t)
print(c.fetchone())
This produces the output:
CREATE TABLE stocks (symbol text)
BEGIN
INSERT INTO stocks VALUES ('RHAT')
SELECT * FROM stocks WHERE symbol='RHAT'
('RHAT',)
the problem here is that the string values are automatically embraced with single quotes. You can not dynamically insert column names that way.
Concerning your question, I'm not sure about sqlite3, but in MySQLdb you can get the final query as something like (I am currently not at a computer to check):
statement % conn.literal(query_params)
You can only use substitution parameters for row values, not column or table names.
Thus, the :column in SELECT * FROM %s WHERE :column = :value is not allowed.