how to use select query inside a python udf for redshift?

how to use select query inside a python udf for redshift? - python

I tried uploading modules to redshift through S3 but it always says no module found. please help
CREATE or replace FUNCTION olus_layer(subs_no varchar)
RETURNS varchar volatile AS
$$
import plpydbapi
dbconn = plpydbapi.connect()
cursor = dbconn.cursor()
cursor.execute("SELECT count(*) from busobj_group.olus_usage_detail")
d=cursor.fetchall()
dbconn.close()
return d
$$
LANGUAGE plpythonu;
–

You cannot do this in Redshift. so you will need to find another approach.
1) see here for udf constraints http://docs.aws.amazon.com/redshift/latest/dg/udf-constraints.html
2) see here http://docs.aws.amazon.com/redshift/latest/dg/udf-python-language-support.html
especially this part:
Important Amazon Redshift blocks all network access and write access
to the file system through UDFs.
This means that even if you try to get around the restriction, it won't work!
If you don't know an alternative way to get what you need, you should ask a new question specifying exactly what your challenge is and what you have tried, (leave this question ans answer here for future reference by others)

It can't connect to DB inside UDF, Python functions are scalar in Redshift, meaning it takes one or more values and returns only one output value.
However, if you want to execute a function against a set of rows try to use LISTAGG function to build an array of values or objects (if you need multiple properties) into a large string (beware of string size limitation), pass it to UDF as parameter and parse/loop inside the function.

Amazon has recently announced the support for Stored Procedures in Redshift. Unlike a user-defined function (UDF), a stored procedure can incorporate data definition language (DDL) and data manipulation language (DML) in addition to SELECT queries. Along with that, it also supports looping and conditional expressions, to control logical flow.
https://docs.aws.amazon.com/redshift/latest/dg/stored-procedure-overview.html

Related

Getting error when running a sql select statement in python

I am new to this and trying to learn python. I wrote a select statement in python where I used a parameter
Select """cln.customer_uid = """[(num_cuid_number)])
TypeError: string indices must be integers

Agree with the others, this doesn't look really like Python by itself.
I will see even without seeing the rest of that code I'll guess the [(num_cuid_number)] value(s) being returned is a string, so you'll want to convert it to integer for the select statement to process.

num_cuid_number is most likely a string in your code; the string indices are the ones in the square brackets. So please first check your data variable to see what you received there. Also, I think that num_cuid_number is a string, while it should be in an integer value.
Let me give you an example for the python code to execute: (Just for the reference: I have used SQLAlchemy with flask)
#app.route('/get_data/')
def get_data():
base_sql="""
SELECT cln.customer_uid='%s' from cln
""" % (num_cuid_number)
data = db.session.execute(base_sql).fetchall()

Pretty sure you are trying to create a select statement with a "where" clause here. There are many ways to do this, for example using raw sql, the query should look similar to this:
query = "SELECT * FROM cln WHERE customer_uid = %s"
parameters = (num_cuid_number,)
separating the parameters from the query is secure. You can then take these 2 variables and execute them with your db engine like
results = db.execute(query, parameters)
This will work, however, especially in Python, it is more common to use a package like SQLAlchemy to make queries more "flexible" (in other words, without manually constructing an actual string as a query string). You can do the same thing using SQLAlchemy core functionality
query = cln.select()
query = query.where(cln.customer_uid == num_cuid_number)
results = db.execute(query)
Note: I simplified "db" in both examples, you'd actually use a cursor, session, engine or similar to execute your queries, but that wasn't your question.

Add a function with multiple return values to SQLite in Python

I want to add a function to SQLite in Python (like explained here)
My function returns multiple values. In python, I can access to different return values by indexing (using []).
However, it seems indexing does not work in SQLite. In other words, the following SELECT statement will have an error:
SELECT my_function(table1.column1)[0] FROM table1;
sqlite3.OperationalError: user-defined function raised exception
Is there any way to access to different return values in SQLite?

The only way to return multiple values from a function is with a table-valued function, which requires creating a virtual table module, which is not possible with only the default Python SQLite driver.
There are additional Python modules to allow this, for example, sqlite-vtfunc.

One way to do that is to have the function return a string with the multiple values encoded as a JSON array and then use SQLite JSON extraction functions to access the individual values:
select json_extract(fout, $[0]) as v0,
json_extract(fout, $[1]) as v1
from (select my_func(cols...) as fout
from ...)
I have used that solution myself and performance is ok. The JSON encoding/decoding doesn't seem to introduce a noticeable performance penalty.

Memory efficient way of fetching postgresql uniqueue dates?

I have a database with roughly 30 million entries, which is a lot and i don't expect anything but trouble working with larger database entries.
But using py-postgresql and the .prepare() statement i would hope i could fetch entries on a "yield" basis and thus avoiding filling up my memory with only the results from the database, which i aparently can't?
This is what i've got so far:
import postgresql
user = 'test'
passwd = 'test
db = postgresql.open('pq://'+user+':'+passwd+'#192.168.1.1/mydb')
results = db.prepare("SELECT time time FROM mytable")
uniqueue_days = []
with db.xact():
for row in result():
if not row['time'] in uniqueue_days:
uniqueue_days.append(row['time'])
print(uniqueue_days)
Before even getting to if not row['time'] in uniqueue_days: i run out of memory, which isn't so strange considering result() probably fetches all results befor looping through them?
Is there a way to get the library postgresql to "page" or batch down the results in say a 60k per round or perhaps even rework the query to do more of the work?
Thanks in advance!
Edit: Should mention the dates in the database is Unix timestamps, and i intend to convert them into %Y-%m-%d format prior to adding them into the uniqueue_days list.

If you were using the better-supported psycopg2 extension, you could use a loop over the client cursor, or fetchone, to get just one row at a time, as psycopg2 uses a server-side portal to back its cursor.
If py-postgresql doesn't support something similar, you could always explicitly DECLARE a cursor on the database side and FETCH rows from it progressively. I don't see anything in the documentation that suggests py-postgresql can do this for you automatically at the protocol level like psycopg2 does.
Usually you can switch between database drivers pretty easily, but py-postgresql doesn't seem to follow the Python DB-API, so testing it will take a few more changes. I still recommend it.

You could let the database do all the heavy lifting.
Ex: Instead of reading all the data into Python and then calculating unique_dates why not try something like this
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES FROM mytable;
If you want to strictly enforce sort order on unique_dates returned then do the following:
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES
FROM mytable
order by 1;
Usefull references for functions used above:
Date/Time Functions and Operators
Data Type Formatting Functions
If you would like to read data in chunks you could use the dates you get from above query to subset your results further down the line:
Ex:
'SELECT * FROM mytable mytable where time between' +UNIQUE_DATES[i] +'and'+ UNIQUE_DATES[j] ;
Where UNIQUE_DATES[i]& [j] will be parameters you would pass from Python.
I will leave it for you to figure how to convert date into unix timestamps.

ODBC return values from iSeries on Linux

This is using pyodbc.
Okay, let's say I create a procedure on the iSeries using something like this:
CREATE PROCEDURE MYLIB.MYSQL(IN WCCOD CHAR ( 6), INOUT WPLIN CHAR (3))
RESULT SETS 1 LANGUAGE CL NOT DETERMINISTIC
CONTAINS SQL EXTERNAL NAME MYLIB.MYCL PARAMETER STYLE GENERAL
Then in Python I do something like this:
cust = 'ABCDEF'
line = '123'
sql = "CALL MYLIB.MYSQL('%(cust)s',?)" % vars()
values = (line)
c = pyodbc.connect('DSN='+system+';CMT=0;NAM=0')
cursor = c.cursor()
cursor.execute(sql,values)
Nothing in the variables shows the return value. The sense I get from seeing comparable code in other languages (ie. .NET) is that the ODBC "variable" is defined, then updated with the return value, but in this case neither "line" or "values" is changed.
I realize one alternative is to have the CL program write the result to a file then read the file, but it seems like an extra step that requires maintenance, never mind added complexity.
Has anyone ever made this work?

First, you won't get any results at all if you don't fetch them. When using pyODBC (or practically any other package adhering to Python Database API Specification v2.0), you have a couple of choices to do this. You can explicitly call one of the fetch methods, such as
results = cursor.fetchall()
after which the result set will be in results (where each result is a tuple and results is a list of these tuples). Or you can iterate directly over the cursor (which is a bit like repeatedly calling the .fetchone() method):
for row in cursor:
# Do stuff with row here
# Each time through the loop gets another row
# When there are no more results, the loop ends
Now, whether you explicitly fetch or you use Python's looping mechanism, you receive a brand-new collection of values, accessed by the names you chose to receive them (results in my first example, row in my second.) You can't specify Python variables to be updated directly by individual values.
Besides the Python DB API 2.0 spec mentioned above, you'll probably want to read up on pyODBC features, and particularly on the way it handles stored procedures.

Holy cow, I got it working, but it takes two procedures, one to call the CL one to run the SQL statement.
For example (first procedure, this is what calls the CL program that returns a value in WPLIN):
CREATE PROCEDURE MYLIB.MYSQLA(IN WCCOD CHAR ( 6), INOUT WPLIN CHAR (3))
RESULT SETS 1 LANGUAGE CL NOT DETERMINISTIC
CONTAINS SQL EXTERNAL NAME MYLIB.MYCL PARAMETER STYLE GENERAL
Second procedure (will call the first, THIS is the procedure we call from ODBC):
CREATE PROCEDURE MYLIB.MYSQLB(IN WCCOD CHAR ( 6), INOUT WPLIN CHAR (3))
DYNAMIC RESULT SETS 1 LANGUAGE SQL
BEGIN
DECLARE C1 CURSOR WITH RETURN TO CLIENT
FOR
SELECT WPLIN FROM DUMMYLIB.DUMMYFILE;
CALL MYLIB.MYSQLA(WCCOD,WPLIN);
OPEN C1;
END
Then from an ODBC connection, we simply execute this:
customer = 'ABCDEF'
line='ABC'
sql = "{CALL MYLIB.MYSQLB('%(customer)s','%(line)s')}" % vars()
cursor.execute(sql)
print cursor.fetchone()
Et voila!
A caveat: The "DUMMYLIB/DUMMYFILE" are a single record physical file I created with a single byte column. It's only used for reference (unless there's a better way?) and it doesn't matter what's in it.
Maybe a bit clumsy, but it works! If anyone knows a way to combine these into a single procedure that would nice!

Python + Sqlite 3. How to construct queries?

I'm trying to create a python script that constructs valid sqlite queries. I want to avoid SQL Injection, so I cannot use '%s'. I've found how to execute queries, cursor.execute('sql ?', (param)), but I want how to get the parsed sql param. It's not a problem if I have to execute the query first in order to obtain the last query executed.

If you're trying to transmit changes to the database to another computer, why do they have to be expressed as SQL strings? Why not pickle the query string and the parameters as a tuple, and have the other machine also use SQLite parameterization to query its database?

If you're not after just parameter substitution, but full construction of the SQL, you have to do that using string operations on your end. The ? replacement always just stands for a value. Internally, the SQL string is compiled to SQLite's own bytecode (you can find out what it generates with EXPLAIN thesql) and ? replacements are done by just storing the value at the correct place in the value stack; varying the query structurally would require different bytecode, so just replacing a value wouldn't be enough.
Yes, this does mean you have to be ultra-careful. If you don't want to allow updates, try opening the DB connection in read-only mode.

Use the DB-API’s parameter substitution. Put ? as a placeholder wherever you want to use a value, and then provide a tuple of values as the second argument to the cursor’s execute() method.
# Never do this -- insecure!
symbol = 'hello'
c.execute("SELECT * FROM stocks WHERE symbol = '%s'" % symbol)
# Do this instead
t = (symbol,)
c.execute('SELECT * FROM stocks WHERE symbol=?', t)
print c.fetchone()
More reference is in the manual.

I want how to get the parsed 'sql param'.
It's all open source so you have full access to the code doing the parsing / sanitization. Why not just reading this code and find out how it works and if there's some (possibly undocumented) implementation that you can reuse ?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to use select query inside a python udf for redshift? - python

Related

Getting error when running a sql select statement in python

Add a function with multiple return values to SQLite in Python

Memory efficient way of fetching postgresql uniqueue dates?

ODBC return values from iSeries on Linux

Python + Sqlite 3. How to construct queries?

Categories

Resources