How to pass variables in spark SQL, using python? - python

I am writing spark code in python.
How do I pass a variable in a spark.sql query?
q25 = 500
Q1 = spark.sql("SELECT col1 from table where col2>500 limit $q25 , 1")
Currently the above code does not work? How do we pass variables?
I have also tried,
Q1 = spark.sql("SELECT col1 from table where col2>500 limit q25='{}' , 1".format(q25))

You need to remove single quote and q25 in string formatting like this:
Q1 = spark.sql("SELECT col1 from table where col2>500 limit {}, 1".format(q25))
Update:
Based on your new queries:
spark.sql("SELECT col1 from table where col2>500 order by col1 desc limit {}, 1".format(q25))
Note that the SparkSQL does not support OFFSET, so the query cannot work.
If you need add multiple variables you can try this way:
q25 = 500
var2 = 50
Q1 = spark.sql("SELECT col1 from table where col2>{0} limit {1}".format(var2,q25))

Another option if you're doing this sort of thing often or want to make your code easier to re-use is to use a map of configuration variables and the format option:
configs = {"q25":10,
"TABLE_NAME":"my_table",
"SCHEMA":"my_schema"}
Q1 = spark.sql("""SELECT col1 from {SCHEMA}.{TABLE_NAME}
where col2>500
limit {q25}
""".format(**configs))

Using f-Strings approach (PySpark):
table = 'my_schema.my_table'
df = spark.sql(f'select * from {table}')

A really easy solution is to store the query as a string (using the usual python formatting), and then pass it to the spark.sql() function:
q25 = 500
query = "SELECT col1 from table where col2>500 limit {}".format(q25)
Q1 = spark.sql(query)

All you need to do is add s (String interpolator) to the string. This allows the usage of variable directly into the string.
val q25 = 10
Q1 = spark.sql(s"SELECT col1 from table where col2>500 limit $q25)

Related

Retrieve data from SQL Server database using Python and PYODBC

I have Python code that connects with SQL Server database using PYODBC and Streamlit to create a web app.
The problem is when I try to perform a select query with multiple conditions the result is empty where as the result it must return records.
If I try the SQL query direct on the database it return the below result:
SELECT TOP (200) ID, first, last
FROM t1
WHERE (first LIKE '%tes%') AND (last LIKE '%tesn%')
where as the query from the python it return empty
sql="select * from testDB.dbo.t1 where ID = ? and first LIKE '%' + ? + '%' and last LIKE '%' + ? + '%' "
param0 = vals[0]
param1=f'{vals[1]}'
param2=f'{vals[2]}'
rows = cursor.execute(sql, param0,param1,param2).fetchall()
Code:
import pandas as pd
import streamlit as st
vals = []
expander_advanced_search = st.beta_expander('Advanced Search')
with expander_advanced_search:
for i, col in enumerate(df.columns):
val = st_input_update("search for {}".format(col))
expander_advanced_search.markdown(val, unsafe_allow_html=True)
vals.append(val)
if st.form_submit_button("search"):
if len(vals)>0:
sql='select * from testDB.dbo.t1 where ID = ? and first LIKE ? and last LIKE ? '
param0 = vals[0]
param1=f'%{vals[1]}%'
param2=f'%{vals[2]}%'
rows = cursor.execute(sql, param0,param1,param2).fetchall()
df = pd.DataFrame.from_records(rows, columns = [column[0] for column in cursor.description])
st.dataframe(df)
Based on suggestion of Dale k I use the OR operator in the select query:
sql="select * from testDB.dbo.t1 where ID = ? OR first LIKE ? or last LIKE ? "
param0 = vals[0] # empty
param1=f'%{vals[1]}%' # nabi
param2=f'%{vals[2]}%' # empty
rows = cursor.execute(sql, param0,param1,param2).fetchall()
The displayed result:
all the records in the database
The expected result:
id first last
7 nabil jider
I think this is probably in your parameters - your form is only submitting first/last values, but your query says ID=?
You're not providing an ID from the form so there are no results. Or it's putting the value from the 'first' input into vals[0] and the resulting query is looking for an ID = 'tes'.
Also, look into pd.read_sql() to pipe query results directly into a DataFrame/
OR statement might be what you're after if you want each clause treated separately:
where ID = ? or first LIKE ? or last LIKE ?'

Convert SAS proc sql to Python(pandas)

I rewrite some code from SAS to Python using Pandas library.
I've got such code, and I have no idea what should I do with it?
Can you help me, beacase its too complicated for me to do it correct. I've changed the name of columns (for encrypt sensitive data)
This is SAS code:
proc sql;
create table &work_lib..opk_do_inf_4 as
select distinct
*,
min(kat_opk) as opk_do_inf,
count(nr_ks) as ilsc_opk_do_kosztu_infr
from &work_lib..opk_do_inf_3
group by kod_ow, kod_sw, nr_ks, nr_ks_pr, nazwa_zabiegu_icd_9, nazwa_zabiegu
having kat_opk = opk_do_inf
;
quit;
This is my try in Pandas:
df = self.opk_do_inf_3() -> create DF using other function
df['opk_do_inf'] = df.groupby(by=['kod_ow', 'kod_sw', 'nr_ks', 'nr_ks_pr', 'nazwa_zabiegu_icd_9', 'nazwa_zabiegu'])['kat_opk'].min()
df['ilsc_opk_do_kosztu_infr'] = df.groupby(by=['kod_ow', 'kod_sw', 'nr_ks', 'nr_ks_pr', 'nazwa_zabiegu_icd_9', 'nazwa_zabiegu'])['nr_ks'].count()
df_groupby = df.groupby(by=['kod_ow', 'kod_sw', 'nr_ks', 'nr_ks_pr', 'nazwa_zabiegu_icd_9', 'nazwa_zabiegu']).filter(lambda x: x['kat_opk']==x['opk_do_inf'])
df = df_groupby.reset_index()
df = df.drop_duplcates()
return df
First, calling SELECT * in an aggregate GROUP BY query is not valid SQL. SAS may allow it but can yield unknown results. Usually SELECT columns should be limited to columns in GROUP BY clause.
With that said, aggregate SQL queries can generally be translated in Pandas with groupby.agg() operations with WHERE (filter before aggregation) or HAVING (filter after aggregation) conditions handled using either .loc or query.
SQL
SELECT col1, col2, col3,
MIN(col1) AS min_col1,
AVG(col2) AS mean_col2,
MAX(col3) AS max_col3,
COUNT(*) AS count_obs
FROM mydata
GROUP BY col1, col2, col3
HAVING col1 = min(col1)
Pandas
General
agg_data = (mydata.groupby(["col1", "col2", "col3"], as_index=False)
.agg(min_col1 = ("col1", "min"),
mean_col2 = ("col2", "mean"),
max_col3 = ("col3", "max"),
count_obs = ("col1", "count"))
.query("col1 == min_col1")
)
Specific
opk_do_inf_4 = (mydata.groupby(["kat_opk", "kod_ow", "kod_sw", "nr_ks", "nr_ks_pr",
"nazwa_zabiegu_icd_9", "nazwa_zabiegu"],
as_index=False)
.agg(opk_do_inf = ("kat_opk", "min"),
ilsc_opk_do_kosztu_infr = ("nr_ks", "count"))
.query("kat_opk == opk_do_inf")
)
You can use the sqldf function from pandasql package to run the sql query on dataframe. example below
''' from pandasql import sqldf
query = "select top 10 * from df "
newdf = sqldf(query, locals())
'''

Python pd.read_sql where cause parameters

Use case:
We have nested queries and our tables have 10 to 20 million rows. Here our intention is to reduce the query CPU time by smart filter
I like to filter my columns in pd.read_sql by other data frame column name. Is that possible?
Step 1: df1 data frame age1 and age3 are my future filter columns for pd.read_sql
raw_data1 = {'age1': [23,45,21],'age2': [10,20,50], 'age3':['forty','fortyone','fortyfour']}
df1 = pd.DataFrame(raw_data1, columns = ['age1','age2','age3'])
df1
Step2: I like take age1 from above df1 dataframe want to use in below pd.read_sql like below to get item1 dataframe
item1 = pd.read_sql("""
SELECT * from [dbo].[ITEM]
where item_age1 = df1.age1
""", conn)
Step3: I like to take age3 from above df1 dataframe want to use in below pd.read_sql like below to get item2 dataframe
item2 = pd.read_sql("""
SELECT * from [dbo].[ITEM]
where item_age3 = df1.age3
""", conn)
Use a parameterized query:
item2 = pd.read_sql("""
SELECT * from [dbo].[ITEM]
where item_age3 IN ({})
""".format(','.join('?'*len(df1.age3))), conn,
params=list(df1.age3))
depending on the database backend this syntax may use '%s' or %(name)s instead of '?'. See PEP249 paramstyle for more information.

sqlite selecting multiple tables

I have a database in sqlite with c.300 tables. Currently i am iterating through a list and appending the data.
Is there a faster way / more pythonic way of doing this?
df = []
for i in Ave.columns:
try:
df2 = get_mcap(i)
df.append(df2)
#print (i)
except:
pass
df = pd.concat(df, axis=0
Ave is a dataframe where the column in the list i want to iterate through.
def get_mcap(Ticker):
cnx = sqlite3.connect('Market_Cap.db')
df = pd.read_sql_query("SELECT * FROM '%s'"%(Ticker), cnx)
df.columns = ['Date', 'Mcap-Ave', 'Mcap-High', 'Mcap-Low']
df = df.set_index('Date')
df.index = pd.to_datetime(df.index)
cnx.close
return df
Before I post my solution, I should include a quick warning that you should never use string manipulation to generate SQL queries unless it's absolutely unavoidable, and in such cases you need to be certain that you are in control of the data which is being used to format the strings and it won't contain anything that will cause the query to do something unintended.
With that said, this seems like one of those situations where you do need to use string formatting, since you cannot pass table names as parameters. Just make sure there's no way that users can alter what is contained within your list of tables.
Onto the solution. It looks like you can get your list of tables using:
tables = Ave.columns.tolist()
For my simple example, I'm going to use:
tables = ['table1', 'table2', 'table3']
Then use the following code to generate a single query:
query_template = 'select * from {}'
query_parts = []
for table in tables:
query = query_template.format(table)
query_parts.append(query)
full_query = ' union all '.join(query_parts)
Giving:
'select * from table1 union all select * from table2 union all select * from table3'
You can then simply execute this one query to get your results:
cnx = sqlite3.connect('Market_Cap.db')
df = pd.read_sql_query(full_query, cnx)
Then from here you should be able to set the index, convert to datetime etc, but now you only need to do these operations once rather than 300 times. I imagine the overall runtime of this should now be much faster.

How to pass tuple in read_sql 'where in' clause in pandas python

I am passing a tuple converted to a string in a read_sql method as
sql = "select * from table1 where col1 in " + str(tuple1) + " and col2 in " + str(tuple2)
df = pd.read_sql(sql, conn)
This is working fine but, when tuple have only one value sql fails with ORA-00936: missing expression, as single element tuple has an extra comma
For example
tuple1 = (4011,)
tuple2 = (23,24)
sql formed is as
select * from table1 where col1 in (4011,) + " and col2 in (23,24)
^
ORA-00936: missing expression
Is there any better way doing this, other than removal of comma with string operations?
Is there a better way to paramatrize read_sql function?
the reason you're getting the error is because of SQL syntax.
When you have a WHERE col in (...) list, a trailing comma will cause a syntax error.
Either way, putting values into SQL statements using string concatenation is frowned upon, and will ultimately lead you to more problems down the line.
Most Python SQL libraries will allow for parameterised queries. Without knowing which library you're using to connect, I can't link exact documentation, but the principle is the same for psycopg2:
http://initd.org/psycopg/docs/usage.html#passing-parameters-to-sql-queries
This functionality is also exposed in pd.read_sql, so to acheive what you want safely, you would do this:
sql = "select * from table1 where col1 in %s and col2 in %s"
df = pd.read_sql(sql, conn, params = [tuple1, tuple2])
There might be a better way to do it but I would add an if statement around making the query and would use .format() instead of + to parameterise the query.
Possible if statement:
if len(tuple1) < 2:
tuple1 = tuple1[0]
This will vary based on what your input is. If you have a list of tuples you can do this:
tuples = [(4011,), (23, 24)]
new_t = []
for t in tuples:
if len(t) == 2:
new_t.append(t)
elif len(t) == 1:
new_t.append(t[0])
Ouput:
[4011, (23, 24)]
Better way of parameterising querys using .format():
sql = "select * from table1 where col1 in {} and col2 in {}".format(str(tuple1), str(tuple2))
Hope this helps!
select * from table_name where 1=1 and (column_a, column_b) not in ((28,1),(25,1))

Categories