So far I have copied and pasted from SQL to Python simple codes where I have used the following formats:
sql = ("SELECT column1, column2, column3, column4 "
"FROM table1 "
"LEFT OUTER JOIN table2 ON x = y "
"LEFT OUTER JOIN table3 ON table3.z = table1.y "
However now I have started to copy into Python largest and more complicated SQL codes and I find quite difficult to use the same format as the above as columns start to contain sub-codes. I have seen some python packages that format an SQL code into python and I was wondering which one you suggest or what is the best and quiker way to overcome this situation.
You can use python multiline strings that start and end with three `
```This is a
a multi
line
string```
and not worry about formatting. This is what i generally use for such purposes but ideally you should go with an ORM
For reference please check
https://www.w3schools.com/python/python_strings.asp
For readability, you can try this
For example:
sql = """ SELECT country, product, SUM(profit) FROM sales left join
x on x.id=sales.k GROUP BY country, product having f > 7 and fk=9
limit 5; """
will result in:
sql = """
SELECT
country,
product,
SUM(profit)
FROM
sales
LEFT JOIN x ON
x.id = sales.k
GROUP BY
country,
product
HAVING
f > 7
AND fk = 9
LIMIT 5; """
Related
I am pulling data using python, from an ORACLE table. I can pull the rows I need for one of the lists using
office_list = [aaa, aab, aac, aad]
&& the actual list is much longer
"""
SELECT *
FROM (
SELECT distinct(id) as id,
office,
cym,
type
FROM oracle1.table1
WHERE office IN ({0})
)
""".format("'" + "','".join([str(i) for i in office_list]) + "'")
What I can't figure out is how to also include another filter from a different list.
In this case it is a list of types
type_list = [type1, type2, type3, type4]
Any help would be appreciated.
thanks
In cx_oracle, pass collections to the query. Using bind variables will help to prevent SQL injection attacks and will allow the SQL engine to re-use execution plans for different lists which will give greater performance (conversely, using string concatenation would make your code vulnerable to SQL injection attacks and not allow re-use of the execution plans).
In SQL:
CREATE OR REPLACE TYPE string_list AS TABLE OF VARCHAR2(25);
/
In Python:
list_type = connection.gettype("STRING_LIST")
office_list = list_type.newobject()
office_list.extend(["aaa", "aab", "aac", "aad"])
type_list = list_type.newobject()
type_list.extend(["xxx", "xxy", "xxz", "xyz"])
cursor.execute(
"""SELECT DISTINCT
id, -- DISTINCT is NOT a function.
office,
cym,
type
FROM oracle1.table1
WHERE office in (SELECT column_value FROM TABLE(:1))
AND type in (SELECT column_value FROM TABLE(:2))""",
[office_list, type_list],
)
for row in cursor:
print(row)
You may be able to simplify the WHERE clause to:
WHERE office MEMBER OF TABLE(:1)
AND type MEMBER OF TABLE(:2)
"""select * from(
select
distinct(id) as id,
office,
cym,
type
from oracle1.table1
where office in ({0})
and type in ({1})
)
""".format("'" + "','".join(office_list) + "'", "'" + "','".join(type_list) + "'")
I am using Python to access an Oracle Exadata database, which is HUGE. The documentation for the table is rather poor and I need to understand strange cases. Coming from an R/python world I ran the following query:
query = ("""
SELECT COUNT(counter) as freq, counter
FROM (
SELECT COUNT(*) as counter
FROM schema.table
WHERE x = 1 AND y = 1
GROUP BY a,b )
GROUP BY counter""")
with cx_Oralce.connct(dsn=tsn, encoding = "UTF-8") as con:
df = pd.read_sql(con=con, query=sql)
This essentially counts the frequency of observations for a given (a,b) pair. My prior was that they are all 1 (they are not). So I would like to see the observations that drive this:
query = ("""
SELECT *
FROM schema.table
WHERE x = 1 and y = 1
AND (for each (a,b) there is more than one record)""")
I am struggling to translate this into proper Oracle SQL.
In R (dplyr) this would be a combination of group_by and mutate (instead of summarise) and in Python pandas this could be done with transform.
I am new to SQL and may use incorrect terminology. I appreciate being corrected.
You can use window functions:
SELECT ab.*
FROM (SELECT t.*, COUNT(*) OVER (PARTITION BY a, b) as cnt
FROM schema.table t
WHERE x = 1 AND y = 1
) ab
WHERE cnt > 1;
I have as an input a string that is a SQL query. I need to get all tables that the query uses (like FROM table or table1 INNER JOIN table2). But the query does not respect any standard. So my question is if there is any method to format the query so that searching for these table names is easier.
My method right now is to search for the keywords from and join and take whatever line is after the keyword (or before in the case of the join), but there are exceptions in the queries where the from does not have a newline after it and I have to treat every exception like this. I don't think regex works because while the table name is {schema_name.table_name} there are also columns like that.
for row in text:
to_append = None
split_row = row.strip('\r').strip(' ').strip('\r').split(' ')
if split_row[-1].lower() == "from" and len(split_row) > 1:
from_indexes.append(text.index(row))
if ("join" in split_row or "JOIN" in split_row) and (split_row[-1] != "join" and split_row[-1]
!= "JOIN"):
for ind in range(len(split_row)):
if split_row[ind].lower() == "join":
to_append = split_row[ind + 1:]
row = split_row[:ind + 1]
row = ' '.join(row)
rows.append(row.strip('\r').strip(' ').strip('\t'))
if to_append is not None:
rows.append(' '.join(to_append))
So I am looking for some method that can standardize the sql query or for another method to extract the table names from the query.
I think a more straightforward approach would be to use regular expressions:
import re
sql = """select t1.*, t2.y, sq.z, table3.q from table1 t1 join
table2 t2 on t1.x = t2.x left join
(select 5 as x, 9 as z) sq JOIN
table3 on sq.x = table3.x
;"""
matches = re.findall(r'(\s+(from|join)\s+)(\w+)', sql, re.DOTALL|re.IGNORECASE)
for match in matches:
print(match[2])
Note that it will not consider (select 5 as x, 9 as z) as a table.
You should use an ORM tool in order to make cleaner queries (see https://en.wikipedia.org/wiki/Object-relational_mapping). Or at least some query builder modules.
I recently found a remake of laravels "eloquent" orm here https://pypi.org/project/eloquent/.
Other ORMs like PeeWee are pretty common to use, too.
I am a beginner python user, and would like to run a SQL query, iteratively, on items in a csv file that have the same group.
My input file looks like this:
"num","fruit_id","fruit"
1,1000560,"apple"
1,1102527,"banana"
1,1103314,"orange"
1,1136980,"pineapple"
2,1321636,"cantalope"
2,1506270,"mandarin"
3,1539403,"grape"
3,1549786,"grapefruit"
3,1734104,"tomato
I would like to group all the "fruit_id" items with the same "num" into a comma separated list and supply this list in the WHERE statement of my SQL query. I have over 40,000 groups, so I need to do this iteratively.
I know how to run the SQL query in python, but I am struggling with how to best create this grouping, reference it properly in my SQL query, and do it iteratively. Any input would be very much appreciated.
My SQL query looks something like this:
SELECT *
FROM db1.table1
JOIN db1.table2 USING (id)
JOIN db1.table3 ON (concept_id=fruit_concept_id)
JOIN db1.table4 USING (detailed_id)
WHERE fruit_id IN ('list_of_fruit_ids_for_group_in_file')
GROUP BY fruit_id, fruit_concept_id;
The 'list_of_fruit_ids_for_group_in_file' would look like :
(1000560, 1102527,1103314,1136980) for group 1
(1321636, 1506270) for group 2
(1539403,1549786,1734104) for group 3
import pandas as pd
import numpy as np
df = pd.read_csv('datatest.csv', delimiter =',')
for group in np.unique(df.num):
#filter df for just the group
df_group = df[df.num == group]
#select fruit_id column and get to a list
grouplist = np.unique(df_group.fruit_id)
print("group num : ",group, "fruits :", grouplist)
output :
group num : 1 fruits : [1000560 1102527 1103314 1136980]
group num : 2 fruits : [1321636 1506270]
group num : 3 fruits : [1539403 1549786 1734104]
it's not comma separated in the python print but it's still a list where you can go through or with a WHERE IN
Thus you could try to include your query inside the for loop :
for group in np.unique(df.num):
#filter df for just the group
df_group = df[df.num == group]
#select fruit_id column and get to a list
grouplist = np.unique(df_group.fruit_id)
data = SQL QUERY... WHERE .. IN grouplist
Here's a better idea: use sets. It will make your code simpler and faster.
create table fruits(
num int not null,
fruit_id int not null,
fruit varchar(30) not null );
Insert each line from your CSV file into the fruits table. Likely there are tools for your DBMS to do that; you shouldn't have to write any Python for it.
Now instead of formulating a WHERE clause, use EXISTS:
select count(*) from T
where exists( select 1 from fruits
where T.fruit_id = fruit_id and num = 1 )
or whatever it is you're after.
It's not obvious you "have to do this iteratively". It seems at least as likely to me that one query will do the job quicker and easier, depending on capacity and use.
The SQL query I have can identify the Max Edit Time from the 3 tables that it is joining together:
Select Identity.SSN, Schedule.First_Class, Students.Last_Name,
(SELECT Max(v)
FROM (VALUES (Students.Edit_DtTm), (Schedule.Edit_DtTm),
(Identity.Edit_DtTm)) AS value(v)) as [MaxEditDate]
FROM Schedule
LEFT JOIN Students ON Schedule.stdnt_id=Students.Student_Id
LEFT JOIN Identity ON Schedule.std_id=Identity.std_id
I need this to be in SQLAlchemy so I can reference the columns being used elsewhere in my code. Below is the simplest version of what i'm trying to do but it doesn't work. I've tried changing around how I query it but I either get a SQL error that I'm using VALUES incorrectly or it doesn't join properly and gets me the actual highest value in those columns without matching it to the outer query
max_edit_subquery = sa.func.values(Students.Edit_DtTm, Schedule.Edit_DtTm, Identity.Edit_DtTm)
base_query = (sa.select([Identity.SSN, Schedule.First_Class, Students.Last_Name,
(sa.select([sa.func.max(self.max_edit_subquery)]))]).
select_from(Schedule.__table__.join(Students, Schedule.stdnt_id == Students.stdnt_id).
join(Ident, Schedule.std_id == Identity.std_id)))
I am not an expert at SQLAlchemy but you could exchange VALUES with UNION ALL:
Select Identity.SSN, Schedule.First_Class, Students.Last_Name,
(SELECT Max(v)
FROM (SELECT Students.Edit_DtTm AS v
UNION ALL SELECT Schedule.Edit_DtTm
UNION ALL SELECT Identity.Edit_DtTm) s
) as [MaxEditDate]
FROM Schedule
LEFT JOIN Students ON Schedule.stdnt_id=Students.Student_Id
LEFT JOIN Identity ON Schedule.std_id=Identity.std_id;
Another approach is to use GREATEST function (not available in T-SQL):
Select Identity.SSN, Schedule.First_Class, Students.Last_Name,
GREATEST(Students.Edit_DtTm, Schedule.Edit_DtTm,Identity.Edit_DtTm)
as [MaxEditDate]
FROM Schedule
LEFT JOIN Students ON Schedule.stdnt_id=Students.Student_Id
LEFT JOIN Identity ON Schedule.std_id=Identity.std_id;
I hope that it will help you to translate it to ORM version.
I had the similar problem and i solved using the below approach. I have added the full code and resultant query. The code was executed on the MSSQL server. I had used different tables and masked with the tables and columns used in your requirement in the below code snippet.
from sqlalchemy import *
from sqlalchemy.ext.compiler import compiles
from sqlalchemy.types import String
from sqlalchemy.sql.expression import FromClause
class values(FromClause):
def __init__(self, *args):
self.column_names = args
#compiles(values)
def compile_values(element, compiler, asfrom=False, **kwrgs):
values = "VALUES %s" % ", ".join("(%s)" % compiler.render_literal_value(elem, String()) for elem in element.column_names)
if asfrom:
values = "(%s)" % values
return values
base_query = self.db_session.query(Schedule.Edit_DtTm.label("Schedule_Edit_DtTm"),
Identity.Edit_DtTm.label("Identity_Edit_DtTm"),
Students.Edit_DtTm.label("Students_Edit_DtTm"),
Identity.SSN
).outerjoin(Students, Schedule.stdnt_id==Students.Student_Id
).outerjoin(Identity, Schedule.std_id==Identity.std_id).subquery()
values_at_from_clause = values(("Students_Edit_DtTm"), ("Schedule_Edit_DtTm"), ("Identity_Edit_DtTm")
).alias('values(MaxEditDate)')
get_max_from_values = self.db_session.query(func.max(text('MaxEditDate'))
).select_from(values_at_from_clause)
output_query = self.db_session.query(get_max_from_values.subquery()
).label("MaxEditDate")
**print output_query**
SELECT
anon_1.Schedule_Edit_DtTm AS anon_1_Schedule_Edit_DtTm,
anon_1.Students_Edit_DtTm AS anon_1_Students_Edit_DtTm,
anon_1.Identity_Edit_DtTm AS anon_1_Identity_Edit_DtTm,
anon_1.SSN AS anon_1_SSN
(
SELECT
anon_2.max_1
FROM
(
SELECT
max( MaxEditDate ) AS max_1
FROM
(
VALUES (Students_Edit_DtTm),
(Schedule_Edit_DtTm),
(Identity_Edit_DtTm)
) AS values(MaxEditDate)
) AS anon_2
) AS MaxEditDate
FROM
(
SELECT
Schedule.Edit_DtTm AS Schedule_Edit_DtTm,
Students.Edit_DtTm AS Students_Edit_DtTm,
Identity.Edit_DtTm AS Identity_Edit_DtTm,
Identity.SSN AS SSN
FROM
Schedule WITH(NOLOCK)
LEFT JOIN Students WITH(NOLOCK) ON
Schedule.stdnt_id==Students.Student_Id
LEFT JOIN Identity WITH(NOLOCK) ON
Schedule.std_id==Identity.std_id
) AS anon_1