passing param to airflow get_pandas_df and comparing to datetime2 - python

Here is the code:
#run cassandra query to get date value
result = session.execute('select max(process_date) as process_date_max from keyspace1.job_run ')
for row in result:
date_last_run = row.process_date_max
date_last_run = str(date_last_run)
sql = """
select * from table where modifieddate > {last_run}
""".format(last_run=date_last_run))
df = mssql.get_pandas_df(sql)
I get error: cannot compare value of datetime2 with int. Please help here as i didn't find any solution on internet so far.

The solution in very simple you just need to put the ' before and after the {last_run} and your code will work
See the below example
result = session.execute('select max(process_date) as process_date_max from keyspace1.job_run ')
for row in result:
date_last_run = row.process_date_max
date_last_run = str(date_last_run)
sql = """
select * from table where modifieddate > '{last_run}'
""".format(last_run=date_last_run))
df = mssql.get_pandas_df(sql)
Method2:
sql = f" select * from table where modifieddate > '{last_run}'"

Related

Looping through dataframe values in columns and using them as a FROM clause using SQL

I am running BigQuery in Jupyter notebook.
query ="""
SELECT
table_catalog,
table_schema,
table_name,
FROM `Project-A.schema_A`.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
"""
The output leads me to the following table:
# This is the output of the query
data = {'table_catalog':['project-A', 'project-A', 'project-A', 'project-A','Project-A','Project-A','Project-A'],
'table_catalog':['schema_A', 'schema_A', 'schema_A', 'schema_A','schema_A','schema_A','schema_A']
'table_name':['Table_A', 'Table_B', 'Table_B', 'Table_C','Table_C','Table_A','Table_A']}
d# Create DataFrame
df = pd.DataFrame(data)
I want to use Table_A, Table_B and Table_C in my next query in the FROM CLAUSE such that it looks like:
query =f"""
SELECT
*
FROM Project-A.Schema_A.{I want to edit this dyanmically - either Table_A, Table_B, Table_C}"""
I tried the following but have been failing, please help me with this:
list_of_tables = list(df['table_name'].unique())
def loop_tables(x):
for tables in list_of_tables:
if x == tables
# x = df['table_name']
loop_tables()
try this
def loop_tables():
list_of_dataframes = []
for table in list_of_tables:
print(table)
dynamic_sql = "select * from project.dataset."
dynamic_sql += table
df = client.query(dynamic_sql).to_dataframe()
list_of_dataframes.append(df)
return list_of_dataframes

Not able to fetch records from DB through python by using Parameterized Queries

The above function has parameters endTime, startTime, list1 and column_filter to it and I am trying to read a query by making the WHERE clause conditions parameterized.
endT = endTime
startT = startTime
myList = ",".join("'" + str(i) + "'" for i in list1)
queryArgs = {'db': devDB,
'schema': dbo,
'table': table_xyz,
'columns': ','.join(column_filter)}
query = '''
WITH TIME_SERIES AS
(SELECT ROW_NUMBER() OVER (PARTITION BY LocId ORDER BY Created_Time DESC) RANK, {columns}
from {schema}.{table}
WHERE s_no in ? AND
StartTime >= ? AND
EndTime <= ? )
SELECT {columns} FROM TIME_SERIES WHERE RANK = 1
'''.format(**queryArgs)
args = (myList, startT, endT)
return self.read(query, args)
The below is my read which connects to the DB to fetch records and a condition is also added to check if its parameterized or not.
def read(self, query, parameterValues = None):
cursor = self.connect(cursor=True)
if parameterValues is not None:
rows = cursor.execute(query, parameterValues)
else:
rows = cursor.execute(query)
df = pd.DataFrame.from_records(rows.fetchall())
if len(df.columns) > 0:
df.columns = [x[0] for x in cursor.description]
cursor.close()
return df
The query args are getting picked up but not the parameterized values. In my case, it is going inside the read method with parameter values of (myList, startT ,endT) as a tuple. The query in WHERE clause remains unchanged (parameters not able to replace ? ), and as a result I am not able to fetch any records. Can you specify where I might be going wrong?

Create Dataframe with Cx_Oracle based on different query date

Below is a sample of DB table
date id name
01.02.11 4 aaaa
21.05.19 5 aaaa
31.12.12 5 aaaa
01.05.15 6 aaaa
In order to query data in the right way (avoiding duplicates), while querying I have to set a 'reporting date' which is the first month day.
The below code gives me the requested results but only for one month.
sql = 'select * from db where date = '01.03.20''
def oracle(user, pwd, dsn, sql, columns):
# Connection to databases
con = cx_Oracle.connect(user=user, password=pwd, dsn=dsn, encoding="UTF-8")
con.outputtypehandler = OutputHandler
# Cursor allows Python code to execute PostgreSQL command in a database session
cur = con.cursor()
# Check Connection
print('Connected')
# Create DF
df = pd.DataFrame(cur.execute(sql).fetchall(), columns= columns, dtype='object')[:]
print('Shape:', df.shape)
return df
Question: How can I query Data using CX_Oracle with different reporting date without doing it manually?
There are multiple way to solve this issue directly using SQL.
However, the expected solution should use 'a for loop'.
I was thinking about changing the reporting date with
for i in [str(i).zfill(2) for i in range(1,13)]:
for j in [str(j).zfill(2) for j in range(0,21)]
sql = f'select * from db where date = '01.{i}.{j}''
For eg: date = 01.01.19
The idea is to query data for this date --> store it within DF
Go to Next month 01.02.19 --> Store it in DF
And so on until reached range 21 or reached last current month (latest date)
If someone has any idea to query data using a loop with cx_Oracle and Pandas for different date thanks for helping!
How about something like this
from datetime import date, datetime, timedelta
import calendar
# Choose Start Month
start_month = date(2019, 1, 1)
# Get Current Month
current_month = date(datetime.today().year, datetime.today().month, 1)
# Create list to collect all successfully run queries
executed_sql_queries = []
# Create list for failed queries
failed_queries = []
# Create list to collect dfs
dfs = []
while start_month <= current_month:
query_date = start_month.strftime('%d.%m.%y')
sql = f"""select * from db where date = '{query_date}' """
try:
df = oracle(user, pwd, dsn, sql=sql, columns)
except sql_error as e:
print(e)
failed_queries.append(sql)
pass # move onto the next query or you can try re-running the query
else:
executed_sql_queries.append(sql)
dfs.append(df)
finally:
# Add one Month to the date for each run
days_in_month = calendar.monthrange(start_month.year, start_month.month)[1]
start_month = start_month + timedelta(days=days_in_month)
all_dfs = pd.concat(dfs)
executed_sql_queries:
["select * from db where date = '01.01.19' ",
"select * from db where date = '01.02.19' ",
"select * from db where date = '01.03.19' ",
"select * from db where date = '01.04.19' ",
"select * from db where date = '01.05.19' ",
"select * from db where date = '01.06.19' ",
"select * from db where date = '01.07.19' ",
"select * from db where date = '01.08.19' ",
"select * from db where date = '01.09.19' ",
"select * from db where date = '01.10.19' ",
"select * from db where date = '01.11.19' ",
"select * from db where date = '01.12.19' ",
"select * from db where date = '01.01.20' ",
"select * from db where date = '01.02.20' ",
"select * from db where date = '01.03.20' ",
"select * from db where date = '01.04.20' "]

Conditionally add WHERE clause in Python for cx_Oracle

I have the following Python code :
params = {}
query = 'SELECT * FROM LOGS '
if(date_from and date_to):
query += ' WHERE LOG_DATE BETWEEN TO_DATE(:date_start, "MM-DD-YYYY") AND LOG_DATE <= TO_DATE(:date_end, "MM-DD-YYYY")'
params['date_start'] = date_from
params['date_end'] = date_to
if(structure):
query += ' AND STRUCTURE=:structure_val'
params['structure_val'] = structure
if(status):
query += ' AND STATUS =:status'
params['status'] = status
cursor.execute(query, params)
Here I am conditionally adding the WHERE clause to the query. But there is an issue when I don't have value for the dates as it will not take the WHERE and will add AND without WHERE. If I add the where clause with the query, if there is no filter, then it will give the wrong query. Is there any better way to do this ? I have been using Laravel for sometime and it's query builder has a method when, which will help to add conditional where clauses. Anything like this in Python for cx_Oracle ?
params = {}
query = 'SELECT * FROM LOGS '
query_conditions = []
if(date_from and date_to):
query_conditions.apend(' WHERE LOG_DATE BETWEEN TO_DATE(:date_start, "MM-DD-YYYY") AND LOG_DATE <= TO_DATE(:date_end, "MM-DD-YYYY")')
params['date_start'] = date_from
params['date_end'] = date_to
if(structure):
query_conditions.append('STRUCTURE=:structure_val')
params['structure_val'] = structure
if(status):
query_conditions.append('STATUS =:status')
params['status'] = status
if query_conditions:
query += " AND ".join(query_conditions)
cursor.execute(query, params)
add them in list and join values with AND

How to change the cursor to the next row using pyodbc in Python

I am trying to fetch records after a regular interval from a database table which growing with records. I am using Python and its pyodbc package to carry out the fetching of records. While fetching, how can I point the cursor to the next row of the row which was read/fetched last so that with every fetch I can only get the new set of records inserted.
To explain more,
my table has 100 records and they are fetched.
after an interval the table has 200 records and I want to fetch rows from 101 to 200. And so on.
Is there a way with pyodbc cursor?
Or any other suggestion would be very helpful.
Below is the code I am trying:
#!/usr/bin/python
import pyodbc
import csv
import time
conn_str = (
"DRIVER={PostgreSQL Unicode};"
"DATABASE=postgres;"
"UID=userid;"
"PWD=database;"
"SERVER=localhost;"
"PORT=5432;"
)
conn = pyodbc.connect(conn_str)
cursor = conn.cursor()
def fetch_table(**kwargs):
qry = kwargs['qrystr']
try:
#cursor = conn.cursor()
cursor.execute(qry)
all_rows = cursor.fetchall()
rowcnt = cursor.rowcount
rownum = cursor.description
#return (rowcnt, rownum)
return all_rows
except pyodbc.ProgrammingError as e:
print ("Exception occured as :", type(e) , e)
def poll_db():
for i in [1, 2]:
stmt = "select * from my_database_table"
rows = fetch_table(qrystr = stmt)
print("***** For i = " , i , "******")
for r in rows:
print("ROW-> ", r)
time.sleep(10)
poll_db()
conn.close()
I don't think you can use pyodbc, or any other odbc package, to find "new" rows. But if there is a 'timestamp' column in your database, or if you can add such a column (some databases allow for it to be automatically populated as the time of insertion so you don't have to change the insert queries) then you can change your query to select only the rows whose timestamp is greater than the previous timestamp. And you can keep changing the prev_timestamp variable on each iteration.
def poll_db():
prev_timestamp = ""
for i in [1, 2]:
if prev_timestamp == "":
stmt = "select * from my_database_table"
else:
# convert your timestamp str to match the database's format
stmt = "select * from my_database_table where timestamp > " + str(prev_timestamp)
rows = fetch_table(qrystr = stmt)
prev_timestamp = datetime.datetime.now()
print("***** For i = " , i , "******")
for r in rows:
print("ROW-> ", r)
time.sleep(10)

Categories