How to convert correlated sub query to pyspark code - python

I am trying to convert correlated subquery in to pyspark code and i am getting below error.
Attribute(s) with the same name appear in the operation:
Sample sql query:
select(few columns)
from table0 spw1
JOIN table1 spw2
ON ( spw2.date_key=(SELECT MAX(date_key)
FROM table2 spw3
WHERE spw3.name =spw1.name
AND spw3.id = spw1.id
AND spw3.v_key=spw1.v_key
AND spw3.start_date < spw1.prev_start_date)

Related

use string as columns definition for DataFrame(cursor.fetchall(),columns

I would like to use a string as column names for pandas DataFrame.
The problem arised is that pandas DataFrame interpret the string var as single column instead of multiple ones. An thus the error:
ValueError: 1 columns passed, passed data had 11 columns
The first part of my code is intended to get the column names from the Mysql database I am about to query:
cursor1.execute ("SELECT GROUP_CONCAT(COLUMN_NAME) AS cols FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = 'or_red' AND TABLE_NAME = 'nomen_prefix'")
for colsTableMysql in cursor1.fetchall() :
colsTable = colsTableMysql[0]
colsTable="'"+colsTable.replace(",", "','")+"'"
The second part uses the created variable "colsTable" :
cursor = connection.cursor()
cursor.execute("SELECT * FROM or_red.nomen_prefix WHERE C_emp IN ("+emplazamientos+")")
tabla = pd.DataFrame(cursor.fetchall(),columns=[colsTable])
#tabla = exec("pd.DataFrame(cursor.fetchall(),columns=["+colsTable+"])")
#tabla = pd.DataFrame(cursor.fetchall())
I have tried ather aproaches like the use of exec(). In that case, there is no error but there is no response with information either, and the result of print(tabla) is None.
¿Is there any direct way of passing the columns dynamically as string to a python pandas DataFrame?
Thanks in advance
I am going to answer my question since I've already found the way.
The first part of my code is intended to get the column names from the Mysql database table I am about to query:
cursor1.execute ("SELECT GROUP_CONCAT(COLUMN_NAME) AS cols FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = 'or_red' AND TABLE_NAME = 'nomen_prefix'")
for colsTableMysql in cursor1.fetchall() :
colsTable = colsTableMysql[0]
colsTable="'"+colsTable.replace(",", "','")+"'"
The second part uses the created variable "colsTable" as input in the statement to define the columns.
cursor = connection.cursor()
cursor.execute("SELECT * FROM or_red.nomen_prefix WHERE C_emp IN ("+emplazamientos+")")
tabla = eval("pd.DataFrame(cursor.fetchall(),columns=["+colsTable+"])")
Using eval the string is parsed and evaluated as a Python expression.

while iterating over a pandas Series, query an SQLite database with each member of the Series

I have a pandas Series made from the following python dictionary, so:
gr8 = {'ERF13' : 'AT2G44840', 'BBX32' : 'AT3G21150', 'NAC061' : 'AT3G44350', 'NAC090' : 'AT5G22380', 'ERF019' : 'AT1G22810'}
gr8obj = pd.Series(gr8)
( where I have previously imported pandas as pd )
I have an SQLite database, AtRegnet.db
I want to iterate over the pandase Series, gr8obj, and query the database, AtRegnet.db, for each member of the series.
This is what I have tried:
for i in gr8obj:
resdf = pd.read_sql('SELECT * FROM AtRegNet WHERE TargetLocus = ?' (i), con=sqlite3.connect("/home/anno/SQLiteDBs/AtRegnet.db"))
fresdf = resdf.append(resdf)
fresdf
( the table in the AtRegnet.db that I want is AtRegNet and the column I am searching on is called TargetLocus. )
I know that when I work on the SQLite3 database directly with a SQL command,
select * from AtRegNet where TargetLocus="AT3G23230"
that I get back 80 lines from the database. (AT3G23230 is one of members of gr8obj)
You can try using a f-string. And the value for TargetLocus in your query should also be in quotes
resdf = pd.read_sql(f'''SELECT * FROM AtRegNet WHERE TargetLocus = \'{i}\'''')

How to make this SQL query work? I wish to insert a tuple inside the query

This is the original SQL query which worked for me
sql_text = """select * from (
SELECT pr.CLOSING_DATE,
'M' + CAST(year(pr.FIRST_DAY_DEL)*12+month(pr.FIRST_DAY_DEL) - year(pr.CLOSING_DATE)*12-month(pr.CLOSING_DATE) as varchar(2)) as relative_product,
pr.SETTLEMENT_PRICE as value
FROM COMMON.dbo.MDC_FUTURES_V pr, COMMON.dbo.MDC_CAT_V d
WHERE pr.MDC_ID = d.MDC_ID AND pr.FIRST_DAY_DEL<=pr.CLOSING_DATE + 1000
AND pr.MDC_ID IN ('10006968')
AND pr.PERIOD='Monthly'
) as data
PIVOT (AVG(VALUE) FOR relative_product IN (
[M1],[M2],[M3],[M4],[M5],[M6],[M7],[M8]
)) AS pvtL
ORDER BY CLOSING_DATE DESC"""
data = pd.read_sql(sql_text, con)
As I need much more months in the future, so I tried to replace the ([M1],[M2],[M3],[M4],[M5],[M6],[M7],[M8]) with a tuple. I wrote below
lst_m36=[]
for i in range(1,9):
lst_m36.append(f"[M{i}]")
tple36=tuple(lst_m36)
However when I try to insert the tuple tple36 in the SQL query, I tried different "" () position but it never works.
sql_text = """select * from (
SELECT pr.CLOSING_DATE,
'M' + CAST(year(pr.FIRST_DAY_DEL)*12+month(pr.FIRST_DAY_DEL) - year(pr.CLOSING_DATE)*12-month(pr.CLOSING_DATE) as varchar(2)) as relative_product,
pr.SETTLEMENT_PRICE as value
FROM COMMON.dbo.MDC_FUTURES_V pr, COMMON.dbo.MDC_CAT_V d
WHERE pr.MDC_ID = d.MDC_ID AND pr.FIRST_DAY_DEL<=pr.CLOSING_DATE + 1000
AND pr.MDC_ID IN ('10006968')
AND pr.PERIOD='Monthly'
) as data
PIVOT (AVG(VALUE) FOR relative_product IN (tple36)) AS pvtL
ORDER BY CLOSING_DATE DESC"""
Here is the error message matching with the last query
DatabaseError: Execution failed on sql: select * from (
SELECT pr.CLOSING_DATE,
'M' + CAST(year(pr.FIRST_DAY_DEL)*12+month(pr.FIRST_DAY_DEL) - year(pr.CLOSING_DATE)*12-month(pr.CLOSING_DATE) as varchar(2)) as relative_product,
pr.SETTLEMENT_PRICE as value
FROM COMMON.dbo.MDC_FUTURES_V pr, COMMON.dbo.MDC_CAT_V d
WHERE pr.MDC_ID = d.MDC_ID AND pr.FIRST_DAY_DEL<=pr.CLOSING_DATE + 1000
AND pr.MDC_ID IN ('10006968')
AND pr.PERIOD='Monthly'
) as data
PIVOT (AVG(VALUE) FOR relative_product IN (tple36)) AS pvtL
ORDER BY CLOSING_DATE DESC
('08S01', '[08S01] [Microsoft][ODBC SQL Server Driver]Communication link failure (0) (SQLExecDirectW)')
unable to rollback
Can anyone help with a SQL query that works?
Don't use pivot while query step. Get results and use the pandas package to get pivot. I think it is going to be easier.

Retrieve Value from Pyspark SQL dataframe in Python program

I am using pyspark.sql in a standalone Python program to run a query on a VERSION 0 of a table stored on Databricks.
I can return a data frame using the following code but can not seem to access the value (which is an int 5 in this case)
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
table_diff_df = spark.sql("select count(*) as result from (select * from myTableName version as of 0 limit 5)")
logger.console(type(table_diff_df))
logger.console(table_diff_df[0][0])
logger.console(type(table_diff_df[0]['result']))
logger.console(table_diff_df)
logger.console("tablediff as str :" + str(table_diff_df))
output 1
<class 'pyspark.sql.dataframe.DataFrame'>
Column<b'result[0]'>
<class 'pyspark.sql.column.Column'>
DataFrame[result: bigint]
tablediff as str :DataFrame[result: bigint]
By adjusting my query to the following(appending .collect()) I have been able to get the value of 5 as required (however I had to remove Version as of 0)
table_diff_df = spark.sql("select count(*) as result from (select * from myTableName limit 5)").collect()
logger.console(type(table_diff_df))
logger.console(table_diff_df[0][0])
output 2
<class 'list'>
5
In my case I MUST run the query on the Version 0 table, but when I add that back into my query as shown below I get the following error
table_diff_df = spark.sql("select count(*) as result from (select * from myTableName Version as of 0 limit 5)").collect()
logger.console(table_diff_df[0][0])
output 3
Time travel is only supported for Delta tables
Is there a simple way to access the value using the Dataframe I have shown in the first code snippet(output1)? and if not how can I get around the problem of time travel is only supported delta table. The table that I am querying is a delta table however I believe calling .collect() is converting it directly to a list(output2)?

how to use a list as a parameter for SQL connection in python using pyodbc

I am learning python, and I am trying to pass a parameter as part of the WHERE clause. I am using pandas and pyodbc. Here is what I've tried so far. I first get the data for column c from a pandas dataframe, and convert it to a list called df_col, which has about 100 numeric values
df_col=df['data_for_colc'].tolist()
then, I execute the SQL statement:
execu = mycursor.execute(
"""
Select
columnA
,columnb
,columnc
where
columnc in (?)
""",df_col)
rows = mycursor.fetchall()
print(rows)
I am able to connecto and download data from SQL server, but I am not able to pass parameters. I just need to be able to download those 100 rows based on the list I created with 100 values, but I get an error: ('42000', "[42000] [Microsoft][ODBC SQL Server Driver][SQL Server]Incorrect syntax near ','. (102) (SQLExecDirectW)")
any help would be appreciated. Thanks
The syntax error is because you left out the FROM clause in your query.
Once you fix that, you need to have as many ? in the IN() list as there are elements in df_col.
placeholders = ", ".join(["?"] * len(df_col))
sql = """
SELECT columnA, columnB, columnC
FROM yourTable
WHERE columnc IN (""" + placeholders + ")"
execu = mycursor.execute(sql, df_col)
rows = mycursor.fetchall()
print(rows)
you have to generate all those question marks...
execu = mycursor.execute(
"""
Select
columnA
,columnb
,columnc
where
columnc in ({})
""".format(','.join("?"*len(df_col))), df_col)
rows = mycursor.fetchall()
print(rows)

Categories