Fetch Queries from Snowflake table and execute each query using Python - python

I have stored queries (like select * from table) in a Snowflake table and want to execute each query row-by-row and generate a CSV file for each query. Below is the python code where I am able to print the queries but don't know how to execute each query and create a CSV file:
I believe I am close to what I want to achieve. I would really appreciate if someone can help over here.
import pyodbc
import pandas as pd
import snowflake.connector
import os
conn = snowflake.connector.connect(
user = 'User',
password = 'Pass',
account = 'Account',
autocommit = True
)
try:
cursor = conn.cursor()
query=('Select Column from Table;')--This will return two select
statements
output = cursor.execute(query)
for i in cursor:
print(i)
cursor.close()
del cursor
conn.close()
except Exception as e:
print(e)

You're pretty close. just need to execute the code instead of printing, and put the data into a file.
I haven't used pandas much myself, but this is the code that Snowflake documentation provides for running a query and putting it into a pandas dataframe.
cursor = conn.cursor()
query=('Select Column, row_number() over(order by Column) as Rownum from Table;')
cursor.execute(query)
resultset = cursor.fetchall()
for result in resultset:
cursor.execute(result[0])
df = cursor.fetch_pandas_all()
df.to_csv(r'C:\Users\...<your filename here>'+ result[1], index = False)
may take some fiddling, but here's a couple links for references:
Snowflake docs on creating a pandas dataframe
Exporting a pandas dataframe to csv
Update: added an example of a way to create separate files for each record. This just adds a distinct number to each row of your sql output so you can use that number as part of the filename. Ultimately, you need to have some logic in your loop to create a filename, whether that's adding a random number, a timestamp, whatever. That can come from the SQL or from the python, up to you. I'd probably add a filename column to your table, but I don't know if that makes sense for you.

Related

Pandas to_sql avoid duplicate rows

I am using pandas' to_sql method to insert data into a mysql table. The mysql table already exists and I'd like to avoid inserting duplicate rows.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html
Is there a way to do this in python?
# mysql connection
import pandas as pd
import pymysql
from sqlalchemy import create_engine
user = 'user1'
pwd = 'xxxx'
host = 'aa1.us-west-1.rds.amazonaws.com'
port = 3306
database = 'main'
engine = create_engine("mysql+pymysql://{}:{}#{}/{}".format(user,pwd,host,database))
con = engine.connect()
df.to_sql(name="dfx", con=con, if_exists = 'append')
con.close()
Are there any work-arounds, if there isn't a straight forward way to do this?
It sounds like you want to do an "upsert" (insert or update). Pangres is a useful package that will allow you to do an upsert using a pandas df. If you don't want to update the row if it exists, that is also an option by setting if_row_exists to 'ignore'
I have never heard of 'upsert' before today, but it sounds interesting. You could certainly delete dupes after the data is loaded into your table.
WITH a as
(
SELECT Firstname,ROW_NUMBER() OVER(PARTITION by Firstname, empID ORDER BY Firstname)
AS duplicateRecCount
FROM dbo.tblEmployee
)
--Now Delete Duplicate Records
DELETE FROM a
WHERE duplicateRecCount > 1
That will work fine, unless you have billions of rows.

How to query a T-SQL temp table with connectorx (pandas slow)

I am using pyodbc to run a query to create a temp table from a bunch of other tables. I then want to pull that whole temp table into pandas, but my pd.read_sql call takes upwards of 15 minutes. I want to try the connectorX library to see if it will speed things up.
For pandas the working way to query the temp table simply looks like:
conn = connection("connection string")
cursor = conn.cursor()
cursor.execute("""Do a bunch of stuff that ultimately creates one #finalTable""")
df = pd.read_sql("SELECT * FROM #finalTable", con=conn)
I've been reading the documentation and it appears I can only pass a connection string to the connectorx.read_sql function, and I haven't been able to find a way to pass it an existing connection that carries the temp table I need.
Am I able to query the temp table with connectorX? If so how?
If not, what would be a faster way to query a large temp table?
Thanks!

Python SQL Server database loop not working

Using Python looping through a number of SQL Server databases creating tables using select into, but when I run the script nothing happens i.e. no error messages and the tables have not been created. Below is an extract example of what I am doing. Can anyone advise?
df = [] # dataframe of database names as example
for i, x in df.iterrows():
SQL = """
Drop table if exists {x}..table
Select
Name
Into
{y}..table
From
MainDatabase..Details
""".format(x=x['Database'],y=x['Database'])
cursor.execute(SQL)
conn.commit()
Looks like your DB driver doesn't support multiple statements behavior, try to split your query to 2 single statements one with drop and other with select:
for i, x in df.iterrows():
drop_sql = """
Drop table if exists {x}..table
""".format(x=x['Database'])
select_sql = """
Select
Name
Into
{y}..table
From
MainDatabase..Details
""".format(x=x['Database'], y=x['Database'])
cursor.execute(drop_sql)
cursor.execute(select_sql)
cursor.commit()
And second tip, your x=x['Database'] and y=x['Database'] are the same, is this correct?

Syntax Error while inserting bulk data from pandas df to postgresql

I am trying to bulk insert pandas dataframe data into Postgresql. In Pandas dataframe I have 35 columns and in Postgresql table I have 45 columns. I am choosing 12 matching column from pandas dataframe and inserting into postgresql table. For this I am using the following code snippets:
df = pd.read_excel(raw_file_path,sheet_name = 'Sheet1',usecols=col_names) <---col_names = list of desired columns (12 columns)
cols = ','.join(list(df.columns))
tuples = [tuple(x) for x in df.to_numpy()]
query = "INSERT INTO {0}.{1} ({2}) VALUES (%%s,%%s,%%s,%%s,%%s,%%s,%%s,%%s,%%s,%%s,%%s,%%s);".format(schema_name,table_name,cols)
curr = conn.cursor()
try:
curr.executemany(query,tuples)
conn.commit()
curr.close()
except (Exception, psycopg2.DatabaseError) as error:
print("Error: %s" % error)
conn.rollback()
curr.close()
return 1
finally:
if conn is not None:
conn.close()
print('Database connection closed.')
When running I am getting this error:
SyntaxError: syntax error at or near "%"
LINE 1: ...it,purchase_group,indenter_name,wbs_code) VALUES (%s,%s,%s,%...
Even if I use ? in place of %%s I am still getting this error.
Can anybody throw some light on this?
P.S. I am using Postgresql version 10.
What you're doing now is actually insert a pandas dataframe one row at a time. Even if this worked, it would be an extremely slow operation. At the same time, if the data might contain strings, just placing them into a query string like this leaves you open to SQL injection.
I wouldn't reinvent the wheel. Pandas has a to_sql function that takes a dataframe and converts it into a query for you. You can specify what to do on conflict (when a row already exists).
It works with SQLAlchemy, which has excellent support for PostgreSQL. And even though it might be a new package to explore and install, you're not required to use it anywhere else to make this work.
from sqlalchemy import create_engine
engine = create_engine('postgresql://localhost:5432/mydatabase')
pd.read_excel(
raw_file_path,
sheet_name = 'Sheet1',
usecols=col_names # <---col_names = list of desired columns (12 columns)
).to_sql(
schema=schema_name,
name=table_name,
con=engine,
method='multi' # this makes it do all inserts in one go
)

Python - Can I insert rows into one database using a cursor (from select) from another database?

I am trying to select data from our main database (postgres) and insert it into a temporary sqlite database for some comparision, analytics and reporting. Is there an easy way to do this in Python? I am trying to do something like this:
Get data from the main Postgres db:
import psycopg2
postgres_conn = psycopg2.connect(connection_string)
from_cursor = postgres_conn.cursor()
from_cursor.execute("SELECT email, firstname, lastname FROM schemaname.tablename")
Insert into SQLite table:
import sqlite3
sqlite_conn = sqlite3.connect(db_file)
to_cursor = sqlite_conn.cursor()
insert_query = "INSERT INTO sqlite_tablename (email, firstname, lastname) values %s"
to_cursor.some_insert_function(insert_query, from_cursor)
So the question is: is there a some_insert_function that would work for this scenario (either using pyodbc or using sqlite3)?
If yes, how to use it? Would the insert_query above work? or should it be modified?
Any other suggestions/approaches would also be appreciated in case a function like this doesn't exist in Python. Thanks in advance!
You should pass the result of your select query to execute_many.
insert_query = "INSERT INTO smallUsers values (?,?,?)"
to_cursor.executemany(insert_query, from_cursor.fetchall())
You should also use a parameterized query (? marks), as explained here: https://docs.python.org/3/library/sqlite3.html#sqlite3.Cursor.execute
If you want to avoid loading the entire source database into memory, you can use the following code to process 100 rows at a time:
while True:
current_data = from_cursor.fetchmany(100)
if not current_data:
break
to_cursor.exectutemany(insert_query, current_data)
sqlite_conn.commit()
sqlite_conn.commit()
You can look at executemany from pyodbc or sqlite. If you can build a list of parameters from your select, you can pass the list to executemany.
Depending on the number of records you plan to insert, performance can be a problem as referenced in this open issue. https://github.com/mkleehammer/pyodbc/issues/120

Categories