Read SQL query into pandas dataframe and replace string in query - python

I'm querying my SSMS database from pandas and the query I have is pretty huge, I've saved it locally and want to read that query as a pandas dataframe and also there is a date string that I have in the query, I want to replace that datestring with a date that I've already assigned in pandas. For reference sake I'll shorten the query.
I'm currently following below:
query = """SELECT * FROM table where date > 'date_string' """
query_result = pd.read_sql(query, conn)
Instead of writing select * ... in pandas I've saved my query locally. I want pandas to read that query. And also replace date_string with startDate_sql
My date_string keeps changing as I'm looping through a list of dates.
The pandas code would look like
query = 'C:\Users\Admin\Desktop\Python\Open Source\query.sql'
query.replace(date_string, startDate_sql)
query_result = pd.read_sql(query, conn)
In this way I'm not writing my query in pandas as it is a huge query and consumes lot of space.
Can someone please tell me how to solve this and what is the correct syntax?
Thank you very much!

Reading a file in Python
Here's how to read in a text file in Python.
query_filename = 'C:\Users\Admin\Desktop\Python\Open Source\query.sql'
# 'rt' means open for reading, in text mode
with open(query_filename, 'rt') as f:
# read the query_filename file into a variable named query
query = f.read()
# replace the literal string 'date_string' with the contents of the variable startDate_sql
query = query.replace('date_string', startDate_sql)
# get dataframe from database
query_result = pd.read_sql(query, conn)
Using parameterized queries
You should probably avoid string replacement to construct queries, because it suffers from SQL injection. Parameterized queries avoid this problem. Here's an example of how to use query parameterization with Pandas.

Related

How to query a T-SQL temp table with connectorx (pandas slow)

I am using pyodbc to run a query to create a temp table from a bunch of other tables. I then want to pull that whole temp table into pandas, but my pd.read_sql call takes upwards of 15 minutes. I want to try the connectorX library to see if it will speed things up.
For pandas the working way to query the temp table simply looks like:
conn = connection("connection string")
cursor = conn.cursor()
cursor.execute("""Do a bunch of stuff that ultimately creates one #finalTable""")
df = pd.read_sql("SELECT * FROM #finalTable", con=conn)
I've been reading the documentation and it appears I can only pass a connection string to the connectorx.read_sql function, and I haven't been able to find a way to pass it an existing connection that carries the temp table I need.
Am I able to query the temp table with connectorX? If so how?
If not, what would be a faster way to query a large temp table?
Thanks!

Fetch Queries from Snowflake table and execute each query using Python

I have stored queries (like select * from table) in a Snowflake table and want to execute each query row-by-row and generate a CSV file for each query. Below is the python code where I am able to print the queries but don't know how to execute each query and create a CSV file:
I believe I am close to what I want to achieve. I would really appreciate if someone can help over here.
import pyodbc
import pandas as pd
import snowflake.connector
import os
conn = snowflake.connector.connect(
user = 'User',
password = 'Pass',
account = 'Account',
autocommit = True
)
try:
cursor = conn.cursor()
query=('Select Column from Table;')--This will return two select
statements
output = cursor.execute(query)
for i in cursor:
print(i)
cursor.close()
del cursor
conn.close()
except Exception as e:
print(e)
You're pretty close. just need to execute the code instead of printing, and put the data into a file.
I haven't used pandas much myself, but this is the code that Snowflake documentation provides for running a query and putting it into a pandas dataframe.
cursor = conn.cursor()
query=('Select Column, row_number() over(order by Column) as Rownum from Table;')
cursor.execute(query)
resultset = cursor.fetchall()
for result in resultset:
cursor.execute(result[0])
df = cursor.fetch_pandas_all()
df.to_csv(r'C:\Users\...<your filename here>'+ result[1], index = False)
may take some fiddling, but here's a couple links for references:
Snowflake docs on creating a pandas dataframe
Exporting a pandas dataframe to csv
Update: added an example of a way to create separate files for each record. This just adds a distinct number to each row of your sql output so you can use that number as part of the filename. Ultimately, you need to have some logic in your loop to create a filename, whether that's adding a random number, a timestamp, whatever. That can come from the SQL or from the python, up to you. I'd probably add a filename column to your table, but I don't know if that makes sense for you.

Handling UUID values in Arrow with Parquet files

I'm new to Python and Pandas - please be gentle!
I'm using SqlAlchemy with pymssql to execute a SQL query against a SQL Server database and then convert the result set into a dataframe. I'm then attempting to write this dataframe as a Parquet file:
engine = sal.create_engine(connectionString)
conn = engine.connect()
df = pd.read_sql(query, con=conn)
df.to_parquet(outputFile)
The data I'm retrieving in the SQL query includes a uniqueidentifier column (i.e. a UUID) named rowguid. Because of this, I'm getting the following error on the last line above:
pyarrow.lib.ArrowInvalid: ("Could not convert UUID('92c4279f-1207-48a3-8448-4636514eb7e2') with type UUID: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column rowguid with type object')
Is there any way I can force all UUIDs to strings at any point in the above chain of events?
A few extra notes:
The goal for this portion of code was to receive the SQL query text as a parameter and act as a generic SQL-to-Parquet function.
I realise I can do something like df['rowguid'] = df['rowguid'].astype(str), but it relies on me knowing which columns have uniqueidentifier types. By the time it's a dataframe, everything is an object and each query will be different.
I also know I can convert it to a char(36) in the SQL query itself, however, I was hoping to do something more "automatic" so the person writing the query doesn't trip over this problem accidentally all the time / doesn't have to remember to always convert the datatype.
Any ideas?
Try DuckDB
engine = sal.create_engine(connectionString)
conn = engine.connect()
df = pd.read_sql(query, con=conn)
df.to_parquet(outputFile)
# Close the database connection
conn.close()
# Create DuckDB connection
duck_conn = duckdb.connect(':memory:')
# Write DataFrame content to a snappy compressed parquet file
COPY (SELECT * FROM df) TO 'df-snappy.parquet' (FORMAT 'parquet')
Ref:
https://duckdb.org/docs/guides/python/sql_on_pandas
https://duckdb.org/docs/sql/data_types/overview
https://duckdb.org/docs/data/parquet

Using python to change and run SQL queries

I have the following code that creates a dataframe based on user input:
import pandas as pd
from pandas import DataFrame
publications =
pd.read_csv("C:/Users/nkambhal/data/pubmed_search_results_180730.csv", sep=
"|")
publications['title'] = publications['title'].fillna('')
search_term = input('Enter the term you are looking for: ')
publications[['title','publication_id']]
[publications['title'].str.contains(search_term)]
title_mask =
publications.title.str.lower().str.contains(search_term.lower())
new = publications.loc[title_mask, ['title', 'publication_ID']]
Now I want to use the publication ID's in the new dataframe to run this SQL query:
SELECT
author_profile
pub_lst.*
FROM
pub_lst
JOIN
author_profile
ON pub_lst.author_id = author_profile.author_id
WHERE
pub_lst.publication_id IN (67855,65559);
In the where statement, I want the IDs in the new dataframe to be there. So in the data frame there are the publication_ids ( 5, 6, 4) then I want them to be added to the query.
How can I add the appropriate publication_ids to the SQL query and run it through python and save it to a csv file?
To put data into a string, you can use the python's str.format function. You can read about it a little more here
For your query string, it should work out like so:
query_string = """
SELECT
author_profile
pub_lst.*
FROM
pub_lst
JOIN
author_profile
ON pub_lst.author_id = author_profile.author_id
WHERE
pub_lst.publication_id IN {};
"""
print(query_string.format(str(tuple(new.publication_ID.values))))
As for the running the query, you will need to use a python module for whichever database you want to connect it. Such as PyMySQL for connecting to a MySQL database. https://pypi.org/project/PyMySQL/
Although, you could use an ORM such as peewee or SqlAlchemy to make your life a little easier while dealing with SQL databases. Pandas and SqlAlchemy mix really well. But Peewee is easier to get started with.
For creating a csv, you could use the inbuild python csv module, pandas or Peewee or SqlAlchemy in ascending order of difficulty.

How to convert sql table into a pyspark/python data structure and return back to sql in databricks notebook

I am running a sql notebook on databricks. I would like to analyze a table with half a billion records in it. I can run simple sql queries on the data. However, I need to change the date column type from str to date.
Unfortunately, update/alter statements do not seem to be supported by sparkSQL so it seems I cannot modify the data in the table.
What would be the one-line of code that would allow me to convert the SQL table to a python data structure (in pyspark) in the next cell?
Then I could modify the file and return it to SQL.
dataFrame = sqlContext.sql('select * from myTable')
df=sqlContext.sql("select * from table")
To convert dataframe back to sql view,
df.createOrReplaceTempView("myview")

Categories