Handling UUID values in Arrow with Parquet files - python

I'm new to Python and Pandas - please be gentle!
I'm using SqlAlchemy with pymssql to execute a SQL query against a SQL Server database and then convert the result set into a dataframe. I'm then attempting to write this dataframe as a Parquet file:
engine = sal.create_engine(connectionString)
conn = engine.connect()
df = pd.read_sql(query, con=conn)
df.to_parquet(outputFile)
The data I'm retrieving in the SQL query includes a uniqueidentifier column (i.e. a UUID) named rowguid. Because of this, I'm getting the following error on the last line above:
pyarrow.lib.ArrowInvalid: ("Could not convert UUID('92c4279f-1207-48a3-8448-4636514eb7e2') with type UUID: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column rowguid with type object')
Is there any way I can force all UUIDs to strings at any point in the above chain of events?
A few extra notes:
The goal for this portion of code was to receive the SQL query text as a parameter and act as a generic SQL-to-Parquet function.
I realise I can do something like df['rowguid'] = df['rowguid'].astype(str), but it relies on me knowing which columns have uniqueidentifier types. By the time it's a dataframe, everything is an object and each query will be different.
I also know I can convert it to a char(36) in the SQL query itself, however, I was hoping to do something more "automatic" so the person writing the query doesn't trip over this problem accidentally all the time / doesn't have to remember to always convert the datatype.
Any ideas?

Try DuckDB
engine = sal.create_engine(connectionString)
conn = engine.connect()
df = pd.read_sql(query, con=conn)
df.to_parquet(outputFile)
# Close the database connection
conn.close()
# Create DuckDB connection
duck_conn = duckdb.connect(':memory:')
# Write DataFrame content to a snappy compressed parquet file
COPY (SELECT * FROM df) TO 'df-snappy.parquet' (FORMAT 'parquet')
Ref:
https://duckdb.org/docs/guides/python/sql_on_pandas
https://duckdb.org/docs/sql/data_types/overview
https://duckdb.org/docs/data/parquet

Related

Getting bytes written to SQL database for integer column, want integer to be written

I am writing data from a dash app to a SQL database setup by Django and then reading back the table in a callback. I have a column that the value should either be 1 or 2 but the value is as below in the SQL database:
SQL Database view of column that should contain 1 or 2
When this is read back to a pandas dataframe it appears as b'\00x\01x... or something along those lines, which then gets read wrong when it needs to be used.
The django code for the column is:
selected = models.IntegerField(default=1, null=True)
I am writing and reading the data using SQLAlchemy. Number appeared perfectly in pandas dataframe before involving SQL. Read and write code:
select = pd.read_sql_table('temp_sel', con=engine)
select.to_sql('temp_sel', con=engine, if_exists='append', index=False)
Any help would be appreciated.
Solved by specifying the variable as an integer before sending to SQL as follows:
j = int(j)

Read SQL query into pandas dataframe and replace string in query

I'm querying my SSMS database from pandas and the query I have is pretty huge, I've saved it locally and want to read that query as a pandas dataframe and also there is a date string that I have in the query, I want to replace that datestring with a date that I've already assigned in pandas. For reference sake I'll shorten the query.
I'm currently following below:
query = """SELECT * FROM table where date > 'date_string' """
query_result = pd.read_sql(query, conn)
Instead of writing select * ... in pandas I've saved my query locally. I want pandas to read that query. And also replace date_string with startDate_sql
My date_string keeps changing as I'm looping through a list of dates.
The pandas code would look like
query = 'C:\Users\Admin\Desktop\Python\Open Source\query.sql'
query.replace(date_string, startDate_sql)
query_result = pd.read_sql(query, conn)
In this way I'm not writing my query in pandas as it is a huge query and consumes lot of space.
Can someone please tell me how to solve this and what is the correct syntax?
Thank you very much!
Reading a file in Python
Here's how to read in a text file in Python.
query_filename = 'C:\Users\Admin\Desktop\Python\Open Source\query.sql'
# 'rt' means open for reading, in text mode
with open(query_filename, 'rt') as f:
# read the query_filename file into a variable named query
query = f.read()
# replace the literal string 'date_string' with the contents of the variable startDate_sql
query = query.replace('date_string', startDate_sql)
# get dataframe from database
query_result = pd.read_sql(query, conn)
Using parameterized queries
You should probably avoid string replacement to construct queries, because it suffers from SQL injection. Parameterized queries avoid this problem. Here's an example of how to use query parameterization with Pandas.

Pandas to_sql changing datatype in database table

Has anyone experienced this before?
I have a table with "int" and "varchar" columns - a report schedule table.
I am trying to import an excel file with ".xls" extension to this table using a python program. I am using pandas to_sql to read in 1 row of data.
Data imported is 1 row 11 columns.
Import works successfully but after the import I noticed that the datatypes in the original table have now been altered from:
int --> bigint
char(1) --> varchar(max)
varchar(30) --> varchar(max)
Any idea how I can prevent this? The switch in datatypes is causing issues in downstrean routines.
df = pd.read_excel(schedule_file,sheet_name='Schedule')
params = urllib.parse.quote_plus(r'DRIVER={SQL Server};SERVER=<<IP>>;DATABASE=<<DB>>;UID=<<UDI>>;PWD=<<PWD>>')
conn_str = 'mssql+pyodbc:///?odbc_connect={}'.format(params)
engine = create_engine(conn_str)
table_name='REPORT_SCHEDULE'
df.to_sql(name=table_name,con=engine, if_exists='replace',index=False)
TIA
Consider using the dtype argument of pandas.DataFrame.to_sql where you pass a dictionary of SQLAlchemy types to named columns:
import sqlalchemy
...
data.to_sql(name=table_name, con=engine, if_exists='replace', index=False,
dtype={'name_of_datefld': sqlalchemy.types.DateTime(),
'name_of_intfld': sqlalchemy.types.INTEGER(),
'name_of_strfld': sqlalchemy.types.VARCHAR(length=30),
'name_of_floatfld': sqlalchemy.types.Float(precision=3, asdecimal=True),
'name_of_booleanfld': sqlalchemy.types.Boolean}
I think this has more to do with how pandas handles the table if it exists. The "replace" value to the if_exists argument tells pandas to drop your table and recreate it. But when re-creating your table, it will do it based on its own terms (and the data stored in that particular DataFrame).
While providing column datatypes will work, doing it for every such case might be cumbersome. So I would rather truncate the table in a separate statement and then just append data to it, like so:
Instead of:
df.to_sql(name=table_name, con=engine, if_exists='replace',index=False)
I'd do:
with engine.connect() as con:
con.execute("TRUNCATE TABLE %s" % table_name)
df.to_sql(name=table_name, con=engine, if_exists='append',index=False)
The truncate statement basically drops and recreates your table too, but it's done internally by the database, and the table gets recreated with the same definition.

Turn .sql database dump into pandas dataframe

I have a .sql file that contains a database dump. I would prefer to get this file into a pandas dataframe so that I can view the data and manipulate it. Willing to take any solution, but need explicit instructions, I've never worked with a .sql file previously.
The file's structure is as follows:
-- MySQL dump 10.13 Distrib 8.0.11, for Win64 (x86_64)
--
-- Host: localhost Database: somedatabase
-- ------------------------------------------------------
-- Server version 8.0.11
DROP TABLE IF EXISTS `selected`;
CREATE TABLE `selected` (
`date` date DEFAULT NULL,
`weekday` int(1) DEFAULT NULL,
`monthday` int(4) DEFAULT NULL,
... [more variables]) ENGINE=somengine DEFAULT CHARSET=something COLLATE=something;
LOCK TABLES `selected` WRITE;
INSERT INTO `selected` VALUES (dateval, weekdayval, monthdayval), (dateval, weekdayval, monthdayval), ... (dateval, weekdayval, monthdayval);
INSERT INTO `selected` VALUES (...), (...), ..., (...);
... (more insert statements) ...
-- Dump completed on timestamp
You should use the sqlalchemy library for this:
https://docs.sqlalchemy.org/en/13/dialects/mysql.html
Or alternatively you could use this:
https://pynative.com/python-mysql-database-connection/
The second option my be easier to load your data to mysql as you could just take your sql file text as the query object and pass it to the connection.
Something like this:
import mysql.connector
connection = mysql.connector.connect(host='localhost',
database='database',
user='user',
password='pw')
query = yourSQLfile
cursor = connection.cursor()
result = cursor.execute(query)
Once you've loaded your table you create the engine with sqlalchemy to connect pandas to your database and simply use the pandas read_sql() command to load your table to a dataframe object.
Another note is that if you just want to manipulate the data, you could take the values statement from the sql file and use that to populate a dataframe manually if you needed to. Just change the "Values (....),(....),(....)" to mydict = {[....],[....],[....]} and load it to a dataframe. Or you could dump the values statement to excel and delete the parentheses and do text to columns, give it headers and save it, then load it to a dataframe from excel. Or just manipulate it in excel (you could even use a concat formula to recreate the sql values syntax and replace the data in the sql file). It really depends on exactly what your end-goal here is.
Sorry you did not receive a timely answer here.

How to convert sql table into a pyspark/python data structure and return back to sql in databricks notebook

I am running a sql notebook on databricks. I would like to analyze a table with half a billion records in it. I can run simple sql queries on the data. However, I need to change the date column type from str to date.
Unfortunately, update/alter statements do not seem to be supported by sparkSQL so it seems I cannot modify the data in the table.
What would be the one-line of code that would allow me to convert the SQL table to a python data structure (in pyspark) in the next cell?
Then I could modify the file and return it to SQL.
dataFrame = sqlContext.sql('select * from myTable')
df=sqlContext.sql("select * from table")
To convert dataframe back to sql view,
df.createOrReplaceTempView("myview")

Categories