How to make df.to_sql() create varchar2 object

How to make df.to_sql() create varchar2 object - python

I have a DataFrame which consists of a column of strings. If I do df.to_sql() to save it as a table into an Oracle database, the column is of CLOB type and I need to convert it. I wonder if I can specify the type (say varchar2) when I create the table?

You can specify SQLAlchemy Type explicitly:
import cx_Oracle
from sqlalchemy import types, create_engine
engine = create_engine('oracle://user:password#host_or_scan_address:1521/ORACLE_SERVICE_NAME')
df.to_sql('table_name', engine, if_exists='replace',
dtype={'str_column': types.VARCHAR(df.str_column.str.len().max())})
df.str_column.str.len().max() - will calculate the maximum string length
NOTE: types.VARCHAR will be mapped to VARCHAR2 for Oracle (see working example here)

You have to options, the first is to create the table manually and then use the if_exists parameter to tell pandas to append to the table rather than to drop and recreate
Option two is to use the dtype pass a dictionary of column names so that the table can be created appropriately. These are SQL Alchemy types so you should
from sqlalchemy.dialects.oracle import VARCHAR2
and pass that in the dictionary as
{'mycolumn': VARCHAR2(256) }
or suitable length.
Ref: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html

Related

Snowflake table created with SQLAlchemy requires quotes ("") to query

I am ingesting data into Snowflake tables using Python and SQLAlchemy. These tables that I have created all require quotations to query both the table name and the column names. For example, select * from "database"."schema"."table" where "column" = 2; Will run, while select * from database.schema.table where column = 2; will not run. The difference being the quotes.
I understand that if a table is created in Snowflake with quotes than quotes will be required to query it. However, I only put an Excel file in a Pandas data frame then used SQLAlchemy and pd.to_sql to create the table. An example of my code:
engine = create_engine(URL(
account = 'my_account',
user = 'my_username',
password = 'my_password',
database = 'My_Database',
schema = 'My_Schema',
warehouse = 'My_Wh',
role='My Role',
))
connection = engine.connect()
df.to_sql('My_Table', con=engine, if_exists='replace', index=False, index_label=None, chunksize=16384)
Does SQLAlchemy automatically create the tables with quotes? Is this a problem with the schema? I did not set that up. Is there a way around this?

From the SQLAlchemy Snowflake Github documentation:
Object Name Case Handling
Snowflake stores all case-insensitive object
names in uppercase text. In contrast, SQLAlchemy considers all
lowercase object names to be case-insensitive. Snowflake SQLAlchemy
converts the object name case during schema-level communication, i.e.
during table and index reflection. If you use uppercase object names,
SQLAlchemy assumes they are case-sensitive and encloses the names with
quotes. This behavior will cause mismatches agaisnt data dictionary
data received from Snowflake, so unless identifier names have been
truly created as case sensitive using quotes, e.g., "TestDb", all
lowercase names should be used on the SQLAlchemy side.
What I think this is trying to say is SQLAlchemy treats any names containing capital letters as being case-sensitive and automatically encloses them in quotes, conversely any names in lower case are not quoted. It doesn't look like this behaviour is configurable.
You probably don't have any control over database and possibly schema names, but when creating your table if you want consistent behaviour whether quoted or unquoted then you should stick to using lower case naming. What you should find is that the table name will then work whether you use "my_table" or my_table.

Is it possible to get sql column metadata from pandas DataFrame in python?

I read some database table with python to a pandas DataFrame and I am able to retrieve the type of the columns:
import sqlite3
import pandas
with sqlite3.connect('d:\database.sqlite') as connection:
dataFrame = pandas.read_sql_query('SELECT * FROM my_table', connection)
dataFrame.dtypes
I am wondering if the DataFrame also includes further metadata about the table, e.g. if a column is nullable, the default value of the column and if the column is a primary key?
I would expect some methods like
dataFrame.isNullable('columnName')
but could not find such methods.
If the meta data is not included in the DataFrame I would have to use an extra query to retrieve that data, e.g.
PRAGMA table_info('my_table')
That would give the columns
cid, name, type, notnull, dflt_value, pk
However, I would like to avoid that extra query. Especially if my original query does not contain all columns or if it defines some extra columns it could become complicated to find the corresponding metadata.
=>If the DataFrame already contains the wanted metadata, please let me know how to access it.
(In Java I would access the metadata with ResultSetMetaData metaData = resultSet.getMetaData(); )

Pandas to_sql changing datatype in database table

Has anyone experienced this before?
I have a table with "int" and "varchar" columns - a report schedule table.
I am trying to import an excel file with ".xls" extension to this table using a python program. I am using pandas to_sql to read in 1 row of data.
Data imported is 1 row 11 columns.
Import works successfully but after the import I noticed that the datatypes in the original table have now been altered from:
int --> bigint
char(1) --> varchar(max)
varchar(30) --> varchar(max)
Any idea how I can prevent this? The switch in datatypes is causing issues in downstrean routines.
df = pd.read_excel(schedule_file,sheet_name='Schedule')
params = urllib.parse.quote_plus(r'DRIVER={SQL Server};SERVER=<<IP>>;DATABASE=<<DB>>;UID=<<UDI>>;PWD=<<PWD>>')
conn_str = 'mssql+pyodbc:///?odbc_connect={}'.format(params)
engine = create_engine(conn_str)
table_name='REPORT_SCHEDULE'
df.to_sql(name=table_name,con=engine, if_exists='replace',index=False)
TIA

Consider using the dtype argument of pandas.DataFrame.to_sql where you pass a dictionary of SQLAlchemy types to named columns:
import sqlalchemy
...
data.to_sql(name=table_name, con=engine, if_exists='replace', index=False,
dtype={'name_of_datefld': sqlalchemy.types.DateTime(),
'name_of_intfld': sqlalchemy.types.INTEGER(),
'name_of_strfld': sqlalchemy.types.VARCHAR(length=30),
'name_of_floatfld': sqlalchemy.types.Float(precision=3, asdecimal=True),
'name_of_booleanfld': sqlalchemy.types.Boolean}

I think this has more to do with how pandas handles the table if it exists. The "replace" value to the if_exists argument tells pandas to drop your table and recreate it. But when re-creating your table, it will do it based on its own terms (and the data stored in that particular DataFrame).
While providing column datatypes will work, doing it for every such case might be cumbersome. So I would rather truncate the table in a separate statement and then just append data to it, like so:
Instead of:
df.to_sql(name=table_name, con=engine, if_exists='replace',index=False)
I'd do:
with engine.connect() as con:
con.execute("TRUNCATE TABLE %s" % table_name)
df.to_sql(name=table_name, con=engine, if_exists='append',index=False)
The truncate statement basically drops and recreates your table too, but it's done internally by the database, and the table gets recreated with the same definition.

sqlalchemy + postgresql hstore to string

How do I convert an sqlalchemy hstore value to a string?
from sqlalchemy.dialects.postgresql import array, hstore
hs = hstore(array(['key1', 'key2', 'key3']), array(['value1', 'value2', 'value3']))
# this triggers sqlalchemy.exc.UnsupportedCompilationError
str(hs)
I expect something like "key1"=>"value1", "key2"=>"value2", "key3"=>"value3"
I would like to use an sqlalchemy api rather than write a custom string formatting function that approximates what I want. I'm working with a legacy code base that uses sqlalchemy: I need to preserve any internal quirks and escaping logic that formatting does.
However, the existing code base uses sqlalchemy via an ORM table insert, while I want to directly convert an sqlalchemy hstore value to a string?
UPDATE: I am trying to do something like this:
I have an existing table with schema
create table my_table
(
id bigint default nextval('my_table_id_seq'::regclass),
ts timestamp default now(),
text_col_a text,
text_col_b text
);
I want to get the following Python sqlalchemy code working:
str_value = some_function()
# Existing code is building an sqlalchemy hstore and inserting
# into a column of type `text`, not an `hstore` column.
# I want it to work with hstore text formatting
hstore_value = legacy_build_my_hstore()
# as is this triggers error:
# ProgrammingError: (psycopg2.ProgrammingError) can't adapt type 'hstore'
return db_connection.execute(
"""
insert into my_table(text_col_a, text_col_b) values (%s, %s)
returning id, ts
""",
(str_value, hstore_value).first()

Let Postgresql do the cast for you instead of trying to manually convert the hstore construct to a string, and SQLAlchemy handle the conversion to suitable text representation:
return db_connection.execute(
my_table.insert().
values(text_col_a=str_value,
text_col_b=cast(hstore_value, Text)).
returning(my_table.c.id, my_table.c.ts)).first()
As soon as you can, alter your schema to use hstore type instead of text, if that is what the column contains.

How to insert a pandas dataframe to an already existing table in a database?

I'm using sqlalchemy in pandas to query postgres database and then insert results of a transformation to another table on the same database. But when I do
df.to_sql('db_table2', engine) I get this error message:
ValueError: Table 'db_table2' already exists. I noticed it want to create a new table. How to insert pandas dataframe to an already existing table ?
df = pd.read_sql_query('select * from "db_table1"',con=engine)
#do transformation then save df to db_table2
df.to_sql('db_table2', engine)
ValueError: Table 'db_table2' already exists

make use of if_exists parameter:
df.to_sql('db_table2', engine, if_exists='replace')
or
df.to_sql('db_table2', engine, if_exists='append')
from docstring:
"""
if_exists : {'fail', 'replace', 'append'}, default 'fail'
- fail: If table exists, do nothing.
- replace: If table exists, drop it, recreate it, and insert data.
- append: If table exists, insert data. Create if does not exist.
"""

Zen of Python:
Explicit is better than implicit.
df.to_sql(
name,# Name of SQL table.
con, # sqlalchemy.engine.Engine or sqlite3.Connection
schema=None, # Something can't understand yet. just keep it.
if_exists='fail', # How to behave if the table already exists. You can use 'replace', 'append' to replace it.
index=True, # It means index of DataFrame will save. Set False to ignore the index of DataFrame.
index_label=None, # Depend on index.
chunksize=None, # Just means chunksize. If DataFrame is big will need this parameter.
dtype=None, # Set the columns type of sql table.
method=None, # Unstable. Ignore it.
)
So, I recommend this example, normally:
df.to_sql(con=engine, name='table_name',if_exists='append', dtype={
'Column1': String(255),
'Column2': FLOAT,
'Column3': INT,
'createTime': DATETIME},index=False)
Set the sql table Primary Key manually(like: Id) and check increment in Navicat or MySQL Workbench.
The Id will increment automatically.
The Docstring of df.to_sql:
Parameters
----------
name : string
Name of SQL table.
con : sqlalchemy.engine.Engine or sqlite3.Connection
Using SQLAlchemy makes it possible to use any DB supported by that
library. Legacy support is provided for sqlite3.Connection objects.
schema : string, optional
Specify the schema (if database flavor supports this). If None, use
default schema.
if_exists : {'fail', 'replace', 'append'}, default 'fail'
How to behave if the table already exists.
* fail: Raise a ValueError.
* replace: Drop the table before inserting new values.
* append: Insert new values to the existing table.
index : bool, default True
Write DataFrame index as a column. Uses `index_label` as the column
name in the table.
index_label : string or sequence, default None
Column label for index column(s). If None is given (default) and
`index` is True, then the index names are used.
A sequence should be given if the DataFrame uses MultiIndex.
chunksize : int, optional
Rows will be written in batches of this size at a time. By default,
all rows will be written at once.
dtype : dict, optional
Specifying the datatype for columns. The keys should be the column
names and the values should be the SQLAlchemy types or strings for
the sqlite3 legacy mode.
method : {None, 'multi', callable}, default None
Controls the SQL insertion clause used:
* None : Uses standard SQL ``INSERT`` clause (one per row).
* 'multi': Pass multiple values in a single ``INSERT`` clause.
* callable with signature ``(pd_table, conn, keys, data_iter)``.
Details and a sample callable implementation can be found in the
section :ref:`insert method <io.sql.method>`.
.. versionadded:: 0.24.0
That's all.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.