change schema for sql file (drop quotes in names) - python

This question is similar to drastega's question
I have similar problem, however I want to get rid of any quoting characters from names. Here is an example:
CREATE TABLE Resolved (
[Name] TEXT,
[Count] INTEGER,
[Obs_Date] TEXT,
[Bessel_year] REAL,
[Filter] TEXT,
[Comments] TEXT
);
changes to:
CREATE TABLE Resolved (
Name TEXT,
Count INTEGER,
Obs_Date TEXT,
Bessel_year REAL,
Filter TEXT,
Comments TEXT
);
Following the steps, from the link above I have managed to change "[" to quotes. However, I don't want to use any quoting characters. I tried to read documentation about sqlalchemy's metadata. I know that I need to use quote=False parameter. But I don't know where to call it. Thank you in advance for your answers.

The code from Joris worked well in my case by just changing the line c.quote = False to c.name.quote = False
with a pandas version 0.23.4, sqlalchemy=1.2.13 and python 3.6 for a postgres database.

It is a bit strange that an sqlite application errors on the quotes (as sqlite should be case insensitive, quotes or not), and I would also be very cautious with some of the special characters you mention in column names. But if you need to insert the data into sqlite with a schema without quotes, you can do the following.
Starting from this:
import pandas as pd
from pandas.io import sql
import sqlalchemy
from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:')
df = pd.DataFrame({'Col1': [1, 2], 'Col2': [0.1, 0.2]})
df.to_sql('test', engine, if_exists='replace', index=False)
So by default sqlalchemy uses quotes (because you have capitals, otherwise no quotes would be used):
In [8]: res = engine.execute("SELECT * FROM sqlite_master;").fetchall()
In [9]: print res[0][4]
CREATE TABLE test (
"Col1" BIGINT,
"Col2" FLOAT
)
Slqalchemy has a quote parameter on each Column that you can set to False. To do this combined with the pandas function, we have to use a workaround (as pandas already creates the Columns with the default values):
db = sql.SQLDatabase(engine)
t = sql.SQLTable('test', db, frame=df, if_exists='replace', index=False)
for c in t.table.columns:
c.quote = False
t.create()
t.insert()
The above is equivalent to the to_sql call, but with interacting with the created table object before writing to the database. Now you have no quotes:
In [15]: res = engine.execute("SELECT * FROM sqlite_master;").fetchall()
In [16]: print res[0][4]
CREATE TABLE test (
Col1 BIGINT,
Col2 FLOAT
)

You can try to use lower case for both table name and column names. Then SQLAlchemy won't quote the table name and column names.

Related

Column does not exist (SQLAlchemy / PostgreSQL): Trouble with quotation marks

Im having trouble with a postgresql query using SQLAlchemy.
I created some large tables using this line of code:
frame.to_sql('Table1', con=engine, method='multi', if_exists='append')
It worked fine. Now, when I want to query data out of it, my first problem is that I have to use quotation marks for each table and column name and I dont really know why, maybe somebody can help me out there.
That is not my main problem though. My main problem is, that when querying the data, all numerical WHERE conditions work fine, but not the ones with Strings in the column data. I get an error that the column does not exist. Im using:
df = pd.read_sql_query('SELECT "variable1", "variable2" FROM "Table1" WHERE "variable1" = 123 AND "variable2" = "abc" ', engine)
I think it might be a problem that I use "abc" instead of 'abc', but I cant change it because of the ' signs in the argument of the query. If I change those ' to " then the Column names and Table names are not detected correctly (because of the problem before that they have to be in quotation marks).
This is the error message:
ProgrammingError: (psycopg2.errors.UndefinedColumn) ERROR: COLUMN »abc« does not exist
LINE 1: ...er" FROM "Table1" WHERE "variable2" = "abc"
And there is an arrow pointing to the first quotation mark of the "abc".
Im new to SQL and I would really appreciate if someone could point me in the right direction.
"Most" SQL dialects (notable exceptions being MS SQL Server and MS Access) strictly differentiate between
single quotes: for string literals, e.g., WHERE thing = 'foo'
double quotes: for object (table, column) names, e.g., WHERE "some col" = 123
PostgreSQL throws in the added wrinkle that table/column names are forced to lower case if they are not (double-)quoted and then uses case-sensitive matching, so if your table is named Table1 then
SELECT * FROM Table1 will fail because PostgreSQL will look for table1, but
SELECT * FROM "Table1" will succeed.
The way to avoid confusion in your query is to use query parameters instead of string literals:
# set up test environment
with engine.begin() as conn:
conn.exec_driver_sql('DROP TABLE IF EXISTS "Table1"')
conn.exec_driver_sql('CREATE TABLE "Table1" (variable1 int, variable2 varchar(50))')
df1 = pd.DataFrame([(123, "abc"), (456, "def")], columns=["variable1", "variable2"])
df1.to_sql("Table1", engine, index=False, if_exists="append")
# test .read_sql_query() with parameters
import sqlalchemy as sa
sql = sa.text('SELECT * FROM "Table1" WHERE variable1 = :v1 AND variable2 = :v2')
param_dict = {"v1": 123, "v2": "abc"}
df2 = pd.read_sql_query(sql, engine, params=param_dict)
print(df2)
"""
variable1 variable2
0 123 abc
"""
It should be: AND "variable2" = 'abc'.
You cannot quote strings/literals with ", as PostgreSQL will interpret it as a database object. Btw. you do not need to wrap table names and and columns with double quotes unless it is extremely necessary, e.g. case sensitive object names, names containing spaces, etc. Imho it is a bad practice and on the long run only leads to confusion. So your query could be perfectly written as follows:
SELECT variable1, variable2
FROM table1
WHERE variable1 = 123 AND variable2 = 'abc';
Keep in mind that it also applies for other objects, like tables or indexes.
CREATE TABLE Table1 (id int) - nice.
CREATE TABLE "Table1" (id int) - not nice.
CREATE TABLE "Table1" ("id" int) - definitely not nice ;)
In case you want to remove the unnecessary double quotes from your table name:
ALTER TABLE "Table1" RENAME TO table1;
Demo: db<>fiddle

Avoid pandas.to_sql writes into table with double quotes (PostgreSQL database)

I am trying to export my dataframe to sql database (Postgres).
I created the table as following:
CREATE TABLE dataops.OUTPUT
(
ID_TAIL CHAR(30) NOT NULL,
ID_MODEL CHAR(30) NOT NULL,
ID_FIN CHAR(30) NOT NULL,
ID_GROUP_FIN CHAR(30) NOT NULL,
ID_COMPONENT CHAR(30) NOT NULL,
DT_OPERATION TIMESTAMP NOT NULL,
DT_EXECUTION TIMESTAMP NOT NULL,
FT_VALUE_SENSOR FLOAT NOT NULL,
DT_LOAD TIMESTAMP NOT NULL
);
And I want to write this dataframe into that sql table:
conn = sqlalchemy.create_engine("postgres://root:1234#localhost:5432/postgres")
data = [['ID_1', 'A4_DOOUE_ADM001', '1201MJ52', 'PATH_1', 'LATCHED1AFT',
'2016-06-22 19:10:25', '2020-11-12 17:20:33.616016', 2.9, '2020-11-12 17:54:06.340735']]
output_df=pd.DataFrame(data,columns=["id_tail", "id_model", "id_fin", "id_group_fin", "id_component", "dt_operation",
"dt_execution", "ft_value_sensor", "dt_load"])
But, when I run the command to write into database output_df.to_sql I realize that a new table "OUTPUT", with double qupotes has been created with the data inserted.
output_df.to_sql(cfg.table_names["output_rep27"], conn, cfg.db_parameters["schema"], if_exists='append',index=False)
This is what I see in my DDBB:
But the same table without quotes is empty:
When you purposely try to insert the table wrong (changing a column name for example) you see that pandas is inserting with double quotes because the error:
How to avoid pandas inserts with double quotes for the table?
Short version Pandas is double quoting identifiers which is fairly standard. When that happens with upper case identifier you have to double quote from then on when using it. Using it unquoted will fold the name to lower case and you won't find the table. For more information on this, see Identifier Syntax. You have three choices, do as I suggested in comment and force name to lower case, always double quote identifiers when using them or modify Panda source code to not double quote.
I found the same question and here is the accepted answer for it
We need to set the dataframe column into lower case before we send it to PostgreSQL, and set a lower cased table name for the table, so we don't need to add double quotes when we select the table or columns
*EDIT : I found out that whitespace also force to_sql function from pandas to write the table or column name using double quotes in PostgreSQL, so if you wanna make the table or column name double-quotes-free, change the whitespaces into non-whitespace characters or just delete the whitespaces from the table name or column name
this is the example from my own case:
import pandas as pd
import re
from sqlalchemy import create_engine
df = pd.read_excel('data.xlsx')
ws = re.compile("\s+")
# lower the case, strip leading and trailing white space,
# and substitute the whitespace between words with underscore
df.columns = [ws.sub("_", i.lower().strip()) for i in df.columns]
my_db_name = 'postgresql://postgres:my_password#localhost:5432/db_name'
engine = create_engine(my_db_name)
df.to_sql('lowercase_table_name', engine) #use lower cased table name
this line of code worked for me
appended_data.columns = map(str.lower, df2.columns)
appended_data.to_sql('table_name', con=engine,
schema='public', index=False, if_exists='append',method='multi')
You need to use large letters in pandas in order to get names without quotes in SQL table.
Use this code on your df.
df.columns.str.upper()
I didn't found a "good" solution, so what I did was to create my own function to insert the values:
import sqlalchemy
import pandas as pd
conn = sqlalchemy.create_engine("postgres://root:1234#localhost:5432/postgres")
data = [['ID_1', 'A4_DOOUE_ADM001', '1201MJ52', 'PATH_1', 'LATCHED1AFT',
'2016-06-22 19:10:25', '2020-11-12 17:20:33.616016', 2.9, '2020-11-12 17:54:06.340735']]
output_df=pd.DataFrame(data,columns=["id_tail", "id_model", "id_fin", "id_group_fin", "id_component", "dt_operation",
"dt_execution", "ft_value_sensor", "dt_load"])
def to_sql(output_df,table_name,conn,schema):
my_query = 'INSERT INTO '+schema+'.'+table_name+' ('+", ".join(list(output_df.columns))+') \
VALUES ('+ ", ".join(np.repeat('%s',output_df.shape[1]).tolist()) +');'
record_to_insert = output_df.applymap(str).values.tolist()
conn.execute(my_query,record_to_insert)
to_sql(output_df,table_name,conn,schema)
I hope it is useful for somebody
For those, who is still looking for the answer.
Instead of writing
output_df.to_sql(name='some_schema.some_table', con=conn)
you should put schema into corresponding to_sql() parameter
output_df.to_sql(name='some_table', schema='some_schema', con=conn)
Otherwise 'some_schema.some_table' will be considered as single table name and enquoted.

passing string arguments to filter database rows in python

i have a written the below function to filter a column in a sql query, the function takes a string argument which will be inputted in the 'where clause'
def summaryTable(machineid):
df=pd.read_sql(""" SELECT fld_ATM FROM [003_tbl_ATM_Tables]
WHERE (LINK <> 1) AND (fld_ATM =('machineid')) ;
""",connection)
connection.close()
return df
the function returns an empty Dataframe. i know the query itself is correct 'cause i get the expected data when i 'hardcode' the machine id
Use params to pass a tuple of parameters including machineid to read_sql. pyodbc replaces the ? character in your query with parameters from the tuple, in order. Their values will be safely substituted at runtime. This avoids dangerous string formatting issues which may result in SQL injection.
df = pd.read_sql(""" SELECT fld_ATM FROM [003_tbl_ATM_Tables]
WHERE (LINK <> 1) AND (fld_ATM = ?) ;
""", connection, params=(machineid,))
You need to add machineid to query using params.
# ? is the placeholder style used by pyodbc. Some use %s, for example.
query = """ SELECT fld_ATM FROM [003_tbl_ATM_Tables]
WHERE (LINK <> 1) AND (fld_ATM = ?) ;
"""
data_df = pd.read_sql_query(query, engine, params=(machineid, ))

DBAPI syntax with pd.read_sql_query() call

I want to read all of the tables contained in a database into pandas data frames. This answer does what I want to accomplish, but I'd like to use the DBAPI syntax with the ? instead of the %s, per the documentation. However, I ran into an error. I thought this answer may address the problem, but I'm now posting my own question because I can't figure it out.
Minimal example
import pandas as pd
import sqlite3
pd.__version__ # 0.19.1
sqlite3.version # 2.6.0
excon = sqlite3.connect('example.db')
c = excon.cursor()
c.execute('''CREATE TABLE stocks
(date text, trans text, symbol text, qty real, price real)''')
c.execute("INSERT INTO stocks VALUES ('2006-01-05', 'BUY', 'RHAT', 100, 35.14)")
c.execute('''CREATE TABLE bonds
(date text, trans text, symbol text, qty real, price real)''')
c.execute("INSERT INTO bonds VALUES ('2015-01-01', 'BUY', 'RSOCK', 90, 23.11)")
data = pd.read_sql_query('SELECT * FROM stocks', excon)
# >>> data
# date trans symbol qty price
# 0 2006-01-05 BUY RHAT 100.0 35.14
But when I include a ? or a (?) as below, I get the error message pandas.io.sql.DatabaseError: Execution failed on sql 'SELECT * FROM (?)': near "?": syntax error.
Problem code
c.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = c.fetchall()
# >>> tables
# [('stocks',), ('bonds',)]
table = tables[0]
data = pd.read_sql_query("SELECT * FROM ?", excon, params=table)
It's probably something trivial that I'm missing, but I'm not seeing it!
The problem is that you're trying to use parameter substitution for a table name, which is not possible. There's an issue on GitHub that discusses this. The relevant part is at the very end of the thread, in a comment by #jorisvandenbossche:
Parameter substitution is not possible for the table name AFAIK.
The thing is, in sql there is often a difference between string
quoting, and variable quoting (see eg
https://sqlite.org/lang_keywords.html the difference in quoting
between string and identifier). So you are filling in a string, which
is for sql something else as a variable name (in this case a table
name).
Parameter substitution is essential to prevent SQL Injection from unsafe user-entered values.
In this particular example you are sourcing table names directly from the database's own metadata, which is already safe, so it's OK to just use normal string formatting to construct the query, but still good to wrap the table names in quotes.
If you are sourcing user-entered table names, you can also parameterize them first before using them in your normal python string formatting.
e.g.
# assume this is user-entered:
table = '; select * from members; DROP members --'
c.execute("SELECT name FROM sqlite_master WHERE type='table' and name = ?;", excon, params=table )
tables = c.fetchall()
In this case the user has entered some malicious input intended to cause havoc, and the parameterized query will cleanse it and the query will return no rows.
If the user entered a clean table e.g. table = 'stocks' then the above query would return that same name back to you, through the wash, and it is now safe.
Then it is fine to continue with normal python string formatting, in this case using f-string style:
table = tables[0]
data = pd.read_sql_query(f"""SELECT * FROM "{table}" ;""", excon)
Referring back to your original example, my first step above is entirely unnecessary. I just provided it for context. It is unnecessary, because there is no user input so you could just do something like this to get a dictionary of dataframes for every table.
c.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = c.fetchall()
# >>> tables
# [('stocks',), ('bonds',)]
dfs = dict()
for t in tables:
dfs[t] = pd.read_sql_query(f"""SELECT * FROM "{t}" ;""", excon)
Then you can fetch the dataframe from the dictionary using the tablename as the key.

Pandas to_sql fails on duplicate primary key

I'd like to append to an existing table, using pandas df.to_sql() function.
I set if_exists='append', but my table has primary keys.
I'd like to do the equivalent of insert ignore when trying to append to the existing table, so I would avoid a duplicate entry error.
Is this possible with pandas, or do I need to write an explicit query?
There is unfortunately no option to specify "INSERT IGNORE". This is how I got around that limitation to insert rows into that database that were not duplicates (dataframe name is df)
for i in range(len(df)):
try:
df.iloc[i:i+1].to_sql(name="Table_Name",if_exists='append',con = Engine)
except IntegrityError:
pass #or any other action
You can do this with the method parameter of to_sql:
from sqlalchemy.dialects.mysql import insert
def insert_on_duplicate(table, conn, keys, data_iter):
insert_stmt = insert(table.table).values(list(data_iter))
on_duplicate_key_stmt = insert_stmt.on_duplicate_key_update(insert_stmt.inserted)
conn.execute(on_duplicate_key_stmt)
df.to_sql('trades', dbConnection, if_exists='append', chunksize=4096, method=insert_on_duplicate)
for older versions of sqlalchemy, you need to pass a dict to on_duplicate_key_update. i.e., on_duplicate_key_stmt = insert_stmt.on_duplicate_key_update(dict(insert_stmt.inserted))
please note that the "if_exists='append'" related to the existing of the table and what to do in case the table not exists.
The if_exists don't related to the content of the table.
see the doc here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html
if_exists : {‘fail’, ‘replace’, ‘append’}, default ‘fail’
fail: If table exists, do nothing.
replace: If table exists, drop it, recreate it, and insert data.
append: If table exists, insert data. Create if does not exist.
Pandas has no option for it currently, but here is the Github issue. If you need this feature too, just upvote for it.
The for loop method above slow things down significantly. There's a method parameter you can pass to panda.to_sql to help achieve customization for your sql query
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html#pandas.DataFrame.to_sql
The below code should work for postgres and do nothing if there's a conflict with primary key "unique_code". Change your insert dialects for your db.
def insert_do_nothing_on_conflicts(sqltable, conn, keys, data_iter):
"""
Execute SQL statement inserting data
Parameters
----------
sqltable : pandas.io.sql.SQLTable
conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
keys : list of str
Column names
data_iter : Iterable that iterates the values to be inserted
"""
from sqlalchemy.dialects.postgresql import insert
from sqlalchemy import table, column
columns=[]
for c in keys:
columns.append(column(c))
if sqltable.schema:
table_name = '{}.{}'.format(sqltable.schema, sqltable.name)
else:
table_name = sqltable.name
mytable = table(table_name, *columns)
insert_stmt = insert(mytable).values(list(data_iter))
do_nothing_stmt = insert_stmt.on_conflict_do_nothing(index_elements=['unique_code'])
conn.execute(do_nothing_stmt)
pd.to_sql('mytable', con=sql_engine, method=insert_do_nothing_on_conflicts)
Pandas doesn't support editing the actual SQL syntax of the .to_sql method, so you might be out of luck. There's some experimental programmatic workarounds (say, read the Dataframe to a SQLAlchemy object with CALCHIPAN and use SQLAlchemy for the transaction), but you may be better served by writing your DataFrame to a CSV and loading it with an explicit MySQL function.
CALCHIPAN repo: https://bitbucket.org/zzzeek/calchipan/
I had trouble where I was still getting the IntegrityError
...strange but I just took the above and worked it backwards:
for i, row in df.iterrows():
sql = "SELECT * FROM `Table_Name` WHERE `key` = '{}'".format(row.Key)
found = pd.read_sql(sql, con=Engine)
if len(found) == 0:
df.iloc[i:i+1].to_sql(name="Table_Name",if_exists='append',con = Engine)
In my case, I was trying to insert new data in an empty table, but some of the rows are duplicated, almost the same issue here, I "may" think about fetching existing data and merge with the new data I got and continue in process, but this is not optimal, and may work only for small data, not a huge tables.
As pandas do not provide any kind of handling for this situation right now, I was looking for a suitable workaround for this, so I made my own, not sure if that will work or not for you, but I decided to control my data first instead of luck of waiting if that worked or not, so what I did is removing duplicates before I call .to_sql so if any error happens, I know more about my data and make sure I know what is going on:
import pandas as pd
def write_to_table(table_name, data):
df = pd.DataFrame(data)
# Sort by price, so we remove the duplicates after keeping the lowest only
data.sort(key=lambda row: row['price'])
df.drop_duplicates(subset=['id_key'], keep='first', inplace=True)
#
df.to_sql(table_name, engine, index=False, if_exists='append', schema='public')
So in my case, I wanted to keep the lowest price of rows (btw I was passing an array of dict for data), and for that, I did sorting first, not necessary but this is an example of what I mean with control the data that I want to keep.
I hope this will help someone who got almost the same as my situation.
When you use SQL Server you'll get a SQL error when you enter a duplicate value into a table that has a primary key constraint. You can fix it by altering your table:
CREATE TABLE [dbo].[DeleteMe](
[id] [uniqueidentifier] NOT NULL,
[Value] [varchar](max) NULL,
CONSTRAINT [PK_DeleteMe]
PRIMARY KEY ([id] ASC)
WITH (IGNORE_DUP_KEY = ON)); <-- add
Taken from https://dba.stackexchange.com/a/111771.
Now your df.to_sql() should work again.
The solutions by Jayen and Huy Tran helped me a lot, but they didn't work straight out of the box. The problem I faced with Jayen code is that it requires that the DataFrame columns be exactly as those of the database. This was not true in my case as there were some DataFrame columns that I won't write to the database.
I modified the solution so that it considers the column names.
from sqlalchemy.dialects.mysql import insert
import itertools
def insertWithConflicts(sqltable, conn, keys, data_iter):
"""
Execute SQL statement inserting data, whilst taking care of conflicts
Used to handle duplicate key errors during database population
This is my modification of the code snippet
from https://stackoverflow.com/questions/30337394/pandas-to-sql-fails-on-duplicate-primary-key
The help page from https://docs.sqlalchemy.org/en/14/core/dml.html#sqlalchemy.sql.expression.Insert.values
proved useful.
Parameters
----------
sqltable : pandas.io.sql.SQLTable
conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
keys : list of str
Column names
data_iter : Iterable that iterates the values to be inserted. It is a zip object.
The length of it is equal to the chunck size passed in df_to_sql()
"""
vals = [dict(zip(z[0],z[1])) for z in zip(itertools.cycle([keys]),data_iter)]
insertStmt = insert(sqltable.table).values(vals)
doNothingStmt = insertStmt.on_duplicate_key_update(dict(insertStmt.inserted))
conn.execute(doNothingStmt)
I faced the same issue and I adopted the solution provided by #Huy Tran for a while, until my tables started to have schemas.
I had to improve his answer a bit and this is the final result:
def do_nothing_on_conflicts(sql_table, conn, keys, data_iter):
"""
Execute SQL statement inserting data
Parameters
----------
sql_table : pandas.io.sql.SQLTable
conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
keys : list of str
Column names
data_iter : Iterable that iterates the values to be inserted
"""
columns = []
for c in keys:
columns.append(column(c))
if sql_table.schema:
my_table = table(sql_table.name, *columns, schema=sql_table.schema)
# table_name = '{}.{}'.format(sql_table.schema, sql_table.name)
else:
my_table = table(sql_table.name, *columns)
# table_name = sql_table.name
# my_table = table(table_name, *columns)
insert_stmt = insert(my_table).values(list(data_iter))
do_nothing_stmt = insert_stmt.on_conflict_do_nothing()
conn.execute(do_nothing_stmt)
How to use it:
history.to_sql('history', schema=schema, con=engine, method=do_nothing_on_conflicts)
The idea is the same as #Nfern's but uses recursive function to divide the df into half in each iteration to skip the row/rows causing the integrity violation.
def insert(df):
try:
# inserting into backup table
df.to_sql("table",con=engine, if_exists='append',index=False,schema='schema')
except:
rows = df.shape[0]
if rows>1:
df1 = df.iloc[:int(rows/2),:]
df2 = df.iloc[int(rows/2):,:]
insert(df1)
insert(df2)
else:
print(f"{df} not inserted. Integrity violation, duplicate primary key/s")

Categories