Python Pandas to_sql, how to create a table with a primary key? - python

I would like to create a MySQL table with Pandas' to_sql function which has a primary key (it is usually kind of good to have a primary key in a mysql table) as so:
group_export.to_sql(con = db, name = config.table_group_export, if_exists = 'replace', flavor = 'mysql', index = False)
but this creates a table without any primary key, (or even without any index).
The documentation mentions the parameter 'index_label' which combined with the 'index' parameter could be used to create an index but doesn't mention any option for primary keys.
Documentation

Simply add the primary key after uploading the table with pandas.
group_export.to_sql(con=engine, name=example_table, if_exists='replace',
flavor='mysql', index=False)
with engine.connect() as con:
con.execute('ALTER TABLE `example_table` ADD PRIMARY KEY (`ID_column`);')

Disclaimer: this answer is more experimental then practical, but maybe worth mention.
I found that class pandas.io.sql.SQLTable has named argument key and if you assign it the name of the field then this field becomes the primary key:
Unfortunately you can't just transfer this argument from DataFrame.to_sql() function. To use it you should:
create pandas.io.SQLDatabase instance
engine = sa.create_engine('postgresql:///somedb')
pandas_sql = pd.io.sql.pandasSQL_builder(engine, schema=None, flavor=None)
define function analoguous to pandas.io.SQLDatabase.to_sql() but with additional *kwargs argument which is passed to pandas.io.SQLTable object created inside it (i've just copied original to_sql() method and added *kwargs):
def to_sql_k(self, frame, name, if_exists='fail', index=True,
index_label=None, schema=None, chunksize=None, dtype=None, **kwargs):
if dtype is not None:
from sqlalchemy.types import to_instance, TypeEngine
for col, my_type in dtype.items():
if not isinstance(to_instance(my_type), TypeEngine):
raise ValueError('The type of %s is not a SQLAlchemy '
'type ' % col)
table = pd.io.sql.SQLTable(name, self, frame=frame, index=index,
if_exists=if_exists, index_label=index_label,
schema=schema, dtype=dtype, **kwargs)
table.create()
table.insert(chunksize)
call this function with your SQLDatabase instance and the dataframe you want to save
to_sql_k(pandas_sql, df2save, 'tmp',
index=True, index_label='id', keys='id', if_exists='replace')
And we get something like
CREATE TABLE public.tmp
(
id bigint NOT NULL DEFAULT nextval('tmp_id_seq'::regclass),
...
)
in the database.
PS You can of course monkey-patch DataFrame, io.SQLDatabase and io.to_sql() functions to use this workaround with convenience.

As of pandas 0.15, at least for some flavors, you can use argument dtype to define a primary key column. You can even activate AUTOINCREMENT this way. For sqlite3, this would look like so:
import sqlite3
import pandas as pd
df = pd.DataFrame({'MyID': [1, 2, 3], 'Data': [3, 2, 6]})
with sqlite3.connect('foo.db') as con:
df.to_sql('df', con=con, dtype={'MyID': 'INTEGER PRIMARY KEY AUTOINCREMENT'})

with engine.connect() as con:
con.execute('ALTER TABLE for_import_ml ADD PRIMARY KEY ("ID");')
for_import_ml is a table name in the database.
Adding a slight variation to tomp's answer (I would comment but don't have enough reputation points).
I am using PGAdmin with Postgres (on Heroku) to check and it works.

automap_base from sqlalchemy.ext.automap (tableNamesDict is a dict with only the Pandas tables):
metadata = MetaData()
metadata.reflect(db.engine, only=tableNamesDict.values())
Base = automap_base(metadata=metadata)
Base.prepare()
Which would have worked perfectly, except for one problem, automap requires the tables to have a primary key. Ok, no problem, I'm sure Pandas to_sql has a way to indicate the primary key... nope. This is where it gets a little hacky:
for df in dfs.keys():
cols = dfs[df].columns
cols = [str(col) for col in cols if 'id' in col.lower()]
schema = pd.io.sql.get_schema(dfs[df],df, con=db.engine, keys=cols)
db.engine.execute('DROP TABLE ' + df + ';')
db.engine.execute(schema)
dfs[df].to_sql(df,con=db.engine, index=False, if_exists='append')
I iterate thru the dict of DataFrames, get a list of the columns to use for the primary key (i.e. those containing id), use get_schema to create the empty tables then append the DataFrame to the table.
Now that you have the models, you can explicitly name and use them (i.e. User = Base.classes.user) with session.query or create a dict of all the classes with something like this:
alchemyClassDict = {}
for t in Base.classes.keys():
alchemyClassDict[t] = Base.classes[t]
And query with:
res = db.session.query(alchemyClassDict['user']).first()

Related

Bulk Upsert with SQLAlchemy Postgres

I'm following the SQLAlchemy documentation here to write a bulk upsert statement with Postgres. For demonstration purposes, I have a simple table MyTable:
class MyTable(base):
__tablename__ = 'mytable'
id = Column(types.Integer, primary_key=True)
test_value = Column(types.Text)
Creating a generic insert statement is simple enough:
from sqlalchemy.dialects import postgresql
values = [{'id': 0, 'test_value': 'a'}, {'id': 1, 'test_value': 'b'}]
insert_stmt = postgresql.insert(MyTable.__table__).values(values)
The problem I run into is when I try to add the "on conflict" part of the upsert.
update_stmt = insert_stmt.on_conflict_do_update(
index_elements=[MyTable.id],
set_=dict(data=values)
)
Trying to execute this statement yields a ProgrammingError:
from sqlalchemy import create_engine
engine = create_engine('postgres://localhost/db_name')
engine.execute(update_stmt)
>>> ProgrammingError: (psycopg2.ProgrammingError) can't adapt type 'dict'
I think my misunderstanding is in constructing the statement with the on_conflict_do_update method. Does anyone know how to construct this statement? I have looked at other questions on StackOverflow (eg. here) but I can't seem to a way to address the above error.
update_stmt = insert_stmt.on_conflict_do_update(
index_elements=[MyTable.id],
set_=dict(data=values)
)
index_elements should either be a list of strings or a list of column objects. So either [MyTable.id] or ['id'] (This is correct)
set_ should be a dictionary with column names as keys and valid sql update objects as values. You can reference values from the insert block using the excluded attribute. So to get the result you are hoping for here you would want set_={'test_value': insert_stmt.excluded.test_value} (The error you made is that data= in the example isn't a magic argument... it was the name of the column on their example table)
So, the whole thing would be
update_stmt = insert_stmt.on_conflict_do_update(
index_elements=[MyTable.id],
set_={'test_value': insert_stmt.excluded.test_value}
)
Of course, in a real world example I usually want to change more then one column. In that case I would do something like...
update_columns = {col.name: col for col in insert_stmt.excluded if col.name not in ('id', 'datetime_created')}
update_statement = insert_stmt.on_conflict_do_update(index_elements=['id'], set_=update_columns)
(This example would overwrite every column except for the id and datetime_created columns)

How can I get column name and type from an existing table in SQLAlchemy?

Suppose I have the table users and I want to know what the column names are and what the types are for each column.
I connect like this;
connectstring = ('mssql+pyodbc:///?odbc_connect=DRIVER%3D%7BSQL'
'+Server%7D%3B+server%3D.....')
engine = sqlalchemy.create_engine(connectstring).connect()
md = sqlalchemy.MetaData()
table = sqlalchemy.Table('users', md, autoload=True, autoload_with=engine)
columns = table.c
If I call
for c in columns:
print type(columns)
I get the output
<class 'sqlalchemy.sql.base.ImmutableColumnCollection'>
printed once for each column in the table.
Furthermore,
print columns
prints
['users.column_name_1', 'users.column_name_2', 'users.column_name_3'....]
Is it possible to get the column names without the table name being included?
columns have name and type attributes
for c in columns:
print c.name, c.type
Better use "inspect" to obtain only information from the columns, with that you do not reflect the table.
import sqlalchemy
from sqlalchemy import inspect
engine = sqlalchemy.create_engine(<url>)
insp = inspect(engine)
columns_table = insp.get_columns(<table_name>, <schema>) #schema is optional
for c in columns_table :
print(c['name'], c['type'])
You could call the column names from the Information Schema:
SELECT *
FROM information_schema.columns
WHERE TABLE_NAME = 'users'
Then sort the result as an array, and loop through them.

Python Pandas to_sql removes all table indices when writing to table

I have the following code which reads a MYSQL select command formed from left joining many tables together. I then want to write the result to another table. When I do that however, (with Pandas), it works properly and the data is added to the table, but it somehow destroys all indices from the table, including the primary key.
Here is the code:
q = "SELECT util.peer_id as peer_id, util.date as ts, weekly_total_page_loads as page_loads FROM %s.%s as util LEFT JOIN \
(SELECT peer_id, date, score FROM %s.%s WHERE date = '%s') as scores \
ON util.peer_id = scores.peer_id AND util.date = scores.date WHERE util.date = '%s';"\
% (config.database_peer_groups, config.table_medians, \
config.database_peer_groups, config.db_score, date, date)
group_export = pd.read_sql(q, con = db)
q = 'USE %s;' % (config.database_export)
cursor.execute(q)
group_export.to_sql(con = db, name = config.table_group_export, if_exists = 'replace', flavor = 'mysql', index = False)
db.commit()
Any Ideas?
Edit:
It seems that,by using if_exists='replace', Pandas drops the table and recreates it, and when it recreates it, it doesn't rebuild the indices.
Furthermore, this question: to_sql pandas method changes the scheme of sqlite tables
suggests that by using a sqlalchemy engine it might potentially solve the problem.
Edit:
When I use if_exists="append" the problem doesn't appear, it is only with if_exists="replace" that the problem occures.

Pandas to_sql fails on duplicate primary key

I'd like to append to an existing table, using pandas df.to_sql() function.
I set if_exists='append', but my table has primary keys.
I'd like to do the equivalent of insert ignore when trying to append to the existing table, so I would avoid a duplicate entry error.
Is this possible with pandas, or do I need to write an explicit query?
There is unfortunately no option to specify "INSERT IGNORE". This is how I got around that limitation to insert rows into that database that were not duplicates (dataframe name is df)
for i in range(len(df)):
try:
df.iloc[i:i+1].to_sql(name="Table_Name",if_exists='append',con = Engine)
except IntegrityError:
pass #or any other action
You can do this with the method parameter of to_sql:
from sqlalchemy.dialects.mysql import insert
def insert_on_duplicate(table, conn, keys, data_iter):
insert_stmt = insert(table.table).values(list(data_iter))
on_duplicate_key_stmt = insert_stmt.on_duplicate_key_update(insert_stmt.inserted)
conn.execute(on_duplicate_key_stmt)
df.to_sql('trades', dbConnection, if_exists='append', chunksize=4096, method=insert_on_duplicate)
for older versions of sqlalchemy, you need to pass a dict to on_duplicate_key_update. i.e., on_duplicate_key_stmt = insert_stmt.on_duplicate_key_update(dict(insert_stmt.inserted))
please note that the "if_exists='append'" related to the existing of the table and what to do in case the table not exists.
The if_exists don't related to the content of the table.
see the doc here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html
if_exists : {‘fail’, ‘replace’, ‘append’}, default ‘fail’
fail: If table exists, do nothing.
replace: If table exists, drop it, recreate it, and insert data.
append: If table exists, insert data. Create if does not exist.
Pandas has no option for it currently, but here is the Github issue. If you need this feature too, just upvote for it.
The for loop method above slow things down significantly. There's a method parameter you can pass to panda.to_sql to help achieve customization for your sql query
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html#pandas.DataFrame.to_sql
The below code should work for postgres and do nothing if there's a conflict with primary key "unique_code". Change your insert dialects for your db.
def insert_do_nothing_on_conflicts(sqltable, conn, keys, data_iter):
"""
Execute SQL statement inserting data
Parameters
----------
sqltable : pandas.io.sql.SQLTable
conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
keys : list of str
Column names
data_iter : Iterable that iterates the values to be inserted
"""
from sqlalchemy.dialects.postgresql import insert
from sqlalchemy import table, column
columns=[]
for c in keys:
columns.append(column(c))
if sqltable.schema:
table_name = '{}.{}'.format(sqltable.schema, sqltable.name)
else:
table_name = sqltable.name
mytable = table(table_name, *columns)
insert_stmt = insert(mytable).values(list(data_iter))
do_nothing_stmt = insert_stmt.on_conflict_do_nothing(index_elements=['unique_code'])
conn.execute(do_nothing_stmt)
pd.to_sql('mytable', con=sql_engine, method=insert_do_nothing_on_conflicts)
Pandas doesn't support editing the actual SQL syntax of the .to_sql method, so you might be out of luck. There's some experimental programmatic workarounds (say, read the Dataframe to a SQLAlchemy object with CALCHIPAN and use SQLAlchemy for the transaction), but you may be better served by writing your DataFrame to a CSV and loading it with an explicit MySQL function.
CALCHIPAN repo: https://bitbucket.org/zzzeek/calchipan/
I had trouble where I was still getting the IntegrityError
...strange but I just took the above and worked it backwards:
for i, row in df.iterrows():
sql = "SELECT * FROM `Table_Name` WHERE `key` = '{}'".format(row.Key)
found = pd.read_sql(sql, con=Engine)
if len(found) == 0:
df.iloc[i:i+1].to_sql(name="Table_Name",if_exists='append',con = Engine)
In my case, I was trying to insert new data in an empty table, but some of the rows are duplicated, almost the same issue here, I "may" think about fetching existing data and merge with the new data I got and continue in process, but this is not optimal, and may work only for small data, not a huge tables.
As pandas do not provide any kind of handling for this situation right now, I was looking for a suitable workaround for this, so I made my own, not sure if that will work or not for you, but I decided to control my data first instead of luck of waiting if that worked or not, so what I did is removing duplicates before I call .to_sql so if any error happens, I know more about my data and make sure I know what is going on:
import pandas as pd
def write_to_table(table_name, data):
df = pd.DataFrame(data)
# Sort by price, so we remove the duplicates after keeping the lowest only
data.sort(key=lambda row: row['price'])
df.drop_duplicates(subset=['id_key'], keep='first', inplace=True)
#
df.to_sql(table_name, engine, index=False, if_exists='append', schema='public')
So in my case, I wanted to keep the lowest price of rows (btw I was passing an array of dict for data), and for that, I did sorting first, not necessary but this is an example of what I mean with control the data that I want to keep.
I hope this will help someone who got almost the same as my situation.
When you use SQL Server you'll get a SQL error when you enter a duplicate value into a table that has a primary key constraint. You can fix it by altering your table:
CREATE TABLE [dbo].[DeleteMe](
[id] [uniqueidentifier] NOT NULL,
[Value] [varchar](max) NULL,
CONSTRAINT [PK_DeleteMe]
PRIMARY KEY ([id] ASC)
WITH (IGNORE_DUP_KEY = ON)); <-- add
Taken from https://dba.stackexchange.com/a/111771.
Now your df.to_sql() should work again.
The solutions by Jayen and Huy Tran helped me a lot, but they didn't work straight out of the box. The problem I faced with Jayen code is that it requires that the DataFrame columns be exactly as those of the database. This was not true in my case as there were some DataFrame columns that I won't write to the database.
I modified the solution so that it considers the column names.
from sqlalchemy.dialects.mysql import insert
import itertools
def insertWithConflicts(sqltable, conn, keys, data_iter):
"""
Execute SQL statement inserting data, whilst taking care of conflicts
Used to handle duplicate key errors during database population
This is my modification of the code snippet
from https://stackoverflow.com/questions/30337394/pandas-to-sql-fails-on-duplicate-primary-key
The help page from https://docs.sqlalchemy.org/en/14/core/dml.html#sqlalchemy.sql.expression.Insert.values
proved useful.
Parameters
----------
sqltable : pandas.io.sql.SQLTable
conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
keys : list of str
Column names
data_iter : Iterable that iterates the values to be inserted. It is a zip object.
The length of it is equal to the chunck size passed in df_to_sql()
"""
vals = [dict(zip(z[0],z[1])) for z in zip(itertools.cycle([keys]),data_iter)]
insertStmt = insert(sqltable.table).values(vals)
doNothingStmt = insertStmt.on_duplicate_key_update(dict(insertStmt.inserted))
conn.execute(doNothingStmt)
I faced the same issue and I adopted the solution provided by #Huy Tran for a while, until my tables started to have schemas.
I had to improve his answer a bit and this is the final result:
def do_nothing_on_conflicts(sql_table, conn, keys, data_iter):
"""
Execute SQL statement inserting data
Parameters
----------
sql_table : pandas.io.sql.SQLTable
conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
keys : list of str
Column names
data_iter : Iterable that iterates the values to be inserted
"""
columns = []
for c in keys:
columns.append(column(c))
if sql_table.schema:
my_table = table(sql_table.name, *columns, schema=sql_table.schema)
# table_name = '{}.{}'.format(sql_table.schema, sql_table.name)
else:
my_table = table(sql_table.name, *columns)
# table_name = sql_table.name
# my_table = table(table_name, *columns)
insert_stmt = insert(my_table).values(list(data_iter))
do_nothing_stmt = insert_stmt.on_conflict_do_nothing()
conn.execute(do_nothing_stmt)
How to use it:
history.to_sql('history', schema=schema, con=engine, method=do_nothing_on_conflicts)
The idea is the same as #Nfern's but uses recursive function to divide the df into half in each iteration to skip the row/rows causing the integrity violation.
def insert(df):
try:
# inserting into backup table
df.to_sql("table",con=engine, if_exists='append',index=False,schema='schema')
except:
rows = df.shape[0]
if rows>1:
df1 = df.iloc[:int(rows/2),:]
df2 = df.iloc[int(rows/2):,:]
insert(df1)
insert(df2)
else:
print(f"{df} not inserted. Integrity violation, duplicate primary key/s")

change schema for sql file (drop quotes in names)

This question is similar to drastega's question
I have similar problem, however I want to get rid of any quoting characters from names. Here is an example:
CREATE TABLE Resolved (
[Name] TEXT,
[Count] INTEGER,
[Obs_Date] TEXT,
[Bessel_year] REAL,
[Filter] TEXT,
[Comments] TEXT
);
changes to:
CREATE TABLE Resolved (
Name TEXT,
Count INTEGER,
Obs_Date TEXT,
Bessel_year REAL,
Filter TEXT,
Comments TEXT
);
Following the steps, from the link above I have managed to change "[" to quotes. However, I don't want to use any quoting characters. I tried to read documentation about sqlalchemy's metadata. I know that I need to use quote=False parameter. But I don't know where to call it. Thank you in advance for your answers.
The code from Joris worked well in my case by just changing the line c.quote = False to c.name.quote = False
with a pandas version 0.23.4, sqlalchemy=1.2.13 and python 3.6 for a postgres database.
It is a bit strange that an sqlite application errors on the quotes (as sqlite should be case insensitive, quotes or not), and I would also be very cautious with some of the special characters you mention in column names. But if you need to insert the data into sqlite with a schema without quotes, you can do the following.
Starting from this:
import pandas as pd
from pandas.io import sql
import sqlalchemy
from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:')
df = pd.DataFrame({'Col1': [1, 2], 'Col2': [0.1, 0.2]})
df.to_sql('test', engine, if_exists='replace', index=False)
So by default sqlalchemy uses quotes (because you have capitals, otherwise no quotes would be used):
In [8]: res = engine.execute("SELECT * FROM sqlite_master;").fetchall()
In [9]: print res[0][4]
CREATE TABLE test (
"Col1" BIGINT,
"Col2" FLOAT
)
Slqalchemy has a quote parameter on each Column that you can set to False. To do this combined with the pandas function, we have to use a workaround (as pandas already creates the Columns with the default values):
db = sql.SQLDatabase(engine)
t = sql.SQLTable('test', db, frame=df, if_exists='replace', index=False)
for c in t.table.columns:
c.quote = False
t.create()
t.insert()
The above is equivalent to the to_sql call, but with interacting with the created table object before writing to the database. Now you have no quotes:
In [15]: res = engine.execute("SELECT * FROM sqlite_master;").fetchall()
In [16]: print res[0][4]
CREATE TABLE test (
Col1 BIGINT,
Col2 FLOAT
)
You can try to use lower case for both table name and column names. Then SQLAlchemy won't quote the table name and column names.

Categories