I'm creating a sqlalchemy engine (have pyhdb and sqlalchemy-hana installed) for a HANA db connection and passing it into pandas' to_sql function for dataframes:
hanaeng = create_engine('hana://username:password#host_address:port')
my_df.to_sql('table_name', con = hanaeng, index = False, if_exists = 'append')
However, I keep getting this error:
sqlalchemy.exc.DatabaseError: (pyhdb.exceptions.DatabaseError) invalid column name
I created a table in my Hana schema that matches the column names and type of what I'm trying to pass into it from the dataframe.
Has anyone ever come across this error? Or tried connecting to hana using a sqlalchemy engine? I tried using a pyhdb connector to make a connection object and passing that into to_sql but I believe pandas is trying to shift accepting only sqlalchemy engine objects in to_sql versus straight DBAPI connectors? Regardless, any help will be great! Thank you
Yes, it does work for sure.
Your problem is that my_df contains a column name that does not match any column in HANA table you are trying to insert data.
Related
I'm new to Python and Pandas - please be gentle!
I'm using SqlAlchemy with pymssql to execute a SQL query against a SQL Server database and then convert the result set into a dataframe. I'm then attempting to write this dataframe as a Parquet file:
engine = sal.create_engine(connectionString)
conn = engine.connect()
df = pd.read_sql(query, con=conn)
df.to_parquet(outputFile)
The data I'm retrieving in the SQL query includes a uniqueidentifier column (i.e. a UUID) named rowguid. Because of this, I'm getting the following error on the last line above:
pyarrow.lib.ArrowInvalid: ("Could not convert UUID('92c4279f-1207-48a3-8448-4636514eb7e2') with type UUID: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column rowguid with type object')
Is there any way I can force all UUIDs to strings at any point in the above chain of events?
A few extra notes:
The goal for this portion of code was to receive the SQL query text as a parameter and act as a generic SQL-to-Parquet function.
I realise I can do something like df['rowguid'] = df['rowguid'].astype(str), but it relies on me knowing which columns have uniqueidentifier types. By the time it's a dataframe, everything is an object and each query will be different.
I also know I can convert it to a char(36) in the SQL query itself, however, I was hoping to do something more "automatic" so the person writing the query doesn't trip over this problem accidentally all the time / doesn't have to remember to always convert the datatype.
Any ideas?
Try DuckDB
engine = sal.create_engine(connectionString)
conn = engine.connect()
df = pd.read_sql(query, con=conn)
df.to_parquet(outputFile)
# Close the database connection
conn.close()
# Create DuckDB connection
duck_conn = duckdb.connect(':memory:')
# Write DataFrame content to a snappy compressed parquet file
COPY (SELECT * FROM df) TO 'df-snappy.parquet' (FORMAT 'parquet')
Ref:
https://duckdb.org/docs/guides/python/sql_on_pandas
https://duckdb.org/docs/sql/data_types/overview
https://duckdb.org/docs/data/parquet
I have a 1,000,000 x 50 Pandas DataFrame that I am currently writing to a SQL table using:
df.to_sql('my_table', con, index=False)
It takes an incredibly long time. I've seen various explanations about how to speed up this process online, but none of them seem to work for MSSQL.
If I try the method in:
Bulk Insert A Pandas DataFrame Using SQLAlchemy
then I get a no attribute copy_from error.
If I try the multithreading method from:
http://techyoubaji.blogspot.com/2015/10/speed-up-pandas-tosql-with.html
then I get a QueuePool limit of size 5 overflow 10 reach, connection timed out error.
Is there any easy way to speed up to_sql() to an MSSQL table? Either via BULK COPY or some other method, but entirely from within Python code?
I've used ctds to do a bulk insert that's a lot faster with SQL server. In example below, df is the pandas DataFrame. The column sequence in the DataFrame is identical to the schema for mydb.
import ctds
conn = ctds.connect('server', user='user', password='password', database='mydb')
conn.bulk_insert('table', (df.to_records(index=False).tolist()))
in pandas 0.24 you can use method ='multi' with chunk size of 1000 which is the sql server limit
chunksize=1000, method='multi'
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method
New in version 0.24.0.
The parameter method controls the SQL insertion clause used. Possible values are:
None: Uses standard SQL INSERT clause (one per row).
'multi': Pass multiple values in a single INSERT clause. It uses a special SQL syntax not supported by all backends. This usually provides better performance for analytic databases like Presto and Redshift, but has worse performance for traditional SQL backend if the table contains many columns. For more information check the SQLAlchemy documention.
even I had the same issue so I applied sqlalchemy with fast execute many.
from sqlalchemy import event, create_engine
engine = create_egine('connection_string_with_database')
#event.listens_for(engine, 'before_cursor_execute')
def plugin_bef_cursor_execute(conn, cursor, statement, params, context,executemany):
if executemany:
cursor.fast_executemany = True # replace from execute many to fast_executemany.
cursor.commit()
always make sure that the given function should be present after the engine variable and before cursor execute.
conn = engine.execute()
df.to_sql('table', con=conn, if_exists='append', index=False) # for reference go to the pandas to_sql documentation.
I've got some weird problem here and stuck. I'm rewriting python script that generates some CSV files, and I need to write the same info on MySQL server.
I've managed to get it working... somehow.
Here is the part that creates CSV:
final_table.get_tunid_town_pivot().to_csv('result_pivot_tunid_town_' + ConsoleLog.get_curr_date_underline() + '.csv', sep=';')
And here is the part that loads data into MySQL table:
conn = pymysql.connect(host='localhost', port=3306, user='test', passwd='test', db='test')
final_table.get_tunid_town_pivot().to_sql(con=conn, name='TunID', if_exists='replace', flavor='mysql', index=False, chunksize=10000)
conn.close()
The problem is that there are 4 columns in dataframe, but in MySQL i get only one last column. I have no idea why is that happening, and I found zero similar problems. Any help please?
Your DataFrame has (probably due to the pivoting) a MultiIndex of 3 levels and only 1 column. By default, to_sql will also write the index to the SQL table, but you did specify index=False, so only the one column will be written to SQL.
So either do not specify to not include the index (so use index=True), or either reset the index and write the frame then (df.reset_index().to_sql(..., index=False)).
Also note that using a pymysql connection in to_sql is deprecated (it should give you a warning), you have to use it through an SQLAlchemy engine.
Using MSSQL (version 2012), I am using SQLAlchemy and pandas (on Python 2.7) to insert rows into a SQL Server table.
After trying pymssql and pyodbc with a specific server string, I am trying an odbc name:
import sqlalchemy, pyodbc, pandas as pd
engine = sqlalchemy.create_engine("mssql+pyodbc://mssqlodbc")
sqlstring = "EXEC getfoo"
dbdataframe = pd.read_sql(sqlstring, engine)
This part works great and worked with the other methods (pymssql, etc). However, the pandas to_sql method doesn't work.
finaloutput.to_sql("MyDB.dbo.Loader_foo",engine,if_exists="append",chunksize="10000")
With this statement, I get a consistent error that pandas is trying to do a CREATE TABLE in the sql server Master db, which it is not permisioned for.
How do I get pandas/SQLAlchemy/pyodbc to point to the correct mssql database? The to_sql method seems to ignore whatever I put in engine connect string (although the read_sql method seems to pick it up just fine.
To have this question as answered: the problem is that you specify the schema in the table name itself. If you provide "MyDB.dbo.Loader_foo" as the table name, pandas will interprete this full string as the table name, instead of just "Loader_foo".
Solution is to only provide "Loader_foo" as table name. If you need to specify a specific schema to write this table into, you can use the schema kwarg (see docs):
finaloutput.to_sql("Loader_foo", engine, if_exists="append")
finaloutput.to_sql("Loader_foo", engine, if_exists="append", schema="something_else_as_dbo")
The object df is of type pandas.core.frame.DataFrame.
In [1]: type(df)
Out[1]: pandas.core.frame.DataFrame
The index of df is a DatetimeIndex
In [2]: type(df.index)
Out[2]: pandas.tseries.index.DatetimeIndex
And con gives a working MySQLdb connection
In [3]: type(con)
Out[3]: MySQLdb.connections.Connection
I've not been able to get this dataframe entered into a MySQL database correctly, specifically, the date field comes through as null when using the following (as well as some variations on this).
df.to_sql( name='existing_table',con=con, if_exists='append', index=True, index_label='date', flavor='mysql', dtype={'date': datetime.date})
What are the steps required to have this dataframe entered correctly into a local MySQL database, with 'date' as a date field in the db?
To correctly write datetime data to SQL, you need at least pandas 0.15.0.
Starting from pandas 0.14, the sql functions are implemented using SQLAlchemy to deal with the database flavor specific differences. So to use to_sql, you need to provide it an SQLAlchemy engine instead of a plain MySQLdb connection:
from sqlalchemy import create_engine
engine = create_engine('mysql+mysqldb://....')
df.to_sql('existing_table', engine, if_exists='append', index=True, index_label='date')
Note: you don't need to provide the flavor keyword anymore.
Plain DBAPI connections are no longer supported for writing data to SQL, except for sqlite.
See http://pandas.pydata.org/pandas-docs/stable/io.html#io-sql for more details.