pandas dataframe with Date index -> insert into MySQL

pandas dataframe with Date index -> insert into MySQL - python

The object df is of type pandas.core.frame.DataFrame.
In [1]: type(df)
Out[1]: pandas.core.frame.DataFrame
The index of df is a DatetimeIndex
In [2]: type(df.index)
Out[2]: pandas.tseries.index.DatetimeIndex
And con gives a working MySQLdb connection
In [3]: type(con)
Out[3]: MySQLdb.connections.Connection
I've not been able to get this dataframe entered into a MySQL database correctly, specifically, the date field comes through as null when using the following (as well as some variations on this).
df.to_sql( name='existing_table',con=con, if_exists='append', index=True, index_label='date', flavor='mysql', dtype={'date': datetime.date})
What are the steps required to have this dataframe entered correctly into a local MySQL database, with 'date' as a date field in the db?

To correctly write datetime data to SQL, you need at least pandas 0.15.0.
Starting from pandas 0.14, the sql functions are implemented using SQLAlchemy to deal with the database flavor specific differences. So to use to_sql, you need to provide it an SQLAlchemy engine instead of a plain MySQLdb connection:
from sqlalchemy import create_engine
engine = create_engine('mysql+mysqldb://....')
df.to_sql('existing_table', engine, if_exists='append', index=True, index_label='date')
Note: you don't need to provide the flavor keyword anymore.
Plain DBAPI connections are no longer supported for writing data to SQL, except for sqlite.
See http://pandas.pydata.org/pandas-docs/stable/io.html#io-sql for more details.

Related

Handling UUID values in Arrow with Parquet files

I'm new to Python and Pandas - please be gentle!
I'm using SqlAlchemy with pymssql to execute a SQL query against a SQL Server database and then convert the result set into a dataframe. I'm then attempting to write this dataframe as a Parquet file:
engine = sal.create_engine(connectionString)
conn = engine.connect()
df = pd.read_sql(query, con=conn)
df.to_parquet(outputFile)
The data I'm retrieving in the SQL query includes a uniqueidentifier column (i.e. a UUID) named rowguid. Because of this, I'm getting the following error on the last line above:
pyarrow.lib.ArrowInvalid: ("Could not convert UUID('92c4279f-1207-48a3-8448-4636514eb7e2') with type UUID: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column rowguid with type object')
Is there any way I can force all UUIDs to strings at any point in the above chain of events?
A few extra notes:
The goal for this portion of code was to receive the SQL query text as a parameter and act as a generic SQL-to-Parquet function.
I realise I can do something like df['rowguid'] = df['rowguid'].astype(str), but it relies on me knowing which columns have uniqueidentifier types. By the time it's a dataframe, everything is an object and each query will be different.
I also know I can convert it to a char(36) in the SQL query itself, however, I was hoping to do something more "automatic" so the person writing the query doesn't trip over this problem accidentally all the time / doesn't have to remember to always convert the datatype.
Any ideas?

Try DuckDB
engine = sal.create_engine(connectionString)
conn = engine.connect()
df = pd.read_sql(query, con=conn)
df.to_parquet(outputFile)
# Close the database connection
conn.close()
# Create DuckDB connection
duck_conn = duckdb.connect(':memory:')
# Write DataFrame content to a snappy compressed parquet file
COPY (SELECT * FROM df) TO 'df-snappy.parquet' (FORMAT 'parquet')
Ref:
https://duckdb.org/docs/guides/python/sql_on_pandas
https://duckdb.org/docs/sql/data_types/overview
https://duckdb.org/docs/data/parquet

Why can't I retrieve Date time value from my database in Django orm?

I have the following scenario:
in models.py
def Lab(models.Model):
test_id=models.IntegerField()
test_name=models.CharField(max_length=10)
test_date=models.DateField()
I migrated the DB to sqlite3 and filled it from an external Excel sheet, now I am trying to do the following:
Lab.objects.values_list('test_date',flat=True)
This call raises the following error
*** ValueError: invalid literal for int() with base 10: b'09 00:00:00.000000'
I can simply ask for the other values and have no problem but not for test_date value, What could be the mistake here?
Update
As pointed out, I manually filled the table using the following code snippet:
df['test_date'] = df['test_date'].astype(str)
df['test_date'] = df['test_date'].replace('0', np.nan)
str_date = df['test_date'].str.zfill(8)
df['test_date'] = pd.to_datetime(str_date, yearfirst=True, format='%Y%m%d')`
then I filled it to the DB as
engine = create_engine('sqlite:///db.sqlite3', echo=False)
df.to_sql('Lab', con=engine, index=False, if_exists='append')

The issue is that Pandas has inserted your test_date column as a datetime, not a plain date - so there is extra data on the end of the value that Django doesn't know what to do with.
I'm far from an expert on Pandas but I believe it would work if you explicitly specified the test_date field as a Date:
from sqlalchemy.types import Date
df.to_sql('Lab', con=engine, dtype={"test_date": Date()}, if_exists='append')

update your query if you want to apply some filters:
Lab.objects.filter(test_id=1).values_list('test_date',flat=True)
or if you want all data:
Lab.objects.all().values_list('test_date',flat=True)

Pandas to_sql changing datatype in database table

Has anyone experienced this before?
I have a table with "int" and "varchar" columns - a report schedule table.
I am trying to import an excel file with ".xls" extension to this table using a python program. I am using pandas to_sql to read in 1 row of data.
Data imported is 1 row 11 columns.
Import works successfully but after the import I noticed that the datatypes in the original table have now been altered from:
int --> bigint
char(1) --> varchar(max)
varchar(30) --> varchar(max)
Any idea how I can prevent this? The switch in datatypes is causing issues in downstrean routines.
df = pd.read_excel(schedule_file,sheet_name='Schedule')
params = urllib.parse.quote_plus(r'DRIVER={SQL Server};SERVER=<<IP>>;DATABASE=<<DB>>;UID=<<UDI>>;PWD=<<PWD>>')
conn_str = 'mssql+pyodbc:///?odbc_connect={}'.format(params)
engine = create_engine(conn_str)
table_name='REPORT_SCHEDULE'
df.to_sql(name=table_name,con=engine, if_exists='replace',index=False)
TIA

Consider using the dtype argument of pandas.DataFrame.to_sql where you pass a dictionary of SQLAlchemy types to named columns:
import sqlalchemy
...
data.to_sql(name=table_name, con=engine, if_exists='replace', index=False,
dtype={'name_of_datefld': sqlalchemy.types.DateTime(),
'name_of_intfld': sqlalchemy.types.INTEGER(),
'name_of_strfld': sqlalchemy.types.VARCHAR(length=30),
'name_of_floatfld': sqlalchemy.types.Float(precision=3, asdecimal=True),
'name_of_booleanfld': sqlalchemy.types.Boolean}

I think this has more to do with how pandas handles the table if it exists. The "replace" value to the if_exists argument tells pandas to drop your table and recreate it. But when re-creating your table, it will do it based on its own terms (and the data stored in that particular DataFrame).
While providing column datatypes will work, doing it for every such case might be cumbersome. So I would rather truncate the table in a separate statement and then just append data to it, like so:
Instead of:
df.to_sql(name=table_name, con=engine, if_exists='replace',index=False)
I'd do:
with engine.connect() as con:
con.execute("TRUNCATE TABLE %s" % table_name)
df.to_sql(name=table_name, con=engine, if_exists='append',index=False)
The truncate statement basically drops and recreates your table too, but it's done internally by the database, and the table gets recreated with the same definition.

Pandas to_sql method work with sqlalchemy hana connector?

I'm creating a sqlalchemy engine (have pyhdb and sqlalchemy-hana installed) for a HANA db connection and passing it into pandas' to_sql function for dataframes:
hanaeng = create_engine('hana://username:password#host_address:port')
my_df.to_sql('table_name', con = hanaeng, index = False, if_exists = 'append')
However, I keep getting this error:
sqlalchemy.exc.DatabaseError: (pyhdb.exceptions.DatabaseError) invalid column name
I created a table in my Hana schema that matches the column names and type of what I'm trying to pass into it from the dataframe.
Has anyone ever come across this error? Or tried connecting to hana using a sqlalchemy engine? I tried using a pyhdb connector to make a connection object and passing that into to_sql but I believe pandas is trying to shift accepting only sqlalchemy engine objects in to_sql versus straight DBAPI connectors? Regardless, any help will be great! Thank you

Yes, it does work for sure.
Your problem is that my_df contains a column name that does not match any column in HANA table you are trying to insert data.

SQLAlchemy/pandas to_sql for SQLServer -- CREATE TABLE in master db

Using MSSQL (version 2012), I am using SQLAlchemy and pandas (on Python 2.7) to insert rows into a SQL Server table.
After trying pymssql and pyodbc with a specific server string, I am trying an odbc name:
import sqlalchemy, pyodbc, pandas as pd
engine = sqlalchemy.create_engine("mssql+pyodbc://mssqlodbc")
sqlstring = "EXEC getfoo"
dbdataframe = pd.read_sql(sqlstring, engine)
This part works great and worked with the other methods (pymssql, etc). However, the pandas to_sql method doesn't work.
finaloutput.to_sql("MyDB.dbo.Loader_foo",engine,if_exists="append",chunksize="10000")
With this statement, I get a consistent error that pandas is trying to do a CREATE TABLE in the sql server Master db, which it is not permisioned for.
How do I get pandas/SQLAlchemy/pyodbc to point to the correct mssql database? The to_sql method seems to ignore whatever I put in engine connect string (although the read_sql method seems to pick it up just fine.

To have this question as answered: the problem is that you specify the schema in the table name itself. If you provide "MyDB.dbo.Loader_foo" as the table name, pandas will interprete this full string as the table name, instead of just "Loader_foo".
Solution is to only provide "Loader_foo" as table name. If you need to specify a specific schema to write this table into, you can use the schema kwarg (see docs):
finaloutput.to_sql("Loader_foo", engine, if_exists="append")
finaloutput.to_sql("Loader_foo", engine, if_exists="append", schema="something_else_as_dbo")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas dataframe with Date index -> insert into MySQL - python

Related

Handling UUID values in Arrow with Parquet files

Why can't I retrieve Date time value from my database in Django orm?

Pandas to_sql changing datatype in database table

Pandas to_sql method work with sqlalchemy hana connector?

SQLAlchemy/pandas to_sql for SQLServer -- CREATE TABLE in master db

Categories

Resources