Pandas to_sql int column turning to float64 because of null values - python

I am running python 3.10, pandas 1.5.2, sqlalchemy 1.4.46, and pyodbc 4.0.35 on an Ubuntu subsystem on Windows (WSL2).
I think the issue I am having was a bug in prior pandas versions, but everyone is saying it has been fixed, so I am not sure why I am having this issue.
I have a dataframe like this: df=
int_col_1
int_col_2
string_col
1
10
'val1'
2
20
None
3
None
'val3'
Then I push this to an MSSQL database using this code:
connection_string = 'DRIVER={ODBC Driver 18 for SQL Server};SERVER=' + server + ';DATABASE=' + database + ';UID=' + username + ';PWD=' + password + ';Encrypt=no'
connection_url = URL.create("mssql+pyodbc", query={"odbc_connect": connection_string})
engine = create_engine(connection_url)
df.to_sql('table', engine, if_exists='replace', index=False)
Or with dtypes like this:
from sqlalchemy.types import Integer, String
df.to_sql('table', engine, if_exists='replace', index=False, dtype={'int_col1_1':Integer(), 'int_col_2':Integer(), 'string_col':String()})
Either way this is what is returned if I pull the table from SQL:
int_col_1
int_col_2
string_col
1
10.0
'val1'
2
20.0
None
3
NaN
'val3'
You'll notice that int_col_1 is fine (int64), string_col is fine (object with NoneType), but int_col_2 is turned into float64.
I know there are methods to correct the data after pulling from SQL, but I am looking for a way to get the data to be correct in SQL when pushed from pandas and I have no idea how to do that. Any help would be really appreciated.

You can cast the column to an Int64 dtype
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1, 2, np.nan]})
df['a'] = df['a'].astype('Int64')
df
a
0 1
1 2
2 <NA>

Related

Is there a way to store both a pandas dataframe and separate string var in the same SQLite table?

Disclaimer: This is my first time posting a question here, so I apologize if I didn't post this properly.
I recently started learning how to use SQLite in python. As the title suggests, I have a python object with a string and a pandas dataframe attributes and want to know if/how I can add both of these to the same SQLite table. Below is the code I have thus far. The mydb.db file gets created successfully, but on insert I get the following error message:
sqlite3.InterfaceError: Error binding parameter :df- probably unsupported type.
I know you can use df.to_sql('mydbs', conn) to store a pandas dataframe in an SQL table, but this wouldn't seem to allow for an additional string to be added to the same table and then retrieved separately from the dataframe. Any solutions or alternative suggestions are appreciated.
Python Code:
# Python 3.7
import sqlite3
import pandas as pd
import myclass
conn = sqlite3.connect("mydb.db")
c = conn.cursor()
c.execute("""CREATE TABLE mydbs (
name text,
df blob
)""")
conn.commit()
c.execute("INSERT INTO mydbs VALUES (:name, :df)", {'name': myclass.name, 'df': myclass.df})
conn.commit()
conn.close()
It looks like you are trying to store a dataframe in an SQL table 'cell'. This is a bit odd, since sql is used for storing tables of data... and a dataframe is something that arguably should be stored as a table on its own (hence the built in pandas function). To accomplish what you want specifically, you could pickle the dataframe and store it
import codecs
import pickle
import pandas as pd
import sqlite3
df = pd.DataFrame({"foo": range(5), "bar": range(5, 10)})
pickled = codecs.encode(pickle.dumps(df), "base64").decode()
df
foo bar
0 0 5
1 1 6
2 2 7
3 3 8
4 4 9
Store & Retrieve:
conn = sqlite3.connect("mydb.db")
c = conn.cursor()
c.execute("""CREATE TABLE mydbs (
name text,
df text
)""")
c.execute("INSERT INTO mydbs VALUES (:name, :df)", {'name': 'name', 'df': pickled})
conn.commit()
c.execute('SELECT * FROM mydbs')
result = c.fetchall()
unpickled = pickle.loads(codecs.decode(result[0][1].encode(), "base64"))
conn.close()
unpickled
foo bar
0 0 5
1 1 6
2 2 7
3 3 8
4 4 9
If you wanted to store the dataframe as an sql table (which imo makes more sense and is simpler) and you needed to have a name with it you could just add a column 'name' to the df:
import pandas as pd
from sqlalchemy import create_engine
df = pd.DataFrame({"foo": range(5), "bar": range(5, 10)})
df
foo bar
0 0 5
1 1 6
2 2 7
3 3 8
4 4 9
Add name column, then save to db and retrieve:
df['name'] = 'the df name'
engine = create_engine('sqlite://', echo=False)
df.to_sql('users', con=engine)
r = engine.execute("SELECT * FROM users").fetchall()
r = pd.read_sql('users', con=engine)
r
index foo bar name
0 0 0 5 the df name
1 1 1 6 the df name
2 2 2 7 the df name
3 3 3 8 the df name
4 4 4 9 the df name
But even that method may not be ideal, since you are effectively adding an extra column of data for each df, and this could get costly if you are working on a large project where database size is a factor, and maybe even speed (although SQL is quite fast). In this case, it may be best to use relational tables. For this I refer you here since there is no point re-writing the code here. Using a relational model would be the most 'proper' solution imo, since it fully embodies the purpose of SQL.

How to fix programming error when using pg8000 module to read_Sql

Hi I'm trying to query a table in aws redshift into a pandas dataframe using a glue job . I am using pg8000 to connect ( as sqlalchemy is not supported in aws glue).
When i used using read_sql or read_sql_query function of pandas to query the table, i am getting extra char in dataframe which i guess is problem with pg8000 dbapi
conn = pg8000.connect(user = 'postgres', password = '*****', host =127.0.0.1, port = 5439, database = 'lifungdb')
cursor = conn.cursor()
df=pd.read_sql("select * from Customer",conn)
print(df)
print (df) is returns with exta char b in the columns. How to strip that extra char
b'id' b'Name' b'Address' b'Contact
1 Sam Texas na
Using list-comprehension to decode the utf-8 strings:
import pandas as pd
a = [['1', 'sam', 'Texas', 'na']]
df = pd.DataFrame(a, columns=[b'id', b'Name', b'Address', b'Contact'])
df.columns = [x.decode('utf-8') for x in df.columns]
print(df)
OUTPUT:
id Name Address Contact
0 1 sam Texas na

Using Pandas Dataframe within a SQL Join

I'm trying to perform a SQL join on the the contents of a dataframe with an external table I have in a Postgres Database.
This is what the Dataframe looks like:
>>> df
name author count
0 a b 10
1 c d 5
2 e f 2
I need to join it with a Postgres table that looks like this:
TABLE: blog
title author url
a b w.com
b b x.com
e g y.com
This is what I'm attempting to do, but this doesn't appear to be the right syntax for the query:
>>> sql_join = r"""select b.*, frame.* from ({0}) frame
join blog b
on frame.name = b.title
where frame.owner = b.owner
order by frame.count desc
limit 30;""".format(df)
>>> res = pd.read_sql(sql_join, connection)
I'm not sure how I can use the values in the dataframes within the sql query.
Can someone point me in the right direction? Thanks!
Edit: As per my use case, I'm not able to convert the blog table into a dataframe given memory and performance constraints.
I managed to do this without having to convert the dataframe to a temp table or without reading SQL into a dataframe from the blog table.
For anyone else facing the same issue, this is achieved using a virtual table of sorts.
This is what my final sql query looks like this:
>>> inner_string = "VALUES ('a','b',10), ('c','d',5), ('e','f',2)"
>>> sql_join = r"""SELECT * FROM blog
JOIN ({0}) AS frame(title, owner, count)
ON blog.title = frame.title
WHERE blog.owner = frame.owner
ORDER BY frame.count DESC
LIMIT 30;""".format(inner_string)
>>> res = pd.read_sql(sql_join, connection)
You can use string manipulation to convert all rows in the dataframe into one large string similar to inner_string.
You should create another dataframe from the Postgres table and then join both dataframes.
You can use read_sql to create a df from table:
import psycopg2 ## Python connector library to Postgres
import pandas as pd
conn = psycopg2.connect(...) ## Put your DB credentials here
blog_df = pd.read_sql('blog', con=conn)
## This will bring `blog` table's data into blog_df
It should look like this:
In [258]: blog_df
Out[258]:
title author url
0 a b w.com
1 b b x.com
2 e g y.com
Now, you can join df and blog_df using merge like below:
In [261]: pd.merge(df, blog_df, left_on='name', right_on='title')
Out[261]:
name author_x count title author_y url
0 a b 10 a b w.com
1 e f 2 e g y.com
You will get result like above. You can clean it further.
Let me know if this helps.
I've had similar problems. I found a work-around that allows me to join two different servers where i only have read-only rights. using sqlalchemy insert the pandas dataframe and then join
import sqlalchemy as sa
import pandas as pd
metadata = MetaData()
sql_of_df = sa.Table(
"##df",
metadata,
sa.Column("name", sa.String(x), primary_key=True),
sa.Column("author", sa.String(x), nullable=False),
sa.Columnt("count", sa.Integer),
)
metadata.create_all(engine)
dataframe_dict = df.to_dict(orient='records')
insert_statement = sql_of_df.insert().values(
{
"name":sa.bindparam("name"),
"author":sa.bindparam("author"),
"count":sa.bindparam("count"),
}
)
session.execute(insert_statement, dataframe_dict)
statement=sa.text("SELECT * from blog Inner join ##df on blog.Title = ##df.name")
session.execute(statement)

Save pandas (string/object) column as VARCHAR in Oracle DB instead of CLOB (default behaviour)

i am trying to transfer a dataframe to oracle database, but the transfer is taking too long, because the datatype of the variable is showing as clob in oracle. However i believe if i convert the datatype from clob to string of 9 digits with padded 0's , it will not take that much time. data is
product
000012320
000234234
is there a way to change the datatype of this variable to string of 9 digits. so that oracle does not think of it as CLOB object. i have tried the following.
df['product']=df['product'].astype(str)
or is there something else that might slow the transfer from python to oracle ?
Here is a demo:
import cx_Oracle
from sqlalchemy import types, create_engine
engine = create_engine('oracle://user:password#host_or_scan_address:1521:ORACLE_SID')
#engine = create_engine('oracle://user:password#host_or_scan_address:1521/ORACLE_SERVICE_NAME')
In [32]: df
Out[32]:
c_str c_int c_float
0 aaaaaaa 4 0.046531
1 bbb 6 0.987804
2 ccccccccccccc 7 0.931600
In [33]: df.to_sql('test', engine, index_label='id', if_exists='replace')
In Oracle DB:
SQL> desc test
Name Null? Type
------------------- -------- -------------
ID NUMBER(19)
C_STR CLOB
C_INT NUMBER(38)
C_FLOAT FLOAT(126)
now let's specify an SQLAlchemy dtype: 'VARCHAR(max_length_of_C_STR_column)':
In [41]: df.c_str.str.len().max()
Out[41]: 13
In [42]: df.to_sql('test', engine, index_label='id', if_exists='replace',
....: dtype={'c_str': types.VARCHAR(df.c_str.str.len().max())})
In Oracle DB:
SQL> desc test
Name Null? Type
--------------- -------- -------------------
ID NUMBER(19)
C_STR VARCHAR2(13 CHAR)
C_INT NUMBER(38)
C_FLOAT FLOAT(126)
PS for padding your string with 0's please check #piRSquared's answer
use str.zfill
df['product'].astype(str).str.zfill(9)
0 000012320
1 000234234
Name: product, dtype: object

pandas read_sql drops dot in column names

is that a bug or I'm doing specifically something wrong ?
I create a df, put it in a sql table, df and table have a column with a dot in it.
now when I read the df from the sql table, column names aren't the same.
I wrote this little piece of code so that people can test it.
import sqlalchemy
import pandas as pd
import numpy as np
engine = sqlalchemy.create_engine('sqlite:///test.sqlite')
dfin = pd.DataFrame(np.random.randn(10,2), columns=['column with a . dot', 'without'])
print(dfin)
dfin.to_sql('testtable', engine, if_exists='fail')
tables = engine.table_names()
for table in tables:
sql = 'SELECT t.* FROM "' + table + '" t'
dfout = pd.read_sql(sql, engine)
print(dfout.columns)
print dfout
Solution is to pass sqlite_raw_colnames=True to your engine
In [141]: engine = sqlalchemy.create_engine('sqlite:///', execution_options={'sqlite_raw_colnames':True})
In [142]: dfin.to_sql('testtable', engine, if_exists='fail')
In [143]: pd.read_sql("SELECT * FROM testtable", engine).head()
Out[143]:
index column with a . dot without
0 0 0.213645 0.321328
1 1 -0.511033 0.496510
2 2 -1.114511 -0.030571
3 3 -1.370342 0.359123
4 4 0.101111 -1.010498
SQLAlchemy does this stripping of dots deliberately (in some cases SQLite may store col names as "tablename.colname"), see eg sqlalchemy+sqlite stripping column names with dots? and https://groups.google.com/forum/?hl=en&fromgroups#!topic/sqlalchemy/EqAuTFlMNZk
This seems a bug, but not necessarily in the pandas read_sql function, as this relies on the keys method of the SQLAlchemy ResultProxy object to determine the column names. And this seems to truncate the column names:
In [15]: result = engine.execute("SELECT * FROM testtable")
In [16]: result.keys()
Out[16]: [u'index', u' dot', u'without']
So the question is if this is a bug in SQLAlchemy, or that pandas should make a workaround (by eg using result.cursor.description which gives the correct names)
For now, you can also use the sqlite fallback mode, using a DBAPI connection instead of SQLAlchemy engine (as this relies on cursor.description, here the correct column names are used:
In [20]: con = sqlite3.connect(':memory:')
In [21]: dfin.to_sql('testtable', con, if_exists='fail')
In [22]: pd.read_sql("SELECT * FROM testtable", con).head()
Out[22]:
index column with a . dot without
0 0 0.213645 0.321328
1 1 -0.511033 0.496510
2 2 -1.114511 -0.030571
3 3 -1.370342 0.359123
4 4 0.101111 -1.010498

Categories