I'm trying to use Pandas to_sql to insert data from .csv files into a mssql db. No matter how I seem do it I run into this error:
pyodbc.DataError: ('String data, right truncation: length 8 buffer 4294967294', '22001')
The code I'm running looks like this:
import pandas as pd
from sqlalchemy import create_engine
df = pd.read_csv('foo.csv')
engine = create_engine("mssql+pyodbc://:#Test")
with engine.connect() as conn, conn.begin():
df.to_sql(name='test', con=conn, schema='foo', if_exists='append', index=False)
Any help would be appreciated!
P.S I'm still fairly new to python and mssql.
Okay so I didn't have my DSN configured correctly. The driver I was using was SQL Server and I needed to change it to ODBC Driver 13 for SQL Server. That fixed all my problems.
Related
In my following python code I successfully can connect to MS Azure SQL Db using ODBC connection, and can load data into an Azure SQL table using pandas' dataframe method to_sql(...). But when I use pyspark.pandas instead, the to_sql(...) method fails stating no such method supported. I know pandas API on Spark has reached about 97% coverage. But I was wondering if there is alternate method of achieving the same while still using ODBC.
Question: In the following code sample, how can we use ODBC connection for pyspark.pandas for connecting to Azure SQL db and load a dataframe into a SQL table?
import sqlalchemy as sq
#import pandas as pd
import pyspark.pandas as ps
import datetime
data_df = ps.read_csv('/dbfs/FileStore/tables/myDataFile.csv', low_memory=False, quotechar='"', header='infer')
.......
data_df.to_sql(name='CustomerOrderTable', con=engine, if_exists='append', index=False, dtype={'OrderID' : sq.VARCHAR(10),
'Name' : sq.VARCHAR(50),
'OrderDate' : sq.DATETIME()})
Ref: Pandas API on Spark and this
UPDATE: The data file is about 6.5GB with 150 columns and 15 million records. Therefore, the pandas cannot handle it, and as expected, it gives OOM (out of memory) error.
I noticed you were appending the data to the table, so this work around came to mind.
Break the pyspark.pandas into chunks, and then export each chunk to pandas, and from there append the chunk.
n = len(data_df)//20 # Break it into 20 chunks
list_dfs = np.array_split(data_df, n) # [df[i:i+n] for i in range(0,df.shape[0],n)]
for df in list_dfs:
df = df.to_pandas()
df.to_sql()
As per the official pyspark.pandas documentation by Apache Spark, there is no such method available for this module which can load the pandas DataFrame to SQL Table.
Please see all provided methods here.
As an alternative approach, there are some similar asks mentioned in these SO threads. This might be helpful.
How to write to a Spark SQL table from a Panda data frame using PySpark?
How can I convert a pyspark.sql.dataframe.DataFrame back to a sql table in databricks notebook
I need to upload a table I modified to my oracle database. I exported the table as pandas dataframe modified it and now want to upload it to the DB.
I am trying to do this using the df.to_sql function as follows:
import sqlalchemy as sa
import pandas as pd
engine = sa.create_engine('oracle://"IP_address_of_server"/"serviceDB"')
df.to_sql("table_name",engine, if_exists='replace', chunksize = None)
I always get this error: DatabaseError: (cx_Oracle.DatabaseError) ORA-12505: TNS:listener does not currently know of SID given in connect descriptor (Background on this error at: http://sqlalche.me/e/4xp6).
I am not an expert of this, so I could not understand what the matter is, specially that the IP_address I am givingg is the right one.
Could anywone help? Thanks a lot!
I have been attempting to use Python to upload a table into Microsoft SQL Server. I have had great success with smaller tables, but start to get errors when there is a large number of columns or rows. I don't believe it is the filesize that is the issue, but I may be mistaken.
The same error comes up whether the data is from an Excel file, csv file, or query.
When I run the code, it does create a table in SQL Server, but only has the column headers (the rest being blank).
This is the code that I am using, which works for smaller files but gives me the below error for the larger ones:
import pyodbc
#import cx_Oracle
import pandas as pd
from sqlalchemy import create_engine
connstr_Dev = ('DSN='+ODBC_Dev+';UID='+SQLSN+';PWD='+SQLpass)
conn_Dev = pyodbc.connect(connstr_Dev)
cursor_Dev=conn_Dev.cursor()
engine_Dev = create_engine('mssql+pyodbc://'+ODBC_Dev)
upload_file= "M:/.../abc123.xls"
sql_table_name='abc_123_sql'
pd.read_excel(upload_file).to_sql(sql_table_name, engine_Dev, schema='dbo', if_exists='replace', index=False, index_label=None, chunksize=None, dtype=None)
conn_Dev.commit()
conn_Dev.close()
This gives me the following error:
ProgrammingError: (pyodbc.ProgrammingError) ('The SQL contains -13854
parameter markers, but 248290 parameters were supplied', 'HY000') .......
(Background on this error at: http://sqlalche.me/e/f405)
The error log in the provided link doesn't give me any ideas on troubleshooting.
Anything I can tweak in the code to make this work?
Thanks!
Upgrading to pandas 0.23.4 solved it for me. What is your version ?
I have a pandas dataframe of approx 300,000 rows (20mb), and want to write to a SQL server database.
I have the following code but it is very very slow to execute. Wondering if there is a better way?
import pandas
import sqlalchemy
engine = sqlalchemy.create_engine('mssql+pyodbc://rea-eqx-dwpb/BIWorkArea?
driver=SQL+Server')
df.to_sql(name='LeadGen Imps&Clicks', con=engine, schema='BIWorkArea',
if_exists='replace', index=False)
If you want to speed up you process with writing into the sql database , you can per-setting the dtypes of the table in your database by the data type of your pandas DataFrame
from sqlalchemy import types, create_engine
d={}
for k,v in zip(df.dtypes.index,df.dtypes):
if v=='object':
d[k]=types.VARCHAR(df[k].str.len().max())
elif v=='float64':
d[k]=types.FLOAT(126)
elif v=='int64':
d[k] = types.INTEGER()
Then
df.to_sql(name='LeadGen Imps&Clicks', con=engine, schema='BIWorkArea', if_exists='replace', index=False,dtype=d)
I have an sql file locally stored in my PC. I want to open and read it using the pandas library. Here it iswhat I have tried:
import pandas as pd
import sqlite3
my_file = 'C:\Users\me\Downloads\\database.sql'
#I am creating an empty database
conn = sqlite3.connect(r'C:\Users\test\Downloads\test.db')
#I am reading my file
df = pd.read_sql(my_file, conn)
However, I am receiving the following error:
DatabaseError: Execution failed on sql 'C:\Users\me\Downloads\database.sql': near "C": syntax error
Try moving the file to D://
Sometimes Python is not granted access to read/write in C.
Hence may be that is an issue.
You can also try alternative method using cursors.
cur=conn.cursor()
r=cur.fetchall()
This r would contain a tuple of your dataset.