Error: _get_column_info() When uploading Pandas DataFrame to Redshift

Error: _get_column_info() When uploading Pandas DataFrame to Redshift - python

I'm trying to upload a pandas DataFrame directly to Redshift using the to_sql function.
connstr = 'redshift+psycopg2://%s:%s#%s.redshift.amazonaws.com:%s/%s' %
(username, password, cluster, port, db_name)
def send_data(df, block_size=10000):
engine = create_engine(connstr)
with engine.connect() as conn, conn.begin():
df.to_sql(name='my_table_clean', schema='my_schema', con=conn, index=False,
if_exists='replace', chunksize=block_size)
del engine
The table my_schema.my_table_clean exists (but is empty), and the connection built using connstr is also valid (verified by a correspond retrieve_data method). The retrieve function pulls data from my_table and my script cleans it up using pandas to output to my_table_clean.
The problem is, I keep getting the following error:
TypeError: _get_column_info() takes exactly 9 arguments (8 given)
during the to_sql function.
I can't seem to figure out what is causing this error. Is anyone familiar with it?
Using
python 2.7.13
pandas 0.20.2
sqlalchemy 1.2.0.
Note: I'm trying to circumvent S3 -> Redshift for this script since I don't want to create a folder in my bucket just for one file, and this single script doesn't conform to my overall ETL structure. I'm hoping to just run this one script after the ETL that creates the original my_table.

Related

Trying to read sqlite database to Dask dataframe

I am trying to read a table from a sqlite database in kaggle using Dask,
link to DB : https://www.kaggle.com/datasets/marcilonsilvacunha/amostracnpj?select=amostraCNPJ.sqlite
some of the tables in this database are really large and I want to test how dask can handle them.
I wrote the following code for one of the tables in the smaller sqlite database :
import dask.dataframe as ddf
import sqlite3
# Read sqlite query results into a pandas DataFrame
con = sqlite3.connect("/kaggle/input/amostraCNPJ.sqlite")
df = ddf.read_sql_table('cnpj_dados_cadastrais_pj', con, index_col='cnpj')
# Verify that result of SQL query is stored in the dataframe
print(df.head())
this gives an error:
AttributeError: 'sqlite3.Connection' object has no attribute '_instantiate_plugins'
any help would be apreciated as this is the first time I use Dask to read sqlite.

As the docstring stated, you should not pass a connection object to dask. You need to pass a sqlalchemy compatible connection string
df = ddf.read_sql_table('cnpj_dados_cadastrais_pj',
'sqlite:////kaggle/input/amostraCNPJ.sqlite', index_col='cnpj')

MySQL table definition has changed error when reading from a table that has been written to by PySpark

I am currently working on a data pipeline with pyspark. As part of the pipeline, I write a spark dataframe to mysql using the following function:
def jdbc_insert_overwrite_table(df, mysql_user, mysql_pass, mysql_host, mysql_port, mysql_db, num_executors, table_name,
logger):
mysql_url = "jdbc:mysql://{}:{}/{}?characterEncoding=utf8".format(mysql_host, mysql_port, mysql_db)
logger.warn("JDBC Writing to table " + table_name)
df.write.format('jdbc')\
.options(
url=mysql_url,
driver='com.mysql.cj.jdbc.Driver',
dbtable=table_name,
user=mysql_user,
password=mysql_pass,
truncate=True,
numpartitions=num_executors,
batchsize=100000
).mode('Overwrite').save()
This works with no issue. However, later on in the pipeline (within the same PySpark app/ spark session), this table is a dependency for another transformation, and I try reading from this table using the following function:
def read_mysql_table_in_session_df(spark, mysql_conn, query_str, query_schema):
cursor = mysql_conn.cursor()
cursor.execute(query_str)
records = cursor.fetchall()
df = spark.createDataFrame(records, schema=query_schema)
return df
And I get this MySQL error: Error 1412: Table definition has changed, please retry transaction.
I've been able to resolve this by closing and ping(reconnect=True) to the database, but I don't like this solution as it feels like a band-aid.
Any ideas why I'm getting this error? I've confirmed writing to the table does not change the table definition (schema wise, at least).

Write a DataFrame to an SQL database (Oracle)

I need to upload a table I modified to my oracle database. I exported the table as pandas dataframe modified it and now want to upload it to the DB.
I am trying to do this using the df.to_sql function as follows:
import sqlalchemy as sa
import pandas as pd
engine = sa.create_engine('oracle://"IP_address_of_server"/"serviceDB"')
df.to_sql("table_name",engine, if_exists='replace', chunksize = None)
I always get this error: DatabaseError: (cx_Oracle.DatabaseError) ORA-12505: TNS:listener does not currently know of SID given in connect descriptor (Background on this error at: http://sqlalche.me/e/4xp6).
I am not an expert of this, so I could not understand what the matter is, specially that the IP_address I am givingg is the right one.
Could anywone help? Thanks a lot!

can't connect to MySQL database when using df.to_sql (pandas, pyodbc)

I am trying to move information that is on several Excel spreadsheets into a single table on a SQL database. First I read the sheets using pandas and convert them into a single dataframe (this part works).
def condense(sheet):
# read all excel files and convert to single pandas dataframe
dfs = []
for school in SCHOOLS:
path = filePaths[school]
try: opened = open(path, "rb")
except: print("There was an error with the Excel file path.")
dataframe = pd.read_excel(opened, sheetname=sheet)
dfs.append(dataframe)
return pd.concat(dfs)
Then I want to upload the dataframe to the database. I have read a lot of documentation but still don't really know where to begin. This is what I have currently.
connection = pyodbc.connect('''Driver={SQL Server};
Server=serverName;
Database=dbName;
Trusted_Connection=True''')
df.to_sql(tableName, connection)
I have also tried using an engine, but am confused at how to format the connection string, especially as I do not want to use a password. (What I tried below does use a password, and doesn't work.)
connection_string= 'mysql://username:password#localhost/dbName'
engine = create_engine(connection_string)
engine.connect()
df.to_sql(tableName, engine)
Any suggestions on how to create the engine without using a password, or where my code is wrong would be very appreciated.

Error in reading an sql file using pandas

I have an sql file locally stored in my PC. I want to open and read it using the pandas library. Here it iswhat I have tried:
import pandas as pd
import sqlite3
my_file = 'C:\Users\me\Downloads\\database.sql'
#I am creating an empty database
conn = sqlite3.connect(r'C:\Users\test\Downloads\test.db')
#I am reading my file
df = pd.read_sql(my_file, conn)
However, I am receiving the following error:
DatabaseError: Execution failed on sql 'C:\Users\me\Downloads\database.sql': near "C": syntax error

Try moving the file to D://
Sometimes Python is not granted access to read/write in C.
Hence may be that is an issue.
You can also try alternative method using cursors.
cur=conn.cursor()
r=cur.fetchall()
This r would contain a tuple of your dataset.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Error: _get_column_info() When uploading Pandas DataFrame to Redshift - python

Related

Trying to read sqlite database to Dask dataframe

MySQL table definition has changed error when reading from a table that has been written to by PySpark

Write a DataFrame to an SQL database (Oracle)

can't connect to MySQL database when using df.to_sql (pandas, pyodbc)

Error in reading an sql file using pandas

Categories

Resources