pandas.to_sql write to redshift fails with NotSupportedError - python

I am using sqlalchemy(sqlalchemy-redshift) as engine with pandas, while writing to redshift with to_sql i get the following error:
sqlalchemy.exc.NotSupportedError: (psycopg2.NotSupportedError) SQL command "CREATE INDEX ix_western_union_answer_pivot_index ON western_union_answer_pivot (index)" not supported on Redshift tables.
[SQL: 'CREATE INDEX ix_western_union_answer_pivot_index ON western_union_answer_pivot (index)'] (Background on this error at: http://sqlalche.me/e/tw8g)
While i understand the problem , How to create an Index in Amazon Redshift
I have two questions,
1. shouldn't sqlalchemy-redshift translate the create index to a redshift supporting sortkey statement? Thats the point of using ORM right?
As a workaround can i stop to_sql from creating the db index?
UPDATE:
on setting index=False in to_sql, the above issue is solved but i endup with
sqlalchemy.exc.DataError: (psycopg2.DataError) value too long for type character varying(256)
Is 256 the max size in redshift? Any solution to this apart from slicing the data to 256 and losing information?

Related

Redshift from Boto3 - ERROR: syntax error at or near "'s3://xx-xx-xx/xx-xx-x/x-x-xx/xxx.csv'"

I'm having issues updating a Redshift table, and honestly I am not sure what I'm doing wrong. Is it even possible to update like this from S3? Im doing a JOIN on the Primary key which is a unique key, so that I can update only existing rows. I appreciate your help:
update "red"."shift"."table"
set bid=s.bid, user_id=s.user_id, username=s.username, total_sum=s.total_sum,amount_currency=s.amount_currency,allowance_name=s.allowance_name,ledgerentrytype=s.ledgerentrytype,transaction_timestamp_group=cast(s.transaction_timestamp_group as timestamp),employer=s.employer,taxreportingstatus=s.taxreportingstatus
from 's3://xx-xx-xx/xx-xx-x/x-x-xx/xxx.csv' as s
join "red"."shift"."table" as t on s.bid=t.bid
iam_role 'arn:aws:iam::xxxxx:role/service-role/xxxx-xx-xx-xx```
The FROM clause of the UPDATE statement needs to point to a table in Redshift not an S3 object. See https://docs.aws.amazon.com/redshift/latest/dg/r_UPDATE.html
Now you can define a table in Redshift and COPY the data from this S3 object into this table, then perform the UPDATE. Or, you can define an external table in Redshift that references this object and use this table in the UPDATE statement. Either way Redshift needs a defined table for the UPDATE statement.

InternalError_: Spectrum Scan Error. S3 to Redshift copy command

I am trying to copy some data from S3 bucket to redshift table by using the COPY command. The format of the file is PARQUET. When I run the execute the COPY command query, I get InternalError_: Spectrum Scan Error.
This is the first time I tried copying from a parquet file.
Please help me if there is a solution for this. I am using boto3 in python.
This generally happens for below reasons:
If there is a mismatch in number of columns between table and file.
If the Column type of your file schema is incompatible with your target table column type.
Try going into the error logs. You might find partial log in cloud watch. From the screen shot you have uplaoded, you can also find a query number you have run.
Got to aws redshift query editor and run below query to get the full log:
select message
from svl_s3log
where query = '<<your query number>>'
order by query,segment,slice;
Hope this helps !
This error usually indicates some problem with compatibility of data in your file and redshift tables. you can get more insights about error in table 'SVL_S3LOG'. In my case it was because file had some invalid utf8 characters.
Spectrum scan error are usually caused by two things.
a) column mismatch between source and destination
e.g. if u are copying data from S3 to redshift then, the columns of parquet are not in order with those present in redshift tables.
b) there is match in the datatype for source and destination
e.g. S3 to redshift copy, in parquet one has col1 datatype as Integer and in redshift same col1 has datatype as float.
Verify the schema with their datatype
matching the sequence and the datatype for source and destination will solve the Spectrum Scan Error.

Creating Create Table statements for Redshift by reading Oracle DDL statement in python

I have 5 tables in an Oracle database. I need to create similar structures of them in AWS Redshift. I am using cx_oracle to connect to Oracle and dump the ddl in a csv file. But changing that DDL for each datatype in python to make it run in Redshift is turning out to be a very tedious process.
Is there any easy way to do in Python? Is there any library or function to do this seamlessly.
PS: I tried to use AWS Schema Conversion Tool for this. The tables got created in Redshift, but, with a glitch. Every datatype got doubled in Redshift.
For example: varchar(100) in Oracle became varchar(200) in Redshift
Has anyone faced a similar issue before with SCT?
The cx_OracleTools project and specifically the DescribeObject tool within that project have the ability to extract the DDL from an Oracle database. You may be able to use that.

Python pandas dataframe transaction

Please suggest a way to execute SQL statement and pandas dataframe .to_sql() in one transaction
I have the dataframe and want to delete some rows on the database side before insertion
So basically I need to delete and then insert in one transaction using .to_sql of dataframe
I use sqlalchemy engine with pandas.df.to_sql()
After further investigation I realized that it is possible to do only with sqllite3, because to_sql supports both sqlalchemy engine and plain connection object as conn parameter, but as a connection it is supported only for sqllite3 database
In other words you have no influence on connection which will be created by to_sql function of dataframe

How to save an SVG image in mySQL (from Python 3.6)

How can one save a large SVG image in a mySQL table?
My problem is that my SVGs are up to 200K symbols, which appears to be too much to save them in my table.
When trying to save as TEXT, Python (using Python 3.6 with Anaconda), python/sqlalchemy tells me the following:
sqlalchemy.exc.DataError: (pymysql.err.DataError) (1406, "Data too long for column 'cantons_svg' at row 27") [SQL: 'INSERT INTO ...]
I encountered this problem today when I try to store videos into tidb. I am using flask as backend framework and sqlalchemy as ORM, connecting to by database with mysql python connector.
the log is as follow:
sqlalchemy.exc.DataError: (pymysql.err.DataError) (1406, "Data too long for column 'video' at row 1")
[SQL: INSERT INTO videos (user_id, token_id, video) VALUES (%(user_id)s, %(token_id)s, %(video)s)]
I found that there is few advise about this situation, amoung those, one suggest me to see if there are any self-defined storage-type in sqlalchemy. It seems quite complicated.(if anyone find a doc or something that giving a clearly guidance about this, please tell me).
As for me, I just use BLOB type of sqlalchemy to init the database. And use
alter table videos modify column video LongBlob DEFAULT NULL ;
munualy change the column type. This work fine with me.

Categories