Insert data into RedShift by using Pyhon - python

Can data be inserted into a RedShift from a local computer without copying data to S3 first?
Basically as a direct insert of record by record into RedShift?
If yes - what library / connection string can be used?
(I am not concerned about performance)
Thanks.

Can data be inserted into a RedShift from a local computer without copying data to S3 first? Basically as a direct insert of record by record into RedShift?
Yes, it could be done. But not a preferred method, though you have already weighted, that performance is not a concern.
You could usepsycopg2 liberary. You could run this from any machine(local/on EC2 or any other cloud platform) having network connection to/for allowed to/from to your Redshift instance.
Here is python code snippet.
import psycopg2
def redshift():
conn = psycopg2.connect(dbname='your_database', host='a********8.****s.redshift.amazonaws.com', port='5439', user='user', password='Pass')
cur = conn.cursor();
cur.execute('insert into test values('1','2','3','4')")
print('success ')
redshift();

It depends if you talk about RedShift or RedShift Spectrum.
In RSS you have to put the data on S3 but if you use RedShift you can make an insert with sqlalchemy for example.

The easiest way to query AWS Redshift from python is through this Jupyter extension - Jupyter Redshift
Not only can you query and save your results but also write them back to the database from within the notebook environment.

Related

AWS Aurora: bulk upsert of records using pre-formed SQL Statements

Is there a way of doing a batch insert/update of records into AWS Aurora using "pre-formed" Postgresql statements, using Python?
My scenario: I have an AWS lambda that receives data changes (insert/modify/remove) from DynamoDB via Kinesis, which then needs to apply them to an instance of Postgres in AWS Aurora.
All I've managed to find doing an Internet search is the use of Boto3 via the "batch_execute_statement" command in the RDS Data Service client, where one needs to populate a list of parameters for each individual record.
If possible, I would like a mechanism where I can supply many "pre-formed" INSERT/UPDATE/DELETE Postgresql statements to the database in a batch operation.
Many thanks in advance for any assistance.
I used Psycopg2 and an SqlAlchemy engine's raw connection (instead of Boto3) and looped through my list of SQL statements, executing each one in turn.

Attempting to establish a connection to Amazon Redshift from Python Script

I am trying to connect to a Amazon redshift table. I created the table using SQL and now I am writing a Python script to append a data frame to the database. I am unable to connect to the database and feel that I have something wrong with my syntax or something else. My code is below.
from sqlalchemy import create_engine
conn = create_engine('jdbc:redshift://username:password#localhost:port/db_name')
Here is the error I am getting.
sqlalchemy.exc.ArgumentError: Could not parse rfc1738 URL from string
Thanks!
There are basically two options for connecting to Amazon Redshift using Python.
Option 1: JDBC Connection
This is a traditional connection to a database. The popular choice tends to be using psycopg2 to establish the connection, since Amazon Redshift resembles a PostgreSQL database. You can download specific JDBC drivers for Redshift.
This connection would require the Redshift database to be accessible to the computer making the query, and the Security Group would need to permit access on port 5439. If you are trying to connect from a computer on the Internet, the database would need to be in a Public Subnet and set to Publicly Accessible = Yes.
See: Establish a Python Redshift Connection: A Comprehensive Guide - Learn | Hevo
Option 2: Redshift Data API
You can directly query an Amazon Redshift database by using the Boto3 library for Python, including an execute_statement() call to query data and a get_statement_result() call to retrieve the results. This also works with IAM authentication rather than having to create additional 'database users'.
There is no need to configure Security Groups for this method, since the request is made to AWS (on the Internet). It also works with Redshift databases that are in private subnets.

Best way for bulk load of data into SQL Server with Python without pyodbc

I want to load data from our cloud environment (pivotal cloud foundry) into SQL Server. Data is fetched from API and held in memory and we use tds to insert data to SQL Server, but only way in documentation I see to use bulk load is to load a file. I cannot use pyodbc because we dont have odbc connection in cloud env.
How can I do bulk insert directly from dictionary?
pytds does not offer bulk load directly, only from file
The first thing that comes to mind is to convert the data into bulk insert sql. Similar to how you migrate mysql.
Or if you could export the data into cvs, you could import use SSMS (Sql Server Managment Studio).

AWS : nothing inserted when use copy command from s3 to redshift

I have big data in s3 and have to move into redshift, and have one table in redshift. Since I use python, I wrote python script and use psycopg2 to connect redshift. I succeeded to connect to redshift, but I failed to insert data from s3 to redshift.
I checked dashboard in aws website and found that redshift received a query and it loads something, but it does not insert anything and the time consumed for this process is too long like over 3 minutes. There is no error log so I can't find what is the reason.
Is there any possible cause for this?
EDIT
added copy command I used.
copy table FROM 's3://example/2017/02/03/' access_key_id '' secret_access_key '' ignoreblanklines timeformat 'epochsecs' delimiter '\t';
Try querying the stl_load_errors table, it has the info on data load errors
http://docs.aws.amazon.com/redshift/latest/dg/r_STL_LOAD_ERRORS.html
select * from stl_load_errors order by starttime desc limit 1

Why does writing from Spark to Vertica DB take longer than writing from Spark to MySQL?

Ultimately, I want to grab data from a Vertica DB into Spark, train a machine learning model, make predictions, and store these predictions into another Vertica DB.
Current issue is identifying the bottleneck in the last part of the flow: storing values in Vertica DB from Spark. It takes about 38 minutes to store 63k rows of data in a Vertica DB. In comparison, when I transfer that same data from Spark to MySQL database, it takes 10 seconds.
I don't know why the difference is so huge.
I have classes called VerticaContext and MySQLContext for Vertica and MySQL connections respectively. Both classes use the SQLContext to read entries using the jdbc format.
df = self._sqlContext.read.format('jdbc').options(url=self._jdbcURL, dbtable=subquery).load()
And write using jdbc.
df.write.jdbc(self._jdbcURL, table_name, save_mode)
There's no difference between the two classes aside from writing to a different target database. I'm confused as to why there's a huge difference in the time it takes to save tables. Is it because of the inherent difference in hardware between the two different databases?
I figured out an alternative solution. Before I dive in, I'll explain what I found and why I think saves to Vertica DB are slow.
Vertica log (search for the file "vertica.log" on your Vertica machine) contains all the recent logs related to reads/writes from and to Vertica DBs. After running the write command, I found out that this is essentially creating INSERT statements into the Vertica DB.
INSERT statements (without the "DIRECT" directive) are slow because they are written into the WOS (RAM) instead of ROS (disk). I don't know the exact details as to why that's the case. The writes were issuing individual INSERT statements
Slow inserts are a known issue. I had trouble finding this information, but I finally found a few links that support that. I'm placing them here for posterity: http://www.vertica-forums.com/viewtopic.php?t=267, http://vertica-forums.com/viewtopic.php?t=124
My solution:
There is documentation that says the COPY command (with a "DIRECT" keyword) is the most efficient way of loading large amounts of data to the database. Since I was looking for a python solution, I used Uber's vertica-python package which allowed me to establish a connection with the Vertica DB and send Vertica commands to execute.
I want to exploit the efficiency of the COPY command, but the data lives somewhere outside of the Vertica cluster. I need to send the data from my Spark cluster to Vertica DB. Fortunately, there's a way to do that from HDFS (see here). I settled on converting the dataframe to a csv file and saving that on HDFS. Then I sent the COPY command to the Vertica DB to grab the file from HDFS.
My code is below (assuming I have a variable that stores the pyspark dataframe already. Let's call that 'df'):
import vertica_python as VertPy
df.toPandas().to_csv(hdfs_table_absolute_filepath, header=False, index=False)
conn_info = {
'host': ['your-host-here']
'port': [port #],
'user': ['username'],
'password': ['password'],
'database': ['database']
}
conn = VertPy.connect(**conn_info)
cur = conn.cursor()
copy_command = create_copy_command(table_name, hdfs_table_relative_filepath)
cursor.execute(copy_command)
def create_copy_command(table_name, table_filepath):
copy_command = "COPY " + table_name + " SOURCE Hdfs(url='http://hadoop:50070/webhdfs/v1" + table_filepath + "', username='root') DELIMITER ',' DIRECT ABORT ON ERROR"
return copy_command

Categories