I have big data in s3 and have to move into redshift, and have one table in redshift. Since I use python, I wrote python script and use psycopg2 to connect redshift. I succeeded to connect to redshift, but I failed to insert data from s3 to redshift.
I checked dashboard in aws website and found that redshift received a query and it loads something, but it does not insert anything and the time consumed for this process is too long like over 3 minutes. There is no error log so I can't find what is the reason.
Is there any possible cause for this?
EDIT
added copy command I used.
copy table FROM 's3://example/2017/02/03/' access_key_id '' secret_access_key '' ignoreblanklines timeformat 'epochsecs' delimiter '\t';
Try querying the stl_load_errors table, it has the info on data load errors
http://docs.aws.amazon.com/redshift/latest/dg/r_STL_LOAD_ERRORS.html
select * from stl_load_errors order by starttime desc limit 1
Related
I am querying API, storing it to a pandas dataframe, transforming it and then writing it to a AWS Redshift database. On my local machine there are no issues and everything works fine. When I placed the code in AWS Lambda, with all the required packages I get this error:
Calling the invoke API action failed with this message: Network Error
I've read that it may be due to the limits of how many rows it can write to a database, then I tried to write only 1 row to the database but still got the same error.
My code where I write looks like this:
conn = create_engine('postgresql://user:password#redshifteu-west-1.redshift.amazonaws.com:5439/dev')
result.to_sql('table_1', conn, index=False, if_exists='replace', schema='schema')
I am using pandas to_sql method and sqlalchemy. How can I write my dataframe to a Redshift database with AWS Lambda?
Note that you may need psycopg2 in order to connect to Redshift through SQLAlchemy:
pip install psycopg2
Now another possibility could be that you are essentially exceeding the rate limit of S3.
If the above doesn't work and you are confident that the rate limit is not exceeded, you can give pandas_redshift a go:
# pip install pandas-redshift
import pandas_redshift as pr
pr.connect_to_redshift(
dbname='dev', host='redshifteu-west-1.redshift.amazonaws.com',
port=5439, user='user', password='password'
)
pr.connect_to_s3(
aws_access_key_id='aws_access_key_id',
aws_secret_access_key='aws_secret_access_key',
bucket='bucket_name',
subdirectory='subdirectory'
)
pr.pandas_to_redshift(data_frame=result, redshift_table_name='table_1')
We want to export data from dynamo db to a file. We have around 150,000 records each record is of 430 bytes. It would be a periodic activity once a week. Can we do that with lambda? Is it possible as lambda has a maximum execution time of 15 minutes?
If there is a better option using python or via UI as I'm unable to export more than 100 records from UI?
One really simple option is to use the Command Line Interface tools
aws dynamodb scan --table-name YOURTABLE --output text > outputfile.txt
This would give you a tab delimited output. You can run it as a cronjob for regular output.
The scan wouldn't take anything like 15 minutes (probably just a few seconds). So you wouldn't need to worry about your Lambda timing out if you did it that way.
You can export your data from dynamodb in a number of ways.
The simplest way would be a full table scan:
dynamodb = boto3.client('dynamodb')
response = dynamodb.scan(
TableName=your_table,
Select='ALL_ATTRIBUTES')
data = response['Items']
while 'LastEvaluatedKey' in response:
response = dynamodb.scan(
TableName=your_table,
Select='ALL_ATTRIBUTES',
ExclusiveStartKey=response['LastEvaluatedKey'])
data.extend(response['Items'])
# save your data as csv here
But if you want to do it every x days, what I would recomend to you is:
Create your first dump from your table with the code above.
Then, you can create a dynamodb trigger to a lambda function that will receive all your table changes (insert, update, delete), and then you can append the data in your csv file. The code would be something like:
def lambda_handler(event, context):
for record in event['Records']:
# get the changes here and save it
Since you will receive only your table updates, you don't need to worry about the 15 minutes execution from lambda.
You can read more about dynamodb streams and lambda here: DynamoDB Streams and AWS Lambda Triggers
And if you want to work on your data, you can always create a aws glue or a EMR cluster.
Guys we resolved it using AWS lambda, 150,000 records (each record is of 430 bytes) are processed to csv file in 2.2 minutes using maximum available memory (3008 mb). Created an event rule for that to run on periodic basis. Time and size is written so that anyone can calculate how much they can do with lambda
You can refer to an existing question on stackoverflow. This question is about exporting dynamo db table as a csv.
I am working on Glue since january, and have worked multiple POC, production data lakes using AWS Glue / Databricks / EMR, etc. I have used AWS Glue to read data from S3 and perform ETL before loading to Redshift, Aurora, etc.
I have a need now to read data from a source table which is on SQL SERVER, and fetch data, write to a S3 bucket in a custom (user defined) CSV file, say employee.csv.
Am looking for some pointers, to do this please.
Thanks
You can connect using JDBC specifying connectionType=sqlserver to get a dynamic frame connecting to SQL SERVER. See here for GlueContext docs
dynF = glueContext.getSource(connection_type="sqlserver", url = ..., dbtable=..., user=..., password=)
This task fits AWS DMS (Data Migration Service) use case. DMS is designed to either migrate data from one data storage to another or keep them in sync. It can certainly keep in sync as well as transform your source (i.e., MSSQL) to your target (i.e., S3).
There is one non-negligible constraint in your case thought. Ongoing sync with MSSQL source only works if your license is the Enterprise or Developer Edition and for versions 2016-2019.
Can data be inserted into a RedShift from a local computer without copying data to S3 first?
Basically as a direct insert of record by record into RedShift?
If yes - what library / connection string can be used?
(I am not concerned about performance)
Thanks.
Can data be inserted into a RedShift from a local computer without copying data to S3 first? Basically as a direct insert of record by record into RedShift?
Yes, it could be done. But not a preferred method, though you have already weighted, that performance is not a concern.
You could usepsycopg2 liberary. You could run this from any machine(local/on EC2 or any other cloud platform) having network connection to/for allowed to/from to your Redshift instance.
Here is python code snippet.
import psycopg2
def redshift():
conn = psycopg2.connect(dbname='your_database', host='a********8.****s.redshift.amazonaws.com', port='5439', user='user', password='Pass')
cur = conn.cursor();
cur.execute('insert into test values('1','2','3','4')")
print('success ')
redshift();
It depends if you talk about RedShift or RedShift Spectrum.
In RSS you have to put the data on S3 but if you use RedShift you can make an insert with sqlalchemy for example.
The easiest way to query AWS Redshift from python is through this Jupyter extension - Jupyter Redshift
Not only can you query and save your results but also write them back to the database from within the notebook environment.
I need to write a result set from MySQL ina csv format inside a bucket in Google Cloud Storage.
Following the instructions here, I created the following example code:
import cloudstorage
from google.appengine.api import app_identity
import db # My own Mysql wrapper
dump = db.get_table_dump(schema) # Here I made a simples SQL SELECT and fetchall()
bucket_name = app_identity.get_default_gcs_bucket_name()
file_name = "/" + bucket_name + "/finalfiles/" + schema + "/" +"myfile.csv"
with cloudstorage.open(file_name, "w") as gcsFile:
gcsFile.write(dump)
It did not work 'cause write expects a string parameter and dump is tuple of tuples result from fetchall().
I can't use this approach (or similar) since I can't write files in GAE enviroment and I also can't create a CSV string from tuple like here, due to the size o my result set (Actually, I tried it and it takes too long and it timed out before finish).
So, my question is, which is the best way to get a result set from MySQL and save it as CSV in a Google Cloud Storage Bucket?
I just went through the same problem with PHP. I ended up using the cloud sql api (https://cloud.google.com/sql/docs/mysql/admin-api/v1beta4/) with the following workflow:
Create an export bucket (i.e. test-exports)
Give the SQL Instance Read/Write permissions to the bucket created in step 1
Within the application, make a call to instance export (https://cloud.google.com/sql/docs/mysql/admin-api/v1beta4/instances/export). This endpoint accepts the SQL to run, as well as a path to an output bucket. (created in step (1))
Step (3) will return back an operation with a 'name' property. You can use this 'name' and poll the operations/get endpoint (https://cloud.google.com/sql/docs/mysql/admin-api/v1beta4/operations/get) until the status is returned as DONE
We have a job which performs these steps nightly (as well as an import using the /import command) on 6 tables and have yet to see any issues. The only thing to keep in mind is that only one operation can run on a single database instance at a time. To combat this, you should the top item from the the operations list endpoint (https://cloud.google.com/sql/docs/mysql/admin-api/v1beta4/operations/list) to confirm the database is ready before issuing any commands.