How to backup DynamoDB using boto3.resource - python

How I can modify following code using boto3.resource()
client = boto3.client('dynamodb', region_name='us-west-2')
response = client.create_backup(
TableName='test',
BackupName='backup_test'
)
I don't want to use 'boto3.client' here.
same I also want to restore my table using boto3.resource()

There is currently no way to backup a table using the dynamodb resource for boto3.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb.html#service-resource

Related

execute copy command from aws glue to connect to redshift

I'm trying to execute copy command in redshift via glue
redshiftsql = 'copy table from s3://bucket/test credentials 'aws-iam-role fromat json as 'auto';"
I'm connecting using below syntax
from_jdbc_conf(frame, catalog_connection, connection_options={}, redshift_tmp_dir = "", transformation_ctx="")
what is the value I need to pass for frame ? Any thoughts ? I appreciate your response.
Apperantly you can pass "extracopyoptions":"" in the connection_options object for redshift, anything from here: https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html
See this archived question from AWS premium support.
https://aws.amazon.com/premiumsupport/knowledge-center/sql-commands-redshift-glue-job/
By the way this is very poorly documented in my opinion.

Using fsspec at pandas.DataFrame.to_csv command

I want to write the csv-file from pandas dataframe on remote machine connecting via smtp-ssh.
Does anybody know how add "storage_options" parameter correctly?
Pandas documentation says that I have to use some dict as parameter's value. But I don't understand which exactly.
hits_df.to_csv('hits20.tsv', compression='gzip', index='False', chunksize=1000000, storage_options={???})
Every time I got ValueError: storage_options passed with file object or non-fsspec file path
What am I doing wrong?
You will find the set of values to use by experimenting directly with the implementation backend SFTPFileSystem. Whatever kwargs you use these are the same ones that would go into stoage_options. Short story: paramiko is not the same as command-line SSH, so some trialing will be required.
If you have things working via the file system class, you can use the alternative route
fs = fsspec.implementations.sftp.SFTPFileSystem(...)
# same as fs = fsspec.filesystem("ssh", ...)
with fs.open("my/file/path", "rb") as f:
pd.read_csv(f, other_kwargs)
Pandas is supporting fsspec which lets you work easily with remote filesystems, and abstracts over s3fs for Amazon S3 and gcfs for Google Cloud Storage (and other backends such as (S)FTP, SSH or HDFS). In particular s3fs is very handy for doing simple file operations in S3 because boto is often quite subtly complex to use.
The argument storage_options will allow you to expose s3fs arguments to pandas.
You can specify an AWS Profile manually using the storage_options which takes a dict. An example bellow:
import boto3
AWS_S3_BUCKET = os.getenv("AWS_S3_BUCKET")
AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")
AWS_SESSION_TOKEN = os.getenv("AWS_SESSION_TOKEN")
df.to_csv(
f"s3://{AWS_S3_BUCKET}/{key}",
storage_options={
"key": AWS_ACCESS_KEY_ID,
"secret": AWS_SECRET_ACCESS_KEY,
"token": AWS_SESSION_TOKEN,
},
)
If you do not have cloud storage access, you can access public data by specifying an anonymous connection like this
pd.read_csv('name',<other fields>, storage_options={"anon": True})
Else one should pass storage_options in dict format, you will get name and key by your cloud VM host (including Amazon S3, Google Cloud, Azure, etc.)
pd.read_csv('name',<other fields>, \
storage_options={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY})

AWS Glue and update duplicating data

I'm using AWS Glue to move multiple files to an RDS instance from S3. Each day I get a new file into S3 which may contain new data, but can also contain a record I have already saved with some updates values. If I run the job multiple times I will of course get duplicate records in the database. Instead of multiple records being inserted I want Glue to try and update that record if it notices a field has changed, each record has a unique id. Is this possible?
I followed the similar approach which is suggested as 2nd option by Yuriy. Get existing data as well as new data and then do some processing to merge to of them and write with ovewrite mode. Following code would help you to get an idea about how to solve this problem.
sc = SparkContext()
glueContext = GlueContext(sc)
#get your source data
src_data = create_dynamic_frame.from_catalog(database = src_db, table_name = src_tbl)
src_df = src_data.toDF()
#get your destination data
dst_data = create_dynamic_frame.from_catalog(database = dst_db, table_name = dst_tbl)
dst_df = dst_data.toDF()
#Now merge two data frames to remove duplicates
merged_df = dst_df.union(src_df)
#Finally save data to destination with OVERWRITE mode
merged_df.write.format('jdbc').options( url = dest_jdbc_url,
user = dest_user_name,
password = dest_password,
dbtable = dest_tbl ).mode("overwrite").save()
Unfortunately there is no elegant way to do it with Glue. If you would write to Redshift you could use postactions to implement Redshift merge operation. However, it's not possible for other jdbc sinks (afaik).
Alternatively in your ETL script you can load existing data from a database to filter out existing records before saving. However if your DB table is big then the job may take a while to process it.
Another approach is to write into a staging table with mode 'overwrite' first (replace existing staging data) and then make a call to a DB via API to copy new records only into a final table.
I have used INSERT into table .... ON DUPLICATE KEY.. for UPSERTs into the Aurora RDS running mysql engine. Maybe this would be a reference for your use case. We cannot use a JDBC since we have only APPEND, OVERWRITE, ERROR modes currently supported.
I am not sure of the RDS database engine you are using, and following is an example for mysql UPSERTS.
Please see this reference, where i have posted a solution using INSERT INTO TABLE..ON DUPLICATE KEY for mysql :
Error while using INSERT INTO table ON DUPLICATE KEY, using a for loop array

Retrieve gs:// path for a storage Bucket and Object at GCS using Python api client

I'm working on a script to create google cloud functions using api, in which i need to get gs:// form of path to the storage bucket OR Object,
Here's what i have tried:
svc = discovery.build('storage', 'v1', http=views.getauth(), cache_discovery=False)
svc_req = svc.objects().get(bucket=func_obj.bucket, object=func_obj.fname, projection='full')
svc_response = svc_req.execute()
print(' Metadata is coming below:')
print(svc_response)
It returns metadata which doesn't include any link in the form of gs:// , how can i get the path to a bucket OR Object in the form of "gs://"
Help me, please!
Thanks in advance!
If you go to API Explorer and simulate your request you will see that this library returns you exactly the same output in python data structures. In there ID looks the most like gs:// link:
"id": "test-bucket/text.txt/0000000000000000"
or in python it looks like:
u'id': u'test-bucket/text.txt/00000000000000000'
Simple use this to transform it into a gs:// link:
import os
u'gs://' + os.path.dirname(svc_response['id'])
which would return:
u'gs://test-bucket/text.txt'
If you want to use the google-cloud-python you will be facing the same issue.

Python loads data from Redshift to S3

I am trying to load some Redshift query results to S3. So far I am using pandas_redshift but I got stuck:
import pandas_redshift as pr
pr.connect_to_redshift( dbname = 'dbname',
host = 'xxx.us-east- 1.redshift.amazonaws.com',
port = 5439,
user = 'xxx',
password = 'xxx')
pr.connect_to_s3(aws_access_key_id = 'xxx',
aws_secret_access_key = 'xxx',
bucket = 'dxxx',
subdirectory = 'dir')
And here is the data that I want to dump to S3:
sql_statement = '''
select
provider,
provider_code
from db1.table1
group by provider, provider_code;
'''
df = pr.redshift_to_pandas(sql_statement)
The df was created successfully but how to do the next step, which is to put this dataframe to S3?
The method you are looking at is very inefficient.
to do this the right way you will need a way to run sql on redshift - via e.g. python.
the following sql should be run
unload ('select provider,provider_code
from db1.table1
group by provider, provider_code;')
to 's3://mybucket/myfolder/unload/'
access_key_id '<access-key-id>'
secret_access_key '<secret-access-key>';
see here fore documentation.
As Jon Scott mentions if your goal is to move data from redshift to S3, then the pandas_redshift package is not the right method. The package is meant to allow you to easily move data from redshift to a Pandas DataFrame on your local machine, or move data from a Pandas DataFrame on your local machine to redshift. It is worth noting that running the command you already have:
df = pr.redshift_to_pandas(sql_statement)
Pulls the data directly from redshift to your computer without involving S3 at all. However this command:
pr.pandas_to_redshift(df, 'schema.your_new_table_name')
Copies the DataFrame to a CSV in S3, then runs a query to copy CSV to redshift (This step requires that you ran pr.connect_to_s3 successfully). It does not perform any cleanup of the S3 bucket so a side effect of this is that the data will end up in the bucket you specify.

Categories