I'm trying to execute copy command in redshift via glue
redshiftsql = 'copy table from s3://bucket/test credentials 'aws-iam-role fromat json as 'auto';"
I'm connecting using below syntax
from_jdbc_conf(frame, catalog_connection, connection_options={}, redshift_tmp_dir = "", transformation_ctx="")
what is the value I need to pass for frame ? Any thoughts ? I appreciate your response.
Apperantly you can pass "extracopyoptions":"" in the connection_options object for redshift, anything from here: https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html
See this archived question from AWS premium support.
https://aws.amazon.com/premiumsupport/knowledge-center/sql-commands-redshift-glue-job/
By the way this is very poorly documented in my opinion.
Related
I want to write the csv-file from pandas dataframe on remote machine connecting via smtp-ssh.
Does anybody know how add "storage_options" parameter correctly?
Pandas documentation says that I have to use some dict as parameter's value. But I don't understand which exactly.
hits_df.to_csv('hits20.tsv', compression='gzip', index='False', chunksize=1000000, storage_options={???})
Every time I got ValueError: storage_options passed with file object or non-fsspec file path
What am I doing wrong?
You will find the set of values to use by experimenting directly with the implementation backend SFTPFileSystem. Whatever kwargs you use these are the same ones that would go into stoage_options. Short story: paramiko is not the same as command-line SSH, so some trialing will be required.
If you have things working via the file system class, you can use the alternative route
fs = fsspec.implementations.sftp.SFTPFileSystem(...)
# same as fs = fsspec.filesystem("ssh", ...)
with fs.open("my/file/path", "rb") as f:
pd.read_csv(f, other_kwargs)
Pandas is supporting fsspec which lets you work easily with remote filesystems, and abstracts over s3fs for Amazon S3 and gcfs for Google Cloud Storage (and other backends such as (S)FTP, SSH or HDFS). In particular s3fs is very handy for doing simple file operations in S3 because boto is often quite subtly complex to use.
The argument storage_options will allow you to expose s3fs arguments to pandas.
You can specify an AWS Profile manually using the storage_options which takes a dict. An example bellow:
import boto3
AWS_S3_BUCKET = os.getenv("AWS_S3_BUCKET")
AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")
AWS_SESSION_TOKEN = os.getenv("AWS_SESSION_TOKEN")
df.to_csv(
f"s3://{AWS_S3_BUCKET}/{key}",
storage_options={
"key": AWS_ACCESS_KEY_ID,
"secret": AWS_SECRET_ACCESS_KEY,
"token": AWS_SESSION_TOKEN,
},
)
If you do not have cloud storage access, you can access public data by specifying an anonymous connection like this
pd.read_csv('name',<other fields>, storage_options={"anon": True})
Else one should pass storage_options in dict format, you will get name and key by your cloud VM host (including Amazon S3, Google Cloud, Azure, etc.)
pd.read_csv('name',<other fields>, \
storage_options={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY})
The ultimate goal is to able to read the data in my Azure container into a PySpark dataframe.
Steps until now
The steps I have followed till now:
Written this code
spark = SparkSession(SparkContext())
spark.conf.set(
"fs.azure.account.key.%s.blob.core.windows.net" % AZURE_ACCOUNT_NAME,
AZURE_ACCOUNT_KEY
)
spark.conf.set(
"fs.wasbs.impl",
"org.apache.hadoop.fs.azure.NativeAzureFileSystem"
)
container_path = "wasbs://%s#%s.blob.core.windows.net" % (
AZURE_CONTAINER_NAME, AZURE_ACCOUNT_NAME
)
blob_folder = "%s/%s" % (container_path, AZURE_BLOB_NAME)
df = spark.read.format("text").load(blob_folder)
print(df.count())
Set public access and anonymous access to my Azure container.
Added two jars hadoop-azure-2.7.3.jar and azure-storage-2.2.0.jar to the path.
Problem
But now I am stuck with this error: Caused by: com.microsoft.azure.storage.StorageException: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual UNSPECIFIED..
I have not been able to find anything which talks about / resolves this issue. The closest I have found is this which does not work / is outdated.
EDIT
I found that the azure-storage-2.2.0.jar did not support APPEND_BLOB. I upgraded to azure-storage-4.0.0.jar and it changed the error from Expected BLOCK_BLOB, actual UNSPECIFIED. to Expected BLOCK_BLOB, actual APPEND_BLOB.. Does anyone know how to pass the correct type to expect?
Can someone please help me with resolving this.
I have minimal expertise in working with Azure but I don't think it should be this difficult to read and create a Spark dataframe from it. What am I doing wrong?
How I can modify following code using boto3.resource()
client = boto3.client('dynamodb', region_name='us-west-2')
response = client.create_backup(
TableName='test',
BackupName='backup_test'
)
I don't want to use 'boto3.client' here.
same I also want to restore my table using boto3.resource()
There is currently no way to backup a table using the dynamodb resource for boto3.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb.html#service-resource
I am trying to load some Redshift query results to S3. So far I am using pandas_redshift but I got stuck:
import pandas_redshift as pr
pr.connect_to_redshift( dbname = 'dbname',
host = 'xxx.us-east- 1.redshift.amazonaws.com',
port = 5439,
user = 'xxx',
password = 'xxx')
pr.connect_to_s3(aws_access_key_id = 'xxx',
aws_secret_access_key = 'xxx',
bucket = 'dxxx',
subdirectory = 'dir')
And here is the data that I want to dump to S3:
sql_statement = '''
select
provider,
provider_code
from db1.table1
group by provider, provider_code;
'''
df = pr.redshift_to_pandas(sql_statement)
The df was created successfully but how to do the next step, which is to put this dataframe to S3?
The method you are looking at is very inefficient.
to do this the right way you will need a way to run sql on redshift - via e.g. python.
the following sql should be run
unload ('select provider,provider_code
from db1.table1
group by provider, provider_code;')
to 's3://mybucket/myfolder/unload/'
access_key_id '<access-key-id>'
secret_access_key '<secret-access-key>';
see here fore documentation.
As Jon Scott mentions if your goal is to move data from redshift to S3, then the pandas_redshift package is not the right method. The package is meant to allow you to easily move data from redshift to a Pandas DataFrame on your local machine, or move data from a Pandas DataFrame on your local machine to redshift. It is worth noting that running the command you already have:
df = pr.redshift_to_pandas(sql_statement)
Pulls the data directly from redshift to your computer without involving S3 at all. However this command:
pr.pandas_to_redshift(df, 'schema.your_new_table_name')
Copies the DataFrame to a CSV in S3, then runs a query to copy CSV to redshift (This step requires that you ran pr.connect_to_s3 successfully). It does not perform any cleanup of the S3 bucket so a side effect of this is that the data will end up in the bucket you specify.
Is it possible to use load csv to load data into Neo4j running on a docker container where the csv file is in a secured S3 bucket? It works fine if I copy the file locally onto the docker container.
I keep getting a 'Neo.ClientError.Statement.ExternalResourceFailed' error.
The neo config shows: dbms.security.allow_csv_import_from_file_urls=true
My code is Python (3.6) using py2neo (3.1.2).
USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS FROM 'https://s3-my-region.amazonaws.com/some-secured-
bucket/somefile.csv' AS row FIELDTERMINATOR ','
MERGE (person:Person {id: row.id})
ON CREATE SET person.first_name = row.first_name
, person.last_name = row.last_name
, person.email = row.email
, person.mobile_phone = row.mobile_phone
, person.business_phone = row.business_phone
, person.business_address = row.business_address
ON MATCH SET person.first_name = row.first_name
, person.last_name = row.last_name
, person.email = row.email
, person.mobile_phone = row.mobile_phone
, person.business_phone = row.business_phone
, person.business_address = row.business_address
Any help or examples would be greatly appreciated.
Many thanks.
You can generate a time-limited signed URL on S3 and you don't need to make the file public.
see here for an example
https://advancedweb.hu/2018/10/30/s3_signed_urls/
Neo4j “LOAD CSV“ can work with http/https urls, such as:
LOAD CSV WITH HEADERS FROM "https://s3-ap-southeast-2.amazonaws.com/myfile/loadcsv/file.csv" AS line
WITH line LIMIT 3
RETURN line
The following configuration need to be changed
the S3 folder needs to be open to public
in neo4j.conf, set
dbms.security.allow_csv_import_from_file_urls=true
in neo4j.conf, comment out or delete dbms.directories.import=import
make sure firewall is not blocking neo4j ports [7474, 7473, 7687]
Also you may use tools like s3fs to map your bucket as a local file system, so you can read the files directly. Only IAM access is needed.
Regarding #dz902 comment - this is not an option in docker container, because if you try to map /var/lib/neo4j/import on your s3 bucket using s3fs it will failed in error
chown: changing ownership of '/var/lib/neo4j/import': Input/output error
That is because neo4j inside the container operates and creates it's folders with different user (neo4j, uid,gid=7474).
There is an option how to run neo4j with another user, but still you cannot use root for this purpose in neo4j case.
More details on that here
If someone have some solutions or ideas how to make this possible (I mean to map the /var/lib/neo4j/import folder to S3 bucket) - please give me your thoughts