Load CSV from secured S3 bucket into Neo4j running in docker

Load CSV from secured S3 bucket into Neo4j running in docker - python

Is it possible to use load csv to load data into Neo4j running on a docker container where the csv file is in a secured S3 bucket? It works fine if I copy the file locally onto the docker container.
I keep getting a 'Neo.ClientError.Statement.ExternalResourceFailed' error.
The neo config shows: dbms.security.allow_csv_import_from_file_urls=true
My code is Python (3.6) using py2neo (3.1.2).
USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS FROM 'https://s3-my-region.amazonaws.com/some-secured-
bucket/somefile.csv' AS row FIELDTERMINATOR ','
MERGE (person:Person {id: row.id})
ON CREATE SET person.first_name = row.first_name
, person.last_name = row.last_name
, person.email = row.email
, person.mobile_phone = row.mobile_phone
, person.business_phone = row.business_phone
, person.business_address = row.business_address
ON MATCH SET person.first_name = row.first_name
, person.last_name = row.last_name
, person.email = row.email
, person.mobile_phone = row.mobile_phone
, person.business_phone = row.business_phone
, person.business_address = row.business_address
Any help or examples would be greatly appreciated.
Many thanks.

You can generate a time-limited signed URL on S3 and you don't need to make the file public.
see here for an example
https://advancedweb.hu/2018/10/30/s3_signed_urls/

Neo4j “LOAD CSV“ can work with http/https urls, such as:
LOAD CSV WITH HEADERS FROM "https://s3-ap-southeast-2.amazonaws.com/myfile/loadcsv/file.csv" AS line
WITH line LIMIT 3
RETURN line
The following configuration need to be changed
the S3 folder needs to be open to public
in neo4j.conf, set
dbms.security.allow_csv_import_from_file_urls=true
in neo4j.conf, comment out or delete dbms.directories.import=import
make sure firewall is not blocking neo4j ports [7474, 7473, 7687]

Also you may use tools like s3fs to map your bucket as a local file system, so you can read the files directly. Only IAM access is needed.

Regarding #dz902 comment - this is not an option in docker container, because if you try to map /var/lib/neo4j/import on your s3 bucket using s3fs it will failed in error
chown: changing ownership of '/var/lib/neo4j/import': Input/output error
That is because neo4j inside the container operates and creates it's folders with different user (neo4j, uid,gid=7474).
There is an option how to run neo4j with another user, but still you cannot use root for this purpose in neo4j case.
More details on that here
If someone have some solutions or ideas how to make this possible (I mean to map the /var/lib/neo4j/import folder to S3 bucket) - please give me your thoughts

Related

execute copy command from aws glue to connect to redshift

I'm trying to execute copy command in redshift via glue
redshiftsql = 'copy table from s3://bucket/test credentials 'aws-iam-role fromat json as 'auto';"
I'm connecting using below syntax
from_jdbc_conf(frame, catalog_connection, connection_options={}, redshift_tmp_dir = "", transformation_ctx="")
what is the value I need to pass for frame ? Any thoughts ? I appreciate your response.

Apperantly you can pass "extracopyoptions":"" in the connection_options object for redshift, anything from here: https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html
See this archived question from AWS premium support.
https://aws.amazon.com/premiumsupport/knowledge-center/sql-commands-redshift-glue-job/
By the way this is very poorly documented in my opinion.

File metadata such as time in Azure Storage from Databricks

I m trying to get creationfile metadata.
File is in: Azure Storage
Accesing data throw: Databricks
right now I m using:
file_path = my_storage_path
dbutils.fs.ls(file_path)
but it returns
[FileInfo(path='path_myFile.csv', name='fileName.csv', size=437940)]
I do not have any information about creation time, there is a way to get that information ?
other solutions in Stackoverflow are refering to files that are already in databricks
Does databricks dbfs support file metadata such as file/folder create date or modified date
in my case we access to the data from Databricks but the data are in Azure Storage.

It really depends on the version of Databricks Runtime (DBR) that you're using. For example, modification timestamp is available if you use DBR 10.2 (didn't test with 10.0/10.1, but definitely not available on 9.1):
If you need to get that information you can use Hadoop FileSystem API via Py4j gateway, like this:
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
fs = FileSystem.get(URI("/tmp"), Configuration())
status = fs.listStatus(Path('/tmp/'))
for fileStatus in status:
print(f"path={fileStatus.getPath()}, size={fileStatus.getLen()}, mod_time={fileStatus.getModificationTime()}")

Using fsspec at pandas.DataFrame.to_csv command

I want to write the csv-file from pandas dataframe on remote machine connecting via smtp-ssh.
Does anybody know how add "storage_options" parameter correctly?
Pandas documentation says that I have to use some dict as parameter's value. But I don't understand which exactly.
hits_df.to_csv('hits20.tsv', compression='gzip', index='False', chunksize=1000000, storage_options={???})
Every time I got ValueError: storage_options passed with file object or non-fsspec file path
What am I doing wrong?

You will find the set of values to use by experimenting directly with the implementation backend SFTPFileSystem. Whatever kwargs you use these are the same ones that would go into stoage_options. Short story: paramiko is not the same as command-line SSH, so some trialing will be required.
If you have things working via the file system class, you can use the alternative route
fs = fsspec.implementations.sftp.SFTPFileSystem(...)
# same as fs = fsspec.filesystem("ssh", ...)
with fs.open("my/file/path", "rb") as f:
pd.read_csv(f, other_kwargs)

Pandas is supporting fsspec which lets you work easily with remote filesystems, and abstracts over s3fs for Amazon S3 and gcfs for Google Cloud Storage (and other backends such as (S)FTP, SSH or HDFS). In particular s3fs is very handy for doing simple file operations in S3 because boto is often quite subtly complex to use.
The argument storage_options will allow you to expose s3fs arguments to pandas.
You can specify an AWS Profile manually using the storage_options which takes a dict. An example bellow:
import boto3
AWS_S3_BUCKET = os.getenv("AWS_S3_BUCKET")
AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")
AWS_SESSION_TOKEN = os.getenv("AWS_SESSION_TOKEN")
df.to_csv(
f"s3://{AWS_S3_BUCKET}/{key}",
storage_options={
"key": AWS_ACCESS_KEY_ID,
"secret": AWS_SECRET_ACCESS_KEY,
"token": AWS_SESSION_TOKEN,
},
)

If you do not have cloud storage access, you can access public data by specifying an anonymous connection like this
pd.read_csv('name',<other fields>, storage_options={"anon": True})
Else one should pass storage_options in dict format, you will get name and key by your cloud VM host (including Amazon S3, Google Cloud, Azure, etc.)
pd.read_csv('name',<other fields>, \
storage_options={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY})

Provide blob type to read an Azure append blob from PySpark

The ultimate goal is to able to read the data in my Azure container into a PySpark dataframe.
Steps until now
The steps I have followed till now:
Written this code
spark = SparkSession(SparkContext())
spark.conf.set(
"fs.azure.account.key.%s.blob.core.windows.net" % AZURE_ACCOUNT_NAME,
AZURE_ACCOUNT_KEY
)
spark.conf.set(
"fs.wasbs.impl",
"org.apache.hadoop.fs.azure.NativeAzureFileSystem"
)
container_path = "wasbs://%s#%s.blob.core.windows.net" % (
AZURE_CONTAINER_NAME, AZURE_ACCOUNT_NAME
)
blob_folder = "%s/%s" % (container_path, AZURE_BLOB_NAME)
df = spark.read.format("text").load(blob_folder)
print(df.count())
Set public access and anonymous access to my Azure container.
Added two jars hadoop-azure-2.7.3.jar and azure-storage-2.2.0.jar to the path.
Problem
But now I am stuck with this error: Caused by: com.microsoft.azure.storage.StorageException: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual UNSPECIFIED..
I have not been able to find anything which talks about / resolves this issue. The closest I have found is this which does not work / is outdated.
EDIT
I found that the azure-storage-2.2.0.jar did not support APPEND_BLOB. I upgraded to azure-storage-4.0.0.jar and it changed the error from Expected BLOCK_BLOB, actual UNSPECIFIED. to Expected BLOCK_BLOB, actual APPEND_BLOB.. Does anyone know how to pass the correct type to expect?
Can someone please help me with resolving this.
I have minimal expertise in working with Azure but I don't think it should be this difficult to read and create a Spark dataframe from it. What am I doing wrong?

Save pandas dataframe to s3

I want to save BIG pandas dataframes to s3 using boto3.
Here is what I am doing now:
csv_buffer = io.StringIO()
df.to_csv(csv_buffer, index=False)
s3.put_object(Bucket="bucket-name", Key="file-name", Body=csv_buffer.getvalue())
This generates a file with the following permissions:
---------- 1 root root file-name
Hw can I change that in order for the file to be owned by the user that executes the script? i.e user "ubuntu" on a AWS instance
This is what I want:
-rw-rw-r-- 1 ubuntu ubuntu file-name
Another thing, did anyone try this method with big dataframes? (millions of rows) and does it perform well?
How does it compare to just saving the file locally and using the boto3 copy file method?
Thanks a lot.

AWS S3 requires multiple-part upload for files above 5GB.
You can implement it with boto3 using a reusable configuration object from TransferConfig class :
import boto3
from boto3.s3.transfer import TransferConfig
# default value for parameter multipart_threshold is 8MB (=number of bytes per part)
config = TransferConfig(multipart_threshold=5GB)
# Perform the transfer
s3 = boto3.client('s3')
s3.upload_file('FILE_NAME', 'BUCKET_NAME', 'OBJECT_NAME', Config=config)
Boto3 documentation here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/s3.html
Then there is some limits here: https://docs.aws.amazon.com/AmazonS3/latest/dev/qfacts.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.