File metadata such as time in Azure Storage from Databricks

File metadata such as time in Azure Storage from Databricks - python

I m trying to get creationfile metadata.
File is in: Azure Storage
Accesing data throw: Databricks
right now I m using:
file_path = my_storage_path
dbutils.fs.ls(file_path)
but it returns
[FileInfo(path='path_myFile.csv', name='fileName.csv', size=437940)]
I do not have any information about creation time, there is a way to get that information ?
other solutions in Stackoverflow are refering to files that are already in databricks
Does databricks dbfs support file metadata such as file/folder create date or modified date
in my case we access to the data from Databricks but the data are in Azure Storage.

It really depends on the version of Databricks Runtime (DBR) that you're using. For example, modification timestamp is available if you use DBR 10.2 (didn't test with 10.0/10.1, but definitely not available on 9.1):
If you need to get that information you can use Hadoop FileSystem API via Py4j gateway, like this:
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
fs = FileSystem.get(URI("/tmp"), Configuration())
status = fs.listStatus(Path('/tmp/'))
for fileStatus in status:
print(f"path={fileStatus.getPath()}, size={fileStatus.getLen()}, mod_time={fileStatus.getModificationTime()}")

Related

EDIT: Airflow GCSToS3Operator: keep_directory_structure=True is not being passed in the rendered template, prefix is being added in the dest_s3_key

I am running the task in composer with Composer version: 2.0.18 and Airflow version: 2.2.5
I am sending data to the AWS S3 from Google GCS. For which I am using GCSToS3Operator with the parameters as follow(with example) I have stored AWS credentials in Airflow Connections with connection id "S3-action-outbound"
gcs_to_s3 = GCSToS3Operator(
task_id="gcs_to_s3",
bucket="gcs_outbound",
prefix="legacy/action/20220629",
delimiter=".csv",
dest_aws_conn_id="S3-action-outbound",
dest_s3_key="s3a://action/daily/",
replace=False,
keep_directory_structure=True,
)
But in the end result it's copying the prefix as well. it's writing data at location: s3a://action/daily/legacy/action/20220629/test1.csv
I just want to add the data to the location which I have added s3a://action/daily/test1.csv
according to the documentation if keep_directory_structure= False only then it's suppose to copy the directory path. I tried making it false and it copied the path twice for example it did like this: s3a://action/daily/legacy/action/20220629/legacy/action/20220629/test1.csv
EDIT:
I just realized that there is an issue with the airflow not taking the variable from template. find the attached screenshot of the rendered template
It did not took the variable replace and keep_directory_structure

There is a discussion about it in the PR when keep_directory_structure was added.
https://github.com/apache/airflow/pull/22071/files
it was not implement as in gcs_to_sftp.py

Using fsspec at pandas.DataFrame.to_csv command

I want to write the csv-file from pandas dataframe on remote machine connecting via smtp-ssh.
Does anybody know how add "storage_options" parameter correctly?
Pandas documentation says that I have to use some dict as parameter's value. But I don't understand which exactly.
hits_df.to_csv('hits20.tsv', compression='gzip', index='False', chunksize=1000000, storage_options={???})
Every time I got ValueError: storage_options passed with file object or non-fsspec file path
What am I doing wrong?

You will find the set of values to use by experimenting directly with the implementation backend SFTPFileSystem. Whatever kwargs you use these are the same ones that would go into stoage_options. Short story: paramiko is not the same as command-line SSH, so some trialing will be required.
If you have things working via the file system class, you can use the alternative route
fs = fsspec.implementations.sftp.SFTPFileSystem(...)
# same as fs = fsspec.filesystem("ssh", ...)
with fs.open("my/file/path", "rb") as f:
pd.read_csv(f, other_kwargs)

Pandas is supporting fsspec which lets you work easily with remote filesystems, and abstracts over s3fs for Amazon S3 and gcfs for Google Cloud Storage (and other backends such as (S)FTP, SSH or HDFS). In particular s3fs is very handy for doing simple file operations in S3 because boto is often quite subtly complex to use.
The argument storage_options will allow you to expose s3fs arguments to pandas.
You can specify an AWS Profile manually using the storage_options which takes a dict. An example bellow:
import boto3
AWS_S3_BUCKET = os.getenv("AWS_S3_BUCKET")
AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")
AWS_SESSION_TOKEN = os.getenv("AWS_SESSION_TOKEN")
df.to_csv(
f"s3://{AWS_S3_BUCKET}/{key}",
storage_options={
"key": AWS_ACCESS_KEY_ID,
"secret": AWS_SECRET_ACCESS_KEY,
"token": AWS_SESSION_TOKEN,
},
)

If you do not have cloud storage access, you can access public data by specifying an anonymous connection like this
pd.read_csv('name',<other fields>, storage_options={"anon": True})
Else one should pass storage_options in dict format, you will get name and key by your cloud VM host (including Amazon S3, Google Cloud, Azure, etc.)
pd.read_csv('name',<other fields>, \
storage_options={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY})

How to read parquet file into pandas from Azure blob store

I need to read and write parquet files from an Azure blob store within the context of a Jupyter notebook running Python 3 kernel.
I see code for working strictly with parquet files and python and other code for grabbing/writing to an Azure blob store but nothing yet that put's it all together.
Here is some sample code I'm playing with:
from azure.storage.blob import BlockBlobService
block_blob_service = BlockBlobService(account_name='testdata', account_key='key-here')
block_blob_service.get_blob_to_text(container_name='mycontainer', blob_name='testdata.parquet')
This last line with throw an encoding-related error.
I've played with storefact but coming up short there.
Thanks for any help

To access the file you need to access the azure blob storage first.
storage_account_name = "your storage account name"
storage_account_access_key = "your storage account access key"
Read path of parquet file into variable
commonLOB_mst_source = "Parquet file path"
file_type = "parquet"
Connect to blob storage
spark.conf.set(
"fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
storage_account_access_key)
Read Parquet file into dataframe
df = spark.read.format(file_type).option("inferSchema", "true").load(commonLOB_mst_source)

Retrieve gs:// path for a storage Bucket and Object at GCS using Python api client

I'm working on a script to create google cloud functions using api, in which i need to get gs:// form of path to the storage bucket OR Object,
Here's what i have tried:
svc = discovery.build('storage', 'v1', http=views.getauth(), cache_discovery=False)
svc_req = svc.objects().get(bucket=func_obj.bucket, object=func_obj.fname, projection='full')
svc_response = svc_req.execute()
print(' Metadata is coming below:')
print(svc_response)
It returns metadata which doesn't include any link in the form of gs:// , how can i get the path to a bucket OR Object in the form of "gs://"
Help me, please!
Thanks in advance!

If you go to API Explorer and simulate your request you will see that this library returns you exactly the same output in python data structures. In there ID looks the most like gs:// link:
"id": "test-bucket/text.txt/0000000000000000"
or in python it looks like:
u'id': u'test-bucket/text.txt/00000000000000000'
Simple use this to transform it into a gs:// link:
import os
u'gs://' + os.path.dirname(svc_response['id'])
which would return:
u'gs://test-bucket/text.txt'
If you want to use the google-cloud-python you will be facing the same issue.

Load CSV from secured S3 bucket into Neo4j running in docker

Is it possible to use load csv to load data into Neo4j running on a docker container where the csv file is in a secured S3 bucket? It works fine if I copy the file locally onto the docker container.
I keep getting a 'Neo.ClientError.Statement.ExternalResourceFailed' error.
The neo config shows: dbms.security.allow_csv_import_from_file_urls=true
My code is Python (3.6) using py2neo (3.1.2).
USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS FROM 'https://s3-my-region.amazonaws.com/some-secured-
bucket/somefile.csv' AS row FIELDTERMINATOR ','
MERGE (person:Person {id: row.id})
ON CREATE SET person.first_name = row.first_name
, person.last_name = row.last_name
, person.email = row.email
, person.mobile_phone = row.mobile_phone
, person.business_phone = row.business_phone
, person.business_address = row.business_address
ON MATCH SET person.first_name = row.first_name
, person.last_name = row.last_name
, person.email = row.email
, person.mobile_phone = row.mobile_phone
, person.business_phone = row.business_phone
, person.business_address = row.business_address
Any help or examples would be greatly appreciated.
Many thanks.

You can generate a time-limited signed URL on S3 and you don't need to make the file public.
see here for an example
https://advancedweb.hu/2018/10/30/s3_signed_urls/

Neo4j “LOAD CSV“ can work with http/https urls, such as:
LOAD CSV WITH HEADERS FROM "https://s3-ap-southeast-2.amazonaws.com/myfile/loadcsv/file.csv" AS line
WITH line LIMIT 3
RETURN line
The following configuration need to be changed
the S3 folder needs to be open to public
in neo4j.conf, set
dbms.security.allow_csv_import_from_file_urls=true
in neo4j.conf, comment out or delete dbms.directories.import=import
make sure firewall is not blocking neo4j ports [7474, 7473, 7687]

Also you may use tools like s3fs to map your bucket as a local file system, so you can read the files directly. Only IAM access is needed.

Regarding #dz902 comment - this is not an option in docker container, because if you try to map /var/lib/neo4j/import on your s3 bucket using s3fs it will failed in error
chown: changing ownership of '/var/lib/neo4j/import': Input/output error
That is because neo4j inside the container operates and creates it's folders with different user (neo4j, uid,gid=7474).
There is an option how to run neo4j with another user, but still you cannot use root for this purpose in neo4j case.
More details on that here
If someone have some solutions or ideas how to make this possible (I mean to map the /var/lib/neo4j/import folder to S3 bucket) - please give me your thoughts

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

File metadata such as time in Azure Storage from Databricks - python

Related

EDIT: Airflow GCSToS3Operator: keep_directory_structure=True is not being passed in the rendered template, prefix is being added in the dest_s3_key

Using fsspec at pandas.DataFrame.to_csv command

How to read parquet file into pandas from Azure blob store

Retrieve gs:// path for a storage Bucket and Object at GCS using Python api client

Load CSV from secured S3 bucket into Neo4j running in docker

Categories

Resources