Using fsspec at pandas.DataFrame.to_csv command - python

I want to write the csv-file from pandas dataframe on remote machine connecting via smtp-ssh.
Does anybody know how add "storage_options" parameter correctly?
Pandas documentation says that I have to use some dict as parameter's value. But I don't understand which exactly.
hits_df.to_csv('hits20.tsv', compression='gzip', index='False', chunksize=1000000, storage_options={???})
Every time I got ValueError: storage_options passed with file object or non-fsspec file path
What am I doing wrong?

You will find the set of values to use by experimenting directly with the implementation backend SFTPFileSystem. Whatever kwargs you use these are the same ones that would go into stoage_options. Short story: paramiko is not the same as command-line SSH, so some trialing will be required.
If you have things working via the file system class, you can use the alternative route
fs = fsspec.implementations.sftp.SFTPFileSystem(...)
# same as fs = fsspec.filesystem("ssh", ...)
with fs.open("my/file/path", "rb") as f:
pd.read_csv(f, other_kwargs)

Pandas is supporting fsspec which lets you work easily with remote filesystems, and abstracts over s3fs for Amazon S3 and gcfs for Google Cloud Storage (and other backends such as (S)FTP, SSH or HDFS). In particular s3fs is very handy for doing simple file operations in S3 because boto is often quite subtly complex to use.
The argument storage_options will allow you to expose s3fs arguments to pandas.
You can specify an AWS Profile manually using the storage_options which takes a dict. An example bellow:
import boto3
AWS_S3_BUCKET = os.getenv("AWS_S3_BUCKET")
AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")
AWS_SESSION_TOKEN = os.getenv("AWS_SESSION_TOKEN")
df.to_csv(
f"s3://{AWS_S3_BUCKET}/{key}",
storage_options={
"key": AWS_ACCESS_KEY_ID,
"secret": AWS_SECRET_ACCESS_KEY,
"token": AWS_SESSION_TOKEN,
},
)

If you do not have cloud storage access, you can access public data by specifying an anonymous connection like this
pd.read_csv('name',<other fields>, storage_options={"anon": True})
Else one should pass storage_options in dict format, you will get name and key by your cloud VM host (including Amazon S3, Google Cloud, Azure, etc.)
pd.read_csv('name',<other fields>, \
storage_options={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY})

Related

File metadata such as time in Azure Storage from Databricks

I m trying to get creationfile metadata.
File is in: Azure Storage
Accesing data throw: Databricks
right now I m using:
file_path = my_storage_path
dbutils.fs.ls(file_path)
but it returns
[FileInfo(path='path_myFile.csv', name='fileName.csv', size=437940)]
I do not have any information about creation time, there is a way to get that information ?
other solutions in Stackoverflow are refering to files that are already in databricks
Does databricks dbfs support file metadata such as file/folder create date or modified date
in my case we access to the data from Databricks but the data are in Azure Storage.
It really depends on the version of Databricks Runtime (DBR) that you're using. For example, modification timestamp is available if you use DBR 10.2 (didn't test with 10.0/10.1, but definitely not available on 9.1):
If you need to get that information you can use Hadoop FileSystem API via Py4j gateway, like this:
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
fs = FileSystem.get(URI("/tmp"), Configuration())
status = fs.listStatus(Path('/tmp/'))
for fileStatus in status:
print(f"path={fileStatus.getPath()}, size={fileStatus.getLen()}, mod_time={fileStatus.getModificationTime()}")

How to load pickled dataframes from GCS into App Engine

I'm trying to load a pickled pandas dataframe from Google Cloud Storage into App Engine.
I have been using blob.download_to_file() to read the bytestream into pandas, however I encounter the following error:
UnpicklingError: invalid load key, m
I have tried seeking to the beginning to no avail and am pretty sure something fundamental is missing from my understanding.
When attempting to pass an open file object and read from there, I get an
UnsupportedOperation: write
error
from io import BytesIO
from google.cloud import storage
def get_byte_fileobj(project, bucket, path) -> BytesIO:
blob = _get_blob(bucket, path, project)
byte_stream = BytesIO()
blob.download_to_file(byte_stream)
byte_stream.seek(0)
return(byte_stream)
def _get_blob(bucket_name, path, project):
credentials = service_account.Credentials.from_service_account_file(
service_account_credentials_path) if service_account_credentials_path else None
storage_client = storage.Client(project=project, credentials=credentials)
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(path)
return(blob)
fileobj = get_byte_fileobj(projectid, 'backups', 'Matches/Matches.pickle')
pd.read_pickle(fileobj)
Ideally pandas would just read from pickle since all of my GCS backups are in that format, but I'm open to suggestions.
The pandas.read_pickle() method takes as argument a file path string, not a file handler/object:
pandas.read_pickle(path, compression='infer')
Load pickled pandas object (or any object) from file.
path : str
File path where the pickled object will be loaded.
If you're in the 2nd generation standard or the flexible environment you could try to use a real /tmp file instead of BytesIO.
Otherwise you'd have to figure out another method of loading the data into pandas, which supports a file object/descriptor. In general the approach is described in How to restore Tensorflow model from Google bucket without writing to filesystem? (context is different, but same general idea)

Loading data from S3 to dask dataframe

I can load the data only if I change the "anon" parameter to True after making the file public.
df = dd.read_csv('s3://mybucket/some-big.csv', storage_options = {'anon':False})
This is not recommended for obvious reasons. How do I load the data from S3 securely?
The backend which loads the data from s3 is s3fs, and it has a section on credentials here, which mostly points you to boto3's documentation.
The short answer is, there are a number of ways of providing S3 credentials, some of which are automatic (a file in the right place, or environment variables - which must be accessible to all workers, or cluster metadata service).
Alternatively, you can provide your key/secret directly in the call, but that of course must mean that you trust your execution platform and communication between workers
df = dd.read_csv('s3://mybucket/some-big.csv', storage_options = {'key': mykey, 'secret': mysecret})
The set of parameters you can pass in storage_options when using s3fs can be found in the API docs.
General reference http://docs.dask.org/en/latest/remote-data-services.html
If you're within your virtual private cloud (VPC) s3 will likely already be credentialed and you can read the file in without a key:
import dask.dataframe as dd
df = dd.read_csv('s3://<bucket>/<path to file>.csv')
If you aren't credentialed, you can use the storage_options parameter and pass a key pair (key and secret):
import dask.dataframe as dd
storage_options = {'key': <s3 key>, 'secret': <s3 secret>}
df = dd.read_csv('s3://<bucket>/<path to file>.csv', storage_options=storage_options)
Full documentation from dask can be found here
Dask under the hood uses boto3 so you can pretty much setup your keys in all the ways boto3 supports e.g role-based export AWS_PROFILE=xxxx or explicitly exporting access key and secret via your environment variables. I would advise against hard-coding your keys least you expose your code to the public by a mistake.
$ export AWS_PROFILE=your_aws_cli_profile_name
or
https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html
For s3 you can use wildcard match to fetch multiple chunked files
import dask.dataframe as dd
# Given N number of csv files located inside s3 read and compute total record len
s3_url = 's3://<bucket_name>/dask-tutorial/data/accounts.*.csv'
df = dd.read_csv(s3_url)
print(df.head())
print(len(df))

Load CSV from secured S3 bucket into Neo4j running in docker

Is it possible to use load csv to load data into Neo4j running on a docker container where the csv file is in a secured S3 bucket? It works fine if I copy the file locally onto the docker container.
I keep getting a 'Neo.ClientError.Statement.ExternalResourceFailed' error.
The neo config shows: dbms.security.allow_csv_import_from_file_urls=true
My code is Python (3.6) using py2neo (3.1.2).
USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS FROM 'https://s3-my-region.amazonaws.com/some-secured-
bucket/somefile.csv' AS row FIELDTERMINATOR ','
MERGE (person:Person {id: row.id})
ON CREATE SET person.first_name = row.first_name
, person.last_name = row.last_name
, person.email = row.email
, person.mobile_phone = row.mobile_phone
, person.business_phone = row.business_phone
, person.business_address = row.business_address
ON MATCH SET person.first_name = row.first_name
, person.last_name = row.last_name
, person.email = row.email
, person.mobile_phone = row.mobile_phone
, person.business_phone = row.business_phone
, person.business_address = row.business_address
Any help or examples would be greatly appreciated.
Many thanks.
You can generate a time-limited signed URL on S3 and you don't need to make the file public.
see here for an example
https://advancedweb.hu/2018/10/30/s3_signed_urls/
Neo4j “LOAD CSV“ can work with http/https urls, such as:
LOAD CSV WITH HEADERS FROM "https://s3-ap-southeast-2.amazonaws.com/myfile/loadcsv/file.csv" AS line
WITH line LIMIT 3
RETURN line
The following configuration need to be changed
the S3 folder needs to be open to public
in neo4j.conf, set
dbms.security.allow_csv_import_from_file_urls=true
in neo4j.conf, comment out or delete dbms.directories.import=import
make sure firewall is not blocking neo4j ports [7474, 7473, 7687]
Also you may use tools like s3fs to map your bucket as a local file system, so you can read the files directly. Only IAM access is needed.
Regarding #dz902 comment - this is not an option in docker container, because if you try to map /var/lib/neo4j/import on your s3 bucket using s3fs it will failed in error
chown: changing ownership of '/var/lib/neo4j/import': Input/output error
That is because neo4j inside the container operates and creates it's folders with different user (neo4j, uid,gid=7474).
There is an option how to run neo4j with another user, but still you cannot use root for this purpose in neo4j case.
More details on that here
If someone have some solutions or ideas how to make this possible (I mean to map the /var/lib/neo4j/import folder to S3 bucket) - please give me your thoughts

Save pandas dataframe to s3

I want to save BIG pandas dataframes to s3 using boto3.
Here is what I am doing now:
csv_buffer = io.StringIO()
df.to_csv(csv_buffer, index=False)
s3.put_object(Bucket="bucket-name", Key="file-name", Body=csv_buffer.getvalue())
This generates a file with the following permissions:
---------- 1 root root file-name
Hw can I change that in order for the file to be owned by the user that executes the script? i.e user "ubuntu" on a AWS instance
This is what I want:
-rw-rw-r-- 1 ubuntu ubuntu file-name
Another thing, did anyone try this method with big dataframes? (millions of rows) and does it perform well?
How does it compare to just saving the file locally and using the boto3 copy file method?
Thanks a lot.
AWS S3 requires multiple-part upload for files above 5GB.
You can implement it with boto3 using a reusable configuration object from TransferConfig class :
import boto3
from boto3.s3.transfer import TransferConfig
# default value for parameter multipart_threshold is 8MB (=number of bytes per part)
config = TransferConfig(multipart_threshold=5GB)
# Perform the transfer
s3 = boto3.client('s3')
s3.upload_file('FILE_NAME', 'BUCKET_NAME', 'OBJECT_NAME', Config=config)
Boto3 documentation here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/s3.html
Then there is some limits here: https://docs.aws.amazon.com/AmazonS3/latest/dev/qfacts.html

Categories