Loading data from S3 to dask dataframe

Loading data from S3 to dask dataframe - python

I can load the data only if I change the "anon" parameter to True after making the file public.
df = dd.read_csv('s3://mybucket/some-big.csv', storage_options = {'anon':False})
This is not recommended for obvious reasons. How do I load the data from S3 securely?

The backend which loads the data from s3 is s3fs, and it has a section on credentials here, which mostly points you to boto3's documentation.
The short answer is, there are a number of ways of providing S3 credentials, some of which are automatic (a file in the right place, or environment variables - which must be accessible to all workers, or cluster metadata service).
Alternatively, you can provide your key/secret directly in the call, but that of course must mean that you trust your execution platform and communication between workers
df = dd.read_csv('s3://mybucket/some-big.csv', storage_options = {'key': mykey, 'secret': mysecret})
The set of parameters you can pass in storage_options when using s3fs can be found in the API docs.
General reference http://docs.dask.org/en/latest/remote-data-services.html

If you're within your virtual private cloud (VPC) s3 will likely already be credentialed and you can read the file in without a key:
import dask.dataframe as dd
df = dd.read_csv('s3://<bucket>/<path to file>.csv')
If you aren't credentialed, you can use the storage_options parameter and pass a key pair (key and secret):
import dask.dataframe as dd
storage_options = {'key': <s3 key>, 'secret': <s3 secret>}
df = dd.read_csv('s3://<bucket>/<path to file>.csv', storage_options=storage_options)
Full documentation from dask can be found here

Dask under the hood uses boto3 so you can pretty much setup your keys in all the ways boto3 supports e.g role-based export AWS_PROFILE=xxxx or explicitly exporting access key and secret via your environment variables. I would advise against hard-coding your keys least you expose your code to the public by a mistake.
$ export AWS_PROFILE=your_aws_cli_profile_name
or
https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html
For s3 you can use wildcard match to fetch multiple chunked files
import dask.dataframe as dd
# Given N number of csv files located inside s3 read and compute total record len
s3_url = 's3://<bucket_name>/dask-tutorial/data/accounts.*.csv'
df = dd.read_csv(s3_url)
print(df.head())
print(len(df))

Related

File metadata such as time in Azure Storage from Databricks

I m trying to get creationfile metadata.
File is in: Azure Storage
Accesing data throw: Databricks
right now I m using:
file_path = my_storage_path
dbutils.fs.ls(file_path)
but it returns
[FileInfo(path='path_myFile.csv', name='fileName.csv', size=437940)]
I do not have any information about creation time, there is a way to get that information ?
other solutions in Stackoverflow are refering to files that are already in databricks
Does databricks dbfs support file metadata such as file/folder create date or modified date
in my case we access to the data from Databricks but the data are in Azure Storage.

It really depends on the version of Databricks Runtime (DBR) that you're using. For example, modification timestamp is available if you use DBR 10.2 (didn't test with 10.0/10.1, but definitely not available on 9.1):
If you need to get that information you can use Hadoop FileSystem API via Py4j gateway, like this:
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
fs = FileSystem.get(URI("/tmp"), Configuration())
status = fs.listStatus(Path('/tmp/'))
for fileStatus in status:
print(f"path={fileStatus.getPath()}, size={fileStatus.getLen()}, mod_time={fileStatus.getModificationTime()}")

Read many parquet files from S3 to pandas dataframe

I've been researching this topic for a few days now and have yet to come up with a working solution. Apologies if this question is repetitive (although I have checked for similar questions and have not quite found the right one).
I have an s3 bucket with about 150 parquet files in it. I have been searching for a dynamic way to bring in all of these files to one dataframe (can be multiple, if more computationally efficient). If all of these parquets were appended to one dataframe, it would be a very large amount of data, so if the solution to this is simply that I require more computing power, please do let me know. I have ultimately stumbled across the awswrangler, and am using the below code, which has been running as expected:
df = wr.s3.read_parquet(path="s3://my-s3-data/folder1/subfolder1/subfolder2/", dataset=True, columns = df_cols, chunked=True)
This code has been returning a generator object, which I am not sure how to get into a dataframe. I have tried solutions from the linked pages (below) and have returned various errors such as invalid filepath and length mismatch.
https://newbedev.com/create-a-pandas-dataframe-from-generator
https://aws-data-wrangler.readthedocs.io/en/stable/stubs/awswrangler.s3.read_parquet.html
Create a pandas DataFrame from generator?
Another solution I tried was from https://www.py4u.net/discuss/140245 :
import s3fs
import pyarrow.parquet as pq
fs = s3fs.S3FileSystem()
bucket = "cortex-grm-pdm-auto-design-data"
path = "s3://my-bucket/folder1/subfolder1/subfolder2/"
# Python 3.6 or later
p_dataset = pq.ParquetDataset(
f"s3://my-bucket/folder1/subfolder1/subfolder2/",
filesystem=fs
)
df = p_dataset.read().to_pandas()
which resulted in an error "'AioClientCreator' object has no attribute '_register_lazy_block_unknown_fips_pseudo_regions'"
lastly, I also tried the many parquet solution from https://newbedev.com/how-to-read-a-list-of-parquet-files-from-s3-as-a-pandas-dataframe-using-pyarrow :
# Read multiple parquets from a folder on S3 generated by spark
def pd_read_s3_multiple_parquets(filepath, bucket, s3=None,
s3_client=None, verbose=False, **args):
if not filepath.endswith('/'):
filepath = filepath + '/' # Add '/' to the end
if s3_client is None:
s3_client = boto3.client('s3')
if s3 is None:
s3 = boto3.resource('s3')
s3_keys = [item.key for item in s3.Bucket(bucket).objects.filter(Prefix=filepath)
if item.key.endswith('.parquet')]
if not s3_keys:
print('No parquet found in', bucket, filepath)
elif verbose:
print('Load parquets:')
for p in s3_keys:
print(p)
dfs = [pd_read_s3_parquet(key, bucket=bucket, s3_client=s3_client, **args)
for key in s3_keys]
return pd.concat(dfs, ignore_index=True)
df = pd_read_s3_multiple_parquets('path/to/folder', 'my_bucket')
This one returned no parquet found in the path (which I am certain is false, the parquets are all there when I visit the actual s3), as well as the error "no objects to concatenate"
Any guidance you can provide is greatly appreciated! Again, apologies for any repetitiveness in my question. Thank you in advance.

AWS data wrangler works seamlessly, I have used it.
Install via pip or conda.
Reading multiple parquet files is a one-liner: see example below.
Creds are automatically read from your environment variables.
# this is running on my laptop
import numpy as np
import pandas as pd
import awswrangler as wr
# assume multiple parquet files in 's3://mybucket/etc/etc/'
s3_bucket_uri = 's3://mybucket/etc/etc/'
df = wr.s3.read_parquet(path=s3_bucket_daily)
# df is a pandas DataFrame
AWS doc with examples that include your use case are here:
https://aws-data-wrangler.readthedocs.io/en/stable/stubs/awswrangler.s3.read_parquet.html

Using fsspec at pandas.DataFrame.to_csv command

I want to write the csv-file from pandas dataframe on remote machine connecting via smtp-ssh.
Does anybody know how add "storage_options" parameter correctly?
Pandas documentation says that I have to use some dict as parameter's value. But I don't understand which exactly.
hits_df.to_csv('hits20.tsv', compression='gzip', index='False', chunksize=1000000, storage_options={???})
Every time I got ValueError: storage_options passed with file object or non-fsspec file path
What am I doing wrong?

You will find the set of values to use by experimenting directly with the implementation backend SFTPFileSystem. Whatever kwargs you use these are the same ones that would go into stoage_options. Short story: paramiko is not the same as command-line SSH, so some trialing will be required.
If you have things working via the file system class, you can use the alternative route
fs = fsspec.implementations.sftp.SFTPFileSystem(...)
# same as fs = fsspec.filesystem("ssh", ...)
with fs.open("my/file/path", "rb") as f:
pd.read_csv(f, other_kwargs)

Pandas is supporting fsspec which lets you work easily with remote filesystems, and abstracts over s3fs for Amazon S3 and gcfs for Google Cloud Storage (and other backends such as (S)FTP, SSH or HDFS). In particular s3fs is very handy for doing simple file operations in S3 because boto is often quite subtly complex to use.
The argument storage_options will allow you to expose s3fs arguments to pandas.
You can specify an AWS Profile manually using the storage_options which takes a dict. An example bellow:
import boto3
AWS_S3_BUCKET = os.getenv("AWS_S3_BUCKET")
AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")
AWS_SESSION_TOKEN = os.getenv("AWS_SESSION_TOKEN")
df.to_csv(
f"s3://{AWS_S3_BUCKET}/{key}",
storage_options={
"key": AWS_ACCESS_KEY_ID,
"secret": AWS_SECRET_ACCESS_KEY,
"token": AWS_SESSION_TOKEN,
},
)

If you do not have cloud storage access, you can access public data by specifying an anonymous connection like this
pd.read_csv('name',<other fields>, storage_options={"anon": True})
Else one should pass storage_options in dict format, you will get name and key by your cloud VM host (including Amazon S3, Google Cloud, Azure, etc.)
pd.read_csv('name',<other fields>, \
storage_options={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY})

Python loads data from Redshift to S3

I am trying to load some Redshift query results to S3. So far I am using pandas_redshift but I got stuck:
import pandas_redshift as pr
pr.connect_to_redshift( dbname = 'dbname',
host = 'xxx.us-east- 1.redshift.amazonaws.com',
port = 5439,
user = 'xxx',
password = 'xxx')
pr.connect_to_s3(aws_access_key_id = 'xxx',
aws_secret_access_key = 'xxx',
bucket = 'dxxx',
subdirectory = 'dir')
And here is the data that I want to dump to S3:
sql_statement = '''
select
provider,
provider_code
from db1.table1
group by provider, provider_code;
'''
df = pr.redshift_to_pandas(sql_statement)
The df was created successfully but how to do the next step, which is to put this dataframe to S3?

The method you are looking at is very inefficient.
to do this the right way you will need a way to run sql on redshift - via e.g. python.
the following sql should be run
unload ('select provider,provider_code
from db1.table1
group by provider, provider_code;')
to 's3://mybucket/myfolder/unload/'
access_key_id '<access-key-id>'
secret_access_key '<secret-access-key>';
see here fore documentation.

As Jon Scott mentions if your goal is to move data from redshift to S3, then the pandas_redshift package is not the right method. The package is meant to allow you to easily move data from redshift to a Pandas DataFrame on your local machine, or move data from a Pandas DataFrame on your local machine to redshift. It is worth noting that running the command you already have:
df = pr.redshift_to_pandas(sql_statement)
Pulls the data directly from redshift to your computer without involving S3 at all. However this command:
pr.pandas_to_redshift(df, 'schema.your_new_table_name')
Copies the DataFrame to a CSV in S3, then runs a query to copy CSV to redshift (This step requires that you ran pr.connect_to_s3 successfully). It does not perform any cleanup of the S3 bucket so a side effect of this is that the data will end up in the bucket you specify.

Save pandas dataframe to s3

I want to save BIG pandas dataframes to s3 using boto3.
Here is what I am doing now:
csv_buffer = io.StringIO()
df.to_csv(csv_buffer, index=False)
s3.put_object(Bucket="bucket-name", Key="file-name", Body=csv_buffer.getvalue())
This generates a file with the following permissions:
---------- 1 root root file-name
Hw can I change that in order for the file to be owned by the user that executes the script? i.e user "ubuntu" on a AWS instance
This is what I want:
-rw-rw-r-- 1 ubuntu ubuntu file-name
Another thing, did anyone try this method with big dataframes? (millions of rows) and does it perform well?
How does it compare to just saving the file locally and using the boto3 copy file method?
Thanks a lot.

AWS S3 requires multiple-part upload for files above 5GB.
You can implement it with boto3 using a reusable configuration object from TransferConfig class :
import boto3
from boto3.s3.transfer import TransferConfig
# default value for parameter multipart_threshold is 8MB (=number of bytes per part)
config = TransferConfig(multipart_threshold=5GB)
# Perform the transfer
s3 = boto3.client('s3')
s3.upload_file('FILE_NAME', 'BUCKET_NAME', 'OBJECT_NAME', Config=config)
Boto3 documentation here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/s3.html
Then there is some limits here: https://docs.aws.amazon.com/AmazonS3/latest/dev/qfacts.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Loading data from S3 to dask dataframe - python

I can load the data only if I change the "anon" parameter to True after making the file public. df = dd.read_csv('s3://mybucket/some-big.csv', storage_options = {'anon':False}) This is not recommended for obvious reasons. How do I load the data from S3 securely?

Related

File metadata such as time in Azure Storage from Databricks

Read many parquet files from S3 to pandas dataframe

Using fsspec at pandas.DataFrame.to_csv command

Python loads data from Redshift to S3

Save pandas dataframe to s3

Categories

Resources