Amazon Sagemaker open json from S3 bucket - python

I created a S3 bucket and placed both a data.csv and a data.json file inside it. I then created a Sagemaker notebook and specified this S3 bucket in the IAM role.
This now works from inside the notebook:
import pandas as pd
from sagemaker import get_execution_role
bucket='my-sagemaker-bucket'
data_key = 'data.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
data = pd.read_csv(data_location)
But this errors saying file doesn't exist:
import json
from sagemaker import get_execution_role
bucket='my-sagemaker-bucket'
data_key = 'data.json'
data_location = 's3://{}/{}'.format(bucket, data_key)
data = json.load(open(data_location))
Anyone know why I can read the csv but not the json? I also can't shutil.copy the csv to the notebook's current working directory (also says file doesn't exist). I'm not very well versed with S3 buckets or Sagemaker, so not sure if this is a permissions/policy issue or something else.

your SageMaker-ExecutionRole might have insufficient rights to access your S3-bucket. The default IAM-SageMaker Execution role has the permission: "AmazonSageMakerFullAccess" which uses the S3 RequestCondition "s3:ExistingObjectTag/SageMaker = true".
So maybe you could try to simply tag your S3 bucket (Tag: SageMaker:true). Control your IAM settings.
import pandas as pd
bucket='my-sagemaker-bucket'
data_key = 'data.json'
data_location = 's3://{}/{}'.format(bucket, data_key)
pd.read_json(data_location) # , orient='columns', typ='series'

Pandas can handle S3 URL using your AWS credentials. So you could use pd.read_csv or pd.read_json instead of json.load. The suggestion from #Michael_S should work.

Related

writing a simple text file with no key value pair to cloud storage storage

My requirement is to export the data from BQ to GCS in a particular sorted order which I am not able to get using automatic export and hence trying to write a manual export for this.
File format is like below:
HDR001||5378473972abc||20101|182082||
DTL001||436282798101|
DTL002||QS
DTL005||3733|8
DTL002||QA
DTL005||3733|8
DTL002||QP
DTL005||3733|8
DTL001||436282798111|
DTL002||QS
DTL005||3133|2
DTL002||QA
DTL005||3133|8
DTL002||QP
DTL005||3133|0
I am very new to this and am able to write the file in local disk but I am not sure how I can write this to file to GCS. I tried to use the write_to_file but I seem to be missing something.
import pandas as pd
import pickle as pkl
import tempfile
from google.colab import auth
from google.cloud import bigquery, storage
#import cloudstorage as gcs
auth.authenticate_user()
df = pd.DataFrame(data=job)
sc = storage.Client(project='temp-project')
with tempfile.NamedTemporaryFile(mode='w+b', buffering=- 1,prefix='test',suffix='temp') as fh:
with open(fh.name,'w+',newline='') as f:
dfAsString = df.to_string(header=" ", index=False)
fh.name = fh.write(dfAsString)
fh.close()
bucket = sc.get_bucket('my-bucket')
target_fn = 'test.csv'
source_fn = fh.name
destination_blob_name = bucket.blob('test.csv')
bucket.blob(destination_blob_name).upload_from_file(source_fn)
Can someone please help?
Thank You.
I would suggest to upload an object through a Cloud Storage bucket. Instead of upload_from_file, you need to use upload_from_filename. Your code should look like this:
bucket.blob(destination_blob_name).upload_from_filename(source_fn)
Here are links for the documentation on how to upload an object to Cloud Storage bucket and Client library docs.
EDIT:
The reason why you're getting that is because somewhere in your code, you're passing a Blob object, rather than a String. Currently your destination variable is a Blob Object, change it to String instead:
destination_blob_name = bucket.blob('test.csv')
to
destination_blob_name = 'test.csv'

Loading S3File in AWS

I'm trying to download my cifar 10 data that is in S3 to train it in AWS SageMaker.
I'm using this code to load the data:
import s3fs
fs = s3fs.S3FileSystem()
def unpickle(file):
dict = pickle.load(file, encoding='bytes')
return dict
with fs.open(f's3://bucket_name/data_batch_1') as f:
data= unpickle(f)
I'm getting the error "EOFError: Ran out of input" on the unpickle function. I assume the "file" is empty, but I tried different ways to get the data from my bucket, and can't seem to get it right.
Unless you have granted the appropriate permissions in IAM for the user to have access to the S3 bucket, the easiest fix is to grant public access, i.e. make sure all are unchecked as below.
Then, using boto3 is an option for importing the dataset from S3 into SageMaker. Here is an example:
import boto3
import botocore
import pandas as pd
from sagemaker import get_execution_role
role = get_execution_role()
bucket = 'databucketname'
data_key = 'datasetname.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
train_df = pd.read_csv(data_location)
Hope this helps.

Load file from Azure Files to Azure Databricks

Looking for a way using Azure files SDK to upload files to my azure databricks blob storage
I tried many things using function from this page
But nothing worked. I don't understand why
example:
file_service = FileService(account_name='MYSECRETNAME', account_key='mySECRETkey')
generator = file_service.list_directories_and_files('MYSECRETNAME/test') #listing file in folder /test, working well
for file_or_dir in generator:
print(file_or_dir.name)
file_service.get_file_to_path('MYSECRETNAME','test/tables/input/referentials/','test.xlsx','/dbfs/FileStore/test6.xlsx')
with test.xlsx = name of file in my azure file
/dbfs/FileStore/test6.xlsx => path where to upload the file in my dbfs system
I have the error message:
Exception=The specified resource name contains invalid characters
Tried to change the name but doesn't seem to work
edit: I'm not even sure the function is doing what I want. What is the best way to load file from azure files?
Per my experience, I think the best way to load file from Azure Files is directly to read a file via its url with sas token.
For example, as the figures below, it's a file named test.xlsx in my test file share, that I viewed it using Azure Storage Explorer, then to generate its url with sas token.
Fig 1. Right click the file and then click the Get Shared Access Signature...
Fig 2. Must select the option Read permission for directly reading the file content.
Fig 3. Copy the url with sas token
Here is my sample code, you can run it with the sas token url of your file in your Azure Databricks.
import pandas as pd
url_sas_token = 'https://<my account name>.file.core.windows.net/test/test.xlsx?st=2020-01-27T10%3A16%3A12Z&se=2020-01-28T10%3A16%3A12Z&sp=rl&sv=2018-03-28&sr=f&sig=XXXXXXXXXXXXXXXXX'
# Directly read the file content from its url with sas token to get a pandas dataframe
pdf = pd.read_excel(url_sas_token )
# Then, to convert the pandas dataframe to a PySpark dataframe in Azure Databricks
df = spark.createDataFrame(pdf)
Alternatively, to use Azure File Storage SDK to generate the url with sas token for your file or to get the bytes of your file for reading, please refer to the offical document Develop for Azure Files with Python and my sample code below.
# Create a client of Azure File Service as same as yours
from azure.storage.file import FileService
account_name = '<your account name>'
account_key = '<your account key>'
share_name = 'test'
directory_name = None
file_name = 'test.xlsx'
file_service = FileService(account_name=account_name, account_key=account_key)
To generate the sas token url of a file
from azure.storage.file import FilePermissions
from datetime import datetime, timedelta
sas_token = file_service.generate_file_shared_access_signature(share_name, directory_name, file_name, permission=FilePermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1))
url_sas_token = f"https://{account_name}.file.core.windows.net/{share_name}/{file_name}?{sas_token}"
import pandas as pd
pdf = pd.read_excel(url_sas_token)
df = spark.createDataFrame(pdf)
Or using get_file_to_stream function to read the file content
from io import BytesIO
import pandas as pd
stream = BytesIO()
file_service.get_file_to_stream(share_name, directory_name, file_name, stream)
pdf = pd.read_excel(stream)
df = spark.createDataFrame(pdf)
Just as an addition to #Peter Pan answer, the alternative approach without using Pandas with python azure-storage-file-share library.
Very detailed documentation: https://pypi.org/project/azure-storage-file-share/#downloading-a-file

Load JSON from s3 inside aws glue pyspark job

I am trying to retrieve a JSON file from an s3 bucket inside a glue pyspark script.
I am running this function in the job inside aws glue:
def run(spark):
s3_bucket_path = 's3://bucket/data/file.gz'
df = spark.read.json(s3_bucket_path)
df.show()
After this I am getting:
AnalysisException: u'Path does not exist: s3://bucket/data/file.gz;'
I searched for this issue and did not find anything that would be similar enough to infer where is the issue. I think there might be permission issues accessing the bucket, but then the error message should be different.
Here You can Try This :
s3 = boto3.client("s3", region_name="us-west-2", aws_access_key_id="
", aws_secret_access_key="")
jsonFile = s3.get_object(Bucket=bucket, Key=key)
jsonObject = json.load(jsonFile["Body"])
where Key = full path to your file in bucket
and use this jsonObject in spark.read.json(jsonObject)

AWS uploading file into wrong bucket

I am using AWS Sagemaker and trying to upload a data folder into S3 from Sagemaker. I am trying to do is to upload my data into the s3_train_data directory (the directory exists in S3). However, it wouldn't upload it in that bucket, but in a default Bucket that has been created, and in turn creates a new folder directory with the S3_train_data variables.
code to input in directory
import os
import sagemaker
from sagemaker import get_execution_role
sagemaker_session = sagemaker.Session()
role = get_execution_role()
bucket = <bucket name>
prefix = <folders1/folders2>
key = <input>
s3_train_data = 's3://{}/{}/{}/'.format(bucket, prefix, key)
#path 'data' is the folder in the Jupyter Instance, contains all the training data
inputs = sagemaker_session.upload_data(path= 'data', key_prefix= s3_train_data)
Is the problem in the code or more in how I created the notebook?
You could look at the Sample notebooks, how to upload the data S3 bucket
There have many ways. I am just giving you hints to answer.
And you forgot create a boto3 session to access the S3 bucket
It is one of the ways to do it.
import os
import urllib.request
import boto3
def download(url):
filename = url.split("/")[-1]
if not os.path.exists(filename):
urllib.request.urlretrieve(url, filename)
def upload_to_s3(channel, file):
s3 = boto3.resource('s3')
data = open(file, "rb")
key = channel + '/' + file
s3.Bucket(bucket).put_object(Key=key, Body=data)
# caltech-256
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-train.rec')
upload_to_s3('train', 'caltech-256-60-train.rec')
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-val.rec')
upload_to_s3('validation', 'caltech-256-60-val.rec')
link : https://buildcustom.notebook.us-east-2.sagemaker.aws/notebooks/sample-notebooks/introduction_to_amazon_algorithms/imageclassification_caltech/Image-classification-fulltraining.ipynb
Another way to do it.
bucket = '<your_s3_bucket_name_here>'# enter your s3 bucket where you will copy data and model artifacts
prefix = 'sagemaker/breast_cancer_prediction' # place to upload training files within the bucket
# do some processing then prepare to push the data.
f = io.BytesIO()
smac.write_numpy_to_dense_tensor(f, train_X.astype('float32'), train_y.astype('float32'))
f.seek(0)
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', train_file)).upload_fileobj(f)
Link : https://buildcustom.notebook.us-east-2.sagemaker.aws/notebooks/sample-notebooks/introduction_to_applying_machine_learning/breast_cancer_prediction/Breast%20Cancer%20Prediction.ipynb
Youtube link : https://www.youtube.com/watch?v=-YiHPIGyFGo - how to pull the data in S3 bucket.

Categories