Load JSON from s3 inside aws glue pyspark job

Load JSON from s3 inside aws glue pyspark job - python

I am trying to retrieve a JSON file from an s3 bucket inside a glue pyspark script.
I am running this function in the job inside aws glue:
def run(spark):
s3_bucket_path = 's3://bucket/data/file.gz'
df = spark.read.json(s3_bucket_path)
df.show()
After this I am getting:
AnalysisException: u'Path does not exist: s3://bucket/data/file.gz;'
I searched for this issue and did not find anything that would be similar enough to infer where is the issue. I think there might be permission issues accessing the bucket, but then the error message should be different.

Here You can Try This :
s3 = boto3.client("s3", region_name="us-west-2", aws_access_key_id="
", aws_secret_access_key="")
jsonFile = s3.get_object(Bucket=bucket, Key=key)
jsonObject = json.load(jsonFile["Body"])
where Key = full path to your file in bucket
and use this jsonObject in spark.read.json(jsonObject)

Related

How to handle dynamic file naming convention, while AWS s3 file coping using boto3 python

I am new to AWS world, started to explore recently.
After running Athena Query, I am trying to copy the query result file generated, to another s3 location.
The problem I am getting here is :
file_name Here I'm trying to build dynamically, using the query id , that Athena generated and by appending with .csv file extension.
Which is generating exception:
botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the CopyObject operation: The specified key does not exist.
If hardcode the file name e.g : file_name = '30795514-8b0b-4b17-8764-495b25d74100.csv' inside single quote '', my code is working fine. Copying is getting done.
Please help me how can I dynamically build source and destination file name dynamically.
import boto3
s3 = session.client('s3');
athena_client = boto3.client(
"athena",
aws_access_key_id=AWS_ACCESS_KEY,
aws_secret_access_key=AWS_SECRET_KEY,
region_name=AWS_REGION,);
def main():
query = "select * from test_table";
response = athena_client.start_query_execution(
QueryString=query,
ResultConfiguration={"OutputLocation": RESULT_OUTPUT_LOCATION}
)
queryId = response['QueryExecutionId'];
src_bucket = 'smg-datalake-prod-athena-query-results'
dst_bucket = 'smg-datalake-prod-athena-query-results'
file_name = str(queryId+".csv");
copy_object(src_bucket, dst_bucket, file_name)
def copy_object(src_bucket, dst_bucket, file_name):
src_key = f'python-athena/{file_name}';
dst_key = f'python-athena/cosmo/rss/v2/newsletter/kloka_latest.csv';
# copy object to destination bucket
s3.copy_object(Bucket=dst_bucket, CopySource={'Bucket': src_bucket, 'Key': src_key}, Key=dst_key);

After executing Athena Query, I just put some sleep. then I tried to move file to another location, it started to work.
As it was running so fast , by the time file is available in query results bucket, my code was trying to copy the file, which yet to be present.

listing s3 buckets using boto3 and python

I am using the below code and referred to many SO answers for listing files under a folder using boto3 and python but was unable to do so. Below is my code:
s3 = boto3.client('s3')
object_listing = s3.list_objects_v2(Bucket='maxValue',
Prefix='madl-temp/')
My s3 path is "s3://madl-temp/maxValue/" where I want to find if there are any parquet files under the maxValue bucket based on which I have to do something like below:
If len(maxValue)>0:
maxValue=true
else:
maxValue=false
I am running it via Glue jobs and I am getting the below error:
botocore.errorfactory.NoSuchBucket: An error occurred (NoSuchBucket) when calling the ListObjectsV2 operation: The specified bucket does not exist

Your bucket name is madl-temp and prefix is maxValue. But in boto3, you have the opposite. So it should be:
s3 = boto3.client('s3')
object_listing = s3.list_objects_v2(Bucket='madl-temp',
Prefix='maxValue/')
To get the number of files you have to do:
len(object_listing['Contents']) - 1
where -1 accounts for a prefix maxValue/.

How to get file key out of an SNS connected to S3 Bucket

I have two different profiles on AWS. The s3 bucket and SNS are in profile A and my lambda function is in profile B. When a new file is added to the s3 bucket, SNS triggers the lambda function.
The lambda function then supposed to access the new file and process it using pandas. Here is what I'm doing now;
sts_connection = boto3.client('sts')
acct_b = sts_connection.assume_role(
RoleArn="arn:aws:iam::**************:role/AllowS3AccessFromAccountB",
RoleSessionName="cross_acct_lambda"
)
ACCESS_KEY = acct_b['Credentials']['AccessKeyId']
SECRET_KEY = acct_b['Credentials']['SecretAccessKey']
SESSION_TOKEN = acct_b['Credentials']['SessionToken']
s3 = boto3.client('s3', aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY, aws_session_token=SESSION_TOKEN)
path = get_file_path(event)
obj = s3.get_object(Bucket='my-bucket-name', Key=path)
csv_string = io.BytesIO(obj['Body'].read())
# Read a csv file and turn it into a DataFrame
df = pd.read_json(csv_string, delimiter=';', engine ='c', encoding= 'unicode_escape')
def get_file_path(event_body):
"""Function to get manifest path anc check if it is manifest"""
try:
# Get message for first SNS record
sns_message = json.loads(event_body["Records"][0]["Sns"]["Message"])
path = sns_message["Records"][0]["s3"]["object"]["key"]
except TypeError as ex:
logging.error("Unable to parse event: " + str(event_body))
raise ex
return path
Everything works fine until the s3.get_object() part. I'm getting the following error;
botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
Maybe I'm reading the file key in the wrong way?
Edit:
Here is what path looks like when I debugged it.
svv/sensor%3D11219V22151/year%3D2020/month%3D03/day%3D02/test.csv
And the s3 file structure is like this;
sensor-data/sensor=*******/year=2020/month=03/day=02
Seems like I need to use a regex for the equal signs. But there should be a more generic solution.

Here's a snippet I have in some Lambda code that is directly triggered by Amazon S3 (not via Amazon SNS):
import urllib
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'])
You could try the similar parsing to see if it corrects the Key.

Amazon Sagemaker open json from S3 bucket

I created a S3 bucket and placed both a data.csv and a data.json file inside it. I then created a Sagemaker notebook and specified this S3 bucket in the IAM role.
This now works from inside the notebook:
import pandas as pd
from sagemaker import get_execution_role
bucket='my-sagemaker-bucket'
data_key = 'data.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
data = pd.read_csv(data_location)
But this errors saying file doesn't exist:
import json
from sagemaker import get_execution_role
bucket='my-sagemaker-bucket'
data_key = 'data.json'
data_location = 's3://{}/{}'.format(bucket, data_key)
data = json.load(open(data_location))
Anyone know why I can read the csv but not the json? I also can't shutil.copy the csv to the notebook's current working directory (also says file doesn't exist). I'm not very well versed with S3 buckets or Sagemaker, so not sure if this is a permissions/policy issue or something else.

your SageMaker-ExecutionRole might have insufficient rights to access your S3-bucket. The default IAM-SageMaker Execution role has the permission: "AmazonSageMakerFullAccess" which uses the S3 RequestCondition "s3:ExistingObjectTag/SageMaker = true".
So maybe you could try to simply tag your S3 bucket (Tag: SageMaker:true). Control your IAM settings.
import pandas as pd
bucket='my-sagemaker-bucket'
data_key = 'data.json'
data_location = 's3://{}/{}'.format(bucket, data_key)
pd.read_json(data_location) # , orient='columns', typ='series'

Pandas can handle S3 URL using your AWS credentials. So you could use pd.read_csv or pd.read_json instead of json.load. The suggestion from #Michael_S should work.

Writing a file to S3 using Lambda in Python with AWS

In AWS, I'm trying to save a file to S3 in Python using a Lambda function. While this works on my local computer, I am unable to get it to work in Lambda. I've been working on this problem for most of the day and would appreciate help. Thank you.
def pdfToTable(PDFfilename, apiKey, fileExt, bucket, key):
# parsing a PDF using an API
fileData = (PDFfilename, open(PDFfilename, "rb"))
files = {"f": fileData}
postUrl = "https://pdftables.com/api?key={0}&format={1}".format(apiKey, fileExt)
response = requests.post(postUrl, files=files)
response.raise_for_status()
# this code is probably the problem!
s3 = boto3.resource('s3')
bucket = s3.Bucket('transportation.manifests.parsed')
with open('/tmp/output2.csv', 'rb') as data:
data.write(response.content)
key = 'csv/' + key
bucket.upload_fileobj(data, key)
# FYI, on my own computer, this saves the file
with open('output.csv', "wb") as f:
f.write(response.content)
In S3, there is a bucket transportation.manifests.parsed containing the folder csv where the file should be saved.
The type of response.content is bytes.
From AWS, the error from the current set-up above is [Errno 2] No such file or directory: '/tmp/output2.csv': FileNotFoundError. In fact, my goal is to save the file to the csv folder under a unique name, so tmp/output2.csv might not be the best approach. Any guidance?
In addition, I've tried to use wb and w instead of rb also to no avail. The error with wb is Input <_io.BufferedWriter name='/tmp/output2.csv'> of type: <class '_io.BufferedWriter'> is not supported. The documentation suggests that using 'rb' is the recommended usage, but I do not understand why that would be the case.
Also, I've tried s3_client.put_object(Key=key, Body=response.content, Bucket=bucket) but receive An error occurred (404) when calling the HeadObject operation: Not Found.

Assuming Python 3.6. The way I usually do this is to wrap the bytes content in a BytesIO wrapper to create a file like object. And, per the boto3 docs you can use the-transfer-manager for a managed transfer:
from io import BytesIO
import boto3
s3 = boto3.client('s3')
fileobj = BytesIO(response.content)
s3.upload_fileobj(fileobj, 'mybucket', 'mykey')
If that doesn't work I'd double check all IAM permissions are correct.

You have a writable stream that you're asking boto3 to use as a readable stream which won't work.
Write the file, and then simply use bucket.upload_file() afterwards, like so:
s3 = boto3.resource('s3')
bucket = s3.Bucket('transportation.manifests.parsed')
with open('/tmp/output2.csv', 'w') as data:
data.write(response.content)
key = 'csv/' + key
bucket.upload_file('/tmp/output2.csv', key)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Load JSON from s3 inside aws glue pyspark job - python

Related

How to handle dynamic file naming convention, while AWS s3 file coping using boto3 python

listing s3 buckets using boto3 and python

How to get file key out of an SNS connected to S3 Bucket

Amazon Sagemaker open json from S3 bucket

Writing a file to S3 using Lambda in Python with AWS

Categories

Resources