Text file content loss on Boto S3 upload

Text file content loss on Boto S3 upload - python

I have been uploading text files to S3 and I came across this interesting error: the files aren't always uploaded, just the file name. So sometimes the entire file uploaded and sometimes I have a 0 byte file on S3. I have been using this tutorial:
http://stackabuse.com/example-upload-a-file-to-aws-s3/
Here is the code I have been using (minus keys and such):
#NOTE Section 8: Uploading to Amazon
AWS_ACCESS_KEY = ''
AWS_ACCESS_SECRET_KEY = ''
filea = open(date + '.txt', 'r+')
key = filea.name
bucket = ''
import os
import boto
from boto.s3.key import Key
##Beginning of function
def upload_to_s3(aws_access_key_id, aws_secret_access_key, filea, bucket, key, callback=None, md5=None, reduced_redundancy=False, content_type=None):
"""
Uploads the given file to the AWS S3
bucket and key specified.
callback is a function of the form:
def callback(complete, total)
The callback should accept two integer parameters,
the first representing the number of bytes that
have been successfully transmitted to S3 and the
second representing the size of the to be transmitted
object.
Returns boolean indicating success/failure of upload.
"""
# try:
# size = os.fstat(file.fileno()).st_size
# except:
# Not all file objects implement fileno(),
# so we fall back on this
# file.seek(0, os.SEEK_END)
# size = file.tell()
conn = boto.connect_s3(aws_access_key_id, aws_secret_access_key)
bucket = conn.get_bucket(bucket, validate=False)
k = Key(bucket)
k.key = key
print k.key
#if content_type:
# k.set_metadata('Content-Type', content_type)
sent = k.set_contents_from_file(filea, cb=callback, md5=md5, reduced_redundancy=reduced_redundancy, rewind=True)
print sent
# Rewind for later use
filea.seek(0)
#print size
##End of function
upload_to_s3(AWS_ACCESS_KEY, AWS_ACCESS_SECRET_KEY, filea, bucket, key)
os.remove(date + '.txt')
Now some info about what I feed into this: earlier sections of the code write out a text file, there are multiple lines paragraphs, but still all one text file that was created with a+ permissions. The file is named using (date + '.txt') and is not closed in earlier sections of the code using .close() unless there is some subprocess that the python interpreter carries out that I am not aware of (.close() gave me a few issues so I just left it open, since the last line of my code here erases it).
I have tried looping the uploading process, but it seems like the file is just not read properly. What am I doing wrong?

Boto does not rewind the file to 0 before it starts to upload. If the file pointer you pass to k.set_contents_from_file is not at the beginning of the file then any data from the beginning to the file to its current position (as reported by fp.tell()) will not be sent. This is by design and I would not consider this a bug in boto.
If you want to be sure the entire file is uploaded to S3, make sure the file pointer is at the beginning of the file before passing it to boto. From the code you show above, you are doing a rewind after the upload but not before.

Related

Process different file types separately in AWS S3 (python)

I'm looking to process files in an S3 location and re upload them to a different location depending on the file type which is specified in the key name. I want to use pagination to be able to process in batches according to accumulated file size, but I'm unsure how to do this separately for each filetype.
I establish a connection to the boto3 client and relevant S3 locations earlier in the code and check that the locations and objects within them exist.
My code is successfully giving me the number of each filetype in the location, but is not entering any of the for loops.
Should I be putting the paginator code within the if statements for each filetype instead?
if locationExists and filesFound:
paginator = s3c.get_paginator("list_objects_v2")
operation_parameters = {'Bucket': bucket,
'Prefix': path}
page_iterator = paginator.paginate(**operation_parameters)
lastresponse_objects = page_iterator.search("Contents[?contains(Key, `lastresponse`)][]")
snapshot_objects = page_iterator.search("Contents[?contains(Key, `_snapshot_`)][]")
snapshotupdate_objects = page_iterator.search("Contents[?contains(Key, `snapshotpriceupdate`)][]")
if lastresponse_objects:
total_LR_files = len(list(lastresponse_objects))
print(f'Lastresponse files: {total_LR_files}')
for file in lastresponse_objects:
print(file.key)
# processing logic for files containing 'lastresponse'
if snapshot_objects:
total_snapshot_files = len(list(snapshot_objects))
print(f'Snapshot files: {total_snapshot_files}')
for file in snapshot_objects:
print(file.key)
# processing logic for files containing '_snapshot_'
if snapshotupdate_objects:
total_snapshotupdate_files = len(list(snapshotupdate_objects))
print(f'Snapshot update files: {total_snapshotupdate_files}')
for file in snapshot_objects:
print(file.key)
# processing logic for files containing 'snapshotpriceupdate'

Upload .txt file to Dropbox without saving locally first

I have a .txt file that I need to upload into a Dropbox folder. On my PC it works great as it is however I need to put the code into a Google Cloud Function and as the GCP file system is read-only - this method if failing.
Can anyone recommend an alternative way of doing this that doesn't require me to save the data locally before pushing it up into Dropbox?
Here is my current working code for my local version:
import pathlib
import dropbox
api_key = 'XXXXXXXXXX'
# Build String And Save Locally To File
string = ["Item_A","Item_B","Item_C","Item_D"]
string = str(string)
with open('Item_List.txt', 'w') as f:
f.write(string)
# Define Local File Path
localfolder = pathlib.Path(".")
localpath = localfolder / 'Item_List.txt'
# Define Dropbox Target Location
targetfile = '/Data/' + 'Item_List.txt'
# Initilize Dropbox
d = dropbox.Dropbox(api_key)
# Upload File To Dropbox
with localpath.open("rb") as f:
d.files_upload(f.read(), targetfile, mode=dropbox.files.WriteMode("overwrite"))

If you need to simply use byte data, you can use the built-in bytes function to convert a string to byte data (you need to also specify encoding):
data = ["Item_A", "Item_B", "Item_C", "Item_D"]
string_data = str(data)
byte_data = bytes(string_data, encoding='utf-8')
And then later just use the byte data as the argument:
d.files_upload(byte_data, targetfile, mode=dropbox.files.WriteMode("overwrite"))

AWS S3 Download and Upload using TemporaryFile

I need to download all content (including versions) of an Amazon S3 Bucket and upload in other Amazon S3 Bucket. Don't tell me to use aws, I just can't use.
I use tempfile.TemporaryFile for this, it apparently works, the print show that the file object has the right content inside, but the uploaded files are empty (zero bytes).
with tempfile.TemporaryFile() as data:
sourceUser.download_fileobj('source-bucket',key,data)
# next 2 lines was just to check the content of the file
data.seek(0)
print (data.read())
destinationUser.upload_fileobj(data,'destination-bucket',key)

I have the same requirement, How do I pass the NamedTemporaryFile to the upload s3
Not sure How to pass the NamedTemporaryFileName to output=f'{file_name}.gpg' and to the load_file function --> filename=f_source.name
with tempFile.NamedTemporaryFile("wb") as f_source:
s3_client.download_fileobj(s3_bucket, s3_key, f_source)
logger.info(f'{s3_key} file downloaded successfully to local {f_source}')
f_source.flush()
file_name = self.s3_key.split('/')[-1]
gpg = gnupg.GPG()
key_data = open(key_path).read()
import_result = gpg.import_keys(key_data)
f_source.seek(0)
with open(f_source.name, 'r+b') as f:
status = gpg.encrypt_file(
file=f,
recipients=[recipient],
output=f'{file_name}.gpg',
)
s3_hook.load_file(
filename=f_source.name,
key=s3_key,
bucket_name=s3_bucket,
replace=True
)

Is it possible to loop through Amazon S3 bucket and count the number of lines in its file/key using Python?

Is it possible to loop through the file/key in Amazon S3 bucket, read the contents and count the number of lines using Python?
For Example:
1. My bucket: "my-bucket-name"
2. File/Key : "test.txt"
I need to loop through the file "test.txt" and count the number of line in the raw file.
Sample Code:
for bucket in conn.get_all_buckets():
if bucket.name == "my-bucket-name":
for file in bucket.list():
#need to count the number lines in each file and print to a log.

Using boto3 you can do the following:
import boto3
# create the s3 resource
s3 = boto3.resource('s3')
# get the file object
obj = s3.Object('bucket_name', 'key')
# read the file contents in memory
file_contents = obj.get()["Body"].read()
# print the occurrences of the new line character to get the number of lines
print file_contents.count('\n')
If you want to do this for all objects in a bucket, you can use the following code snippet:
bucket = s3.Bucket('bucket_name')
for obj in bucket.objects.all():
file_contents = obj.get()["Body"].read()
print file_contents.count('\n')
Here is the reference to boto3 documentation for more functionality: http://boto3.readthedocs.io/en/latest/reference/services/s3.html#object
Update: (Using boto 2)
import boto
s3 = boto.connect_s3() # establish connection
bucket = s3.get_bucket('bucket_name') # get bucket
for key in bucket.list(prefix='key'): # list objects at a given prefix
file_contents = key.get_contents_as_string() # get file contents
print file_contents.count('\n') # print the occurrences of the new line character to get the number of lines

Reading large files to memory sometimes is far from ideal. Instead you may find the following more of use:
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucketname', Key=fileKey)
nlines = 0
for _ in obj['Body'].iter_lines(): nlines+=1
print (nlines)

Amazon S3 is only a storage service. You must get the file in order to perform actions on it (e.g. reading number of files).

You can loops through a bucket using boto3 list_objects_v2. Because list_objects_v2 only list maximum of 1000 keys (even you specify MaxKeys), you must whether NextContinuationToken exist in the response dictionary, then specify ContinuationToken to read next page.
I wrote the sample code in some answer but I can't recall.
Then you use get_object() to read the file, and use simple line count code
(Update)
If you need key in particular prefix name, then add the PREFIX filter.

Python S3 download zip file

I have zip files uploaded to S3. I'd like to download them for processing. I don't need to permanently store them, but I need to temporarily process them. How would I go about doing this?

Because working software > comprehensive documentation:
Boto2
import zipfile
import boto
import io
# Connect to s3
# This will need your s3 credentials to be set up
# with `aws configure` using the aws CLI.
#
# See: https://aws.amazon.com/cli/
conn = boto.s3.connect_s3()
# get hold of the bucket
bucket = conn.get_bucket("my_bucket_name")
# Get hold of a given file
key = boto.s3.key.Key(bucket)
key.key = "my_s3_object_key"
# Create an in-memory bytes IO buffer
with io.BytesIO() as b:
# Read the file into it
key.get_file(b)
# Reset the file pointer to the beginning
b.seek(0)
# Read the file as a zipfile and process the members
with zipfile.ZipFile(b, mode='r') as zipf:
for subfile in zipf.namelist():
do_stuff_with_subfile()
Boto3
import zipfile
import boto3
import io
# this is just to demo. real use should use the config
# environment variables or config file.
#
# See: http://boto3.readthedocs.org/en/latest/guide/configuration.html
session = boto3.session.Session(
aws_access_key_id="ACCESSKEY",
aws_secret_access_key="SECRETKEY"
)
s3 = session.resource("s3")
bucket = s3.Bucket('stackoverflow-brice-test')
obj = bucket.Object('smsspamcollection.zip')
with io.BytesIO(obj.get()["Body"].read()) as tf:
# rewind the file
tf.seek(0)
# Read the file as a zipfile and process the members
with zipfile.ZipFile(tf, mode='r') as zipf:
for subfile in zipf.namelist():
print(subfile)
Tested on MacOSX with Python3.

If speed is a concern, a good approach would be to choose an EC2 instance fairly close to your S3 bucket (in the same region) and use that instance to unzip/process your zipped files.
This will allow for a latency reduction and allow you to process them fairly efficiently. You can remove each extracted file after finishing your work.
Note: This will only work if you are fine using EC2 instances.

Pandas provides a shortcut for this, which removes most of the code from the top answer, and allows you to be agnostic about whether your file path is on s3, gcp, or your local machine.
import pandas as pd
obj = pd.io.parsers.get_filepath_or_buffer(file_path)[0]
with io.BytesIO(obj.read()) as byte_stream:
# Use your byte stream, to, for example, print file names...
with zipfile.ZipFile(byte_stream, mode='r') as zipf:
for subfile in zipf.namelist():
print(subfile)

I believe you have heard boto which is Python interface to Amazon Web Services
You can get key from s3 to file.
import boto
import zipfile.ZipFile as ZipFile
s3 = boto.connect_s3() # connect
bucket = s3.get_bucket(bucket_name) # get bucket
key = bucket.get_key(key_name) # get key (the file in s3)
key.get_file(local_name) # set this to temporal file
with ZipFile(local_name, 'r') as myzip:
# do something with myzip
os.unlink(local_name) # delete it
You can also use tempfile. For more detail, see create & read from tempfile

Reading certain file from a zip file from S3 bucket.
import boto3
import os
import zipfile
import io
import json
'''
When you configure awscli, you\'ll set up a credentials file located at
~/.aws/credentials. By default, this file will be used by Boto3 to authenticate.
'''
os.environ['AWS_PROFILE'] = "<profile_name>"
os.environ['AWS_DEFAULT_REGION'] = "<region_name>"
# Let's use Amazon S3
s3_name = "<bucket_name>"
zip_file_name = "<zip_file_name>"
file_to_open = "<file_to_open>"
s3 = boto3.resource('s3')
obj = s3.Object(s3_name, zip_file_name )
with io.BytesIO(obj.get()["Body"].read()) as tf:
# rewind the file
tf.seek(0)
# Read the file as a zipfile and process the members
with zipfile.ZipFile(tf, mode='r') as zipf:
file_contents= zipf.read(file_to_open).decode("utf-8")
print(file_contents)
reference from #brice answer.

Adding on to #brice answer
Here is the code if you want to read any data inside the file line by line
with zipfile.ZipFile(tf, mode='r') as zipf:
for line in zipf.read("xyz.csv").split(b"\n"):
print(line)
break # to break off after the first line
Hope this helps!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Text file content loss on Boto S3 upload - python

Related

Process different file types separately in AWS S3 (python)

Upload .txt file to Dropbox without saving locally first

AWS S3 Download and Upload using TemporaryFile

Is it possible to loop through Amazon S3 bucket and count the number of lines in its file/key using Python?

Python S3 download zip file

Categories

Resources