How to upload an empty folder to S3 using Boto3? - python

My program backs up all the files in the directory, except for empty folders. How do you upload an empty folder into S3 using Boto 3, Python?
for dirName, subdirList, fileList in os.walk(path):
# for each directory, walk through all files
for fname in fileList:
current_key = dirName[dir_str_index:] +"\\"+ fname
current_key = current_key.replace("\\", "/")

S3 doesn't really have folders:
Amazon S3 has a flat structure instead of a hierarchy like you would see in a file system. However, for the sake of organizational simplicity, the Amazon S3 console supports the folder concept as a means of grouping objects. Amazon S3 does this by using a shared name prefix for objects (that is, objects that have names that begin with a common string).
Since folders are just part of object names, you can't have empty folders in S3.

Related

How to Create list of filenames in an S3 directory using pyspark and/or databricks utils

I have a need to move files from one S3 bucket directory to two others. I have to do this from a Databricks notebook. If the the file has a json extension, I will move into jsonDir. Otherwise, I will move into otherDir. Presumably I would do this with pyspark, and databrick utils (dbutils).
I do not know the name of the S3 bucket, only the relative path off of it (call it MYPATH). For instance, I can do:
dbutils.fs.ls(MYPATH)
and it lists all the files in the S3 directory. Unfortunately with dbutils, you can move one file at a time or all of them (no wildcards). The bulk of my program is:
for file in fileList:
if file.endswith("json"):
dbutils.fs.mv(file, jsonDir)
continue
if not file.endswith("json")
dbutils.fs.mv(file, otherDir)
continue
My Problem: I do not know how to retrieve the list of files from MYPATH to put them in array "fileList". I would be grateful for any ideas. Thanks.
I think your code runs if you do these minor changes:
fileList = dbutils.fs.ls(MYPATH)
for file in fileList:
if file.name.endswith("/"): # Don't copy dirs
continue
if file.name.endswith("json"):
dbutils.fs.mv(file.path, jsonDir + file.name)
continue
if not file.name.endswith("json"):
dbutils.fs.mv(file.path, otherDir + file.name)
continue
Here, file.name is appended to keep the name of the file in the new dir. I need this one Azure dbfs backed storage, otherwise everything gets moved to the same blob.
It is critical that jsonDir and otherDir ends with a / character.

Include only .gz extension files from S3 bucket

I want to process/download .gz files from S3 bucket. There are more than 10,000 files on S3 so I am using
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket')
objects = bucket.objects.all()
for object in objects:
print(object.key)
This lists .txt files which I want to avoid. How can I do that?
The easiest way to filter objects by name or suffix is to do it within Python, such as using .endswith() to include/exclude objects.
You can Filter by Prefix, but not by suffix.

Unzip files using Python to one folder

I want to unzip files with python 2.7.8 .When i try to extract zip files that contain files with same names in it to one folder, some files got lost because duplication names. I try that:
import zipfile,fnmatch,os
rootPath = r"C:\zip"
pattern = '*.zip'
for root, dirs, files in os.walk(rootPath):
for filename in fnmatch.filter(files, pattern):
print(os.path.join(root, filename))
outpath = r"C:\Project\new"
zipfile.ZipFile(os.path.join(root, filename)).extractall(r"C:\Project\new")
UPDATE:
I try to extract all the files located inside the zip files into one folder only without a new subfolders created. If there are files with the same name i need all the files
The ZipFile.extractall() method simply extracts the files and stores them one by one in the target path. If you want to preserve files with duplicated names you will have to iterate over the members using ZipeFile.namelist() and take appropriate action when you detect duplicates. The ZipFile.read() allows you to read the file contents, then you can write them wherever (and with whatever name) you want.

Can you list all folders in an S3 bucket?

I have a bucket containing a number of folders each folders contains a number of images. Is it possible to list all the folders without iterating through all keys (folders and images) in the bucket. I'm using Python and boto.
You can use list() with an empty prefix (first parameter) and a folder delimiter (second parameter) to achieve what you're asking for:
s3conn = boto.connect_s3(access_key, secret_key, security_token=token)
bucket = s3conn.get_bucket(bucket_name)
folders = bucket.list('', '/')
for folder in folders:
print folder.name
Remark:
In S3 there is no such thing as "folders". All you have is buckets and objects.
The objects represent files. When you name a file: name-of-folder/name-of-file it will look as if it's a file: name-of-file that resides inside folder: name-of-folder - but in reality there's no such thing as the "folder".
You can also use AWS CLI (Command Line Interface): the command s3ls <bucket-name> will list only the "folders" in the first-level of the bucket.
Yes ! You can list by using prefix and delimiters of a key. Have a look at the following documentation.
http://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysHierarchy.html

How should i create the index of files using boto and S3 using python django

I have the folder structure with 2000 files on S3.
I want that every week i run the program that gets the lists of files are folders from s3 and populates the database.
Then i use that database to show same folder structure on the site.
I ahve two problems
How can i get the list of folders from there and then store them in mysql. Do i need to grab all the file names and then split with "/" . But it looks diffuclt to see which files belong to which folders. I have found this https://stackoverflow.com/a/17096755/1958218 but could not found where is listObjects() function
doesn't get_all_keys() method of the s3 bucket do what you need:
s3 = boto.connect_s3()
b = s3.get_bucket('bucketname')
keys = b.get_all_keys()
then iterate over keys, do os.path.split and unique...

Categories