How to get list_blobs to behave like gsutil - python

I would like to only get the first level of a fake folder structure on GCS.
If I run e.g.:
gsutil ls 'gs://gcp-public-data-sentinel-2/tiles/'
I get a list like this:
gs://gcp-public-data-sentinel-2/tiles/01/
gs://gcp-public-data-sentinel-2/tiles/02/
gs://gcp-public-data-sentinel-2/tiles/03/
gs://gcp-public-data-sentinel-2/tiles/04/
gs://gcp-public-data-sentinel-2/tiles/05/
gs://gcp-public-data-sentinel-2/tiles/06/
gs://gcp-public-data-sentinel-2/tiles/07/
gs://gcp-public-data-sentinel-2/tiles/08/
gs://gcp-public-data-sentinel-2/tiles/09/
gs://gcp-public-data-sentinel-2/tiles/10/
gs://gcp-public-data-sentinel-2/tiles/11/
gs://gcp-public-data-sentinel-2/tiles/12/
gs://gcp-public-data-sentinel-2/tiles/13/
gs://gcp-public-data-sentinel-2/tiles/14/
gs://gcp-public-data-sentinel-2/tiles/15/
.
.
.
Running code like the following in the Python API give me an empty result:
from google.cloud import storage
bucket_name = 'gcp-public-data-sentinel-2'
prefix = 'tiles/'
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
for blob in bucket.list_blobs(max_results=10, prefix=prefix,
delimiter='/'):
print blob.name
If I don't use the delimiter option I get all the results in the bucket which is not very useful.

Maybe not the best way, but, inspired by this comment on the official repository:
iterator = bucket.list_blobs(delimiter='/', prefix=prefix)
response = iterator._get_next_page_response()
for prefix in response['prefixes']:
print('gs://'+bucket_name+'/'+prefix)
Gives:
gs://gcp-public-data-sentinel-2/tiles/01/
gs://gcp-public-data-sentinel-2/tiles/02/
gs://gcp-public-data-sentinel-2/tiles/03/
gs://gcp-public-data-sentinel-2/tiles/04/
gs://gcp-public-data-sentinel-2/tiles/05/
gs://gcp-public-data-sentinel-2/tiles/06/
gs://gcp-public-data-sentinel-2/tiles/07/
gs://gcp-public-data-sentinel-2/tiles/08/
gs://gcp-public-data-sentinel-2/tiles/09/
gs://gcp-public-data-sentinel-2/tiles/10/
...

If one finds this ticket like me after a long time: currently (google-cloud-storage 2.1.0) one can list the bucket contents using '//' instead of '/'. However, it lists "recursively" down to the actual blob (as it is not a real FS)

Here is a faster way (found this in a github thread, posted by #evanj https://github.com/GoogleCloudPlatform/google-cloud-python/issues/920):
def list_gcs_directories(bucket, prefix):
iterator = bucket.list_blobs(prefix=prefix, delimiter='/')
prefixes = set()
for page in iterator.pages:
print(page, page.prefixes)
prefixes.update(page.prefixes)
return prefixes
You want to call this function as follows:
client = storage.Client()
bucket_name = 'my_bucket_name'
bucket_obj = client.bucket(bucket_name)
list_folders = list_gcs_directories(bucket_obj, prefix='my/prefix/path/within/bucket/')
# Getting rid of the prefix
list_folders = [''.join(indiv_folder.split('/')[-1])
for indiv_folder in list_folders]

Related

How to mock list of response objects from boto3?

I'd like to get all archives from a specific directory on S3 bucket like the following:
def get_files_from_s3(bucket_name, s3_prefix):
files = []
s3_resource = boto3.resource("s3")
bucket = s3_resource.Bucket(bucket_name)
response = bucket.objects.filter(Prefix=s3_prefix)
for obj in response:
if obj.key.endswidth('.zip'):
# get all archives
files.append(obj.key)
return files
My question is about testing it; because I'd like to mock the list of objects in the response to be able to iterate on it. Here is what I tried:
from unittest.mock import patch
from dataclasses import dataclass
#dataclass
class MockZip:
key = 'file.zip'
#patch('module.boto3')
def test_get_files_from_s3(self, mock_boto3):
bucket = mock_boto3.resource('s3').Bucket(self.bucket_name)
response = bucket.objects.filter(Prefix=S3_PREFIX)
response.return_value = [MockZip()]
files = module.get_files_from_s3(BUCKET_NAME, S3_PREFIX)
self.assertEqual(['file.zip'], files)
I get an assertion error like this: E AssertionError: ['file.zip'] != []
Does anyone have a better approach? I used struct but I don't think this is the problem, I guess I get an empty list because the response is not iterable. So how can I mock it to be a list of mock objects instead of just a MockMagick type?
Thanks
You could use moto, which is an open-source libray specifically build to mock boto3-calls. It allows you to work directly with boto3, without having to worry about setting up mocks manually.
The testfunction that you're currently using would look like this:
from moto import mock_s3
#pytest.fixture(scope='function')
def aws_credentials():
"""Mocked AWS Credentials, to ensure we're not touching AWS directly"""
os.environ['AWS_ACCESS_KEY_ID'] = 'testing'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'testing'
os.environ['AWS_SECURITY_TOKEN'] = 'testing'
os.environ['AWS_SESSION_TOKEN'] = 'testing'
#mock_s3
def test_get_files_from_s3(self, aws_credentials):
s3 = boto3.resource('s3')
bucket = s3.Bucket(self.bucket_name)
# Create the bucket first, as we're interacting with an empty mocked 'AWS account'
bucket.create()
# Create some example files that are representative of what the S3 bucket would look like in production
client = boto3.client('s3', region_name='us-east-1')
client.put_object(Bucket=self.bucket_name, Key="file.zip", Body="...")
client.put_object(Bucket=self.bucket_name, Key="file.nonzip", Body="...")
# Retrieve the files again using whatever logic
files = module.get_files_from_s3(BUCKET_NAME, S3_PREFIX)
self.assertEqual(['file.zip'], files)
Full documentation for Moto can be found here:
http://docs.getmoto.org/en/latest/index.html
Disclaimer: I am a maintainer for Moto.

Downloading only AWS S3 object file names and image URL in CSV Format

I have hosted files in AWS s3 bucket, I need only all S3 bucket object URL's in CSV file
Please suggest
You can get all S3 Object URLS by using the AWS SDK for S3. First, what you need to do is read all items in a bucket. You can use Python code similar to this Java code (you can port the logic):
ListObjectsRequest listObjects = ListObjectsRequest
.builder()
.bucket(bucketName)
.build();
ListObjectsResponse res = s3.listObjects(listObjects);
List<S3Object> objects = res.contents();
for (ListIterator iterVals = objects.listIterator(); iterVals.hasNext(); ) {
S3Object myValue = (S3Object) iterVals.next();
System.out.print("\n The name of the key is " + myValue.key());
}
Then iterate through the list and get the key as shown above. For each object, you can get the URL using Python code similar to this:
GetUrlRequest request = GetUrlRequest.builder()
.bucket(bucketName)
.key(keyName)
.build();
URL url = s3.utilities().getUrl(request);
System.out.println("The URL for "+keyName +" is "+url.toString());
Put each URL value into a collection and then write the collection out to a CSV. That is how you achieve your use case.

Get Properties of storage blobs returning empty dict

I've just uploaded a 5GB of data and would like to verify that the MD5 sums match. I've calculated this for my local copy of the files, but am having problems fetching ContentMD5 from Azure. So far, I get an empty dict, but I can see the blob names. I've limited it to the first 10 items at the moment, just for debugging. I'm aware that MD5 is different on Azure from a typical md5sum call and have allowed for that locally. But, currently, I cannot see any blob properties. The properties are there when I browse via the Azure console (as is the ContentMD5 property).
Where am I going wrong?
Here's my code at the moment:
import os
from os import sys
from azure.storage.blob import BlobServiceClient
def remote_check(connection_str):
blob_service_client = BlobServiceClient.from_connection_string(connection_str)
container_name = "global"
container = blob_service_client.get_container_client(container=container_name)
blob_list = container.list_blobs()
count = 0
for blob in blob_list:
if count < 10:
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob)
a = blob_client.get_blob_properties()
print(a.metadata)
print("Blob name: " + str(blob_client.blob_name))
count = count + 1
else:
break
def main():
try:
CONNECTION_STRING = os.environ['AZURE_STORAGE_CONNECTION_STRING']
remote_check(CONNECTION_STRING)
except KeyError:
print("AZURE_STORAGE_CONNECTION_STRING must be set.")
sys.exit(1)
if __name__ == '__main__':
main()
Please make sure you're using the latest version of package azure-storage-blob 12.6.0.
Some properties are in the content_settings, for example, to get content_md5, you should use the following code:
a=blob_client.get_blob_properties()
print(a.content_settings.content_md5)
Here is the my test result:
Maybe you can check the blob properties with a rest (e.g. with an rest client like postman) call described here:
https://learn.microsoft.com/en-us/rest/api/storageservices/get-blob-properties
The "Content-MD5" is returned as HTTP-Response Header.

Can azure data lake files be filtered based on Last Modified time using azure python sdk?

I am trying to perform in-memory operations on files stored in azure datalake. I am unable to find documentation regarding using a matching pattern without using the ADL Downloader.
For a single file, this is the code I use
filename = '/<folder/<filename>.json'
with adlsFileSystemClient.open(filename) as f:
for line in f:
<file-operations>
But how do we filter based on filename (string matching) or based on last modified date.
When I used U-SQL , I had the option to filter the fileset based on the last modified option.
DECLARE EXTERNAL #TodaysTime = DateTime.UtcNow.AddDays(-1);
#rawInput=
EXTRACT jsonString string,
uri = FILE.URI()
,modified_date = FILE.MODIFIED()
FROM #in
USING Extractors.Tsv(quoting : true);
#parsedInput=
SELECT *
FROM #rawInput
WHERE modified_date > #TodaysTime;
Is there any similar options to filter the files modified during a specified period when using adlsFileSystemClient?
Github Issue: https://github.com/Azure/azure-data-lake-store-python/issues/300
Any help is appreciated.
Note:
This question was answered by akharit in GitHub recently. I am providing his answer below which solves my requirement.
**There isn't any in build functionality in the adls sdk itself as there is no server side api that will return only files modified with the last 4 hours.
It should be easy to write the code to do that after you get the list of all entries.
The modification time field returns milliseconds since unix epoch, which you can convert to a python datetime object by
from datetime import datetime, timedelta
datetime.fromtimestamp(file['modificationTime'] / 1000)
And then something like
filtered = [file['name'] for file in adl.ls('/', detail=True) if (datetime.now() - datetime.fromtimestamp(file['modificationTime']/1000)) > timedelta(hours = 4)]
You can use walk instead of ls for recursive enumeration as well.
**
Based on the below code , You can find the container level directories and file names with file properties including the last_modified data as well . So you can control the file based on the last_modified date .
from pyspark.sql.functions import col
from azure.storage.blob import BlockBlobService
from datetime import datetime
block_blob_service = BlockBlobService(account_name='acccount_name', account_key='account-key')
container_name ='Contaniner_name'
second_conatainer_name ='Contaniner_name_second'
#block_blob_service.create_container(container_name)
generator = block_blob_service.list_blobs(container_name,prefix="Recovery/")
report_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
myfile = open('/dbfs/adlsaudit/auditfiles2', 'w')
for blob in generator:
length = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.content_length
last_modified = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.last_modified
file_size = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.content_length
# print("\t Recovery: " + blob.name,":" +str(length),":" + str(last_modified))
line = container_name+'|'+second_conatainer_name+'|'+blob.name+'|'+ str(file_size) +'|'+str(last_modified)+'|'+str(report_time)
myfile.write(line+'\n')
myfile.close()

Amazon AWS - S3 to ElasticSearch (Python Lambda)

I'd like to copy data from an S3 directory to the Amazon ElasticSearch service. I've tried following the guide, but unfortunately the part I'm looking for is missing. I don't know how the lambda function itself should look like (and all the info about this in the guide is: "Place your application source code in the eslambda folder."). I'd like ES to autoindex the files.
Currently I'm trying
for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = urllib.unquote_plus(record['s3']['object']['key'])
index_name = event.get('index_name', key.split('/')[0])
object = s3_client.Object(bucket, key)
data = object.get()['Body'].read()
helpers.bulk(es, data, chunk_size=100)
But I get like a massive error stating
elasticsearch.exceptions.RequestError: TransportError(400, u'action_request_validation_exception', u'Validation Failed: 1: index is missing;2: type is missing;3: index is missing;4: type is missing;5: index is missing;6: type is missing;7: ...
Could anyone explain to me, how can I set things up so that my data gets moved from S3 to ES where it gets auto-mapped and auto-indexed? Apparently it's possible, as mentioned in the reference here and here.
While mapping can automatically be assigned in Elasticsearch, the indexes are not automatically generated. You have to specify the index name and type in the POST request. If that index does not exist, then Elasticsearch will create the index automatically.
Based on your error, it looks like you're not passing through an index and type.
For example, here's how a simple POST request to add a record to the index MyIndex and type MyType which would first create the index and type if it did not already exist.
curl -XPOST 'example.com:9200/MyIndex/MyType/' \
-d '{"name":"john", "tags" : ["red", "blue"]}'
I wrote a script to download a csv file from S3 and then transfer the data to ES.
Made an S3 client using boto3 and downloaded the file from S3
Made an ES client to connect to Elasticsearch.
Opened the csv file and used the helpers module from elasticsearch to insert csv file contents into elastic search.
main.py
import boto3
from elasticsearch import helpers, Elasticsearch
import csv
import os
from config import *
#S3
Downloaded_Filename=os.path.basename(Prefix)
s3 = boto3.client('s3', aws_access_key_id=awsaccesskey,aws_secret_access_key=awssecretkey,region_name=awsregion)
s3.download_file(Bucket,Prefix,Downloaded_Filename)
#ES
ES_index = Downloaded_Filename.split(".")[0]
ES_client = Elasticsearch([ES_host],http_auth=(ES_user, ES_password),port=ES_port)
#S3 to ES
with open(Downloaded_Filename) as f:
reader = csv.DictReader(f)
helpers.bulk(ES_client, reader, index=ES_index, doc_type='my-type')
config.py
awsaccesskey = ""
awssecretkey = ""
awsregion = "us-east-1"
Bucket=""
Prefix=''
ES_host = "localhost"
ES_port = "9200"
ES_user = "elastic"
ES_password = "changeme"

Categories