How to grab specific object metadata info from Google Cloud Storage?

How to grab specific object metadata info from Google Cloud Storage? - python

I want to:
Access all GCP projects linked to my google account.
Get all buckets that contain the word foobar in their name.
Retrieve some of the metadata from the ones provided by Google (Creation time, Update time, Storage class, Content-Length, Content-Type, Hash (crc32c), Hash, ETag, Generation, Metageneration, ACL, TOTAL) for example Creation time and Content-Type and TOTAL.
Save the results in a .csv / dataframe format with fields like: foobar, Creation time, Content-Type, TOTAL
I don't want to:
Although I think only files have metadata, in case sub-directories have metadata too, I don't want to grab sub-directories' metadata.
Overdo it with the parsing through folders. Some of the buckets have tons of subdirectories. I want the cheapest way possible to get to the objects of interest.
What I have so far:
I use gcloud projects list to get all projects linked to my account.
I manually create a .csv file with the fields: project_id, recursive, selected. recursive TRUE is for those I know they don't have that many folders so I can afford to look through all sub-directories. selected TRUE just helps me to go through some of the projects and not all.
For all the projects where the selected field is TRUE I collect the data and save it in a file with the following command:
gsutil ls -L -p "${project}" gs://*foobar* >> non_recursive.csv
For all the projects where the selected and the recursive fields is TRUE I collect the data and save it in a file with the following command:
gsutil ls -r -L -p "${project}" gs://*secret* >> recursive.csv
So my questions:
How can I modify this: gsutil ls -L -p "${project}" gs://*foobar* >> non_recursive.csv to collect only some of the metadata fields and to output it in the dataframe format mentioned above?
Is there a better way to do the above? (Python or Bash solutions only please)

You can generate a list of the files for which you want to fetch metadata, and then generate a gsutil ls command for each, e.g.,
sed 's/\(.*\)/gsutil ls -L \1/' objects_to_list | sh
If there are a large number of such objects you could do the listings in parallel, e.g.,
sed 's/\(.*\)/gsutil ls -L \1/' objects_to_list | split -l 100 - LISTING_PART
for f in LISTING_PART*; do
sh $f > $f.out &
done
wait

This gets filename and mimeType:
blobs = storage_client.list_blobs(BUCKET)
for blob in blobs:
item = {'content': "gs://{}/{}".format(blob.bucket.name,blob.name), 'mimeType': "{}".format(blob.content_type)}
print(item)
Can get other metatdata.

Related

Programmatically move/rename/process files in AWS S3

I have an AWS S3 structure that looks like this:
bucket_1
|
|__folder_1
| |__file_1
| |__file_2
|
|__folder_2
|__file_1
|__file_2
bucket_2
And I am trying to find a "good way" (efficient and cost effective) to achieve the following:
bucket_1
|
|__folder_1
| |__file_1
| |__file_2
|
|__folder_2
|__file_1
|__file_2
bucket_2
|
|__folder_1_file_1
|__folder_2_file_1
|__processed_file_2
Where:
folder_1_file_1 and folder_2_file_1 are the original two file_1 that have been copied/renamed (prepending the folder path to the file_name) into the new bucket
processed_file_2 is a file that depends on the content of the two file_2 (e.g., if file_2 were text files, processed_file_2 might be a joint text file where the two original files are appended to each other-note that this is just an example).
I do have a python script that does this for me locally (copy/rename files, process the other files and move to a new folder), but I'm not sure of what tools I should use to do this on AWS, without having to download the data, process them and re-upload them.
I have done some readings, and I've seen that AWS lambda might be one way of doing this, but I'm not sure it's the ideal solution. I'm not even sure if I should keep this as a python script or I should look at other ways (I'm open to other programming languages/tools, as long as they are possibly a very good solution to my problem).
As a plus, it would be useful to have this process triggered either every N days, or when a certain threshold of files have been reached, but also a semi-automated solution (where I should manually run the script/use the tool) would be an acceptable solution.

[Move and Rename objects within s3 bucket using boto3]
import boto3
s3_resource = boto3.resource(‘s3’)
# Copy object A as object B
s3_resource.Object(“bucket_name”, “newpath/to/object_B.txt”).copy_from(
CopySource=”path/to/your/object_A.txt”)
# Delete the former object A
s3_resource.Object(“bucket_name”, “path/to/your/object_A.txt”).delete()

You could move the files within the s3 bucket using the s3fs module.
import s3fs
path1='s3:///bucket_name/folder1/sample_file.pkl'
path2='s3:///bucket_name2/folder2/sample_file.pkl'
s3=s3fs.S3FileSystem()
s3.move(path1,path2)
In case if you have credentials, you could pass within the client_kwargs of S3FileSystem as shown below:
import s3fs
path1='s3:///bucket_name/folder1/sample_file.pkl'
path2='s3:///bucket_name/folder2/sample_file.pkl'
credentials= {}
credentials.setdefault("region_name", r_name) # mention the region
credentials.setdefault("aws_access_key_id", a_key) # mention the access_key_id
credentials.setdefault("aws_secret_access_key", s_a_key) # mention the
secret_access_key
s3=s3fs.S3FileSystem(client_kwargs=credentials)
s3.move(path1,path2)

How to get the last commit of an specific file using python?

I tried with GitPython, but I am just getting the actual commit git hash.
import git
repo = git.Repo(search_parent_directories=True)
repo.head.commit.hexsha
But, for trazability I want to store the git commit hash of an specific file, i.e. the equivalent of this command (using git)
git log -n 1 --pretty=format:%h -- experiments/test.yaml
Is it possible to achive with GitPython?

An issue like how do I get sha key for any repository' file. points to the Tree object, as providing an access for recursive traversal of git trees, with access to all meta-data, including the SHA1 hash.
self.assertEqual(tree['smmap'], tree / 'smmap') # access by index and by sub-path
for entry in tree: # intuitive iteration of tree members
print(entry)
blob = tree.trees[0].blobs[0] # let's get a blob in a sub-tree
assert blob.name
blob.hexsha would be the SHA1 of the blob.

HTCondor output files: obtain created directory

I am using HTcondor to generate some data (txt, png). By running my program, it creates a directory next to the .sub file, named datasets, where the datasets are stored into. Unfortunately, condor does not give me back this created data when finished. In other words, my goal is to get the created data in a "Datasets" subfolder next to the .sub file.
I tried:
1) to not put the data under the datasets subfolder, and I obtained them as thought. Howerver, this is not a smooth solution, since I generate like 100 files which are now mixed up with the .sub file and all the other.
2) Also I tried to set this up in the sub file, leading to this:
notification = Always
should_transfer_files = YES
RunAsOwner = True
When_To_Transfer_Output = ON_EXIT_OR_EVICT
getenv = True
transfer_input_files = main.py
transfer_output_files = Datasets
universe = vanilla
log = log/test-$(Cluster).log
error = log/test-$(Cluster)-$(Process).err
output = log/test-$(Cluster)-$(Process).log
executable = Simulation.bat
queue
This time I get the error, that Datasets was not found. Spelling was checked already.
3) Another option would be, to pack everything in a zip, but since I have to run hundreds of jobs, I do not want to unpack all this files afterwards.
I hope somebody comes up with a good idea on how to solve this.

Just for the record here: HTCondor does not transfer created directories at the end of the run or its contents. The best way to get the content back is to write a wrapper script that will run your executable and then compress the created directory at the root of the working directory. This file will be transferred with all other files. For example, create run.exe:
./Simulation.bat
tar zcf Datasets.tar.gz Datasets
and in your condor submission script put:
executable = run.exe
However, if you do not want to do this and if HTCondor is using a common shared space like an AFS you can simply copy the whole directory out:
./Simulation.bat
cp -r Datasets <AFS location>
The other alternative is to define an initialdir as described at the end of: https://research.cs.wisc.edu/htcondor/manual/quickstart.html
But one must create the directory structure by hand.
also, look around pg. 65 of: https://indico.cern.ch/event/611296/contributions/2604376/attachments/1471164/2276521/TannenbaumT_UserTutorial.pdf
This document is, in general, a very useful one for beginners.

Deploy files to a directory with a custom structure in Python

I have number of file objects which I would like to redeploy to a directory with a new structure based on requirements stated by the user.
As example could be having these file objects:
1)root\name\date\type\filename
2)root\name\date\type\filename
3)root\name\date\type\filename
...that I want to save (or a copy of them) in a new structure like below after the user defined a need to split type->date->name:
1)root\type\date\name\filename
2)root\type\date\name\filename
3)root\type\date\name\filename
... or even losing levels such as:
1)root\type\filename
2)root\type\filename
3)root\type\filename
I can only come up with the option to go the long way round, taking the initial list and though a process of filtering simply deploy in the new calculated folder structure using basic string operations.
I feel though that someone has probably done this before in a smart way and potentially that a library/module already exists to do this. Does anyone have any ideas?

Here is a solution using Python glob:
The current levels are: name, date, type and filename:
curr_levels = "name\\date\\type\\filename"
curr_levels = curr_levels.split("\\")
The user want other levels: type, date, name and filename:
user_levels = "type\\date\\name\\filename"
user_levels = user_levels.split("\\")
We can use glob.iglob to iterate the tree structure on 4 levels.
The glob pattern is something like: <src_dir>\*\*\*\* (but, we use a more generic way here).
The user structure can be defined with a simple string format.
For instance: {type}\{date}\{name}\{filename} on Windows.
Wee need to create the directory structure first and then copy (or move) the file.
pattern = os.path.join(source_dir, *("*" * len(curr_levels)))
fmt = os.sep.join(['{{{key}}}'.format(key=key) for key in user_levels])
for source_path in glob.iglob(pattern):
source_relpath = os.path.relpath(source_path, source_dir)
parts = source_relpath.split(os.sep)
values = dict(zip(curr_levels, parts))
target_relpath = fmt.format(**values)
target_path = os.path.join(target_dir, target_relpath)
parent_dir = os.path.dirname(target_path)
if not os.path.exists(parent_dir):
os.makedirs(parent_dir)
shutil.copy2(source_path, target_path)
Note: if your source_dir is the same as target_dir (the root in your question), you need to replace glob.iglob by glob.glob in order to store the whole list of files in memory before processing. This is required to avoid glob.iglob browse the directory tree you are creating…

If you are in UNIX environment, the simpler way to achieve this will be using shell script with cp command.
For example, for copying all files from: /root/name/date/type/filename as /root/date/filename; you need to just do:
cp /root/*/date/*/filename /root/date/filename
OR, if you want to move the file, use mv command:
mv /root/*/date/*/filename /root/date/filename
You may run these commands via Python as well using os.system() as:
import os
os.system("cp \root\*\date\*\filename root\date\filename")
For details, check: Calling an external command in Python
Edit based on comment. For copying /root/name/date/type/filename into /root/date/name/type/filename, you need to just do:
cp /root/name/date/type/filename /root/date/name/type/filename
But make sure that directory /root/date/name/type exists before doing it. In order to make sure it exists, and if not create a directory by using mkdir with -p option as:
mkdir -p /root/date/name/type

How to specify hosts list or roledefs in text file (instead of code)?

The list or groups of host or (at least in my scenarios) somewhat dynamic, and decoupled from the code.
In addition, many times I use fabric for "one liners" - that is, with out writing a script.
I'm looking for a simple way to define a list of hosts and\or role definitions that doesn't require modifying or using python scripts.
a simple host per line format is preferred, as it's the current format of out hosts lists.
from what I saw the closest thing is the .rc file - but according to the documentation it only supports simple variables.

if I understand you correctly, you need separate file for list of hosts. You need to add this lines to your fabfile:
env.roledefs = {
#static roles
}
# add dynamic role from file "hosts"
with open("./hosts") as f:
env.roledefs['tmp'] = f.readlines()
Create hosts file with list of hosts in current directory:
example1.com
example2.com
try it:
$ fab -R tmp -- uname -a

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.