Deploy files to a directory with a custom structure in Python

Deploy files to a directory with a custom structure in Python - python

I have number of file objects which I would like to redeploy to a directory with a new structure based on requirements stated by the user.
As example could be having these file objects:
1)root\name\date\type\filename
2)root\name\date\type\filename
3)root\name\date\type\filename
...that I want to save (or a copy of them) in a new structure like below after the user defined a need to split type->date->name:
1)root\type\date\name\filename
2)root\type\date\name\filename
3)root\type\date\name\filename
... or even losing levels such as:
1)root\type\filename
2)root\type\filename
3)root\type\filename
I can only come up with the option to go the long way round, taking the initial list and though a process of filtering simply deploy in the new calculated folder structure using basic string operations.
I feel though that someone has probably done this before in a smart way and potentially that a library/module already exists to do this. Does anyone have any ideas?

Here is a solution using Python glob:
The current levels are: name, date, type and filename:
curr_levels = "name\\date\\type\\filename"
curr_levels = curr_levels.split("\\")
The user want other levels: type, date, name and filename:
user_levels = "type\\date\\name\\filename"
user_levels = user_levels.split("\\")
We can use glob.iglob to iterate the tree structure on 4 levels.
The glob pattern is something like: <src_dir>\*\*\*\* (but, we use a more generic way here).
The user structure can be defined with a simple string format.
For instance: {type}\{date}\{name}\{filename} on Windows.
Wee need to create the directory structure first and then copy (or move) the file.
pattern = os.path.join(source_dir, *("*" * len(curr_levels)))
fmt = os.sep.join(['{{{key}}}'.format(key=key) for key in user_levels])
for source_path in glob.iglob(pattern):
source_relpath = os.path.relpath(source_path, source_dir)
parts = source_relpath.split(os.sep)
values = dict(zip(curr_levels, parts))
target_relpath = fmt.format(**values)
target_path = os.path.join(target_dir, target_relpath)
parent_dir = os.path.dirname(target_path)
if not os.path.exists(parent_dir):
os.makedirs(parent_dir)
shutil.copy2(source_path, target_path)
Note: if your source_dir is the same as target_dir (the root in your question), you need to replace glob.iglob by glob.glob in order to store the whole list of files in memory before processing. This is required to avoid glob.iglob browse the directory tree you are creating…

If you are in UNIX environment, the simpler way to achieve this will be using shell script with cp command.
For example, for copying all files from: /root/name/date/type/filename as /root/date/filename; you need to just do:
cp /root/*/date/*/filename /root/date/filename
OR, if you want to move the file, use mv command:
mv /root/*/date/*/filename /root/date/filename
You may run these commands via Python as well using os.system() as:
import os
os.system("cp \root\*\date\*\filename root\date\filename")
For details, check: Calling an external command in Python
Edit based on comment. For copying /root/name/date/type/filename into /root/date/name/type/filename, you need to just do:
cp /root/name/date/type/filename /root/date/name/type/filename
But make sure that directory /root/date/name/type exists before doing it. In order to make sure it exists, and if not create a directory by using mkdir with -p option as:
mkdir -p /root/date/name/type

Related

Is it possible to create a HDF5 structure matching the file path on a local pc?

I am trying to automatically create a HDF5 structure by using the file paths on my local pc. I want to read through the subdirectories and create a HDF5 structure to match, that I can then save files to. Thank you

You can do this by combining os.walk() and h5py create_group(). The only complications are handling Linux vs Windows (and drive letter on Windows). Another consideration is relative vs absolute path. (I used absolute path, but my example can be modified. (Note: it's a little verbose so you can see what's going on.) Here is the example code (for Windows):
with h5py.File('SO_73879694.h5','w') as h5f:
cwd = os.getcwd()
for root, dirs, _ in os.walk(cwd, topdown=True):
print(f'ROOT: {root}')
# for Windows, modify root: remove drive letter and replace backslashes:
grp_name = root[2:].replace( '\\', '/')
print(f'grp_name: {grp_name}\n')
h5f.create_group(grp_name)

This is actually quite easy to do using HDFql in Python. Here is a complete script that does that:
# import HDFql module (make sure it can be found by the Python interpreter)
import HDFql
# create and use (i.e. open) an HDF5 file named 'my_file.h5'
HDFql.execute("CREATE AND USE FILE my_file.h5")
# get all directories recursively starting from the root of the file system (/)
HDFql.execute("SHOW DIRECTORY / LIKE **")
# iterate result set and create group in each iteration
while HDFql.cursor_next() == HDFql.SUCCESS:
HDFql.execute("CREATE GROUP \"%s\"" % HDFql.cursor_get_char())

Using subprocess to delete multiple files that have a certain string in them

I have a few files with this format:
17-07-39_03-05-2022_Testing.txt
16-07-34_03-05-2022_Testing.png
"Testing" is the name of the system.
I am trying to use subprocess to delete all files that have "Testing" in them at the end.
pc_user = subprocess.run("whoami", capture_output=True, shell=False).stdout.decode().strip("\n")
name, user = pc_user.split("\\")
path = f"C:/Users/Testing/PycharmProjects/pythonProject/Project/"
subprocess.run(f"rm {path}*{name}.*")

I would use a slightly different approach to delete these specific files.
First, you will want to make a list of the paths to all files. It should look like this:
list = ['17-07-39_03-05-2022_Testing.txt',
'16-07-34_03-05-2022_Testing.png',
etc.]
Then, you want to search this list for paths that include the string 'testing' and put all these paths in a new list:
testing_list = []
for path in list:
if path.find('Testing') > 0:
testing_list.append(path)
You can now delete all files in the testing_list list. I would use one of the following methods:
os.remove() removes a fil
os.rmdir() removes an empty directory.
shutil.rmtree() deletes a directory and all its contents.
You can find more information on deleting files here

(OP has edited the question since this was posted and removed the powershell tags, making this answer irellevant)
There is a much simpler way to do this using Get-ChildItem and a -like using only powershell.
foreach($file in (gci -path "C:/Users/Testing/PycharmProjects/pythonProject/Project/")){ if($file.name -like "*testing.*"){ remove-item $file.fullname -confirm:$false }}

How to exclude files in path directory using pathlib in python the proper way?

I have a datapath to a file couple of data files, let us say data01.txt, data02.txt and so on. During processing the user will provide mask files for the data (potentially also via an external tool). Mask files will contain the string 'mask', e.g., data01-mask.txt.
from pathlib import Path
p = Path(C:\Windowns\test\data01.txt)
dircontent = list(p.parent.glob('*'))
Gives me a list of all the filespath as Path objects including potential masks. Now I want a list that gives me the directory content but not including any file containing mask. I have tried this approach to use fancy regex *![mask]* but I do not get it to work.
Using,
dircontentstr = [str(elem) for elem in x]
filtereddir = [elem.find('mask') for elem in dircontentstr if elem.find('mask')==-1]
I can get the desired result, but it seems silly to then convert back to Path elements. Is there a straight forward way to exclude from the directory list?

There is no need to convert anything to strings here, as Path objects have helpful attributes that you can use to filter on. Take a look at the .name and .stem attributes; these let you filter path objects on the base filename (where .stem is the base name without extension):
dircontent = [path for path in p.parent.glob('*') if 'mask' not in path.stem]

HTCondor output files: obtain created directory

I am using HTcondor to generate some data (txt, png). By running my program, it creates a directory next to the .sub file, named datasets, where the datasets are stored into. Unfortunately, condor does not give me back this created data when finished. In other words, my goal is to get the created data in a "Datasets" subfolder next to the .sub file.
I tried:
1) to not put the data under the datasets subfolder, and I obtained them as thought. Howerver, this is not a smooth solution, since I generate like 100 files which are now mixed up with the .sub file and all the other.
2) Also I tried to set this up in the sub file, leading to this:
notification = Always
should_transfer_files = YES
RunAsOwner = True
When_To_Transfer_Output = ON_EXIT_OR_EVICT
getenv = True
transfer_input_files = main.py
transfer_output_files = Datasets
universe = vanilla
log = log/test-$(Cluster).log
error = log/test-$(Cluster)-$(Process).err
output = log/test-$(Cluster)-$(Process).log
executable = Simulation.bat
queue
This time I get the error, that Datasets was not found. Spelling was checked already.
3) Another option would be, to pack everything in a zip, but since I have to run hundreds of jobs, I do not want to unpack all this files afterwards.
I hope somebody comes up with a good idea on how to solve this.

Just for the record here: HTCondor does not transfer created directories at the end of the run or its contents. The best way to get the content back is to write a wrapper script that will run your executable and then compress the created directory at the root of the working directory. This file will be transferred with all other files. For example, create run.exe:
./Simulation.bat
tar zcf Datasets.tar.gz Datasets
and in your condor submission script put:
executable = run.exe
However, if you do not want to do this and if HTCondor is using a common shared space like an AFS you can simply copy the whole directory out:
./Simulation.bat
cp -r Datasets <AFS location>
The other alternative is to define an initialdir as described at the end of: https://research.cs.wisc.edu/htcondor/manual/quickstart.html
But one must create the directory structure by hand.
also, look around pg. 65 of: https://indico.cern.ch/event/611296/contributions/2604376/attachments/1471164/2276521/TannenbaumT_UserTutorial.pdf
This document is, in general, a very useful one for beginners.

How to loop through the list of .tar.gz files using linux command in python

Using python 2.7
I have a list of *.tat.gz files on a linux box. Using python, I want to loop through the files and extract those files in a different location, under their respective folders.
For example: if my file name is ~/TargetData/zip/1440198002317590001.tar.gz
then I want to untar and ungzip this file in a different location under its
respective folder name i.e. ~/TargetData/unzip/1440198002317590001.
I have written some code but I am not able to loop through the files. In a command line I am able to untar using $ tar -czf 1440198002317590001.tar.gz 1440198002317590001 command. But I want to be able to loop through the .tar.gz files. The code is mentioned below. Here, I’m not able to loop just the files Or print only the files. Can you please help?
import os
inF = []
inF = str(os.system('ls ~/TargetData/zip/*.tar.gz'))
#print(inF)
if inF is not None:
for files in inF[:-1]:
print files
"""
os.system('tar -czf files /unzip/files[:-7]')
# This is what i am expecting here files = "1440198002317590001.tar.gz" and files[:-7]= "1440198002317590001"
"""
Have you ever worked on this type of use case? Your help is greatly appreciated!! Thank you!

I think you misunderstood the meaning of os.system(), that will do the job, but its return value was not expected by you, it returns 0 for successful done, you can not directly assign its output to a variable. You may consider the module [subprocess], see doc here. However, I DO NOT recommend that way to list files (actually, it returns string instead of list, see doc find the detail by yourself).
The best way I think would be glob module, see doc here. Use glob.glob(pattern), you can put all files match the pattern in a list, then you can loop it easily.
Of course, if you are familiar with os module, you also can use os.listdir(), os.path.join(), or even os.paht.expanduser() to do this. (Unlike glob, it only put filenames without fully path into a list, you need to reconstruct file path).
By the way, for you purpose here, there is no need to declare an empty list first (i.e. inF = [])
For unzip file part, you can do it by os.system, but I also recommend to use subprocess module instead of os.system, you will find the reason in the doc of subprocess.
DO NOT see the following code, ONLY see them after you really can not solve this by yourself.
import os
import glob
inF = glob.glob('~/TargetData/zip/*.tar.gz')
if inF:
for files in inF:
# consider subprocess.call() instead of os.system
unzip_name = files.replace('zip', 'unzip')[:-7]
# get directory name and make sure it exists, otherwise create it
unzip_dir = os.path.dirname(unzip_name)
if not os.path.exists(unzip_dir):
os.mkdir(unzip_dir)
subprocess.call(['tar -xzf', files, '-C', unzip_name])
# os.system('tar -czf files /unzip/files[:-7]')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Deploy files to a directory with a custom structure in Python - python

Related

Is it possible to create a HDF5 structure matching the file path on a local pc?

Using subprocess to delete multiple files that have a certain string in them

How to exclude files in path directory using pathlib in python the proper way?

HTCondor output files: obtain created directory

How to loop through the list of .tar.gz files using linux command in python

Categories

Resources