Python rglob pattern for directory search

Python rglob pattern for directory search - python

I try to get the name of subdirectories with Python3 script on Windows10.
Thus, I wrote code as follows:
from pathlib2 import Path
p = "./path/to/target/dir"
[str(item) for item in Path(p).rglob(".")]
# obtained only subdirectories path names including target directory itself.
It is good for me to get this result, but I don't know why the pattern of rglob argument returns this reuslt.
Can someone explain this?
Thanks.

Every directory in a posix-style filesystem features two files from the get go: .., which refers to the parent directory, and ., which refers to the current directory:
$ mkdir tmp; cd tmp
tmp$ ls -a
. ..
tmp$ cd .
tmp$ # <-- still in the same directory
- with the notable exception of /.., which refers to the root itself since the root has not parent.
A Path object from python's pathlib is, when it is created, just a wrapper around a string that is assumed to point somewhere into the filesystem. It will only refer to something tangible when it is resolved:
>>> Path('.')
PosixPath('.') # just a fancy string
>>> Path('.').resolve()
PosixPath('/current/working/dir') # an actual point in your filesystem
The bottom line is that
the paths /current/working/dir and /current/working/dir/. are, from the filesystem's point of view, completely equivalent, and
a pathlib.Path will also reflect that as soon as it is resolved.
By matching the glob call to ., you found all links pointing to the current directories below the initial directory. The results from glob get resolved on return, so the . doesn't appear in there any more.
As a source for this behavior, see this section of PEP428 (which serves as the specification for pathlib), where it briefly mentions path equivalence.

Related

Python's shutil.make_archive() creates dot directory on Linux (when using tar or gztar)

I'm using a basic python script to create an archive with the contents of a directory "directoryX":
shutil.make_archive('NameOfArchive', format='gztar', root_dir=getcwd()+'/directoryX/')
The generated archive rather than just storing the contents of directoryX, creates a . folder in the archive (and the contents of folder directoryX are stored in this . folder).
Interestingly this only happens with .tar and tar.gz but not with .zip
Used python version -> 3.8.10
It seems that when using .tar or .tar.gz formats, the default base_dir of "./" gets accepted literally and it creates a folder titled "."
I tried using base_dir=os.currdir but got the same results...
Tried to also use python2 but got the same results.
Is this a bug with shutil.make_archive or am I doing something incorrectly?

It's a documented behavior, sort of, just a little odd. The base_dir argument to make_archive is documented to:
Be the directory we start archiving from (after chdiring to root_dir)
Default to the current directory (specifically, os.curdir)
os.curdir is actually a constant string, '.', and, matching the tar command line utility, shutil.make_archive (and tar.add which it's implemented in terms of) stores the complete path "given" (in this case, './' plus the rest of the relative path to the file). If you run tar -c -z -C directoryX -f NameOfArchive.tar.gz ., you'll end up with a tarball full of ./ prefixed files too (-C directoryX does the same thing as root_dir, and the . argument is the same as the default base_dir='.').
I don't see an easy workaround that retains the simplicity of shutil.make_archive; if you try to pass base_dir='' it dies when it tries to stat '', so that's out.
To be clear, this behavior should be fine; a tar entry named ./foo and one named foo are equivalent for most purposes. If it really bothers you, you can switch to using the tarfile module directly, e.g.:
# Imports at top of file
import os
import tarfile
# Actual code
with tarfile.open('NameOfArchive.tar.gz', 'w:gz') as tar:
for entry in os.scandir('directoryX'):
# Operates recursively on any directories, using the arcname as the base,
# so you add the whole tree just by adding all the entries in the top
# level directory. Using arcname of entry.name means it's equivalent to
# adding os.path.basename(entry.path), omitting all directory components
tar.add(entry.path, arcname=entry.name)
# The whole loop *could* be replaced with just:
# tar.add('directoryX', arcname='')
# which would add all contents recursively, but it would also put an entry
# for '/' in, which is undesirable
For a directory structure like:
directoryX/
|
\- foo
\- bar
\- subdir/
|
\- spam
\- eggs
the resulting tar's contents would be:
foo
bar
subdir/
subdir/eggs
subdir/spam
vs. the:
./foo
./bar
./subdir/
./subdir/eggs
./subdir/spam
your current code produces.
Slightly more work to code, but not that much worse; two imports and three lines of code, and with greater control over what gets added (for example, you could trivially exclude symlinks by wrapping the tar.add call in an if not entry.is_symlink(): block, or omit recursive adding of specific directories by conditionally setting recursive=False to the tar.add call for directories you don't want to include the contents of; you can even provide a filter function to the tar.add call to conditionally exclude specific entries even when deep recursion gets involved).

How to delete all pycache folders in directory tree

During my simulation, Python creates folders named __pycache__. Not just one, but many. The __pycache__-folders are almost always created next to the modules that are executed.
But these modules are scattered in my directory. The main folder is called LPG and has a lot of subfolders, which in turn have further subfolders. The __pycache__-folders can occur at all possible places.
At the end of my simulation I would like to clean up and delete all folders named __pycache__ within the LPG-tree.
What is the best way to do this?
Currently, I am calling the function below on simulation end (also on simulation start). However, that is a bit annoying since I specifically have to write down every Path where a __pycache__-folder might occur.
def clearCache():
"""
Removes generic `__pycache__` .
The `__pycache__` files are automatically created by python during the simulation.
This function removes the generic files on simulation start and simulation end.
"""
try:
shutil.rmtree(Path(f"{PATH_to_folder_X}/__pycache__"))
except:
pass
try:
shutil.rmtree(Path(f"{PATH_to_folder_Y}/__pycache__"))
except:
pass

This will remove all *.pyc files and pycache directories recursively in the current directory:
with python:
import os
os.popen('find . | grep -E "(__pycache__|\.pyc|\.pyo$)" | xargs rm -rf')
or manually with terminal or cmd:
find . | grep -E "(__pycache__|\.pyc|\.pyo$)" | xargs rm -rf

Bit of a frame challenge here: If you don't want the bytecode caches, the best solution is to not generate them in the first place. If you always delete them after every run, they're worse than useless. Either:
Invoke python/python3 with the -B option (affects that single launch), or...
Set the PYTHONDONTWRITEBYTECODE environment variable to affect all Python launches until it's unset, e.g. in bash, export PYTHONDONTWRITEBYTECODE=1
This does need to be set before the Python script is launched, so perhaps wrap your script with a simple bash script or the like that invokes the real Python script with the appropriate switch/environment set up.

Another simple solution is available if you have access to a Command-Line Interface and the find utility:
find . -type d -name __pycache__
as you will ask in plain language, it finds in the current folder (.) directories (-type d) that exactly match your pattern -name __pycache__. You can use this to identify where these folders are and then to delete them:
find . -type d -name __pycache__ -exec rm -fr {} \;
the huge advantage of this solution is that it transfers to other tasks easily (finding *.pyc files?) and has become an everyday tool for me.

Here is simple solution if you already know where the __pycache__ folders are just try the following
import shutil
import os
def clearCache():
"""
Removes generic `__pycache__` .
The `__pycache__` files are automatically created by python during the simulation.
This function removes the genric files on simulation start and simulation end.
"""
path = 'C:/Users/Yours/Desktop/LPG'
try:
for all in os.listdir(path):
if os.path.isdir(path + all):
if all == '__pycache__':
shutil.rmtree(path + all, ignore_errors=False)
except:
pass
clearCache()
Just simple you can still modify the path to the actually your path is.
And if you want the script to penetrate into the subdirectories to remove the pycache folders just check the following
Example
import shutil
import os
path = 'C:/Users/Yours/Desktop/LPG'
for directories, subfolder, files in os.walk(path):
if os.path.isdir(directories):
if directories[::-1][:11][::-1] == '__pycache__':
shutil.rmtree(directories)

If you want to delete any folders from any directory, use this function.
By default, it would start deleting from current directory and recursively goes in every sub-directories
import os
import shutil
def remove_dirs(curr_dir='./', del_dirs=['temp_folder', '__pycache__']):
for del_dir in del_dirs:
if del_dir in os.listdir(curr_dir):
shutil.rmtree(os.path.join(curr_dir, del_dir))
for dir in os.listdir(curr_dir):
dir = os.path.join(curr_dir, dir)
if os.path.isdir(dir):
self.remove_dirs(dir, del_dirs)

You can use os with glob like this:
import os, glob
in_dir = "/path/to/your/folder"
pattern = ['__pycache__']
for p in pattern:
[os.remove(x) for x in glob.iglob(os.path.join(in_dir, "**", p), recursive=True)]

Weird python file path behavior

I have this folder structure, within edi_standards.py I want to open csv/transaction_groups.csv
But the code only works when I access it like this os.path.join('standards', 'csv', 'transaction_groups.csv')
What I think it should be is os.path.join('csv', 'transaction_groups.csv') since both edi_standards.py and csv/ are on the same level in the same folder standards/
This is the output of printing __file__ in case you doubt what I say:
>>> print(__file__)
~/edi_parser/standards/edi_standards.py

when you're running a python file, the python interpreter does not change the current directory to the directory of the file you're running.
In your case, you're probably running (from ~/edi_parser):
standards/edi_standards.py
For this you have to hack something using __file__, taking the dirname and building the relative path of your resource file:
os.path.join(os.path.dirname(__file__),"csv","transaction_groups.csv")
Anyway, it's good practice not to rely on the current directory to open resource files. This method works whatever the current directory is.

I do agree with Answer of Jean-Francois above,
I would like to mention that os.path.join does not consider the absolute path of your current working directory as the first argument
For example consider below code
>>> os.path.join('Functions','hello')
'Functions/hello'
See another example
>>> os.path.join('Functions','hello','/home/naseer/Python','hai')
'/home/naseer/Python/hai'
Official Documentation
states that whenever we have given a absolute path as a argument to the os.path.join then all previous path arguments are discarded and joining continues from the absolute path argument.
The point I would like to highlight is we shouldn't expect that the function os.path.join will work with relative path. So You have to submit absolute path to be able to properly locate your file.

Python File Path Name

What is the difference between "./file_name", "../file_name" and "file_name"when used as the file path in Python?
For example, if you want to save in the file_path, is it correct that "../file_name" will save file_name inside the current directory? And "./file_name" will save it to the desktop? It's really confusing.

./file_name and file_name mean the same thing - a file called file_name in the current working directory.
../file_name means a file called file_name in the in the parent directory of the current working directory.
Summary
. represents current directory whereas .. represents parent directory.
Explanation by example
if the current working directory is this/that/folder then:
. results in this/that/folder
.. results in this/that
../.. results in this
.././../other results in this/other

Basically, ./ is the current directory, while ../ is the parent of the current directory. Both are actually hard links in filesystems, i.e., they are needed in order to specify relative paths.
Let's consider the following:
/root/
directory_a
directory_a_a
file_name
directory_a_b
file_name
directory_b
directory_b_a
directory_b_b
and let's consider your current working directory is /root/directory_a/directory_a_a. Then, from this directory if you refer to ./file_name you are referring to /root/directory_a/directory_a_a/file_name. On the other hand, if you refer to ../file_name you are referring to /root/directory_a/file_name.
In the end, ./ and ../ depend upon your current working directory. If you want to be very specific you should use an absolute path.

Safely extract zip or tar using Python

I'm trying to extract user-submitted zip and tar files to a directory. The documentation for zipfile's extractall method (similarly with tarfile's extractall) states that it's possible for paths to be absolute or contain .. paths that go outside the destination path. Instead, I could use extract myself, like this:
some_path = '/destination/path'
some_zip = '/some/file.zip'
zipf = zipfile.ZipFile(some_zip, mode='r')
for subfile in zipf.namelist():
zipf.extract(subfile, some_path)
Is this safe? Is it possible for a file in the archive to wind up outside of some_path in this case? If so, what way can I ensure that files will never wind up outside the destination directory?

Note: Starting with python 2.7.4, this is a non-issue for ZIP archives. Details at the bottom of the answer. This answer focuses on tar archives.
To figure out where a path really points to, use os.path.abspath() (but note the caveat about symlinks as path components). If you normalize a path from your zipfile with abspath and it does not contain the current directory as a prefix, it's pointing outside it.
But you also need to check the value of any symlink extracted from your archive (both tarfiles and unix zipfiles can store symlinks). This is important if you are worried about a proverbial "malicious user" that would intentionally bypass your security, rather than an application that simply installs itself in system libraries.
That's the aforementioned caveat: abspath will be misled if your sandbox already contains a symlink that points to a directory. Even a symlink that points within the sandbox can be dangerous: The symlink sandbox/subdir/foo -> .. points to sandbox, so the path sandbox/subdir/foo/../.bashrc should be disallowed. The easiest way to do so is to wait until the previous files have been extracted and use os.path.realpath(). Fortunately extractall() accepts a generator, so this is easy to do.
Since you ask for code, here's a bit that explicates the algorithm. It prohibits not only the extraction of files to locations outside the sandbox (which is what was requested), but also the creation of links inside the sandbox that point to locations outside the sandbox. I'm curious to hear if anyone can sneak any stray files or links past it.
import tarfile
from os.path import abspath, realpath, dirname, join as joinpath
from sys import stderr
resolved = lambda x: realpath(abspath(x))
def badpath(path, base):
# joinpath will ignore base if path is absolute
return not resolved(joinpath(base,path)).startswith(base)
def badlink(info, base):
# Links are interpreted relative to the directory containing the link
tip = resolved(joinpath(base, dirname(info.name)))
return badpath(info.linkname, base=tip)
def safemembers(members):
base = resolved(".")
for finfo in members:
if badpath(finfo.name, base):
print >>stderr, finfo.name, "is blocked (illegal path)"
elif finfo.issym() and badlink(finfo,base):
print >>stderr, finfo.name, "is blocked: Symlink to", finfo.linkname
elif finfo.islnk() and badlink(finfo,base):
print >>stderr, finfo.name, "is blocked: Hard link to", finfo.linkname
else:
yield finfo
ar = tarfile.open("testtar.tar")
ar.extractall(path="./sandbox", members=safemembers(ar))
ar.close()
Edit: Starting with python 2.7.4, this is a non-issue for ZIP archives: The method zipfile.extract() prohibits the creation of files outside the sandbox:
Note: If a member filename is an absolute path, a drive/UNC sharepoint and leading (back)slashes will be stripped, e.g.: ///foo/bar becomes foo/bar on Unix, and C:\foo\bar becomes foo\bar on Windows. And all ".." components in a member filename will be removed, e.g.: ../../foo../../ba..r becomes foo../ba..r. On Windows, illegal characters (:, <, >, |, ", ?, and *) [are] replaced by underscore (_).
The tarfile class has not been similarly sanitized, so the above answer still apllies.

Contrary to the popular answer, unzipping files safely is not completely solved as of Python 2.7.4. The extractall method is still dangerous and can lead to path traversal, either directly or through the unzipping of symbolic links. Here was my final solution which should prevent both attacks in all versions of Python, even versions prior to Python 2.7.4 where the extract method was vulnerable:
import zipfile, os
def safe_unzip(zip_file, extract_path='.'):
with zipfile.ZipFile(zip_file, 'r') as zf:
for member in zf.infolist():
file_path = os.path.realpath(os.path.join(extract_path, member.filename))
if file_path.startswith(os.path.realpath(extract_path)):
zf.extract(member, extract_path)
Edit 1: Fixed variable name clash. Thanks Juuso Ohtonen.
Edit 2: s/abspath/realpath/g. Thanks TheLizzard

Use ZipFile.infolist()/TarFile.next()/TarFile.getmembers() to get the information about each entry in the archive, normalize the path, open the file yourself, use ZipFile.open()/TarFile.extractfile() to get a file-like for the entry, and copy the entry data yourself.

Copy the zipfile to an empty directory. Then use os.chroot to make that directory the root directory. Then unzip there.
Alternatively, you can call unzip itself with the -j flag, which ignores the directories:
import subprocess
filename = '/some/file.zip'
rv = subprocess.call(['unzip', '-j', filename])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python rglob pattern for directory search - python

Related

Python's shutil.make_archive() creates dot directory on Linux (when using tar or gztar)

How to delete all pycache folders in directory tree

Weird python file path behavior

Python File Path Name

Safely extract zip or tar using Python

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python rglob pattern for directory search - python

Related

Python's shutil.make_archive() creates dot directory on Linux (when using tar or gztar)

How to delete all __pycache__ folders in directory tree

Weird python file path behavior

Python File Path Name

Safely extract zip or tar using Python

Categories

Resources

How to delete all pycache folders in directory tree