Since it looks like Pathlib is the future, I'm trying to refactor some of my code to change from from previous use of os to Pathlib. I'm stuck with the following problem. Since I work with a Mac, sometimes the folders contain hidden files preceded by a period (.DS_Store or names from deleted files preceded by ._). That gets me into a lot of problems when I loop through files in a directory that have certain extension. To avoid this problem using os.walk, I do the following:
for root, dirs, files in os.walk(DIR_NAME):
# iterate all files
for file in files:
if file.endswith(ext):
if file.startswith("."):
continue
do something with the file
I know we have the .stem and .suffix method to manipulate file names with Pathlib, but I don't see how they can help with this problem. The .startswith seems more intuitive but alas it does not seem to be available in Pathlib. So, the question is, how would one go about doing this in Pathlib?
I'm trying to get path and filename, from a directory, including those ones inside the subdirectories. The problem is that some subfolders have one or more points in the name.
So when I execute this code
listaFile=glob.glob('c:\test\ID_1'+/**/*.*',recursive= True)
I get
c:\test\ID_1\fil.e1.txt
c:\test\ID_1\fil.e2.doc
c:\test\ID_1\subfolder1\file1.txt
c:\test\ID_1\sub.folder2 (instead of c:\test\ID_1\sub.folder2\file1.txt)
thank you all in advance!
You need to filter it out checking if it's a file or folder. An easy way would be to use the pathlib instead of glob directly. Example below.
listaFile = [str(path) for path in pathlib.Path(r"c:\test\ID_1").rglob("*.*") if path.is_file()]
I am using Python 3.5 to analyze data contained in csv files. These files are contained in a "figs" directory, which is contained in a case directory, which is contained in an overall data directory, e.g.:
/strm1/serino/DATA/06052009/figs
Or more generally:
/strm1/serino/DATA/case_date_in_MMDDYYYY/figs
The directory I am starting in is '/strm1/serino/DATA/,' and each subdirectory is the month, day, and year of a case I am working with. Each subdirectory contains another subdirectory named 'figs,' and that is the location of each case's csv file. To be exact:
/strm1/serino/DATA/case_date_in_MMDDYYYY/figs/case_date_in_MMDDYYYY.csv
So, I would like to start in my DATA directory and go through its subdirectories to find those that have the MMDDYYYY naming. However, some of the case directories may be named with a state abbreviation at the end, like: '06052009_TX.' Therefore, instead of matching the MMDDYYYY naming exactly, it could be something as simple as verifying that the directory name contains any number 1 through 9.
Once I am in the first subdirectory (the case directory) I would like to move into the 'figs' subdirectory. Once there, I want to access the csv file with the same naming convention as the first subdirectory (the case directory). I will fill existing arrays with the data contained in each csv file.
Basically, my question concerns navigating through multiple subdirectories that match a certain naming convention and ultimately accessing the data file at the "end." I was naively playing around with glob, fnmatch, os.listdir, and os.walk, but I could not get anything close enough to working that I feel would be helpful to include. I am not very familiar with those modules. What I can include is what I am going for:
for dirs in data_dir that contain a number:
go into this directory
go into 'figs' directory
read data from the csv file whose name matches its case directory name (or whose name format matches the case directory name format)
I have come across related questions, but I have not been able to apply their answers in the way that I would like, especially with nested directories. I really appreciate the help, and let me know if I need to clarify anything.
The following should get you going. It uses the datetime.strptime() function to attempt to convert each folder name into a valid datetime object. If the conversion fails, then you know that the folder name is not in the correct format and can be skipped. It then attempts to parse any CSV file found in the corresponding fig folder:
from datetime import datetime
import glob
import csv
import os
dirpath, dirnames, filenames = next(os.walk('/strm1/serino/DATA'))
for dirname in dirnames:
if len(dirname) >= 8:
try:
dt = datetime.strptime(dirname[:8], '%m%d%Y')
print(dt, dirname)
csv_folder = os.path.join(dirpath, dirname)
for csv_file in glob.glob(os.path.join(csv_folder, 'figs', '*.csv')):
with open(csv_file, newline='') as f_input:
csv_input = csv.reader(f_input)
for row in csv_input:
print(row)
except ValueError as e:
pass
You listed several problems above. Which one are you stuck on? It seems like you already know how to navigate the file storage system using os.path. You may not know of the function os.path.join() which allows you to manually specify a file path relative to a file as such:
os.path.abspath(os.path.join(os.path.dirname(__file__), '../..', 'Data/TrailShelters/'))
To break down the above:
os.path.dirname(__file__) returns the path of the current file. '../..' means: go up two levels in the folder hierarchy. And Data/TrailShelters/ is the directory I wish to navigate to.
How does this apply to your particular case? Well, you will need to make some adaptations but you can store the os.path of the parent directory in a variable. Then you can essentially use a while sub_dir is not null loop to iterate through subdirectories. For every subdirectory you will want to examine its os.path and extract the particular part of the path you are interested in. Then you can simply use something like: if 'TN' in subdirectory_name to determine if it is a subdirectory you are interested in. If so; then update the saved os.path of the parent directory by appending the path to the subdirectory. Does that make any sense?
I have a some directories that contain some other directories which, at the lowest level, contain bunch of csv files such as (folder) a -> b -> c -> (csv files). There is usually only one folder at each level. When I process a directory how can I follow this structure until the end to get the csv files ? I was thinking maybe a recursive solution but I think there may be better ways to do this. I am using python. Hope I was clear.
The os package has a walk function that will do exactly what you need:
for current_path, directory, files in walk("/some/path"):
# current_path is the full path of the directory we are currently in
# directory is the name of the directory
# files is a list of file names in this directory
You can use os.path's to derive the full path to each file (if you need it).
Alternately, you might find the glob module to be of more use to you:
for csv_file in glob(/some/path/*/*.csv"):
# csv_file is the full path to the csv file.
How would you delete in Python all files in directory /tmp/dir and all its subdirectories that have extension .txt or .mp3?
You just need to use os.walk to traverse recursively a directory and os.remove when you find a file whose name matches your requirements.
Note that os.walk returns on one hand file names and, on the other hand, a root directory. Hence, for the os.remove to work you'll need to create the full filename with os.path.join.