I am using Python 3.5 to analyze data contained in csv files. These files are contained in a "figs" directory, which is contained in a case directory, which is contained in an overall data directory, e.g.:
/strm1/serino/DATA/06052009/figs
Or more generally:
/strm1/serino/DATA/case_date_in_MMDDYYYY/figs
The directory I am starting in is '/strm1/serino/DATA/,' and each subdirectory is the month, day, and year of a case I am working with. Each subdirectory contains another subdirectory named 'figs,' and that is the location of each case's csv file. To be exact:
/strm1/serino/DATA/case_date_in_MMDDYYYY/figs/case_date_in_MMDDYYYY.csv
So, I would like to start in my DATA directory and go through its subdirectories to find those that have the MMDDYYYY naming. However, some of the case directories may be named with a state abbreviation at the end, like: '06052009_TX.' Therefore, instead of matching the MMDDYYYY naming exactly, it could be something as simple as verifying that the directory name contains any number 1 through 9.
Once I am in the first subdirectory (the case directory) I would like to move into the 'figs' subdirectory. Once there, I want to access the csv file with the same naming convention as the first subdirectory (the case directory). I will fill existing arrays with the data contained in each csv file.
Basically, my question concerns navigating through multiple subdirectories that match a certain naming convention and ultimately accessing the data file at the "end." I was naively playing around with glob, fnmatch, os.listdir, and os.walk, but I could not get anything close enough to working that I feel would be helpful to include. I am not very familiar with those modules. What I can include is what I am going for:
for dirs in data_dir that contain a number:
go into this directory
go into 'figs' directory
read data from the csv file whose name matches its case directory name (or whose name format matches the case directory name format)
I have come across related questions, but I have not been able to apply their answers in the way that I would like, especially with nested directories. I really appreciate the help, and let me know if I need to clarify anything.
The following should get you going. It uses the datetime.strptime() function to attempt to convert each folder name into a valid datetime object. If the conversion fails, then you know that the folder name is not in the correct format and can be skipped. It then attempts to parse any CSV file found in the corresponding fig folder:
from datetime import datetime
import glob
import csv
import os
dirpath, dirnames, filenames = next(os.walk('/strm1/serino/DATA'))
for dirname in dirnames:
if len(dirname) >= 8:
try:
dt = datetime.strptime(dirname[:8], '%m%d%Y')
print(dt, dirname)
csv_folder = os.path.join(dirpath, dirname)
for csv_file in glob.glob(os.path.join(csv_folder, 'figs', '*.csv')):
with open(csv_file, newline='') as f_input:
csv_input = csv.reader(f_input)
for row in csv_input:
print(row)
except ValueError as e:
pass
You listed several problems above. Which one are you stuck on? It seems like you already know how to navigate the file storage system using os.path. You may not know of the function os.path.join() which allows you to manually specify a file path relative to a file as such:
os.path.abspath(os.path.join(os.path.dirname(__file__), '../..', 'Data/TrailShelters/'))
To break down the above:
os.path.dirname(__file__) returns the path of the current file. '../..' means: go up two levels in the folder hierarchy. And Data/TrailShelters/ is the directory I wish to navigate to.
How does this apply to your particular case? Well, you will need to make some adaptations but you can store the os.path of the parent directory in a variable. Then you can essentially use a while sub_dir is not null loop to iterate through subdirectories. For every subdirectory you will want to examine its os.path and extract the particular part of the path you are interested in. Then you can simply use something like: if 'TN' in subdirectory_name to determine if it is a subdirectory you are interested in. If so; then update the saved os.path of the parent directory by appending the path to the subdirectory. Does that make any sense?
Related
Since it looks like Pathlib is the future, I'm trying to refactor some of my code to change from from previous use of os to Pathlib. I'm stuck with the following problem. Since I work with a Mac, sometimes the folders contain hidden files preceded by a period (.DS_Store or names from deleted files preceded by ._). That gets me into a lot of problems when I loop through files in a directory that have certain extension. To avoid this problem using os.walk, I do the following:
for root, dirs, files in os.walk(DIR_NAME):
# iterate all files
for file in files:
if file.endswith(ext):
if file.startswith("."):
continue
do something with the file
I know we have the .stem and .suffix method to manipulate file names with Pathlib, but I don't see how they can help with this problem. The .startswith seems more intuitive but alas it does not seem to be available in Pathlib. So, the question is, how would one go about doing this in Pathlib?
So I've a question, Like I'm reading the fits file and then i'm using the information from the header of the fits to define the other files which are related to the original fits file. But for some of the fits file, the other files (blaze_file, bis_file, ccf_table) are not available. And because of that my code gives the pretty obvious error that No Such file or directory.
import pandas as pd
import sys, os
import numpy as np
from glob import glob
from astropy.io import fits
PATH = os.path.join("home", "Desktop", "2d_spectra")
for filename in os.listdir(PATH):
if filename.endswith("_e2ds_A.fits"):
e2ds_hdu = fits.open(filename)
e2ds_header = e2ds_hdu[0].header
date = e2ds_header['DATE-OBS']
date2 = date = date[0:19]
blaze_file = e2ds_header['HIERARCH ESO DRS BLAZE FILE']
bis_file = glob('HARPS.' + date2 + '*_bis_G2_A.fits')
ccf_table = glob('HARPS.' + date2 + '*_ccf_G2_A.tbl')
if not all(file in os.listdir(PATH) for file in [blaze_file,bis_file,ccf_table]):
continue
So what i want to do is like, i want to make my code run only if all the files are available otherwise don't. But the problem is that, i'm defining the other files as variable inside the for loop as i'm using the header information. So how can i define them before the for loop???? and then use something like
So can anyone help me out of this?
The filenames returned by os.listdir() are always relative to the path given there.
In order to be used, they have to be joined with this path.
Example:
PATH = os.path.join("home", "Desktop", "2d_spectra")
for filename in os.listdir(PATH):
if filename.endswith("_e2ds_A.fits"):
filepath = os.path.join(PATH, filename)
e2ds_hdu = fits.open(filepath)
…
Let the filenames be ['a', 'b', 'a_ed2ds_A.fits', 'b_ed2ds_A.fits']. The code now excludes the two first names and then prepends the file path to the remaining two.
a_ed2ds_A.fits becomes /home/Desktop/2d_spectra/a_ed2ds_A.fits and
b_ed2ds_A.fits becomes /home/Desktop/2d_spectra/b_ed2ds_A.fits.
Now they can be accessed from everywhere, not just from the given file path.
I should become accustomed to reading a question in full before trying to answer it.
The problem I mentionned is a problem if you don't start the script from any path outside the said directory. Nevertheless, applying it will make your code much more consistent.
Your real problem, however, lies somewhere else: you examine a file and then, after checking its contents, want to read files whose names depend on informations from that first file.
There are several ways to accomplish your goal:
Just extend your loop with the proper tests.
Pseudo code:
for file in files:
if file.endswith("fits"):
open file
read date from header
create file names depending on date
if all files exist:
proceed
or
for file in files:
if file.endswith("fits"):
open file
read date from header
create file names depending on date
if not all files exist:
continue # actual keyword, no pseudo code!
proceed
Put some functionality into functions (variation of 1.)
Create a loop in a generator function which yields the "interesting information" of one fits file (or alternatively nothing) and have another loop run over them to actually work with the data.
If I am still missing some points or am not detailled enough, please let me know.
Since you have to read the fits file to know the other dependant files names, there's no way you can avoid reading the fit file first. The only thing you can do is test for the dependant files existance before trying to read them and skip the rest of the loop (using continue) if not.
Edit this line
e2ds_hdu = fits.open(filename)
And replace with
e2ds_hdu = fits.open(os.path.join(PATH, filename))
I've got a pretty simple task but I haven't done too many functions with excel within python and I'm not sure how to go about doing this.
What I need to do:
Look at many excel files within subfolders, rename them according to information within the file and store them in all in one folder somewhere else.
The data is structured like this:
Main Folder
Subfolder1
File1
File2
File3
...
For about a hundred subfolders and several files within each subfolder.
From here, I want to pull the company name, part number, and date from within the file and use those to rename the excel file. Not sure how to rename the file.
Then save it somewhere else. I'm having trouble finding all these functions, any advice?
Check the os and os.path module for listing folder contents (walk, listdir) and working with path names (abspath, basename etc.)
Also, shutil has some interesting functions for copying stuff. Check out copyfile and specify the dst parameter based on the data you read from the excel file.
This page can help you getting at the Excel data: http://www.python-excel.org/
You probably want to have some highlevel code like this:
for subfolder_name in os.listdir(MAIN_FOLDER):
# exercise left to reader: filter out non-folders
subfolder_path = os.path.join(MAIN_FOLDER, subfolder_name)
for excel_file_name in os.listdir(os.path.join(MAIN_FOLDER, subfolder_name)):
# exercise left to reader: filter out non-excel-files
excel_file_path = os.path.join(subfolder_path, excel_file_name)
new_excel_file_name = extract_filename_from_excel_file(excel_file_path)
new_excel_file_path = os.path.join(NEW_MAIN_FOLDER, subfolder_name,
new_excel_file_name)
shutil.copyfile(excel_file_path, new_excel_file_path)
You'll have to provide extract_filename_from_excel_file yourself using the xlrd module from the site I mentioned.
I have some homework that I am trying to complete. I don't want the answer. I'm just having trouble in starting. The work I have tried is not working at all... Can someone please just provide a push in the right direction. I am trying to learn but after trying and trying I need some help.
I know I can you os.path.basename() to get the basename and then add it to the file name but I can't get it together.
Here is the assignment
In this project, write a function that takes a directory path and creates an archive of the directory only. For example, if the same path were used as in the example ("c:\\xxxx\\Archives\\archive_me"), the zipfile would contain archive_me\\groucho, archive_me\\harpo and archive_me\\chico.
The base directory (archive_me in the example above) is the final element of the input, and all paths recorded in the zipfile should start with the base directory.
If the directory contains sub-directories, the sub-directory names and any files in the sub-directories should not be included. (Hint: You can use isfile() to determine if a filename represents a regular file and not a directory.)
Thanks again any direction would be great.
It would help to know what you tried yourself, so I'm only giving a few pointers to methods in the standard libraries:
os.listdir to get the a list of files and folders under a given directory (beware, it returns only the file/folder name, not the full path!)
os.path.isfile as mentioned in the assignment to check if a given path represents a file or a folder
os.path.isdir, the opposite of os.path.isfile (thanks inspectorG4adget)
os.path.join to join a filename with the basedir without having to worry about slashes and delimiters
ZipFile for handling, well, zip files
zipFile.write to write the files found to the zip
I'm not sure you'll need all of those, but it doesn't hurt knowing they exist.
I have a some directories that contain some other directories which, at the lowest level, contain bunch of csv files such as (folder) a -> b -> c -> (csv files). There is usually only one folder at each level. When I process a directory how can I follow this structure until the end to get the csv files ? I was thinking maybe a recursive solution but I think there may be better ways to do this. I am using python. Hope I was clear.
The os package has a walk function that will do exactly what you need:
for current_path, directory, files in walk("/some/path"):
# current_path is the full path of the directory we are currently in
# directory is the name of the directory
# files is a list of file names in this directory
You can use os.path's to derive the full path to each file (if you need it).
Alternately, you might find the glob module to be of more use to you:
for csv_file in glob(/some/path/*/*.csv"):
# csv_file is the full path to the csv file.