Loop through list of files - python

I'm in the process of developing a data column check, but I'm having a tough time figuring out how to properly loop through a list of files. I have a folder with a list of csv files. I need to check if each file maintains a certain structure. I'm not worried about checking the structure of each file, I'm more worried about how to properly pull each individual file from the dir, dataframe it, and then move on to the next file. Any help would be much appreciated.
def files(path):
files = os.listdir(path)
len_files = len(files)
cnt = 0
while cnt < len_files:
print(files)
for file in os.listdir(path):
if os.path.isfile(os.path.join(path, file)):
with open(path + file, 'r') as f:
return data_validate(f)
def data_validate(file):
# Validation check code will eventually go here...
print(pd.read_csv(file))
def run():
files("folder/subfolder/")

Which version of python do you use?
I use Pathlib and python3.6+ to do a lot of file processing with pandas. I find Pathlib easy to use, though you still have to dip back into os for a couple of functions they haven't implemented yet. A plus is that Path objects can be passed into the os functions without modification - so I like the flexibility.
This is a function I used to recursively go through an arbitrary directory structure that I have modified to look more like what you're trying to achieve above, returning a list of DataFrames.
If your directory is always going to be flat, you can simplify this even more.
def files(directory):
top_dir = Path(directory)
validated_files = list()
for item in top_dir.iterdir():
if item.is_file():
validated_files.append(data_validate(item))
elif item.is_dir():
validated_files.append(files(item))
return validated_files

Related

Multiple instances of the same function asynchronously in python

I have a little script that does a few simple tasks. Running Python 3.7.
One of the tasks has to merge some files together which can be a little time consuming.
It loops through multiple directories, then each directory gets passed to the function. The function just loops through the files and merges them.
Instead of waiting for it to finish one directory, then onto the next one, then wait, then onto the next one, etc...
I'd like to utilize the horsepower/cores/threads to have the script merging the PDF's in multiple directories at once, together, which should shave time.
I've got something like this:
if multi_directories:
if os.path.isdir('merged'):
pass
else:
os.makedirs('merged')
for directory in multi_directories:
merge_pdfs(directory)
My merge PDF function looks like this:
def merge_pdfs(directory):
root_dir = os.path.dirname(os.path.abspath(__file__))
merged_dir_location = os.path.join(root_dir, 'merged')
dir_title = directory.rsplit('/', 1)[-1]
file_list = [file for file in os.listdir(directory)]
merger = PdfFileMerger()
for pdf in file_list:
file_to_open = os.path.join(directory, pdf)
merger.append(open(file_to_open, 'rb'))
file_to_save = os.path.join(
merged_dir_location,
dir_title+"-merged.pdf"
)
with open(file_to_save, "wb") as fout:
merger.write(fout)
return True
This works great - but merge_pdfs runs slow in some instances where there are a high number of PDF's in the directory.
Essentially - I want to be a be able to loop through multi_directories and create a new thread or process for each directory and merge the PDF's at the same time.
I've looked at asyncio, multithreading and a wealth of little snippets here and there but can't seem to get it to work.
You can do something like:
from multiprocessing import Pool
n_processes = 2
...
if multi_directories:
if os.path.isdir('merged'):
pass
else:
os.makedirs('merged')
pool = Pool(n_processes)
pool.map(merge_pdfs, multi_directories)
It should help if the bottleneck is CPU usage. But it may make things even worse if the bottleneck is HDD, cause reading several files in parallel from one physical HDD is usually slower then reading them consecutively. Try it with different values of n_processes.
BTW, to make list from iterable use list(): file_list = list(os.listdir(directory)). And since listdir() returns List, you can just write file_list = os.listdir(directory)

Run only if "if " statement is true.!

So I've a question, Like I'm reading the fits file and then i'm using the information from the header of the fits to define the other files which are related to the original fits file. But for some of the fits file, the other files (blaze_file, bis_file, ccf_table) are not available. And because of that my code gives the pretty obvious error that No Such file or directory.
import pandas as pd
import sys, os
import numpy as np
from glob import glob
from astropy.io import fits
PATH = os.path.join("home", "Desktop", "2d_spectra")
for filename in os.listdir(PATH):
if filename.endswith("_e2ds_A.fits"):
e2ds_hdu = fits.open(filename)
e2ds_header = e2ds_hdu[0].header
date = e2ds_header['DATE-OBS']
date2 = date = date[0:19]
blaze_file = e2ds_header['HIERARCH ESO DRS BLAZE FILE']
bis_file = glob('HARPS.' + date2 + '*_bis_G2_A.fits')
ccf_table = glob('HARPS.' + date2 + '*_ccf_G2_A.tbl')
if not all(file in os.listdir(PATH) for file in [blaze_file,bis_file,ccf_table]):
continue
So what i want to do is like, i want to make my code run only if all the files are available otherwise don't. But the problem is that, i'm defining the other files as variable inside the for loop as i'm using the header information. So how can i define them before the for loop???? and then use something like
So can anyone help me out of this?
The filenames returned by os.listdir() are always relative to the path given there.
In order to be used, they have to be joined with this path.
Example:
PATH = os.path.join("home", "Desktop", "2d_spectra")
for filename in os.listdir(PATH):
if filename.endswith("_e2ds_A.fits"):
filepath = os.path.join(PATH, filename)
e2ds_hdu = fits.open(filepath)
…
Let the filenames be ['a', 'b', 'a_ed2ds_A.fits', 'b_ed2ds_A.fits']. The code now excludes the two first names and then prepends the file path to the remaining two.
a_ed2ds_A.fits becomes /home/Desktop/2d_spectra/a_ed2ds_A.fits and
b_ed2ds_A.fits becomes /home/Desktop/2d_spectra/b_ed2ds_A.fits.
Now they can be accessed from everywhere, not just from the given file path.
I should become accustomed to reading a question in full before trying to answer it.
The problem I mentionned is a problem if you don't start the script from any path outside the said directory. Nevertheless, applying it will make your code much more consistent.
Your real problem, however, lies somewhere else: you examine a file and then, after checking its contents, want to read files whose names depend on informations from that first file.
There are several ways to accomplish your goal:
Just extend your loop with the proper tests.
Pseudo code:
for file in files:
if file.endswith("fits"):
open file
read date from header
create file names depending on date
if all files exist:
proceed
or
for file in files:
if file.endswith("fits"):
open file
read date from header
create file names depending on date
if not all files exist:
continue # actual keyword, no pseudo code!
proceed
Put some functionality into functions (variation of 1.)
Create a loop in a generator function which yields the "interesting information" of one fits file (or alternatively nothing) and have another loop run over them to actually work with the data.
If I am still missing some points or am not detailled enough, please let me know.
Since you have to read the fits file to know the other dependant files names, there's no way you can avoid reading the fit file first. The only thing you can do is test for the dependant files existance before trying to read them and skip the rest of the loop (using continue) if not.
Edit this line
e2ds_hdu = fits.open(filename)
And replace with
e2ds_hdu = fits.open(os.path.join(PATH, filename))

taking data from files which are in folder

How do I get the data from multiple txt files that placed in a specific folder. I started with this could not fix. It gives an error like 'No such file or directory: '.idea' (??)
(Let's say I have an A folder and in that, there are x.txt, y.txt, z.txt and so on. I am trying to get and print the information from all the files x,y,z)
def find_get(folder):
for file in os.listdir(folder):
f = open(file, 'r')
for data in open(file, 'r'):
print data
find_get('filex')
Thanks.
If you just want to print each line:
import glob
import os
def find_get(path):
for f in glob.glob(os.path.join(path,"*.txt")):
with open(os.path.join(path, f)) as data:
for line in data:
print(line)
glob will find only your .txt files in the specified path.
Your error comes from not joining the path to the filename, unless the file was in the same directory you were running the code from python would not be able to find the file without the full path. Another issue is you seem to have a directory .idea which would also give you an error when trying to open it as a file. This also presumes you actually have permissions to read the files in the directory.
If your files were larger I would avoid reading all into memory and/or storing the full content.
First of all make sure you add the folder name to the file name, so you can find the file relative to where the script is executed.
To do so you want to use os.path.join, which as it's name suggests - joins paths. So, using a generator:
def find_get(folder):
for filename in os.listdir(folder):
relative_file_path = os.path.join(folder, filename)
with open(relative_file_path) as f:
# read() gives the entire data from the file
yield f.read()
# this consumes the generator to a list
files_data = list(find_get('filex'))
See what we got in the list that consumed the generator:
print files_data
It may be more convenient to produce tuples which can be used to construct a dict:
def find_get(folder):
for filename in os.listdir(folder):
relative_file_path = os.path.join(folder, filename)
with open(relative_file_path) as f:
# read() gives the entire data from the file
yield (relative_file_path, f.read(), )
# this consumes the generator to a list
files_data = dict(find_get('filex'))
You will now have a mapping from the file's name to it's content.
Also, take a look at the answer by #Padraic Cunningham . He brought up the glob module which is suitable in this case.
The error you're facing is simple: listdir returns filenames, not full pathnames. To turn them into pathnames you can access from your current working directory, you have to join them to the directory path:
for filename in os.listdir(directory):
pathname = os.path.join(directory, filename)
with open(pathname) as f:
# do stuff
So, in your case, there's a file named .idea in the folder directory, but you're trying to open a file named .idea in the current working directory, and there is no such file.
There are at least four other potential problems with your code that you also need to think about and possibly fix after this one:
You don't handle errors. There are many very common reasons you may not be able to open and read a file--it may be a directory, you may not have read access, it may be exclusively locked, it may have been moved since your listdir, etc. And those aren't logic errors in your code or user errors in specifying the wrong directory, they're part of the normal flow of events, so your code should handle them, not just die. Which means you need a try statement.
You don't do anything with the files but print out every line. Basically, this is like running cat folder/* from the shell. Is that what you want? If not, you have to figure out what you want and write the corresponding code.
You open the same file twice in a row, without closing in between. At best this is wasteful, at worst it will mean your code doesn't run on any system where opens are exclusive by default. (Are there such systems? Unless you know the answer to that is "no", you should assume there are.)
You don't close your files. Sure, the garbage collector will get to them eventually--and if you're using CPython and know how it works, you can even prove the maximum number of open file handles that your code can accumulate is fixed and pretty small. But why rely on that? Just use a with statement, or call close.
However, none of those problems are related to your current error. So, while you have to fix them too, don't expect fixing one of them to make the first problem go away.
Full variant:
import os
def find_get(path):
files = {}
for file in os.listdir(path):
if os.path.isfile(os.path.join(path,file)):
with open(os.path.join(path,file), "r") as data:
files[file] = data.read()
return files
print(find_get("filex"))
Output:
{'1.txt': 'dsad', '2.txt': 'fsdfs'}
After the you could generate one file from that content, etc.
Key-thing:
os.listdir return a list of files without full path, so you need to concatenate initial path with fount item to operate.
there could be ideally used dicts :)
os.listdir return files and folders, so you need to check if list item is really file
You should check if the file is actually file and not a folder, since you can't open folders for reading. Also, you can't just open a relative path file, since it is under a folder, so you should get the correct path with os.path.join. Check below:
import os
def find_get(folder):
for file in os.listdir(folder):
if not os.path.isfile(file):
continue # skip other directories
f = open(os.path.join(folder, file), 'r')
for line in f:
print line

Changing name of file until it is unique

I have a script that downloads files (pdfs, docs, etc) from a predetermined list of web pages. I want to edit my script to alter the names of files with a trailing _x if the file name already exists, since it's possible files from different pages will share the same filename but contain different contents, and urlretrieve() appears to automatically overwrite existing files.
So far, I have:
urlfile = 'https://www.foo.com/foo/foo/foo.pdf'
filename = urlfile.split('/')[-1]
filename = foo.pdf
if os.path.exists(filename):
filename = filename('.')[0] + '_' + 1
That works fine for one occurrence, but it looks like after one foo_1.pdf it will start saving as foo_1_1.pdf, and so on. I would like to save the files as foo_1.pdf, foo_2.pdf, and so on.
Can anybody point me in the right direction on how to I can ensure that file names are stored in the correct fashion as the script runs?
Thanks.
So what you want is something like this:
curName = "foo_0.pdf"
while os.path.exists(curName):
num = int(curName.split('.')[0].split('_')[1])
curName = "foo_{}.pdf".format(str(num+1))
Here's the general scheme:
Assume you start from the first file name (foo_0.pdf)
Check if that name is taken
If it is, iterate the name by 1
Continue looping until you find a name that isn't taken
One alternative: Generate a list of file numbers that are in use, and update it as needed. If it's sorted you can say name = "foo_{}.pdf".format(flist[-1]+1). This has the advantage that you don't have to run through all the files every time (as the above solution does). However, you need to keep the list of numbers in memory. Additionally, this will not fill any gaps in the numbers
Why not just use the tempfile module:
fileobj = tempfile.NamedTemporaryFile(suffix='.pdf', prefix='', delete = False)
Now your filename will be available in fileobj.name and you can manipulate to your heart's content. As an added benefit, this is cross-platform.
Since you're dealing with multiple pages, this seeems more like a "global archive" than a per-page archive. For a per-page archive, I would go with the answer from #wnnmaw
For a global archive, I would take a different approch...
Create a directory for each filename
Store the file in the directory as "1" + extension
write the current "number" to the directory as "_files.txt"
additional files are written as 2,3,4,etc and increment the value in _files.txt
The benefits of this:
The directory is the original filename. If you keep turning "Example-1.pdf" into "Example-2.pdf" you run into a possibility where you download a real "Example-2.pdf", and can't associate it to the original filename.
You can grab the number of like-named files either by reading _files.txt or counting the number of files in the directory.
Personally, I'd also suggest storing the files in a tiered bucketing system, so that you don't have too many files/directories in any one directory (hundreds of files makes it annoying as a user, thousands of files can affect OS performance ). A bucketing system might turn a filename into a hexdigest, then drop the file into `/%s/%s/%s" % ( hex[0:3], hex[3:6], filename ). The hexdigest is used to give you a more even distribution of characters.
import os
def uniquify(path, sep=''):
path = os.path.normpath(path)
num = 0
newpath = path
dirname, basename = os.path.split(path)
filename, ext = os.path.splitext(basename)
while os.path.exists(newpath):
newpath = os.path.join(dirname, '{f}{s}{n:d}{e}'
.format(f=filename, s=sep, n=num, e=ext))
num += 1
return newpath
filename = uniquify('foo.pdf', sep='_')
Possible problems with this include:
If you call to uniquify many many thousands of times with the same
path, each subsequent call may get a bit slower since the
while-loop starts checking from num=0 each time.
uniquify is vulnerable to race conditions whereby a file may not
exist at the time os.path.exists is called, but may exist at the
time you use the value returned by uniquify. Use
tempfile.NamedTemporaryFile to avoid this problem. You won't get
incremental numbering, but you will get files with unique names,
guaranteed not to already exist. You could use the prefix parameter to
specify the original name of the file. For example,
import tempfile
import os
def uniquify(path, sep='_', mode='w'):
path = os.path.normpath(path)
if os.path.exists(path):
dirname, basename = os.path.split(path)
filename, ext = os.path.splitext(basename)
return tempfile.NamedTemporaryFile(prefix=filename+sep, suffix=ext, delete=False,
dir=dirname, mode=mode)
else:
return open(path, mode)
Which could be used like this:
In [141]: f = uniquify('/tmp/foo.pdf')
In [142]: f.name
Out[142]: '/tmp/foo_34cvy1.pdf'
Note that to prevent a race-condition, the opened filehandle -- not merely the name of the file -- is returned.

Python - Efficiently building a dictionary

I am trying to build a dict(dict(dict())) out of multiple files, which are stored in different numbered directories, i.e.
/data/server01/datafile01.dat
/data/server01/datafile02.dat
...
/data/server02/datafile01.dat
/data/server02/datafile02.dat
...
/data/server86/datafile01.dat
...
/data/server86/datafile99.dat
I have a couple problems at the moment:
Switching between directories
I know that I have 86 servers, but the number of files per server may vary. I am using:
for i in range(1,86):
basedir='/data/server%02d' % i
for file in glob.glob(basedir+'*.dat'):
Do reading and sorting here
but I cant seem to switch between the directories properly. It just sits in the first one and gets stuck it seems when there are no files in the directory
Checking if key already exists
I would like to have a function that somehow checks if a key is already present or not, and in case it isnt creates that key and certain subkeys, since one cant define dict[Key1][Subkey1][Subsubkey1]=value
BTW i am using Python 2.6.6
Björn helped with the defaultdict half of your question. His suggestion should get you very close to where you want to be in terms of the default value for keys that do not yet exist.
The best tool for walking a directory and looking at files is os.walk. You can combine the directory and filename names that you get from it with os.path.join to find the files you are interested in. Something like this:
import os
data_path = '/data'
# option 1 using nested list comprehensions**
data_files = (os.path.join(root,f) for (root, dirs, files) in os.walk(data_path)
for f in files) # can use [] instead of ()
# option 2 using nested for loops
data_files = []
for root, dirs, files in os.walk(data_path):
for f in files:
data_files.append(os.path.join(root, f))
for data_file in data_files:
# ... process data_file ...
**Docs for list comprehensions.
I can't help you with your first problem, but the second one can be solved by using a defaultdict. This is a dictionary that has a function that is called to generate a value when a requested key did not exist. Using lambda you can nest them:
>>> your_dict = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
>>> your_dict[1][2][3]
0
I'm assuming these 'directories' are remotely mounted shares?
Couple of things:
I'd use os.path.join instead of 'basedir' + '*.dat'
For FS related stuff I've had very good results parallelizing the computation using
multiprocessing.Pool to get around those times where a remote fs might be extremely slow and hold up the whole process.
import os
import glob
import multiprocessing as mp
def processDir(path):
results = {}
for file in glob.iglob(os.path.join(path,'*.dat')):
results.update(add to the results here)
return results
dirpaths = ['/data/server%02d'%i for i in range(1,87)]
_results = mp.Pool(8).map(processDir,dirpaths)
results = combine _results here...
For your dict-related problems, use defaultdict, as mentioned in the other answers, or even your own dict subclass, or function?
def addresult(results,key,subkey,subsubkey,value):
if key not in results:
results[key] = {}
if subkey not in results[key]:
results[key][subkey] = {}
if subsubkey not in results[key][subkey]:
results[key][subkey][subsubkey] = value
There are almost certainly more efficient ways to accomplish this, but that's a start.

Categories