Python while loop with - python

I'm trying to use a while loop to loop through .xml files in a folder. However, files are being added to the folder whilst the while loop is running. This is a shortened version of the code I currently use:
import os
my_folder = "d:\\xml\\"
while True:
files = [f for f in os.listdir(my_folder) if f.endswith(".xml")]
while files:
for file in files:
# do whatever with the xml file
os.remove(my_folder + file)
files = [f for f in os.listdir(my_folder) if f.endswith(".xml")]
What I would like to do is tidy up the code by only having one line filling the files list. I'd like to have something like:
while files = [f for f in os.listdir(my_folder) if f.endswith(".xml")]:
But, I know this won't work. I would imagine that Python is capable of this, but I don't know the correct syntax.
Added note:
I'm using Windows 10 with Python 3.7.6

You could simplify your code by removing the inner while loop and the second assignment to files. This will loop indefinitely, see if there are xml files in the directory, and if so process and delete them, before continuing to loop. (You might also add a short sleep in case of no new files.)
while True:
files = [f for f in os.listdir(my_folder) if f.endswith(".xml")]
for file in files:
# do whatever with the xml file
os.remove(my_folder + file)
As shown in the other answer, you could also use the := operator and something like the following...
while True:
while (files := [...]):
...
... but this would behave exactly the same as without the inner while. Only if you e.g. want to do something when there are temporarily no files left, i.e. have code in the outer loop that's not in the inner loop, this may make a difference.

Related

get List of recently added .csv Files into the directory using python

I have a output files folder, where all the files get dumped, i need to check into the folder every five mins and pick up all the list of recently added files by using python.
One way of doing this is using sets, and get the non intersected files, is there any other better approach?
much appreciate the code snippet of it.
Thanks
To solve this, you can make use of the particular method listdir() from the os module and sleep() from the time module.
import os
from time import sleep
path = "/path/to/folder/with/csv/files"
with open("log.txt", "a+") as log_file:
while True:
log_file.seek(0)
existing = [f.strip() for f in log_file]
csvs = [f for f in os.listdir(path) if f.endswith(".csv") and f not in existing]
if len(csvs) > 0:
print(f"Found {len(csvs)} new file(s):")
for f in csvs:
print(f)
print("\n")
else:
print("Found 0 new files.")
log_file.writelines([f"{f}\n" for f in csvs])
sleep(300)
We will be storing the existing file names in a .txt file. You could use a .json file or any other file type you like. Firstly, we open the file using with/open (in append/read mode) and get a list of the file names that have previously been stored in the text file. We then get a list of all of the .csv files in that directory that are not in the file:
csvs = [f for f in os.listdir(path) if f.endswith(".csv") and f not in existing]
os.listdir() is lists all of the files and folders in the current working directory.
The following if/else statement is simply for output purposes and is not required. It is only saying: if new csv files were found, print how many and the names of each. If none were found, print that zero were found.
All that's left to do is write the newly discovered file names into the .txt file so that on the next iteration, they will be marked as existing and not new:
log_file.writelines([f"{f}\n" for f in csvs])
The final line, sleep(300), makes the program wait 300 seconds, or 5 minutes, to iterate again.

Loop through list of files

I'm in the process of developing a data column check, but I'm having a tough time figuring out how to properly loop through a list of files. I have a folder with a list of csv files. I need to check if each file maintains a certain structure. I'm not worried about checking the structure of each file, I'm more worried about how to properly pull each individual file from the dir, dataframe it, and then move on to the next file. Any help would be much appreciated.
def files(path):
files = os.listdir(path)
len_files = len(files)
cnt = 0
while cnt < len_files:
print(files)
for file in os.listdir(path):
if os.path.isfile(os.path.join(path, file)):
with open(path + file, 'r') as f:
return data_validate(f)
def data_validate(file):
# Validation check code will eventually go here...
print(pd.read_csv(file))
def run():
files("folder/subfolder/")
Which version of python do you use?
I use Pathlib and python3.6+ to do a lot of file processing with pandas. I find Pathlib easy to use, though you still have to dip back into os for a couple of functions they haven't implemented yet. A plus is that Path objects can be passed into the os functions without modification - so I like the flexibility.
This is a function I used to recursively go through an arbitrary directory structure that I have modified to look more like what you're trying to achieve above, returning a list of DataFrames.
If your directory is always going to be flat, you can simplify this even more.
def files(directory):
top_dir = Path(directory)
validated_files = list()
for item in top_dir.iterdir():
if item.is_file():
validated_files.append(data_validate(item))
elif item.is_dir():
validated_files.append(files(item))
return validated_files

Multiple instances of the same function asynchronously in python

I have a little script that does a few simple tasks. Running Python 3.7.
One of the tasks has to merge some files together which can be a little time consuming.
It loops through multiple directories, then each directory gets passed to the function. The function just loops through the files and merges them.
Instead of waiting for it to finish one directory, then onto the next one, then wait, then onto the next one, etc...
I'd like to utilize the horsepower/cores/threads to have the script merging the PDF's in multiple directories at once, together, which should shave time.
I've got something like this:
if multi_directories:
if os.path.isdir('merged'):
pass
else:
os.makedirs('merged')
for directory in multi_directories:
merge_pdfs(directory)
My merge PDF function looks like this:
def merge_pdfs(directory):
root_dir = os.path.dirname(os.path.abspath(__file__))
merged_dir_location = os.path.join(root_dir, 'merged')
dir_title = directory.rsplit('/', 1)[-1]
file_list = [file for file in os.listdir(directory)]
merger = PdfFileMerger()
for pdf in file_list:
file_to_open = os.path.join(directory, pdf)
merger.append(open(file_to_open, 'rb'))
file_to_save = os.path.join(
merged_dir_location,
dir_title+"-merged.pdf"
)
with open(file_to_save, "wb") as fout:
merger.write(fout)
return True
This works great - but merge_pdfs runs slow in some instances where there are a high number of PDF's in the directory.
Essentially - I want to be a be able to loop through multi_directories and create a new thread or process for each directory and merge the PDF's at the same time.
I've looked at asyncio, multithreading and a wealth of little snippets here and there but can't seem to get it to work.
You can do something like:
from multiprocessing import Pool
n_processes = 2
...
if multi_directories:
if os.path.isdir('merged'):
pass
else:
os.makedirs('merged')
pool = Pool(n_processes)
pool.map(merge_pdfs, multi_directories)
It should help if the bottleneck is CPU usage. But it may make things even worse if the bottleneck is HDD, cause reading several files in parallel from one physical HDD is usually slower then reading them consecutively. Try it with different values of n_processes.
BTW, to make list from iterable use list(): file_list = list(os.listdir(directory)). And since listdir() returns List, you can just write file_list = os.listdir(directory)

taking data from files which are in folder

How do I get the data from multiple txt files that placed in a specific folder. I started with this could not fix. It gives an error like 'No such file or directory: '.idea' (??)
(Let's say I have an A folder and in that, there are x.txt, y.txt, z.txt and so on. I am trying to get and print the information from all the files x,y,z)
def find_get(folder):
for file in os.listdir(folder):
f = open(file, 'r')
for data in open(file, 'r'):
print data
find_get('filex')
Thanks.
If you just want to print each line:
import glob
import os
def find_get(path):
for f in glob.glob(os.path.join(path,"*.txt")):
with open(os.path.join(path, f)) as data:
for line in data:
print(line)
glob will find only your .txt files in the specified path.
Your error comes from not joining the path to the filename, unless the file was in the same directory you were running the code from python would not be able to find the file without the full path. Another issue is you seem to have a directory .idea which would also give you an error when trying to open it as a file. This also presumes you actually have permissions to read the files in the directory.
If your files were larger I would avoid reading all into memory and/or storing the full content.
First of all make sure you add the folder name to the file name, so you can find the file relative to where the script is executed.
To do so you want to use os.path.join, which as it's name suggests - joins paths. So, using a generator:
def find_get(folder):
for filename in os.listdir(folder):
relative_file_path = os.path.join(folder, filename)
with open(relative_file_path) as f:
# read() gives the entire data from the file
yield f.read()
# this consumes the generator to a list
files_data = list(find_get('filex'))
See what we got in the list that consumed the generator:
print files_data
It may be more convenient to produce tuples which can be used to construct a dict:
def find_get(folder):
for filename in os.listdir(folder):
relative_file_path = os.path.join(folder, filename)
with open(relative_file_path) as f:
# read() gives the entire data from the file
yield (relative_file_path, f.read(), )
# this consumes the generator to a list
files_data = dict(find_get('filex'))
You will now have a mapping from the file's name to it's content.
Also, take a look at the answer by #Padraic Cunningham . He brought up the glob module which is suitable in this case.
The error you're facing is simple: listdir returns filenames, not full pathnames. To turn them into pathnames you can access from your current working directory, you have to join them to the directory path:
for filename in os.listdir(directory):
pathname = os.path.join(directory, filename)
with open(pathname) as f:
# do stuff
So, in your case, there's a file named .idea in the folder directory, but you're trying to open a file named .idea in the current working directory, and there is no such file.
There are at least four other potential problems with your code that you also need to think about and possibly fix after this one:
You don't handle errors. There are many very common reasons you may not be able to open and read a file--it may be a directory, you may not have read access, it may be exclusively locked, it may have been moved since your listdir, etc. And those aren't logic errors in your code or user errors in specifying the wrong directory, they're part of the normal flow of events, so your code should handle them, not just die. Which means you need a try statement.
You don't do anything with the files but print out every line. Basically, this is like running cat folder/* from the shell. Is that what you want? If not, you have to figure out what you want and write the corresponding code.
You open the same file twice in a row, without closing in between. At best this is wasteful, at worst it will mean your code doesn't run on any system where opens are exclusive by default. (Are there such systems? Unless you know the answer to that is "no", you should assume there are.)
You don't close your files. Sure, the garbage collector will get to them eventually--and if you're using CPython and know how it works, you can even prove the maximum number of open file handles that your code can accumulate is fixed and pretty small. But why rely on that? Just use a with statement, or call close.
However, none of those problems are related to your current error. So, while you have to fix them too, don't expect fixing one of them to make the first problem go away.
Full variant:
import os
def find_get(path):
files = {}
for file in os.listdir(path):
if os.path.isfile(os.path.join(path,file)):
with open(os.path.join(path,file), "r") as data:
files[file] = data.read()
return files
print(find_get("filex"))
Output:
{'1.txt': 'dsad', '2.txt': 'fsdfs'}
After the you could generate one file from that content, etc.
Key-thing:
os.listdir return a list of files without full path, so you need to concatenate initial path with fount item to operate.
there could be ideally used dicts :)
os.listdir return files and folders, so you need to check if list item is really file
You should check if the file is actually file and not a folder, since you can't open folders for reading. Also, you can't just open a relative path file, since it is under a folder, so you should get the correct path with os.path.join. Check below:
import os
def find_get(folder):
for file in os.listdir(folder):
if not os.path.isfile(file):
continue # skip other directories
f = open(os.path.join(folder, file), 'r')
for line in f:
print line

Running a python script on all the files in a directory

I have a Python script that reads through a text csv file and creates a playlist file. However I can only do one at a time, like:
python playlist.py foo.csv foolist.txt
However, I have a directory of files that need to be made into a playlist, with different names, and sometimes a different number of files.
So far I have looked at creating a txt file with a list of all the names of the file in the directory, then loop through each line of that, however I know there must be an easier way to do it.
for f in *.csv; do
python playlist.py "$f" "${f%.csv}list.txt"
done
Will that do the trick? This will put foo.csv in foolist.txt and abc.csv in abclist.txt.
Or do you want them all in the same file?
Just use a for loop with the asterisk glob, making sure you quote things appropriately for spaces in filenames
for file in *.csv; do
python playlist.py "$file" >> outputfile.txt;
done
Is it a single directory, or nested?
Ex.
topfile.csv
topdir
--dir1
--file1.csv
--file2.txt
--dir2
--file3.csv
--file4.csv
For nested, you can use os.walk(topdir) to get all the files and dirs recursively within a directory.
You could set up your script to accept dirs or files:
python playlist.py topfile.csv topdir
import sys
import os
def main():
files_toprocess = set()
paths = sys.argv[1:]
for p in paths:
if os.path.isfile(p) and p.endswith('.csv'):
files_toprocess.add(p)
elif os.path.isdir(p):
for root, dirs, files in os.walk(p):
files_toprocess.update([os.path.join(root, f)
for f in files if f.endswith('.csv')])
if you have directory name you can use os.listdir
os.listdir(dirname)
if you want to select only a certain type of file, e.g., only csv file you could use glob module.

Categories