Comparison of file list with files in folder - python

I have a list of filenames but in the directory they are named a little different. I wanna print filenames that are not in directory. Example of files:
FOO_BAR_524B_023D9B01_2021-157T05-34-31__00001_2021-08-30T124702.130.tgz
import os
missing = ['FOO_BAR_524B_023D9B01_2021-157T05-34-31__00001', 'dfiknvbdjfhnv']
for fileName in missing:
for fileNames in next(os.walk('C:\\Users\\foo\\bar'))[2]:
if fileName not in fileNames:
print(fileName)
I cannot get what I'm doing wrong...

The problem is that you iterate over every file in the directory (for fileNames in next(os.walk(...))[2]) and check if fileName is in each of those file names. For every file in the folder where fileName not in fileNames, fileName is printed, resulting in it being printed many times.
This can be fixed by doing a single check to see if all files in the folder do not contain the target file name.
import os
missing = ['FOO_BAR_524B_023D9B01_2021-157T05-34-31__00001', 'dfiknvbdjfhnv']
fileNames = next(os.walk('C:\\Users\\foo\\bar'))[2]
for missingfileName in missing:
if all(missingfileName not in fileName for fileName in fileNames):
print(missingfileName)
If you want it to be more efficient and you are only looking for file names that are prefixes of other names, then you can use a data structure called a trie. For example if missing equals ['bcd'], and there is a file called abcde and these are not considered a match, then a trie is appropriate here.

Related

Renaming files that contain a pattern, looping through subfolders

I have a main folder, that contains multiple subfolders, that contain multiple files. I am trying to loop through subfolders and rename files that match a certain pattern. Here is what I have:
import os
from fnmatch import fnmatch
pattern = "*z_2*"
pattern2 ='b_2.txt'
path = r'C:\Users\Desktop\123'
list1= []
for (dirpath, dirnames, filenames) in os.walk(path):
list1+= [os.path.join(dirpath, file) for file in filenames]
for i in list1:
if fnmatch(i,pattern):
a=os.path.join(path,i)
b = os.path.dirname(i)
os.rename(a, os.path.join(b,pattern2))
What I don't understand, is why, when I specify use os.rename , it is instead creating a text file in the specified subfolder, resulting in:
FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'C:\\Users\\Desktop\\ABC\\_100\\az_207.txt' -> 'C:\\Users\\Desktop\\ABC\\_100\\b_2.txt'
The problem is that when you rename a file, the destination filepath depends on b, which in turn depends only on the dirname part of i, not on i itself. So when your loop over list1 finds more than one file in the same directory, they all get the same value for os.path.join(b,pattern2). So your code is creating more than one file with the same name.
You probably want to reuse some part of a when building the destination filename, so as to ensure uniqueness.

How to do computations through directory and subfolders

I have one main directory which has 9 subfolders. Inside of each of them, there are 1000 files. I needed to do a for loop for reading main directory and folders but the problem is that, subfolder names are not similar and don't have a numerator and I got stuck. I have seen Iterate through folders, then subfolders and print filenames with path to text file but I could not distinguish how to get started.
My effort is below:
import os
for root, dirs, files in os.walk(r'\Desktop\output\new our scenario\test'):
for file in files:
with open(os.path.join(root, file), "r") as auto:
##Doing Whatever I want
But it's not correct and does not work.
Do you know glob? That might be a solution to your problem.
You can get a list of all files in subdirectories by using wildcard path names, e.g.:
Here is an example for looping through txt files, but you do not necessarily restrict it to a file type. But if you do not use *.* at the end it will also list dirs
import glob
file_list = glob.glob('known_dir/*/*.txt')
for file in file_list:
with open(file, "r") as auto:
##Doing Whatever you want

RegEx to find specific file path

I am trying to find the existence of a file testing.txt
The first file exists in: sub/hbc_cube/college/
The second file exists in: sub/hbc/college
However, when searching for where the file exists, I CANNOT assume the string 'hbc' because the name may be different depending on the user. So I am trying to find a way to
PASS if the path is
sub/_cube/college/
FAIL if the path is
sub/*/college
But I cannot use a glob character () because the () will count _cube as failing. I am trying to figure out a regular expression that will only detect a string and not a string with an underscore (hbc_cube for example).
I have tried using the python regex dictionary but I have not been able to figure out the correct regex to use
file_list = lookupfiles(['testing.txt'], dirlist = ['sub/'])
for file in file_list:
if str(file).find('_cube/college/') #hbc_cube/college
print("pass")
if str(file).find('*/college/') #hbc/college
print("fail")
If the file exists in both locations I want only "fail" to print. The problem is the * character is counting hbc_cube.
The glob module is your friend. You don't even need to match against multiple directories, glob will do it for you:
from glob import glob
testfiles = glob("sub/*/testing.txt")
if len(testfiles) > 0 and all("_cube/" in path for path in testfiles):
print("Pass")
else:
print("Fail")
In case it is not obvious, the test all("_cube/" in path for path in testfiles) will take care of this requirement:
If the file exists in both locations I want only "fail" to print. The problem is the * character is counting hbc_cube.
If some of the paths that matched do not contain _cube, the test fails. Since you want to know about files that cause the test to fail, you cannot search solely for files in a path containing *_cube -- you must retrieve both good and bad paths, and inspect them as shown.
Of course you can shorten the above code, or generalize it to construct the globbed path by combining options from a list of folders and a list of files, etc., depending on the particulars of your case.
Note that there are "full regular expressions", provided by the re module, and the simpler "globs" used by the glob module. If you go check the documentation, don't confuse them.
Use the pathlib to parse your path, from the path object get the parent, this will discard the /college part, and check if the path string ends with _cube
from pathlib import Path
file_list = lookupfiles(['testing.txt'], dirlist = ['sub/'])
for file in file_list:
path = Path(file)
if str(path.parent).endswith('_cube'):
print('pass')
else:
print('Fail')
Edit:
If the file variable in the for loop contains the file name (sub/_cube/college/testing.txt) just call parent twice on the path, path.parent.parent
Another approach would be to filter the files inside lookupfiles() that is if you have access to that function and can edit it
The os module is well suited for this:
import os
# This assumes your current working directory has sub in it
for root, dirs, files in os.walk('sub'):
for file in files:
if file=='testing.txt':
# print the file and the directory it's in
print(os.path.join(root, file))
os.walk will return a three-element tuple as it iterates: a root dir, directories in that current folder, and files in that current folder. To print the directory, you combine the root (cwd) and the file name.
For example, on my machine:
for root, dirs, files in os.walk(os.getcwd()):
for file in files:
if file.endswith('ipynb'):
os.path.join(root, file)
# returns
/Users/mm92400/Salesforce_Repos/DataExplorationClustersAndTime.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationUntitled1.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationExploratory.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationUntitled3.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationUntitled.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationUntitled4.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationUntitled2.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationClusterAnalysis.ipynb

Search for multiple files by name and copy to a new folder

I have been trying to write some python code in order to get each line from a .txt file and search for a file with that name in a folder and its subfolders. After this I want to copy that file in a preset destination folder.
The thing is when I test this code I can read all the files in the .txt and I can display all files in a directory and its subdirectories. The problem rises when I have to compare the filename I read from the .txt (line by line as I said) with all the filenames within the directory folder and then copy the file there.
Any ideas what am I doing wrong?
import os, shutil
def main():
dst = '/Users/jorjis/Desktop/new'
f = open('/Users/jorjis/Desktop/articles.txt', 'rb')
lines = [line[:-1] for line in f]
for files in os.walk("/Users/jorjis/Desktop/folder/"):
for line in lines:
if line == files:
shutil.copy('/dir/file.ext', '/new/dir')
You are comparing the file names from the text file with a tuple with three elements: the root path of the currently visited folder, a list of all subdirectory names in that path, and a list of all file names in that path. Comparing a string with a tuple will never be true. You have to compare each file name with the set of file names to copy. The data type set comes in handy here.
Opening a file together with the with statement ensures that it is closed when the control flow leaves the with block.
The code might look like this:
import os
import shutil
def main():
destination = '/Users/jorjis/Desktop/new'
with open('/Users/jorjis/Desktop/articles.txt', 'r') as lines:
filenames_to_copy = set(line.rstrip() for line in lines)
for root, _, filenames in os.walk('/Users/jorjis/Desktop/folder/'):
for filename in filenames:
if filename in filenames_to_copy:
shutil.copy(os.path.join(root, filename), destination)
If I had to guess, I would say that the files in the .txt contain the entire path. You'd need to add a little more to os.walk to match up completely.
for root, _, files in os.walk("/Users/jorjis/Desktop/folder/"):
for f in files:
new_path = f + root
if new_path in lines:
shutil.copy(new_path, `/some_new_dir')
Then again, I'm not sure what the .txt file looks like so it might be that your original way works. If that's the case, take a closer look at the lines = ... line.

Concatenating fasta files from different folders

I have a large numbers of fasta files (these are just text files) in different subfolders. What I need is a way to search through the directories for files that have the same name and concatenate these into a file with the name of the input files. I can't do this manually as I have 10000+ genes that I need to do this for.
So far I have the following Python code that looks through one of the directories and then uses those file names to search through the other directories. This returns a list that has the full path for each file.
import os
from os.path import join, abspath
path = '/directoryforfilelist/' #Directory for source list
listing = os.listdir(path)
for x in listing:
for root, dirs, files in os.walk('/rootdirectorytosearch/'):
if x in files:
pathlist = abspath(join(root,x))
Where I am stuck is how to concatenate the files it returns that have the same name. The results from this script look like this.
/directory1/file1.fasta
/directory2/file1.fasta
/directory3/file1.fasta
/directory1/file2.fasta
/directory2/file2.fasta
/directory3/file2.fasta
In this case I would need the end result to be two files named file1.fasta and file2.fasta that contain the text from each of the same named files.
Any leads on where to go from here would be appreciated. While I did this part in Python anyway that gets the job done is fine with me. This is being run on a Mac if that matters.
Not tested, but here's roughly what I'd do:
from itertools import groupby
import os
def conc_by_name(names):
for tail, group in groupby(names, key=os.path.split):
with open(tail, 'w') as out:
for name in group:
with open(name) as f:
out.writelines(f)
This will create the files (file1.fasta and file2.fasta in your example) in the current folder.
For each file of your list, allocate the target file in append mode, read each line of your source file and write it to the target file.
Assuming that the target folder is empty to start with, and is not in /rootdirectorytosearch.

Categories