Python glob - differentiate between similar filenames in different directories - python

I'm using glob recursively to find all excel files in directory, including files located in sub directories. After that, I stem the whole path of the file and only put the excel name on a list for later usage. However, the problem I'm having is when I encounter 2 different excel files but with similar name.
I want the ability to differentiate between them on my stemmed list.
For example: This what happens now with 3 files in the directory, 2 of which bearing the same name:
mainDirectory/subDirectory1/file1.xlsx
mainDirectory/subDirectory2/file1.xlsx
mainDirectory/subDirectory2/file2.xlsx
The list would be:
file1
file2
file1
And it creates a lot of problems for me afterwards, so I need help on how to add something for each 2 files to make them unique, and later on to remove it.
I'm thinking maybe to add the parent sub directory up, something like this:
subDirectory1/file1
file2
subDirectory2/file1
Is there an option in glob to deal with this issue? Or anything else?
Here is my code:
excel_list = []
file_list = []
f = []
file_list = glob.glob(path + "/**" + "/*.xlsx", recursive=True)
f.extend(file_list)
for x in f:
x = pathlib.Path(x).stem
excel_list.append(x)

Related

Python list certain files in different folders

I've got 2 folders, each with a different CSV file inside (both have the same format):
I've written some python code to search within the "C:/Users/Documents" directory for CSV files which begin with the word "File"
import glob, os
inputfile = []
for root, dirs, files in os.walk("C:/Users/Documents/"):
for datafile in files:
if datafile.startswith("File") and datafile.endswith(".csv"):
inputfile.append([os.path.join(root, datafile)])
print(inputfile)
That almost worked as it returns:
[['C:/Users/Documents/Test A\\File 1.csv'], ['C:/Users/Documents/Test B\\File 2.csv']]
Is there any way I can get it to return this instead (no sub list and shows / instead of \):
['C:/Users/Documents/Test A/File 1.csv', 'C:/Users/Documents/Test B/File 2.csv']
The idea is so I can then read both CSV files at once later, but I believe I need to get the list in the format above first.
okay, I will paste an option here.
I made use of os.path.abspath to get the the path before join.
Have a look and see if it works.
import os
filelist = []
for folder, subfolders, files in os.walk("C:/Users/Documents/"):
for datafile in files:
if datafile.startswith("File") and datafile.endswith(".csv"):
filePath = os.path.abspath(os.path.join(folder, datafile))
filelist.append(filePath)
filelist
Result:
['C:/Users/Documents/Test A/File 1.csv','C:/Users/Documents/Test B/File 2.csv']

How to find and copy almost identical filenames from one folder to another using python?

I have a folder with a large number of files (mask_folder). The filenames in this folder are built as follows:
asdgaw-1454_mask.tif
lkafmns-8972_mask.tif
sdnfksdfk-1880_mask.tif
etc.
In another folder (test_folder), I have a smaller number of files with filenames written almost the same, but without the addition of _mask. Like:
asdgaw-1454.tif
lkafmns-8972.tif
etc.
What I need is a code to find the files in mask_folder that have an identical start of the filenames as compared to the files in test_folder and then these files should be copied from the mask_folder to the test_folder.
In that way the test_folder contains paired files as follows:
asdgaw-1454_mask.tif
asdgaw-1454.tif
lkafmns-8972_mask.tif
lkafmns-8972.tif
etc.
This is what I tried, it runs without any errors but nothing happens:
import shutil
import os
mask_folder = "//Mask/"
test_folder = "//Test/"
n = 8
list_of_files_mask = []
list_of_files_test = []
for file in os.listdir(mask_folder):
if not file.startswith('.'):
list_of_files_mask.append(file)
start_mask = file[0:n]
print(start_mask)
for file in os.listdir(test_folder):
if not file.startswith('.'):
list_of_files_test.append(file)
start_test = file[0:n]
print(start_test)
for file in start_test:
if start_mask == start_test:
shutil.copy2(file, test_folder)
The past period I searched for but not found a solution for above mentioned problem. So, any help is really appreciated.
First, you want to get only the files, not the folders as well, so you should probably use os.walk() instead of listdir() to make the solution more robust. Read more about it in this question.
Then, I suggest loading the filenames of the test folder into memory (since they are the smaller part) and then NOT load all the other files into memory as well but instead copy them right away.
import os
import shutil
test_dir_path = ''
mask_dir_path = ''
# load file names from test folder into a list
test_file_list = []
for _, _, file_names in os.walk(test_dir_path):
# 'file_names' is a list of strings
test_file_list.extend(file_names)
# exit after this directory, do not check child directories
break
# check mask folder for matches
for _, _, file_names in os.walk(mask_dir_path):
for name_1 in file_names:
# we just remove a part of the filename to get exact matches
name_2 = name_1.replace('_mask', '')
# we check if 'name_2' is in the file name list of the test folder
if name_2 in test_file_list:
print('we copy {} because {} was found'.format(name_1, name_2))
shutil.copy2(
os.path.join(mask_dir_path, name_1),
test_dir_path)
# exit after this directory, do not check child directories
break
Does this solve your problem?

Concatenating fasta files from different folders

I have a large numbers of fasta files (these are just text files) in different subfolders. What I need is a way to search through the directories for files that have the same name and concatenate these into a file with the name of the input files. I can't do this manually as I have 10000+ genes that I need to do this for.
So far I have the following Python code that looks through one of the directories and then uses those file names to search through the other directories. This returns a list that has the full path for each file.
import os
from os.path import join, abspath
path = '/directoryforfilelist/' #Directory for source list
listing = os.listdir(path)
for x in listing:
for root, dirs, files in os.walk('/rootdirectorytosearch/'):
if x in files:
pathlist = abspath(join(root,x))
Where I am stuck is how to concatenate the files it returns that have the same name. The results from this script look like this.
/directory1/file1.fasta
/directory2/file1.fasta
/directory3/file1.fasta
/directory1/file2.fasta
/directory2/file2.fasta
/directory3/file2.fasta
In this case I would need the end result to be two files named file1.fasta and file2.fasta that contain the text from each of the same named files.
Any leads on where to go from here would be appreciated. While I did this part in Python anyway that gets the job done is fine with me. This is being run on a Mac if that matters.
Not tested, but here's roughly what I'd do:
from itertools import groupby
import os
def conc_by_name(names):
for tail, group in groupby(names, key=os.path.split):
with open(tail, 'w') as out:
for name in group:
with open(name) as f:
out.writelines(f)
This will create the files (file1.fasta and file2.fasta in your example) in the current folder.
For each file of your list, allocate the target file in append mode, read each line of your source file and write it to the target file.
Assuming that the target folder is empty to start with, and is not in /rootdirectorytosearch.

How can I take a list of file names in python and assign each file name as a number for later use?

Here is what I have so far in Windows:
import os
os.chdir("C:/Users/any/Desktop/test")
for files in os.listdir("."):
print files
Now it prints this:
test picture.jpg
test script.bat
test text.txt
But now where I am stuck is the output is going to be random with each folder done so my idea of a solution is to have it take the list and label each one individually as
filename1
filename2
filename3
So now filename1 = test picture.jpg
Edit
Well what I am trying to do is later in my code each filename will be used for example say I was trying to rename my files so that any of those files that contained the letter 'e' in it, it would be changed to an 'a' character:
import os
os.chdir("C:/Users/any/Desktop/test")
for files in os.listdir("."):
print files
files = files.replace('e', 'a')
print files
But I need to be able to have it do each filename individually so the code could look something like this:
import os
os.chdir("C:/Users/any/Desktop/test")
for files in os.listdir("."):
print filename1
filename1 = filename1.replace('e', 'a')
print filename1
While I am not 100% sure I understand what you are trying to do, you could just try this:
dirlist = os.listdir('.')
then each index, starting with zero, of dirlist would yield an entry from your directory. The index values would in effect take the place of the numbers for your filenames.
So rather than having
filename1
filename2
filename3
you'd have
dirlist[0]
dirlist[1]
dirlist[2]
with minimal effort, and could easily refer to individual entries in any order with the index.
And as an added bonus, you could easily iterate through this list of names with a for-loop if need be
for names in dirlist:
...
which would be a bit more tricky with the individual filenames you mentioned in your original post.
Update:
Given your edit to your post, you would have been able to achieve your goal with pretty much your original code:
import os
os.chdir("C:/Users/any/Desktop/test")
for fname in os.listdir("."):
fname = fname.replace('e', 'a')
print fname
If you add all those names into a list, you'll have a pre-built index and numbering system:
>>> lst = ['a','b','c']
>>> lst.index('b')
1
>>> lst[1]
'b'
In your case:
files = os.listdir('.')
files[0] # first file
files[3] # fourth file (first one is 0)
If you're just going to name them incrementally like that why not just store them in a list?
filenames = []
filenames.extend(os.listdir("."))
filenames[0] # filename1 equivalent
filenames[1] # filename2 equivalent
The reason I've used extend here is it sounds like you want to construct a list of file names from various folders.

WxPython - building a directory tree based on file availability

I do atomistic modelling, and use Python to analyze simulation results. To simplify work with a whole bunch of Python scripts used for different tasks, I decided to write simple GUI to run scripts from it.
I have a (rather complex) directory structure beginning from some root (say ~/calc), and I want to populate wx.TreeCtrl control with directories containing calculation results preserving their structure. The folder contains the results if it contains a file with .EXT extension. What i try to do is walk through dirs from root and in each dir check whether it contains .EXT file. When such dir is reached, add it and its ancestors to the tree:
def buildTree(self, rootdir):
root = rootdir
r = len(rootdir.split('/'))
ids = {root : self.CalcTree.AddRoot(root)}
for (dirpath, dirnames, filenames) in os.walk(root):
for dirname in dirnames:
fullpath = os.path.join(dirpath, dirname)
if sum([s.find('.EXT') for s in filenames]) > -1 * len(filenames):
ancdirs = fullpath.split('/')[r:]
ad = rootdir
for ancdir in ancdirs:
d = os.path.join(ad, ancdir)
ids[d] = self.CalcTree.AppendItem(ids[ad], ancdir)
ad = d
But this code ends up with many second-level nodes with the same name, and that's definitely not what I want. So I somehow need to see if the node is already added to the tree, and in positive case add new node to the existing one, but I do not understand how this could be done. Could you please give me a hint?
Besides, the code contains 2 dirty hacks I'd like to get rid of:
I get the list of ancestor dirs with splitting the full path in \
positions, and this is Linux-specific;
I find if .EXT file is in the directory by trying to find the extension in the strings from filenames list, taking in account that s.find returns -1 if the substring is not found.
Is there a way to make these chunks of code more readable?
First of all the hacks:
To get the path seperator for whatever os your using you can use os.sep.
Use str.endswith() and use the fact that in Python the empty list [] evaluates to False:
if [ file for file in filenames if file.endswith('.EXT') ]:
In terms of getting them all nicely nested you're best off doing it recursively. So the pseudocode would look something like the following. Please note this is just provided to give you an idea of how to do it, don't expect it to work as it is!
def buildTree(self, rootdir):
rootId = self.CalcTree.AddRoot(root)
self.buildTreeRecursion(rootdir, rootId)
def buildTreeRecursion(self, dir, parentId)
# Iterate over the files in dir
for file in dirFiles:
id = self.CalcTree.AppendItem(parentId, file)
if file is a directory:
self.buildTreeRecursion(file, id)
Hope this helps!

Categories