Python: Search for Substring Inside List Using Unique Values

Python: Search for Substring Inside List Using Unique Values - python

I have two lists, both containing file paths to PDFs. The first list contains PDFs that have unique file names. The second list contains the file names with the same unique file names that need to be matched up to the first list, although it is possible that there could be multiple PDFs in the second list that could be matched to the first. It is a one to many relationship from ListA to ListB. Below is an example.
List A: C:\FolderA\A.pdf, C:\FolderA\B.pdf, C:\FolderA\C.pdf
List B: C:\FolderB\A_1.pdf, C:\FolderB\B_1.pdf, C:\FolderB\C_1.pdf, C:\FolderB\C_2.pdf
I need to find a way to iterate through both lists and combine the PDFs by matching the unique filename. If I can find a way to iterate and match the files, then I think I can combine the PDFs on my own. Below is the code I have so far.
folderA = C:\FolderA
ListA = []
for root, dirs, filenames in os.walk(folderA):
for filename in filenames:
ListA.append(str(filename))
filepath = os.path.join(root, filename)
ListA.append(str(filepath))
folderB: C:\FolderB
ListB = []
for root, dirs, filenames in os.walk(folderB):
for filename in filenames:
filepath = os.path.join(root, filename)
folderB.append(str(filepath))
#Split ListB to file name only without the "_#" so it can be matched to the PDFs in ListA.
for pdfValue in ListB:
pdfsplit = pdfValue.split(".")[0]
pdfsplit1 = pdfsplit.split("\\")[-1]
pdfsplit2 = pdfsplit1.rsplit("_", 1)[0]
for pdfValue2 in ListA:
if pdfsplit2 in ListA:
#combine PDF code
I have verified everything works up to the last if statement. From here is when I am not sure how to go about it. I know how to search for a substring within a string, but I cannot get it to work correctly with a list. No matter how I code it, I either end up in an endless loop or it does not successfully match.
Any ideas on how to make this work, if it is possible?

It would be better to use gather all the information together in one data structure, rather than separate lists. That should allow you to reduce your code to a single function.
Completely untested, but something like this should work.
from collections import defaultdict
pdfs = defaultdict(lambda: defaultdict(list))
def find_pdfs(pdfs, folder, split=False):
for root, dirs, filenames in os.walk(folder):
for filename in filenames:
basename, ext = os.path.splitext(filename)
if ext == '.pdf':
if split:
basename = basename.partition('_')[0]
pdfs[basename][root].append(filename)
find_pdfs(pdfs, folderA)
find_pdfs(pdfs, folderB, True)
This should produce a data structure like this:
pdfs = {
'A':
{'C:\FolderA': ['A.pdf'],
'C:\FolderB': ['A_1.pdf']},
'B':
{'C:\FolderA': ['B.pdf'],
'C:\FolderB': ['B_1.pdf']},
'C':
{'C:\FolderA': ['C.pdf'],
'C:\FolderB': ['C_1.pdf', 'C_2.pdf']},
}

I think what you want to do is create a collections.defaultdict and set it up to hold lists of matching names.
import collections
matching_files = collections.defaultdict(list)
You can then strip the filenames in folder B down to base names, and put the paths into the dict:
matching_files[pdfsplit2].append(pdfValue)
Now you have a list of pdf files from folder B, grouped by base name. Go back to folder A and do the same thing (split off the path and extension, use that for the key, add the full path to the list). You'll have lists, which have files sharing a common base name.
for key,file_list in matching_files.items(): #use .iteritems() for py-2.x
print("Files with base name '%s':"%key)
print(' ', '\n '.join(file_list))

To compare the two files names, rather than split along the '_', you should try the str.startwith() method :
A.startwith(B) returns True if the string A beginning is the string B.
In your case, your code would be :
match={} #the dictionary where you will stock the matching names
for pdfValue in ListA:
match[pdfValue]=[] # To create an entry in the dictionary with the wanted keyword
A=pdfValue.split("\\")[-1] #You want just the filename part
for pdfValue2 in ListB:
B=pdfValue2.split("\\")[-1]
if B.startswith(A): # Then B has the same unique namefile than A
match[pdfValue].append(pdfValue2) #so you associate it with A in the dictionnary
I hope it works for you

One more solution
lista = ['C:\FolderA\A.pdf', 'C:\FolderA\B.pdf', 'C:\FolderA\C.pdf']
listb = ['C:\FolderB\A_1.pdf', 'C:\FolderB\B_1.pdf', 'C:\FolderB\C_1.pdf', 'C:\FolderB\C_2.pdf']
# get the filenames for folder a and folder b
lista_filenames = [l.split('\\')[-1].split('.')[0] for l in lista]
listb_filenames = [l.split('\\')[-1].split('.')[0] for l in listb]
# create a dictionary to store lists of mappings
from collections import defaultdict
data_structure = defaultdict(list)
for i in lista_filenames:
for j in listb_filenames:
if i in j:
data_structure['C:\\FolderA\\' + i +'.pdf'].append('C:\\FolderB\\' + j +'.pdf')
# this is how the mapping dictionary looks like
print data_structure
results in :
defaultdict(<type 'list'>, {'C:\\FolderA\\C.pdf': ['C:\\FolderB\\C_1.pdf', 'C:\\FolderB\\C_2.pdf'], 'C:\\FolderA\\A.pdf': ['C:\\FolderB\\A_1.pdf'], 'C:\\FolderA\\B.pdf': ['C:\\FolderB\\B_1.pdf']})

Related

How do I use a list or set as keys in file renaming

Is something like this possible? Id like to use a dictionary or set as the key for my file renamer. I have a lot of key words that id like to filter out of the file names but the only way iv found to do it so far is to search by string such as key720 = "720" this make it functions correctly but creates bloat. I have to have a version of the code at bottom for each keyword I want to remove.
how do I get the list to work as keys in the search?
I tried to take the list and make it a string with:
str1 = ""
keyres = (str1.join(keys))
This was closer but it makes a string of all the entry's I think and didn't pick up any keywords.
so iv come to this at the moment.
keys = ["720p", "720", "1080p", "1080"]
for filename in os.listdir(dirName):
if keys in filename:
filepath = os.path.join(dirName, filename)
newfilepath = os.path.join(dirName, filename.replace(keys,""))
os.rename(filepath, newfilepath)
Is there a way to maybe go by index and increment it one at a time? would that allow the strings in the list to be used as strings?
What I'm trying to do is take a file name and rename it by removing all occurrences of the key words.

How about using Regular Expressions, specifically the sub function?
from re import sub
KEYS = ["720p", "720", "1080p", "1080"]
old_filename = "filename1080p.jpg"
new_filename = sub('|'.join(KEYS),'',old_filename)
print(new_filename)

building a dictionary of my directories and file paths to select all files whose name contains a specific string

I have a rootdirectory called 'IC'. 'IC' contains a bunch of subdirectories which contain subsubdirectories which contain subsubsubdirectories and so on. Is there an easy way to move all the sub...directory files into their parent subdirectory and then delete the empty sub...directories.
So far I've made this monstrosity of nested loops to build a dictionary of file paths and subdirectories as dictionaries containing file paths etc. I was gonna then make something to go through the dictionary and pick all files containing 'IC' and the subdirectory they are in. I need to know which directories contain an 'IC' file or not. I also need to move all the files containing 'IC' to the top level subdirectories(see hashtag in code)
import os, shutil
rootdir = 'data/ICs'
def dir_tree(rootdir):
IC_details = {}
# This first loop is what I'm calling the top level subdirectories. They are the three
# subdirectories inside the directory 'data/ICs'
for i in os.scandir(rootdir):
if os.path.isdir(i):
IC_details[i.path] = {}
for i in IC_details:
for j in os.scandir(i):
if os.path.isdir(j.path):
IC_details[i][j.name] = {}
elif os.path.isfile(j.path):
IC_details[i][j.name] = [j.path]
for j in IC_details[i]:
if os.path.isdir(os.path.join(i,j)):
for k in os.scandir(os.path.join(i,j)):
if os.path.isdir(k.path):
IC_details[i][j][k.name] = {}
elif os.path.isfile(k.path):
IC_details[i][j][k.name] = [k.path]
for k in IC_details[i][j]:
if os.path.isdir(os.path.join(i,j,k)):
for l in os.scandir(os.path.join(i,j,k)):
if os.path.isdir(l.path):
IC_details[i][j][k][l.name] = {}
elif os.path.isfile(l.path):
IC_details[i][j][k][l.name] = [l.path]
for l in IC_details[i][j][k]:
if os.path.isdir(os.path.join(i,j,k,l)):
for m in os.scandir(os.path.join(i,j,k,l)):
if os.path.isfile(m.path):
IC_details[i][j][k][l][m.name] = [m.path]
return IC_details
IC_tree = dir_tree(rootdir)

You should have a look at the 'glob' module :
glob — Unix style pathname pattern expansion¶

Python regular expression for a string and match them into a dictonnary

I have three files in a directory and I wanted them to be matched with a list of strings to dictionary.
The files in dir looks like following,
DB_ABC_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz
DB_ABC_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz
DB_DEF_S1_001_MM_R1.faq.gz
DB_DEF_S1_001_MM_R2.faq.gz
The list has part of the filename as,
ABC
DEF
So here is what I tried,
import os
import re
dir='/user/home/files'
list='/user/home/list'
samp1 = {}
samp2 = {}
FH_sample = open(list, 'r')
for line in FH_sample:
samp1[line.strip().split('\n')[0]] =[]
samp2[line.strip().split('\n')[0]] =[]
FH_sample.close()
for file in os.listdir(dir):
m1 =re.search('(.*)_R1', file)
m2 = re.search('(.*)_R2', file)
if m1 and m1.group(1) in samp1:
samp1[m1.group(1)].append(file)
if m2 and m2.group(1) in samp2:
samp2[m2.group(1)].append(file)
I wanted the above script to find the matches from m1 and m2 and collect them in dictionaries samp1 and samp2. But the above script is not finding the matches, within the if loop. Now the samp1 and samp2 are empty.
This is what the output should look like for samp1 and samp2:
{'ABC': [DB_ABC_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz, DB_ABC_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz], 'DEF': [DB_DEF_S1_001_MM_R1.faq.gz, DB_DEF_S1_001_MM_R2.faq.gz]}
Any help would be greatly appreciated

A lot of this code you probably don't need. You could just see if the substring that you have from list is in dir.
The code below reads in the data as lists. You seem to have already done this, so it will simply be a matter of replacing files with the file names you read in from dir and replacing st with the substrings from list (which you shouldn't use as a variable name since it is actually used for something else in Python).
files = ["BSSE_QGF_1987_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz",
"BSSE_QGF_1967_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz",
"BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R1_001_MM_1.faq.gz",
"BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R2_001_MM_1.faq.gz"]
my_strings = ["MOHUA", "MSJLF"]
res = {s: [] for s in my_strings}
for k in my_strings:
for file in files:
if k in file:
res[k].append(file)
print(res)

You can pass the python script a dict and provide id_list and then add id_list as dict keys and append the fastqs if the dict key is in the fastq_filename:
import os
import sys
dir_path = sys.argv[1]
fastqs=[]
for x in os.listdir(dir_path):
if x.endswith(".faq.gz"):
fastqs.append(x)
id_list = ['MOHUA', 'MSJLF']
sample_dict = dict((sample,[]) for sample in id_list)
print(sample_dict)
for k in sample_dict:
for z in fastqs:
if k in z:
sample_dict[k].append(z)
print(sample_dict)
to run:
python3.6 fq_finder.py /path/to/fastqs
output from above to show what is going on:
{'MOHUA': [], 'MSJLF': []} # first print creates dict with empty list as vals for keys
{'MOHUA': ['BSSE_QGF_1987_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz', 'BSSE_QGF_1967_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz'], 'MSJLF': ['BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R2_001_MM_1.faq.gz', 'BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R1_001_MM_1.faq.gz']}

How to generate dictionary from two lists, keys with multiple values

I have a series of jpgs that are photos of sites. For some sites I have more than one photo. I have unique IDs for each site. Each photo has a name like 'IDPhoto1', 'IDPhoto2', etc. I would like to isolate the unique IDs in one list and the full paths of the file in another, then generate a dictionary from the two lists where the unique ID is the key and the full paths containing the same ID in the file name would be the values. I can generate the lists, but am unsure how to create the dictionay.
I'm updating this to include the code I have so far, which is not much. I tried someone's suggesting below to create the dictionary, but I'm not getting the right values associated with the right keys, as well as the fact that there should be multiple values for some keys. How can I generate a dictionary that matches values and keys with the same ID?
UPDATE: I think I am getting closer I've edited the code below. Now I am getting the right keys associated with the right values, but am having associated multiple values with each key:
import os, PyPDF2, re
pgs_dir = r'P:\Records\GIS\Projects\D04_OHS\OverheadSignStructurePics'
jpg_paths = []
jpg_ID = []
pattern = re.compile(r'Photo.*')
for dirpath, dirnames, filenames in os.walk(jpgs_dir):
for filename in filenames:
fullPath = dirpath+'\\'+filename
if re.search(pattern,filename):
jpg_paths.append(fullPath)
file_ID = re.sub(pattern,'',filename)
jpg_ID.append(file_ID)
jpg_ID_unique = set(jpg_ID)
for j in jpg_ID:
for jp in jpg_paths:
if j in jp:
if j not in pdf_dict.keys():
pdf_dict[j] = jp
else:
pdf_dict.update({j:jp})
print pdf_dict
FINAL EDIT:
To better focus this question I've edited the original post to more clearly ask one question. I also was able to generate some code that worked for me, it is posted below:
import os, re
jpgs_dir = r'D:\Records\GIS\Projects\D04_OHS\OverheadSignStructurePics\jpgs_reduced'
jpg_paths = []
jpg_ID = []
pattern = re.compile(r'Photo.*')
pdf_dict = {}
pdf_file_path = r'D:\Records\GIS\Projects\D04_OHS\OverheadSignStructurePics\TIFS'
convert_txt_file = r'D:\Records\GIS\Projects\D04_OHS\OverheadSignStructurePics\Convert_JPGS_Batch.txt'
for dirpath, dirnames, filenames in os.walk(jpgs_dir):
for filename in filenames:
fullPath = dirpath+'\\'+filename
if re.search(pattern,filename):
jpg_paths.append(fullPath)
file_ID = re.sub(pattern,'',filename)
jpg_ID.append(file_ID)
jpg_ID_unique = set(jpg_ID)
print 'total number of unique IDs =',len(jpg_ID_unique)
for j in jpg_ID_unique:
for jp in jpg_paths:
if j in jp:
pdf_dict.setdefault(j, []).append(jp)
print pdf_dict
Also here is example of one of the resulting dictionary key and values:
{'35101023189S1': ['D:\Records\GIS\Projects\D04_OHS\OverheadSignStructurePics\jpgs_reduced\35101023189S1Photo1.jpg', 'D:\Records\GIS\Projects\D04_OHS\OverheadSignStructurePics\jpgs_reduced\35101023189S1Photo2.jpg', 'D:\Records\GIS\Projects\D04_OHS\OverheadSignStructurePics\jpgs_reduced\35101023189S1Photo3.jpg', 'D:\Records\GIS\Projects\D04_OHS\OverheadSignStructurePics\jpgs_reduced\35101023189S1Photo4.jpg']

>>> list_of_unique_ids = ['IDPhoto1', 'IDPhoto2', 'IDPhoto3']
>>> list_of_corresponding_paths = [r'Path/To/Photo1', r'Path/To/Photo2', r'Path/To/Photo3']
>>> dictionary = dict(zip(list_of_unique_ids, list_of_corresponding_paths))
>>> print dictionary
{'IDPhoto3': 'Path/To/Photo3', 'IDPhoto2': 'Path/To/Photo2', 'IDPhoto1': 'Path/To/Photo1'}

Why are these strings escaping from my regular expression in python?

In my code, I load up an entire folder into a list and then try to get rid of every file in the list except the .mp3 files.
import os
import re
path = '/home/user/mp3/'
dirList = os.listdir(path)
dirList.sort()
i = 0
for names in dirList:
match = re.search(r'\.mp3', names)
if match:
i = i+1
else:
dirList.remove(names)
print dirList
print i
After I run the file, the code does get rid of some files in the list but keeps these two especifically:
['00. Various Artists - Indie Rock Playlist October 2008.m3u', '00. Various Artists - Indie Rock Playlist October 2008.pls']
I can't understand what's going on, why are those two specifically escaping my search.

You are modifying your list inside a loop. That can cause issues. You should loop over a copy of the list instead (for name in dirList[:]:), or create a new list.
modifiedDirList = []
for name in dirList:
match = re.search(r'\.mp3', name)
if match:
i += 1
modifiedDirList.append(name)
print modifiedDirList
Or even better, use a list comprehension:
dirList = [name for name in sorted(os.listdir(path))
if re.search(r'\.mp3', name)]
The same thing, without a regular expression:
dirList = [name for name in sorted(os.listdir(path))
if name.endswith('.mp3')]

maybe you should use the glob module - here is you entire script:
>>> import glob
>>> mp3s = sorted(glob.glob('*.mp3'))
>>> print mp3s
>>> print len(mp3s)

As soon as you call dirList.remove(names), the original iterator doesn't do what you want. If you iterate over a copy of the list, it will work as expected:
for names in dirList[:]:
....
Alternatively, you can use list comprehensions to construct the right list:
dirList = [name for name in dirList if re.search(r'\.mp3', name)]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Search for Substring Inside List Using Unique Values - python

Related

How do I use a list or set as keys in file renaming

building a dictionary of my directories and file paths to select all files whose name contains a specific string

Python regular expression for a string and match them into a dictonnary

How to generate dictionary from two lists, keys with multiple values

Why are these strings escaping from my regular expression in python?

Categories

Resources