Why are these strings escaping from my regular expression in python?

Why are these strings escaping from my regular expression in python? - python

In my code, I load up an entire folder into a list and then try to get rid of every file in the list except the .mp3 files.
import os
import re
path = '/home/user/mp3/'
dirList = os.listdir(path)
dirList.sort()
i = 0
for names in dirList:
match = re.search(r'\.mp3', names)
if match:
i = i+1
else:
dirList.remove(names)
print dirList
print i
After I run the file, the code does get rid of some files in the list but keeps these two especifically:
['00. Various Artists - Indie Rock Playlist October 2008.m3u', '00. Various Artists - Indie Rock Playlist October 2008.pls']
I can't understand what's going on, why are those two specifically escaping my search.

You are modifying your list inside a loop. That can cause issues. You should loop over a copy of the list instead (for name in dirList[:]:), or create a new list.
modifiedDirList = []
for name in dirList:
match = re.search(r'\.mp3', name)
if match:
i += 1
modifiedDirList.append(name)
print modifiedDirList
Or even better, use a list comprehension:
dirList = [name for name in sorted(os.listdir(path))
if re.search(r'\.mp3', name)]
The same thing, without a regular expression:
dirList = [name for name in sorted(os.listdir(path))
if name.endswith('.mp3')]

maybe you should use the glob module - here is you entire script:
>>> import glob
>>> mp3s = sorted(glob.glob('*.mp3'))
>>> print mp3s
>>> print len(mp3s)

As soon as you call dirList.remove(names), the original iterator doesn't do what you want. If you iterate over a copy of the list, it will work as expected:
for names in dirList[:]:
....
Alternatively, you can use list comprehensions to construct the right list:
dirList = [name for name in dirList if re.search(r'\.mp3', name)]

Related

Python regular expression for a string and match them into a dictonnary

I have three files in a directory and I wanted them to be matched with a list of strings to dictionary.
The files in dir looks like following,
DB_ABC_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz
DB_ABC_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz
DB_DEF_S1_001_MM_R1.faq.gz
DB_DEF_S1_001_MM_R2.faq.gz
The list has part of the filename as,
ABC
DEF
So here is what I tried,
import os
import re
dir='/user/home/files'
list='/user/home/list'
samp1 = {}
samp2 = {}
FH_sample = open(list, 'r')
for line in FH_sample:
samp1[line.strip().split('\n')[0]] =[]
samp2[line.strip().split('\n')[0]] =[]
FH_sample.close()
for file in os.listdir(dir):
m1 =re.search('(.*)_R1', file)
m2 = re.search('(.*)_R2', file)
if m1 and m1.group(1) in samp1:
samp1[m1.group(1)].append(file)
if m2 and m2.group(1) in samp2:
samp2[m2.group(1)].append(file)
I wanted the above script to find the matches from m1 and m2 and collect them in dictionaries samp1 and samp2. But the above script is not finding the matches, within the if loop. Now the samp1 and samp2 are empty.
This is what the output should look like for samp1 and samp2:
{'ABC': [DB_ABC_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz, DB_ABC_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz], 'DEF': [DB_DEF_S1_001_MM_R1.faq.gz, DB_DEF_S1_001_MM_R2.faq.gz]}
Any help would be greatly appreciated

A lot of this code you probably don't need. You could just see if the substring that you have from list is in dir.
The code below reads in the data as lists. You seem to have already done this, so it will simply be a matter of replacing files with the file names you read in from dir and replacing st with the substrings from list (which you shouldn't use as a variable name since it is actually used for something else in Python).
files = ["BSSE_QGF_1987_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz",
"BSSE_QGF_1967_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz",
"BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R1_001_MM_1.faq.gz",
"BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R2_001_MM_1.faq.gz"]
my_strings = ["MOHUA", "MSJLF"]
res = {s: [] for s in my_strings}
for k in my_strings:
for file in files:
if k in file:
res[k].append(file)
print(res)

You can pass the python script a dict and provide id_list and then add id_list as dict keys and append the fastqs if the dict key is in the fastq_filename:
import os
import sys
dir_path = sys.argv[1]
fastqs=[]
for x in os.listdir(dir_path):
if x.endswith(".faq.gz"):
fastqs.append(x)
id_list = ['MOHUA', 'MSJLF']
sample_dict = dict((sample,[]) for sample in id_list)
print(sample_dict)
for k in sample_dict:
for z in fastqs:
if k in z:
sample_dict[k].append(z)
print(sample_dict)
to run:
python3.6 fq_finder.py /path/to/fastqs
output from above to show what is going on:
{'MOHUA': [], 'MSJLF': []} # first print creates dict with empty list as vals for keys
{'MOHUA': ['BSSE_QGF_1987_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz', 'BSSE_QGF_1967_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz'], 'MSJLF': ['BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R2_001_MM_1.faq.gz', 'BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R1_001_MM_1.faq.gz']}

Python: Search for Substring Inside List Using Unique Values

I have two lists, both containing file paths to PDFs. The first list contains PDFs that have unique file names. The second list contains the file names with the same unique file names that need to be matched up to the first list, although it is possible that there could be multiple PDFs in the second list that could be matched to the first. It is a one to many relationship from ListA to ListB. Below is an example.
List A: C:\FolderA\A.pdf, C:\FolderA\B.pdf, C:\FolderA\C.pdf
List B: C:\FolderB\A_1.pdf, C:\FolderB\B_1.pdf, C:\FolderB\C_1.pdf, C:\FolderB\C_2.pdf
I need to find a way to iterate through both lists and combine the PDFs by matching the unique filename. If I can find a way to iterate and match the files, then I think I can combine the PDFs on my own. Below is the code I have so far.
folderA = C:\FolderA
ListA = []
for root, dirs, filenames in os.walk(folderA):
for filename in filenames:
ListA.append(str(filename))
filepath = os.path.join(root, filename)
ListA.append(str(filepath))
folderB: C:\FolderB
ListB = []
for root, dirs, filenames in os.walk(folderB):
for filename in filenames:
filepath = os.path.join(root, filename)
folderB.append(str(filepath))
#Split ListB to file name only without the "_#" so it can be matched to the PDFs in ListA.
for pdfValue in ListB:
pdfsplit = pdfValue.split(".")[0]
pdfsplit1 = pdfsplit.split("\\")[-1]
pdfsplit2 = pdfsplit1.rsplit("_", 1)[0]
for pdfValue2 in ListA:
if pdfsplit2 in ListA:
#combine PDF code
I have verified everything works up to the last if statement. From here is when I am not sure how to go about it. I know how to search for a substring within a string, but I cannot get it to work correctly with a list. No matter how I code it, I either end up in an endless loop or it does not successfully match.
Any ideas on how to make this work, if it is possible?

It would be better to use gather all the information together in one data structure, rather than separate lists. That should allow you to reduce your code to a single function.
Completely untested, but something like this should work.
from collections import defaultdict
pdfs = defaultdict(lambda: defaultdict(list))
def find_pdfs(pdfs, folder, split=False):
for root, dirs, filenames in os.walk(folder):
for filename in filenames:
basename, ext = os.path.splitext(filename)
if ext == '.pdf':
if split:
basename = basename.partition('_')[0]
pdfs[basename][root].append(filename)
find_pdfs(pdfs, folderA)
find_pdfs(pdfs, folderB, True)
This should produce a data structure like this:
pdfs = {
'A':
{'C:\FolderA': ['A.pdf'],
'C:\FolderB': ['A_1.pdf']},
'B':
{'C:\FolderA': ['B.pdf'],
'C:\FolderB': ['B_1.pdf']},
'C':
{'C:\FolderA': ['C.pdf'],
'C:\FolderB': ['C_1.pdf', 'C_2.pdf']},
}

I think what you want to do is create a collections.defaultdict and set it up to hold lists of matching names.
import collections
matching_files = collections.defaultdict(list)
You can then strip the filenames in folder B down to base names, and put the paths into the dict:
matching_files[pdfsplit2].append(pdfValue)
Now you have a list of pdf files from folder B, grouped by base name. Go back to folder A and do the same thing (split off the path and extension, use that for the key, add the full path to the list). You'll have lists, which have files sharing a common base name.
for key,file_list in matching_files.items(): #use .iteritems() for py-2.x
print("Files with base name '%s':"%key)
print(' ', '\n '.join(file_list))

To compare the two files names, rather than split along the '_', you should try the str.startwith() method :
A.startwith(B) returns True if the string A beginning is the string B.
In your case, your code would be :
match={} #the dictionary where you will stock the matching names
for pdfValue in ListA:
match[pdfValue]=[] # To create an entry in the dictionary with the wanted keyword
A=pdfValue.split("\\")[-1] #You want just the filename part
for pdfValue2 in ListB:
B=pdfValue2.split("\\")[-1]
if B.startswith(A): # Then B has the same unique namefile than A
match[pdfValue].append(pdfValue2) #so you associate it with A in the dictionnary
I hope it works for you

One more solution
lista = ['C:\FolderA\A.pdf', 'C:\FolderA\B.pdf', 'C:\FolderA\C.pdf']
listb = ['C:\FolderB\A_1.pdf', 'C:\FolderB\B_1.pdf', 'C:\FolderB\C_1.pdf', 'C:\FolderB\C_2.pdf']
# get the filenames for folder a and folder b
lista_filenames = [l.split('\\')[-1].split('.')[0] for l in lista]
listb_filenames = [l.split('\\')[-1].split('.')[0] for l in listb]
# create a dictionary to store lists of mappings
from collections import defaultdict
data_structure = defaultdict(list)
for i in lista_filenames:
for j in listb_filenames:
if i in j:
data_structure['C:\\FolderA\\' + i +'.pdf'].append('C:\\FolderB\\' + j +'.pdf')
# this is how the mapping dictionary looks like
print data_structure
results in :
defaultdict(<type 'list'>, {'C:\\FolderA\\C.pdf': ['C:\\FolderB\\C_1.pdf', 'C:\\FolderB\\C_2.pdf'], 'C:\\FolderA\\A.pdf': ['C:\\FolderB\\A_1.pdf'], 'C:\\FolderA\\B.pdf': ['C:\\FolderB\\B_1.pdf']})

Regex on list element in for loop

I have a script that searches through config files and finds all matches of strings from another list as follows:
dstn_dir = "C:/xxxxxx/foobar"
dst_list =[]
files = [fn for fn in os.listdir(dstn_dir)if fn.endswith('txt')]
dst_list = []
for file in files:
parse = CiscoConfParse(dstn_dir+'/'+file)
for sfarm in search_str:
int_objs = parse.find_all_children(sfarm)
if len(int_objs) > 0:
dst_list.append(["\n","#" *40,file + " " + sfarm,"#" *40])
dst_list.append(int_objs)
I need to change this part of the code:
for sfarm in search_str:
int_objs = parse.find_all_children(sfarm)
search_str is a list containing strings similar to ['xrout:55','old:23'] and many others.
So it will only find entries that end with the string from the list I am iterating through in sfarm. My understanding is that this would require my to use re and match on something like sfarm$ but Im not sure on how to do this as part of the loop.
Am I correct in saying that sfarm is an iterable? If so I need to know how to regex on an iterable object in this context.

Strings in python are iterable, so sfarm is an iterable, but that has little meaning in this case. From reading what CiscoConfParse.find_all_children() does, it is apparent that your sfarm is the linespec, which is a regular expression string. You do not need to explicitly use the re module here; just pass sfarm concatenated with '$':
search_string = ['xrout:55','old:23']
...
for sfarm in search_str:
int_objs = parse.find_all_children(sfarm + '$') # one of many ways to concat
...

Please check this code. Used glob module to get all "*.txt" files in folder.
Please check here for more info on glob module.
import glob
import re
dst_list = []
search_str = ['xrout:55','old:23']
for file_name in glob.glob(r'C:/Users/dinesh_pundkar\Desktop/*.txt'):
with open(file_name,'r') as f:
text = f.read()
for sfarm in search_str:
regex = re.compile('%s$'%sfarm)
int_objs = regex.findall(text)
if len(int_objs) > 0:
dst_list.append(["\n","#" *40,file_name + " " + sfarm,"#" *40])
dst_list.append(int_objs)
print dst_list
Output:
C:\Users\dinesh_pundkar\Desktop>python a.py
[['\n', '########################################', 'C:/Users/dinesh_pundkar\\De
sktop\\out.txt old:23', '########################################'], ['old:23']]
C:\Users\dinesh_pundkar\Desktop>

finding a file name from a substring

I have a list of my filenames that I've saved as follows:
filelist = os.listdir(mypath)
Now, suppose one of my files is something like "KRAS_P01446_3GFT_SOMETHING_SOMETHING.txt".
However, all I know ahead of time is that I have a file called "KRAS_P01446_3GFT_*". How can I get the full file name from file list using just "KRAS_P01446_3GFT_*"?
As a simpler example, I've made the following:
mylist = ["hi_there", "bye_there","hello_there"]
Suppose I had the string "hi". How would I make it return mylist[0] = "hi_there".
Thanks!

In the first example, you could just use the glob module:
import glob
import os
print '\n'.join(glob.iglob(os.path.join(mypath, "KRAS_P01446_3GFT_*")))
Do this instead of os.listdir.
The second example seems tenuously related to the first (X-Y problem?), but here's an implementation:
mylist = ["hi_there", "bye_there","hello_there"]
print '\n'.join(s for s in mylist if s.startswith("hi"))

If you mean "give me all filenames starting with some prefix", then this is simple:
[fname for fname in mylist if fname.startswith('hi')]
If you mean something more complex--for example, patterns like "some_*_file" matching "some_good_file" and "some_bad_file", then look at the regex module.

mylist = ["hi_there", "bye_there","hello_there"]
partial = "hi"
[fullname for fullname in mylist if fullname.startswith(partial)]

If the list is not very big, you can do a per item check like this.
def findMatchingFile (fileList, stringToMatch) :
listOfMatchingFiles = []
for file in fileList:
if file.startswith(stringToMatch):
listOfMatchingFiles.append(file)
return listOfMatchingFiles
There are more "pythonic" way of doing this, but I prefer this as it is more readable.

batch renaming 100K files with python

I have a folder with over 100,000 files, all numbered with the same stub, but without leading zeros, and the numbers aren't always contiguous (usually they are, but there are gaps) e.g:
file-21.png,
file-22.png,
file-640.png,
file-641.png,
file-642.png,
file-645.png,
file-2130.png,
file-2131.png,
file-3012.png,
etc.
I would like to batch process this to create padded, contiguous files. e.g:
file-000000.png,
file-000001.png,
file-000002.png,
file-000003.png,
When I parse the folder with for filename in os.listdir('.'): the files don't come up in the order I'd like to them to. Understandably they come up
file-1,
file-1x,
file-1xx,
file-1xxx,
etc. then
file-2,
file-2x,
file-2xx,
etc. How can I get it to go through in the order of the numeric value? I am a complete python noob, but looking at the docs i'm guessing I could use map to create a new list filtering out only the numerical part, and then sort that list, then iterate that? With over 100K files this could be heavy. Any tips welcome!

import re
thenum = re.compile('^file-(\d+)\.png$')
def bynumber(fn):
mo = thenum.match(fn)
if mo: return int(mo.group(1))
allnames = os.listdir('.')
allnames.sort(key=bynumber)
Now you have the files in the order you want them and can loop
for i, fn in enumerate(allnames):
...
using the progressive number i (which will be 0, 1, 2, ...) padded as you wish in the destination-name.

There are three steps. The first is getting all the filenames. The second is converting the filenames. The third is renaming them.
If all the files are in the same folder, then glob should work.
import glob
filenames = glob.glob("/path/to/folder/*.txt")
Next, you want to change the name of the file. You can print with padding to do this.
>>> filename = "file-338.txt"
>>> import os
>>> fnpart = os.path.splitext(filename)[0]
>>> fnpart
'file-338'
>>> _, num = fnpart.split("-")
>>> num.rjust(5, "0")
'00338'
>>> newname = "file-%s.txt" % num.rjust(5, "0")
>>> newname
'file-00338.txt'
Now, you need to rename them all. os.rename does just that.
os.rename(filename, newname)
To put it together:
for filename in glob.glob("/path/to/folder/*.txt"): # loop through each file
newname = make_new_filename(filename) # create a function that does step 2, above
os.rename(filename, newname)

Thank you all for your suggestions, I will try them all to learn the different approaches. The solution I went for is based on using a natural sort on my filelist, and then iterating that to rename. This was one of the suggested answers but for some reason it has disappeared now so I cannot mark it as accepted!
import os
files = os.listdir('.')
natsort(files)
index = 0
for filename in files:
os.rename(filename, str(index).zfill(7)+'.png')
index += 1
where natsort is defined in http://code.activestate.com/recipes/285264-natural-string-sorting/

Why don't you do it in a two step process. Parse all the files and rename with padded numbers and then run another script that takes those files, which are sorted correctly now, and renames them so they're contiguous?

1) Take the number in the filename.
2) Left-pad it with zeros
3) Save name.

def renamer():
for iname in os.listdir('.'):
first, second = iname.replace(" ", "").split("-")
number, ext = second.split('.')
first, number, ext = first.strip(), number.strip(), ext.strip()
number = '0'*(6-len(number)) + number # pad the number to be 7 digits long
oname = first + "-" + number + '.' + ext
os.rename(iname, oname)
print "Done"
Hope this helps

The simplest method is given below. You can also modify for recursive search this script.
use os module.
get filenames
os.rename
import os
class Renamer:
def __init__(self, pattern, extension):
self.ext = extension
self.pat = pattern
return
def rename(self):
p, e = (self.pat, self.ext)
number = 0
for x in os.listdir(os.getcwd()):
if str(x).endswith(f".{e}") == True:
os.rename(x, f'{p}_{number}.{e}')
number+=1
return
if __name__ == "__main__":
pattern = "myfile"
extension = "txt"
r = Renamer(pattern=pattern, extension=extension)
r.rename()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why are these strings escaping from my regular expression in python? - python

maybe you should use the glob module - here is you entire script: >>> import glob >>> mp3s = sorted(glob.glob('*.mp3')) >>> print mp3s >>> print len(mp3s)

Related

Python regular expression for a string and match them into a dictonnary

Python: Search for Substring Inside List Using Unique Values

Regex on list element in for loop

finding a file name from a substring

batch renaming 100K files with python

Categories

Resources