finding a file name from a substring - python

I have a list of my filenames that I've saved as follows:
filelist = os.listdir(mypath)
Now, suppose one of my files is something like "KRAS_P01446_3GFT_SOMETHING_SOMETHING.txt".
However, all I know ahead of time is that I have a file called "KRAS_P01446_3GFT_*". How can I get the full file name from file list using just "KRAS_P01446_3GFT_*"?
As a simpler example, I've made the following:
mylist = ["hi_there", "bye_there","hello_there"]
Suppose I had the string "hi". How would I make it return mylist[0] = "hi_there".
Thanks!

In the first example, you could just use the glob module:
import glob
import os
print '\n'.join(glob.iglob(os.path.join(mypath, "KRAS_P01446_3GFT_*")))
Do this instead of os.listdir.
The second example seems tenuously related to the first (X-Y problem?), but here's an implementation:
mylist = ["hi_there", "bye_there","hello_there"]
print '\n'.join(s for s in mylist if s.startswith("hi"))

If you mean "give me all filenames starting with some prefix", then this is simple:
[fname for fname in mylist if fname.startswith('hi')]
If you mean something more complex--for example, patterns like "some_*_file" matching "some_good_file" and "some_bad_file", then look at the regex module.

mylist = ["hi_there", "bye_there","hello_there"]
partial = "hi"
[fullname for fullname in mylist if fullname.startswith(partial)]

If the list is not very big, you can do a per item check like this.
def findMatchingFile (fileList, stringToMatch) :
listOfMatchingFiles = []
for file in fileList:
if file.startswith(stringToMatch):
listOfMatchingFiles.append(file)
return listOfMatchingFiles
There are more "pythonic" way of doing this, but I prefer this as it is more readable.

Related

Python regular expression for a string and match them into a dictonnary

I have three files in a directory and I wanted them to be matched with a list of strings to dictionary.
The files in dir looks like following,
DB_ABC_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz
DB_ABC_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz
DB_DEF_S1_001_MM_R1.faq.gz
DB_DEF_S1_001_MM_R2.faq.gz
The list has part of the filename as,
ABC
DEF
So here is what I tried,
import os
import re
dir='/user/home/files'
list='/user/home/list'
samp1 = {}
samp2 = {}
FH_sample = open(list, 'r')
for line in FH_sample:
samp1[line.strip().split('\n')[0]] =[]
samp2[line.strip().split('\n')[0]] =[]
FH_sample.close()
for file in os.listdir(dir):
m1 =re.search('(.*)_R1', file)
m2 = re.search('(.*)_R2', file)
if m1 and m1.group(1) in samp1:
samp1[m1.group(1)].append(file)
if m2 and m2.group(1) in samp2:
samp2[m2.group(1)].append(file)
I wanted the above script to find the matches from m1 and m2 and collect them in dictionaries samp1 and samp2. But the above script is not finding the matches, within the if loop. Now the samp1 and samp2 are empty.
This is what the output should look like for samp1 and samp2:
{'ABC': [DB_ABC_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz, DB_ABC_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz], 'DEF': [DB_DEF_S1_001_MM_R1.faq.gz, DB_DEF_S1_001_MM_R2.faq.gz]}
Any help would be greatly appreciated
A lot of this code you probably don't need. You could just see if the substring that you have from list is in dir.
The code below reads in the data as lists. You seem to have already done this, so it will simply be a matter of replacing files with the file names you read in from dir and replacing st with the substrings from list (which you shouldn't use as a variable name since it is actually used for something else in Python).
files = ["BSSE_QGF_1987_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz",
"BSSE_QGF_1967_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz",
"BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R1_001_MM_1.faq.gz",
"BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R2_001_MM_1.faq.gz"]
my_strings = ["MOHUA", "MSJLF"]
res = {s: [] for s in my_strings}
for k in my_strings:
for file in files:
if k in file:
res[k].append(file)
print(res)
You can pass the python script a dict and provide id_list and then add id_list as dict keys and append the fastqs if the dict key is in the fastq_filename:
import os
import sys
dir_path = sys.argv[1]
fastqs=[]
for x in os.listdir(dir_path):
if x.endswith(".faq.gz"):
fastqs.append(x)
id_list = ['MOHUA', 'MSJLF']
sample_dict = dict((sample,[]) for sample in id_list)
print(sample_dict)
for k in sample_dict:
for z in fastqs:
if k in z:
sample_dict[k].append(z)
print(sample_dict)
to run:
python3.6 fq_finder.py /path/to/fastqs
output from above to show what is going on:
{'MOHUA': [], 'MSJLF': []} # first print creates dict with empty list as vals for keys
{'MOHUA': ['BSSE_QGF_1987_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz', 'BSSE_QGF_1967_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz'], 'MSJLF': ['BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R2_001_MM_1.faq.gz', 'BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R1_001_MM_1.faq.gz']}

Python: Group a list of file names according to common name identifier

In a directory I have some files:
temperature_Resu05_les_spec_r0.0300.0
temperature_Resu05_les_spec_r0.0350.0
temperature_Resu05_les_spec_r0.0400.0
temperature_Resu05_les_spec_r0.0450.0
temperature_Resu06_les_spec_r0.0300.0
temperature_Resu06_les_spec_r0.0350.0
temperature_Resu06_les_spec_r0.0400.0
temperature_Resu06_les_spec_r0.0450.0
temperature_Resu07_les_spec_r0.0300.0
temperature_Resu07_les_spec_r0.0350.0
temperature_Resu07_les_spec_r0.0400.0
temperature_Resu07_les_spec_r0.0450.0
temperature_Resu08_les_spec_r0.0300.0
temperature_Resu08_les_spec_r0.0350.0
temperature_Resu08_les_spec_r0.0400.0
temperature_Resu08_les_spec_r0.0450.0
temperature_Resu09_les_spec_r0.0300.0
temperature_Resu09_les_spec_r0.0350.0
temperature_Resu09_les_spec_r0.0400.0
temperature_Resu09_les_spec_r0.0450.0
I need a list of all the files that have the same identifier XXXX as in _rXXXX. For example one such list would be composed of
temperature_Resu05_les_spec_r0.0300.0
temperature_Resu06_les_spec_r0.0300.0
temperature_Resu07_les_spec_r0.0300.0
temperature_Resu08_les_spec_r0.0300.0
temperature_Resu09_les_spec_r0.0300.0
I don't know a priori what the XXXX values are going to be so I can't iterate through them and match like that. Im thinking this might best be handles with a regular expression. Any ideas?
Yes, regular expressions are a fun way to do it! It could look something like this:
results = {}
for fname in fnames:
id = re.search('.*_r(.*)', fname).group(1) # grabs whatever is after the final "_r" as an identifier
if id in results:
results[id] += fname
else:
results[id] = [fname]
The results will be stored in a dictionary, results, indexed by the id.
I should add that this will work as long as all file names reliably have the _rXXXX structure. If there's any chance that a file name will not match that pattern, you will have to check for it and act accordingly.
No a regex is not the best way, you pattern is very straight forward, just str.rsplit on the _r and use the right element of the split as the key to group the data with. A defaultdict will do the grouping efficiently:
from collections import defaultdict
with open("yourfile") as f:
groups = defaultdict(list)
for line in f:
groups[line.rsplit("_r",1)[1]].append(line.rstrip())
from pprint import pprint as pp
pp(groups.values())
Which for your sample will give you:
[['temperature_Resu09_les_spec_r0.0450.0'],
['temperature_Resu05_les_spec_r0.0300.0',
'temperature_Resu06_les_spec_r0.0300.0',
'temperature_Resu07_les_spec_r0.0300.0',
'temperature_Resu08_les_spec_r0.0300.0',
'temperature_Resu09_les_spec_r0.0300.0'],
['temperature_Resu05_les_spec_r0.0400.0',
'temperature_Resu06_les_spec_r0.0400.0',
'temperature_Resu07_les_spec_r0.0400.0',
'temperature_Resu08_les_spec_r0.0400.0',
'temperature_Resu09_les_spec_r0.0400.0'],
['temperature_Resu05_les_spec_r0.0450.0',
'temperature_Resu06_les_spec_r0.0450.0',
'temperature_Resu07_les_spec_r0.0450.0',
'temperature_Resu08_les_spec_r0.0450.0'],
['temperature_Resu05_les_spec_r0.0350.0',
'temperature_Resu06_les_spec_r0.0350.0',
'temperature_Resu07_les_spec_r0.0350.0',
'temperature_Resu08_les_spec_r0.0350.0',
'temperature_Resu09_les_spec_r0.0350.0']]

Get all characters after a certain character?

Let's say I have a list of strings like this:
list1 = [
"filename1.txt",
"file2.py",
"fileexample.tiff"
]
How would I be able to grab all characters after the '.', if it's not too much to ask, by using "for i in" and have them come back in a list, like this: ['.txt','.py','.tiff']
If you are dealing with filepaths, then you should use the os.path module
import os.path
list1 = ["filename1.txt","file2.py","fileexample.tiff"]
print [os.path.splitext(f)[1] for f in list1]
prints
['.txt', '.py', '.tiff']
import os
for i in list1:
fileName, fileExtension = os.path.splitext(i)
print fileExtension
second one :
[i.split('.')[1] for i in list1]
map(lambda s:s.rsplit(".",1)[-1],my_list)
is probably how I would do it
which just splits from the right side exactly once on a period ... and gets whatever is on the right hand side for each item in the list

Keep latest file and delete all other

In my folder there are many pdf files with date-timestamp format such as shown in the last.
I would like to keep the latest file for the day and delete the rest for that day. How can I do in Python ?
2012-07-13-15-13-27_1342167207.pdf
2012-07-13-15-18-22_1342167502.pdf
2012-07-13-15-18-33_1342167513.pdf
2012-07-23-14-45-12_1343029512.pdf
2012-07-23-14-56-48_1343030208.pdf
2012-07-23-16-03-45_1343034225.pdf
2012-07-23-16-04-23_1343034263.pdf
2012-07-26-07-27-19_1343262439.pdf
2012-07-26-07-33-27_1343262807.pdf
2012-07-26-07-51-59_1343263919.pdf
2012-07-26-22-38-30_1343317110.pdf
2012-07-26-22-38-54_1343317134.pdf
2012-07-27-10-43-27_1343360607.pdf
2012-07-27-10-58-40_1343361520.pdf
2012-07-27-11-03-19_1343361799.pdf
2012-07-27-11-04-14_1343361854.pdf
Should I use list to fill and sort out then ? Desired output is:
2012-07-13-15-18-33_1342167513.pdf
2012-07-23-16-04-23_1343034263.pdf
2012-07-26-22-38-54_1343317134.pdf
2012-07-27-11-04-14_1343361854.pdf
Thanks
Your desired list can also be achieved using groupby .
from itertools import groupby
from os import listdir,unlink
filtered_list = list()
names = os.listdir()
for key,group in groupby(names,lambda x : x[:10]): # groups based on the start 10 characters of file
filtered_list.append([item for item in group][-1]) #picks the last file from the group
print filtered_list
Sort the list and delete files if the next file in the list is on the same day,
import glob
import os
files = glob.glob("*.pdf")
files.sort()
for ifl, fl in enumerate(files[:-1]):
if files[ifl+1].startswith(fl[:10]): #Check if next file is same day
os.unlink(fl) # It is - delete current file
Edit:
As the OPs question became clearer it became evident that not just the last file of the list is required, but the latest file of each day - to achieve this I included a "same day" conditioned unlinking.
You could do it that way. The following code is untested, but may work:
import os
names = os.listdir()
names.sort()
for f in names[:-1]:
os.unlink(f)
Fortunately your file names use ISO8601 date format so the textual sort achieves the desired result with no need to parse the dates.
The following snippet works with the test case given.
files = os.listdir(".")
days = set(fname[8:10] for fname in files)
for d in days:
f = [i for i in files if i[8:10] == d]
for x in sorted(f)[:-1]:
os.remove(x)
Using dictionary You can keep one value. This can be dirty and quickest solution, maybe not the best.
#!/usr/bin/env python
import os
import datetime
import stat
import shutil
filelist=[]
lst=[]
dc={}
os.chdir(".")
for files in os.listdir("."):
if files.endswith(".pdf"):
lst.append(files)
for x in lst:
print x[0:10].replace("-","")
dc[int(x[0:10].replace("-",""))]=x
a = dc.items()
flist=[]
for k, v in a:
flist.append(v)
dir="tmpdir"
if not os.path.exists(dir):
os.makedirs(dir)
from shutil import copyfile
for x in flist:
print x
copyfile(x, dir + "/" + x)
#os.chdir(".")
for files in os.listdir("."):
if files.endswith(".pdf"):
os.unlink(files)
os.chdir("./tmpdir")
for files in os.listdir("."):
if files.endswith(".pdf"):
copyfile(files, "../"+files)
os.chdir("../")
shutil.rmtree(os.path.abspath(".")+"/tmpdir")

Why are these strings escaping from my regular expression in python?

In my code, I load up an entire folder into a list and then try to get rid of every file in the list except the .mp3 files.
import os
import re
path = '/home/user/mp3/'
dirList = os.listdir(path)
dirList.sort()
i = 0
for names in dirList:
match = re.search(r'\.mp3', names)
if match:
i = i+1
else:
dirList.remove(names)
print dirList
print i
After I run the file, the code does get rid of some files in the list but keeps these two especifically:
['00. Various Artists - Indie Rock Playlist October 2008.m3u', '00. Various Artists - Indie Rock Playlist October 2008.pls']
I can't understand what's going on, why are those two specifically escaping my search.
You are modifying your list inside a loop. That can cause issues. You should loop over a copy of the list instead (for name in dirList[:]:), or create a new list.
modifiedDirList = []
for name in dirList:
match = re.search(r'\.mp3', name)
if match:
i += 1
modifiedDirList.append(name)
print modifiedDirList
Or even better, use a list comprehension:
dirList = [name for name in sorted(os.listdir(path))
if re.search(r'\.mp3', name)]
The same thing, without a regular expression:
dirList = [name for name in sorted(os.listdir(path))
if name.endswith('.mp3')]
maybe you should use the glob module - here is you entire script:
>>> import glob
>>> mp3s = sorted(glob.glob('*.mp3'))
>>> print mp3s
>>> print len(mp3s)
As soon as you call dirList.remove(names), the original iterator doesn't do what you want. If you iterate over a copy of the list, it will work as expected:
for names in dirList[:]:
....
Alternatively, you can use list comprehensions to construct the right list:
dirList = [name for name in dirList if re.search(r'\.mp3', name)]

Categories