Remove multiple list elements (substrings) - python

I have a sorted list of directories:
lst :=
./C01
./C01/BOOST
./C01/BOOST/src
./C01/BOOST/src/template
./C01/EmployeeAdmin
./Spheres
./db01/src/scottlib # added later
./db03
./db03/src
./db03/src/examples
./db03/src/exercises
./txt2bmp
./txt2bmp/data
./txt2bmp/docs
./txt2bmp/tests
./txt2bmp/txt2bmp
./txt2bmp_COPYED
./txt2bmp_COPYED/data
./txt2bmp_COPYED/docs
./txt2bmp_COPYED/tests
./txt2bmp_COPYED/txt2bmp
./txt2bmp_cpp
./txt2bmp_cpp/doc
I've tried to remove all subfolder - if existing -, but i could not do it in a nice, pythonic way. I did it with loops, but this was an long, ugly and inscrutable solution...
After the deletion the list should be shown like this:
lst2 :=
./C01
./Spheres
./db01/src/scottlib
./db03
./txt2bmp
./txt2bmp_COPYED
./txt2bmp_cpp
The next line is the command I've tried to modify in many ways, but without success...
[ i for i in lst if not i.startswith(lst[0])]
Perhaps you have an idea to solve this in an elegant way?

I think this does the trick
lst2 = [a for a in lst if '/'.join(a.split('/')[:-1]) not in lst]
for line in lst2: print line
Output:
./C01
./Spheres
./db03
./txt2bmp
./txt2bmp_COPYED
./txt2bmp_cpp

In your case, you can do a simple count of the path separator and use that for filtering in a list comprehension, keeping only items with a count of 1:
lst2 = [i for i in lst if i.count('/')==1]
Considering #Jean-FrançoisFabre's comment, you could replace / with os.sep to add some portability to the snippet.

I would first normalize the paths (with os.path.relpath), then isolate the first component, then filter to unique paths (with set):
from os.path import relpath
def strip_tail(path):
try:
return path[:path.index("/")]
except ValueError:
return path
lst2 = list(set(strip_tail(relpath(dir)) for dir in lst))

Related

Extracting sublists of specific elements from Python lists of strings

I have a large list of elements
a = [['qc1l1.1',
'qc1r2.1',
'qc1r3.1',
'qc2r1.1',
'qc2r2.1',
'qt1.1',
'qc3.1',
'qc4.1',
'qc5.1',
'qc6.1',.................]
From this list i want to extract several sublists for elements start with the letters "qfg1" "qdg2" "qf3" "qd1" and so on.
such that:
list1 = ['qfg1.1', 'qfg1.2',....]
list2 = ['qfg2.1', 'qfg2.2',,,]
I tried to do:
list1 = []
for i in all_quads_names:
if i in ['qfg']:
list1.append(i)
but it gives an empty lists, how can i do this without the need of doing loops as its a very large list.
Using in (as others have suggested) is incorrect. Because you want to check it as a prefix not merely whether it is contained.
What you want to do is use startswith() for example this:
list1 = []
for name in all_quads_names:
if name.startswith('qc1r'):
list1.append(name)
The full solution would be something like:
prefixes = ['qfg', 'qc1r']
lists = []
for pref in prefixes:
list = []
for name in all_quads_names:
if name.startswith(pref):
list.append(name)
lists.append(list)
Now lists will contain all the lists you want.
You are looking for 'qfg' in the ith element of the all_quads_names. So, try the following:
list1 = []
for i in all_quads_names:
if 'qfg' in i:
list1.append(i)
Try to revert you if statement and to compare a string to the list elements :
if "qfg" in i:
you can refer to this question :
Searching a sequence of characters from a string in python
you have many python methods to do just that described here:
https://stackabuse.com/python-check-if-string-contains-substring/
Edit after AminM coment :
From your exemple it doesn't seem necessary to use the startwith() method, but if your sublist ("qfg") can also be found later in the string and you need to find only the string that is starting with "qfg", then you should have something like :
if i.startswith("qfg"):
Try something like this:
prefixes = ['qfg1', … ]
top = [ […], […], … ]
extracted = []
for sublist in top:
if sublist and any(prefix in sublist[0] for prefix in prefixes):
extracted.append(sublist)

How to remove items in a list of strings based on duplicate substrings among the elements?

I have a list of files from different paths, but some of that paths contain the same file(and file name).
I would like to remove these duplicate files, but since they're from different paths, I just can't do set(thelist)
Minimal Example
Say that my list looks like this
thelist = ['/path1/path2/file13332', '/path11/path21/file21', 'path1232/path1112/file13332', '/path1/path2/file13339']
What is the most pythonic way to get this
deduplicatedList = ['/path1/path2/file13332', '/path11/path21/file21', '/path1/path2/file13339']
File file13332 was in the list twice. I am not concerned about which element was removed
One way is to use dictionary:
thelist = ['/path1/path2/file13332', '/path11/path21/file21', 'path1232/path1112/file13332', '/path1/path2/file13339']
deduplicatedList = list({f.split('/')[-1]: f for f in thelist}.values())
print(deduplicatedList)
['path1232/path1112/file13332', '/path11/path21/file21', '/path1/path2/file13339']
s = set()
deduped = [s.add(os.path.basename(i)) or i for i in l if os.path.basename(i) not in s]
s contains the unique basenames which guards against adding non-unique basenames to deduped.

remove certain files in a list matching a pattern

I have a list with files (the path to them).
I wrote a function like this to remove certain files matching a pattern but it just removes 2 files at most and I don't understand why.
remove_list = ('*.txt',) # Example for removing all .txt files in the list
def removal(list):
for f in list:
if any(fnmatch(basename(f.lower()), pattern) for pattern in remove_list:
list.remove(f)
return list
//Edit; Ok naming my list "list" in the code was a bad idea. in my code here its called differently. Just wanted to give an abstract idea what I'm dealing with. Should have mentioned that
Modifying a list while you're iterating over it is a bad idea, as you can very easily get in edge cases when behaviour is not determined.
The best way to do what you want is to build a new list without the items you don't want:
remove_list = (r'*.txt',) # Example for removing all .txt files in the list
def removal(l, rm_list):
for f in l:
for pattern in rm_list:
if not fnmatch(basename(f.lower()), pattern):
yield f
print(list(removal(list_with_files, remove_list))
Here, I'm unrolling your any one-liner that might make your code look smart, but is hard to read, and might give you headaches in six months. It's better (because more readable) to do a simple for and an if instead!
The yield keyword will make the function return what's called a generator in python, so that when you're iterating over the result of the function, it will return the value, to make it available to the calling context, and then get back to the function to return the next item.
This is why in the print statement, I use list() around the function call, whereas if you iterate over it, you don't need to put it in a list:
for elt in removal(list_with_files, remove_list):
print(elt)
If you don't like using a generator (and the yield statement), then you have to build the list manually, before returning it:
remove_list = (r'*.txt',) # Example for removing all .txt files in the list
def removal(l, rm_list):
ret_list = []
for f in l:
for pattern in rm_list:
if not fnmatch(basename(f.lower()), pattern):
ret_list.append(f)
return ret_list
HTH
You can use str.endswith if you are removing based on extension, you just need to pass a tuple of extensions:
remove_tup = (".txt",".py") # Example for removing all .txt files in the list
def removal(lst):
return [f for f in lst if not f.endswith(remove_tup)]
The code you provided is vague.
1.don't use list it is shadow the build-in list
2.don't modify the list when you iterate it, you can make a copy of it
My suggestion is:
You can iterate your original list and the remove_list as below:
test.py
list1=["file1.txt", "file2.txt", "other.csv"]
list2=["file1.txt", "file2.txt"] # simulates your remove_list
listX = [x for x in list1 if x not in list2] # creates a new list
print listX
$python test.py
['other.csv']
As was said in the comments, don't modify a list as you iterate over it. Can also use a list comprehension like so:
patterns = ('*.txt', '*.csv')
good = [f for f in all_files if not any(fnmatch(basename(f.lower()), pattern) for pattern in patterns)]

How can I sort list of strings in specific order?

Let's say I have such a list:
['word_4_0_w_7',
'word_4_0_w_6',
'word_3_0_w_10',
'word_3_0_w_2']
and I want to sort them according to number that comes after "word" and according to number after "w".
It will look like this:
['word_3_0_w_2',
'word_3_0_w_10',
'word_4_0_w_6',
'word_4_0_w_7']
What comes in mind is to create a bunch of list and according to index after "word" stuff them with sorted strings according "w", and then merge them.
Is in Python more clever way to do it?
Use Python's key functionality, in conjunction with other answers:
def mykey(value):
ls = value.split("_")
return int(ls[1]), int(ls[-1])
newlist = sorted(firstlist, key=mykey)
## or, if you want it in place:
firstlist.sort(key=mykey)
Python will be more efficient with key vs cmp.
You can provide a function to the sort() method of list objects:
l = ['word_4_0_w_7',
'word_4_0_w_6',
'word_3_0_w_10',
'word_3_0_w_2']
def my_key_func(x):
xx = x.split("_")
return (int(xx[1]), int(xx[-1]))
l.sort(key=my_key_func)
Output:
print l
['word_3_0_w_2', 'word_3_0_w_10', 'word_4_0_w_6', 'word_4_0_w_7']
edit: Changed code according to comment by #dwanderson ; more info on this can be found here.
You can use a function to extract the relevant parts of your string and then use those parts to sort:
a = ['word_4_0_w_7', 'word_4_0_w_6', 'word_3_0_w_10', 'word_3_0_w_2']
def sort_func(x):
parts = x.split('_');
sort_key = parts[1]+parts[2]+"%02d"%int(parts[4])
return sort_key
a_sorted = sorted(a,key=sort_func)
The expression "%02d" %int(x.split('_')[4]) is used to add a leading zero in front of second number otherwise 10 will sort before 2. You may have to do the same with the number extracted by x.split('_')[2].

How do you sort files numerically?

I'm processing some files in a directory and need the files to be sorted numerically. I found some examples on sorting—specifically with using the lambda pattern—at wiki.python.org, and I put this together:
import re
file_names = """ayurveda_1.tif
ayurveda_11.tif
ayurveda_13.tif
ayurveda_2.tif
ayurveda_20.tif
ayurveda_22.tif""".split('\n')
num_re = re.compile('_(\d{1,2})\.')
file_names.sort(
key=lambda fname: int(num_re.search(fname).group(1))
)
Is there a better way to do this?
This is called "natural sorting" or "human sorting" (as opposed to lexicographical sorting, which is the default). Ned B wrote up a quick version of one.
import re
def tryint(s):
try:
return int(s)
except:
return s
def alphanum_key(s):
""" Turn a string into a list of string and number chunks.
"z23a" -> ["z", 23, "a"]
"""
return [ tryint(c) for c in re.split('([0-9]+)', s) ]
def sort_nicely(l):
""" Sort the given list in the way that humans expect.
"""
l.sort(key=alphanum_key)
It's similar to what you're doing, but perhaps a bit more generalized.
Just use :
tiffFiles.sort(key=lambda var:[int(x) if x.isdigit() else x for x in re.findall(r'[^0-9]|[0-9]+', var)])
is faster than use try/except.
If you are using key= in your sort method you shouldn't use cmp which has been removed from the latest versions of Python. key should be equated to a function which takes a record as input and returns any object which will compare in the order you want your list sorted. It doesn't need to be a lambda function and might be clearer as a stand alone function. Also regular expressions can be slow to evaluate.
You could try something like the following to isolate and return the integer part of the file name:
def getint(name):
basename = name.partition('.')
alpha, num = basename.split('_')
return int(num)
tiffiles.sort(key=getint)
#April provided a good solution in How is Pythons glob.glob ordered? that you could try
#First, get the files:
import glob
import re
files = glob.glob1(img_folder,'*'+output_image_format)
# Sort files according to the digits included in the filename
files = sorted(files, key=lambda x:float(re.findall("(\d+)",x)[0]))
Partition results in Tuple
def getint(name):
(basename, part, ext) = name.partition('.')
(alpha, num) = basename.split('_')
return int(num)
This is a modified version of #Don O'Donnell's answer, because I couldn't get it working as-is, but I think it's the best answer here as it's well-explained.
def getint(name):
_, num = name.split('_')
num, _ = num.split('.')
return int(num)
print(sorted(tiffFiles, key=getint))
Changes:
1) The alpha string doesn't get stored, as it's not needed (hence _, num)
2) Use num.split('.') to separate the number from .tiff
3) Use sorted instead of list.sort, per https://docs.python.org/2/howto/sorting.html

Categories