remove certain files in a list matching a pattern - python

I have a list with files (the path to them).
I wrote a function like this to remove certain files matching a pattern but it just removes 2 files at most and I don't understand why.
remove_list = ('*.txt',) # Example for removing all .txt files in the list
def removal(list):
for f in list:
if any(fnmatch(basename(f.lower()), pattern) for pattern in remove_list:
list.remove(f)
return list
//Edit; Ok naming my list "list" in the code was a bad idea. in my code here its called differently. Just wanted to give an abstract idea what I'm dealing with. Should have mentioned that

Modifying a list while you're iterating over it is a bad idea, as you can very easily get in edge cases when behaviour is not determined.
The best way to do what you want is to build a new list without the items you don't want:
remove_list = (r'*.txt',) # Example for removing all .txt files in the list
def removal(l, rm_list):
for f in l:
for pattern in rm_list:
if not fnmatch(basename(f.lower()), pattern):
yield f
print(list(removal(list_with_files, remove_list))
Here, I'm unrolling your any one-liner that might make your code look smart, but is hard to read, and might give you headaches in six months. It's better (because more readable) to do a simple for and an if instead!
The yield keyword will make the function return what's called a generator in python, so that when you're iterating over the result of the function, it will return the value, to make it available to the calling context, and then get back to the function to return the next item.
This is why in the print statement, I use list() around the function call, whereas if you iterate over it, you don't need to put it in a list:
for elt in removal(list_with_files, remove_list):
print(elt)
If you don't like using a generator (and the yield statement), then you have to build the list manually, before returning it:
remove_list = (r'*.txt',) # Example for removing all .txt files in the list
def removal(l, rm_list):
ret_list = []
for f in l:
for pattern in rm_list:
if not fnmatch(basename(f.lower()), pattern):
ret_list.append(f)
return ret_list
HTH

You can use str.endswith if you are removing based on extension, you just need to pass a tuple of extensions:
remove_tup = (".txt",".py") # Example for removing all .txt files in the list
def removal(lst):
return [f for f in lst if not f.endswith(remove_tup)]

The code you provided is vague.
1.don't use list it is shadow the build-in list
2.don't modify the list when you iterate it, you can make a copy of it
My suggestion is:
You can iterate your original list and the remove_list as below:
test.py
list1=["file1.txt", "file2.txt", "other.csv"]
list2=["file1.txt", "file2.txt"] # simulates your remove_list
listX = [x for x in list1 if x not in list2] # creates a new list
print listX
$python test.py
['other.csv']

As was said in the comments, don't modify a list as you iterate over it. Can also use a list comprehension like so:
patterns = ('*.txt', '*.csv')
good = [f for f in all_files if not any(fnmatch(basename(f.lower()), pattern) for pattern in patterns)]

Related

Python Import a List from Another File

(CHECK UPDATE BELOW)
I'm having trouble with importing a list from another file.
I'm still learning how to call/pass variables in functions.
I have something very similar to this example,
#file1
list_one = [] #global
list_two = [] #global
def function_list_one():
# code
return list_one
def function_list_two(list_one, list_two):
# code
return list_two
def both_lists(list_one, list_two):
# code
return new_list
# return new_list outputs a JSON list
both_lists(list_one, list_two) # this is used to call it from main
So far, if I print this new list it works from within this file. It prints what I want. However, if I try to do the next few things then it just prints an empty list. The last time from file1 is for the main file where I call that function using those parameters.
#file2
from file1 import both_lists, list_one, list_two
new_list = both_lists(list_one, list_two)
new_list2 = list_two
print(new_list)
# outputs = []
prin(new_list2)
#outputs = []
# Expected output should be the results from new_list from file1, which
# would be a JSON type list.
This is where nothing happens. If I try both methods, they return an empty list. I understand I have a global variable for both list_one and list_two. What I don't understand is why it returns an empty list when I have used this same method in another program I made. Basically, I want to use the "return new_list" in fil2 in order to proceed with the rest of the program.
Update:
I found the solution. Somehow file2 when I import the new_lists it didn't like it. So, instead I just import both_lists(function_list_one(), function_list_two())
Thank you all for your time!
It looks as though you're importing the lists as defined, not as they are built by your function_list_* functions.
When you import list_one and list_two, you are importing them as empty lists, as file 1 has them:
list_one = [] #global
list_two = [] #global
If both_lists() is just appending them, then what you're getting for new_list in file 2 is just both of those empty lists appended together, or just another empty list.
I think what you're after is the result of appending function_list_one(list_one) and function_list_two(list_two). There's a number of ways you could accomplish that, but perhaps the simplest is to redefine both_lists() to do something like this:
def both_lists(list_one, list_two):
# Code
return function_list_one(list_one) + function_list_two(list_two)
Then, when you import both_lists, you also import the functionality defined for your list-building functions.
To sum up:
As written, your imports are importing these items:
list_one = []
list_two = []
both_lists, which I'm assuming just appends list_one and list_two
The code in function_list_one() and function_list_two() won't execute unless you import and call those functions or include their functionality in something that you do import - and then call that.
My suggestion above is just one way around this, and that's assuming I read your code correctly! Hope this helps, though. :^)

How to remove items in a list of strings based on duplicate substrings among the elements?

I have a list of files from different paths, but some of that paths contain the same file(and file name).
I would like to remove these duplicate files, but since they're from different paths, I just can't do set(thelist)
Minimal Example
Say that my list looks like this
thelist = ['/path1/path2/file13332', '/path11/path21/file21', 'path1232/path1112/file13332', '/path1/path2/file13339']
What is the most pythonic way to get this
deduplicatedList = ['/path1/path2/file13332', '/path11/path21/file21', '/path1/path2/file13339']
File file13332 was in the list twice. I am not concerned about which element was removed
One way is to use dictionary:
thelist = ['/path1/path2/file13332', '/path11/path21/file21', 'path1232/path1112/file13332', '/path1/path2/file13339']
deduplicatedList = list({f.split('/')[-1]: f for f in thelist}.values())
print(deduplicatedList)
['path1232/path1112/file13332', '/path11/path21/file21', '/path1/path2/file13339']
s = set()
deduped = [s.add(os.path.basename(i)) or i for i in l if os.path.basename(i) not in s]
s contains the unique basenames which guards against adding non-unique basenames to deduped.

How do I individually print items of a list that are in another list in python

I want to individually print(and then write to a file) items of a list that are in another list. If there are no matching items then I want 'NONE' to be printed. I have a time limit on my program, so I would like a quick and easy solution to this, preferable under 0.1 seconds.
I have a list called joinedComb, and I want to individually print all items in joinedComb that are in another list called dictionary
I have tried
for i in joinedCombs:
if i in dictionary:
endResult.append(i)
fout.write(i+'\n')
if endResult == []:
fout.write('NONE\n')
I would like it to print something like this:
GREG
GEKA
GENO
or
NONE
endResult = [i for i in joinedCombs if i in dictionary]
fout = '\n'.join(endResult) if any(endResult) else 'NONE'
If you prefer, it is possible to do it without loops. You can use logical conjuction of two sets but don't expect execution time shortening.
endResult = set(joinedCombs).intersection(set(dictionary.keys()))

Python substitute elements inside a list

I have the following code that is filtering and printing a list. The final output is json that is in the form of name.example.com. I want to substitute that with name.sub.example.com but I'm having a hard time actually doing that. filterIP is a working bit of code that removes elements entirely and I have been trying to re-use that bit to also modify elements, it doesn't have to be handled this way.
def filterIP(fullList):
regexIP = re.compile(r'\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}$')
return filter(lambda i: not regexIP.search(i), fullList)
def filterSub(fullList2):
regexSub = re.compile(r'example\.com, sub.example.com')
return filter(lambda i: regexSub.search(i), fullList2)
groups = {key : filterSub(filterIP(list(set(items)))) for (key, items) in groups.iteritems() }
print(self.json_format_dict(groups, pretty=True))
This is what I get without filterSub
"type_1": [
"server1.example.com",
"server2.example.com"
],
This is what I get with filterSub
"type_1": [],
This is what I'm trying to get
"type_1": [
"server1.sub.example.com",
"server2.sub.example.com"
],
The statement:
regexSub = re.compile(r'example\.com, sub.example.com')
doesn't do what you think it does. It creates a compiled regular expression that matches the string "example.com" followed by a comma, a space, the string "sub", an arbitrary character, the string "example", an arbitrary character, and the string "com". It does not create any sort of substitution.
Instead, you want to write something like this, using the re.sub function to perform the substitution and using map to apply it:
def filterSub(fullList2):
regexSub = re.compile(r'example\.com')
return map(lambda i: re.sub(regexSub, "sub.example.com", i),
filter(lambda i: re.search(regexSub, i), fullList2))
If the examples are all truly as simple as those you listed, a regex is probably overkill. A simple solution would be to use string .split and .join. This would likely give better performance.
First split the url at the first period:
url = 'server1.example.com'
split_url = url.split('.', 1)
# ['server1', 'example.com']
Then you can use the sub to rejoin the url:
subbed_url = '.sub.'.join(split_url)
# 'server1.sub.example.com'
Of course you can do the split and the join at the same time
'.sub.'.join(url.split('.', 1))
Or create a simple function:
def sub_url(url):
return '.sub.'.join(url.split('.', 1))
To apply this to the list you can take several approaches.
A list comprehension:
subbed_list = [sub_url(url)
for url in url_list]
Map it:
subbed_list = map(sub_url, url_list)
Or my favorite, a generator:
gen_subbed = (sub_url(url)
for url in url_list)
The last looks like a list comprehension but gives the added benefit that you don't rebuild the entire list. It processes the elements one item at a time as the generator is iterated through. If you decide you do need the list later you can simply convert it to a list as follows:
subbed_list = list(gen_subbed)

Remove multiple list elements (substrings)

I have a sorted list of directories:
lst :=
./C01
./C01/BOOST
./C01/BOOST/src
./C01/BOOST/src/template
./C01/EmployeeAdmin
./Spheres
./db01/src/scottlib # added later
./db03
./db03/src
./db03/src/examples
./db03/src/exercises
./txt2bmp
./txt2bmp/data
./txt2bmp/docs
./txt2bmp/tests
./txt2bmp/txt2bmp
./txt2bmp_COPYED
./txt2bmp_COPYED/data
./txt2bmp_COPYED/docs
./txt2bmp_COPYED/tests
./txt2bmp_COPYED/txt2bmp
./txt2bmp_cpp
./txt2bmp_cpp/doc
I've tried to remove all subfolder - if existing -, but i could not do it in a nice, pythonic way. I did it with loops, but this was an long, ugly and inscrutable solution...
After the deletion the list should be shown like this:
lst2 :=
./C01
./Spheres
./db01/src/scottlib
./db03
./txt2bmp
./txt2bmp_COPYED
./txt2bmp_cpp
The next line is the command I've tried to modify in many ways, but without success...
[ i for i in lst if not i.startswith(lst[0])]
Perhaps you have an idea to solve this in an elegant way?
I think this does the trick
lst2 = [a for a in lst if '/'.join(a.split('/')[:-1]) not in lst]
for line in lst2: print line
Output:
./C01
./Spheres
./db03
./txt2bmp
./txt2bmp_COPYED
./txt2bmp_cpp
In your case, you can do a simple count of the path separator and use that for filtering in a list comprehension, keeping only items with a count of 1:
lst2 = [i for i in lst if i.count('/')==1]
Considering #Jean-FrançoisFabre's comment, you could replace / with os.sep to add some portability to the snippet.
I would first normalize the paths (with os.path.relpath), then isolate the first component, then filter to unique paths (with set):
from os.path import relpath
def strip_tail(path):
try:
return path[:path.index("/")]
except ValueError:
return path
lst2 = list(set(strip_tail(relpath(dir)) for dir in lst))

Categories