How do you sort files numerically? - python

I'm processing some files in a directory and need the files to be sorted numerically. I found some examples on sorting—specifically with using the lambda pattern—at wiki.python.org, and I put this together:
import re
file_names = """ayurveda_1.tif
ayurveda_11.tif
ayurveda_13.tif
ayurveda_2.tif
ayurveda_20.tif
ayurveda_22.tif""".split('\n')
num_re = re.compile('_(\d{1,2})\.')
file_names.sort(
key=lambda fname: int(num_re.search(fname).group(1))
)
Is there a better way to do this?

This is called "natural sorting" or "human sorting" (as opposed to lexicographical sorting, which is the default). Ned B wrote up a quick version of one.
import re
def tryint(s):
try:
return int(s)
except:
return s
def alphanum_key(s):
""" Turn a string into a list of string and number chunks.
"z23a" -> ["z", 23, "a"]
"""
return [ tryint(c) for c in re.split('([0-9]+)', s) ]
def sort_nicely(l):
""" Sort the given list in the way that humans expect.
"""
l.sort(key=alphanum_key)
It's similar to what you're doing, but perhaps a bit more generalized.

Just use :
tiffFiles.sort(key=lambda var:[int(x) if x.isdigit() else x for x in re.findall(r'[^0-9]|[0-9]+', var)])
is faster than use try/except.

If you are using key= in your sort method you shouldn't use cmp which has been removed from the latest versions of Python. key should be equated to a function which takes a record as input and returns any object which will compare in the order you want your list sorted. It doesn't need to be a lambda function and might be clearer as a stand alone function. Also regular expressions can be slow to evaluate.
You could try something like the following to isolate and return the integer part of the file name:
def getint(name):
basename = name.partition('.')
alpha, num = basename.split('_')
return int(num)
tiffiles.sort(key=getint)

#April provided a good solution in How is Pythons glob.glob ordered? that you could try
#First, get the files:
import glob
import re
files = glob.glob1(img_folder,'*'+output_image_format)
# Sort files according to the digits included in the filename
files = sorted(files, key=lambda x:float(re.findall("(\d+)",x)[0]))

Partition results in Tuple
def getint(name):
(basename, part, ext) = name.partition('.')
(alpha, num) = basename.split('_')
return int(num)

This is a modified version of #Don O'Donnell's answer, because I couldn't get it working as-is, but I think it's the best answer here as it's well-explained.
def getint(name):
_, num = name.split('_')
num, _ = num.split('.')
return int(num)
print(sorted(tiffFiles, key=getint))
Changes:
1) The alpha string doesn't get stored, as it's not needed (hence _, num)
2) Use num.split('.') to separate the number from .tiff
3) Use sorted instead of list.sort, per https://docs.python.org/2/howto/sorting.html

Related

Sort list of objects by string-attribute

I've generated many image-files (PNG) within a folder. Each have names akin to "img0.png", "img1.png", ..., "img123164971.png" etc. The order of these images matter to me, and the numerical part represent the order I need to retrieve them before I add them to a html-form.
This question closely gives me a solution: Does Python have a built in function for string natural sort?
But I'm not entirely sure how to incorporate it into my specific code:
imagedata = list()
files_and_dirs = Path(imagefolder).glob('**/*')
images = [x for x in files_and_dirs if x.is_file() and x.suffix == '.png']
for image in images:
imagedata.append("<img src='{0}/{1}' width='200'>".format(imagefolder, image.name))
These files are naturally read alphanumerically, but that is not what I want.
I have a feeling that I can simply do a images = sort_function(images), but I'm unsure how exactly. I realize I can do this:
imagedata = list()
files_and_dirs = Path(barcodeimagefolder).glob('**/*')
images = [x.name for x in files_and_dirs if x.is_file() and x.suffix == '.png']
images = natural_sort(images)
for image in images:
imagedata.append("<img src='{0}/{1}' width='200'>".format(imagefolder, image))
def natural_sort(l):
convert = lambda text: int(text) if text.isdigit() else text.lower()
alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]
return sorted(l, key = alphanum_key)
Using Mark Byers' solution in the link. But I later on need the list of the actual images themselves, and it seems redundant having two lists when one of them contains all the data in the other. Instead I would very much like to sort the list of image-files based on their file-name, in that way. Or better yet, read them from the folder in that order, if possible. Any advice?
Edit: I changed the title, making it a bit more condense and hopefully still accurate.
Assuming you really want to have things "naturally sorted" strictly by the name of the individual file, as opposed to the full path (e.g., so "zzz/image01.png" comes before "aaa/image99.png"), (EDIT: I see now from the comments that this isn't the case) one way to do this is create an ordered dictionary where the keys are the filenames, and the values are the "" tags you're ultimately creating. Then do a natural sort of the dictionary keys, and return a list of the corresponding values.
So using a simple list of 3 made-up files and adding a twist to the original natural_sort, function:
import collections
import re
def files_with_natural_sort(l):
convert = lambda text: int(text) if text.isdigit() else text.lower()
alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]
return [ l[newkey] for newkey in sorted(l, key = alphanum_key) ]
original_files = ["folder_c/file9.png", "folder_a/file11.png", "folder_b/file10.png"]
image_dict = collections.OrderedDict()
for file in original_files:
[folder, filename] = file.split('/')
image_dict[filename] = '<img src="%s" width="200">' % file
sorted_keys = files_with_natural_sort(image_dict)
print(sorted_keys)
This outputs:
['<img src="folder_c/file9.png" width="200">', '<img src="folder_b/file10.png"
width="200">', '<img src="folder_a/file11.png" width="200">']
It's possible to get around this using a regular dictionary and playing with the .keys() list of that dictionary. But this still works. As far as trying to create a list of the files of the desired order as you read them, I suppose you could do some fancy bubble sorting for that, but I really wouldn't sweat it. Unless you have millions of files, I don't see the harm in using multiple lists.
You mean you just want to sort imagedata? Not pretty, but try:
imagedata.sort(key=lambda x : int(re.search('(\d+)', re.search('(src=\'.+\/)', x)[0])[0]))
The inner regex gets src='<something>/, while the outer gets the digits within <something>, assuming <something> has a non-digit prefix and a non-digit suffix.

Python elegant way to map string structure

Let's say I know beforehand that the string
"key1:key2[]:key3[]:key4" should map to "newKey1[]:newKey2[]:newKey3"
then given "key1:key2[2]:key3[3]:key4",
my method should return "newKey1[2]:newKey2[3]:newKey3"
(the order of numbers within the square brackets should stay, like in the above example)
My solution looks like this:
predefined_mapping = {"key1:key2[]:key3[]:key4": "newKey1[]:newKey2[]:newKey3"}
def transform(parent_key, parent_key_with_index):
indexes_in_parent_key = re.findall(r'\[(.*?)\]', parent_key_with_index)
target_list = predefined_mapping[parent_key].split(":")
t = []
i = 0
for elem in target_list:
try:
sub_result = re.subn(r'\[(.*?)\]', '[{}]'.format(indexes_in_parent_key[i]), elem)
if sub_result[1] > 0:
i += 1
new_elem = sub_result[0]
except IndexError as e:
new_elem = elem
t.append(new_elem)
print ":".join(t)
transform("key1:key2[]:key3[]:key4", "key1:key2[2]:key3[3]:key4")
prints newKey1[2]:newKey2[3]:newKey3 as the result.
Can someone suggest a better and elegant solution (around the usage of regex especially)?
Thanks!
You can do it a bit more elegantly by simply splitting the mapped structure on [], then interspersing the indexes from the actual data and, finally, joining everything together:
import itertools
# split the map immediately on [] so that you don't have to split each time on transform
predefined_mapping = {"key1:key2[]:key3[]:key4": "newKey1[]:newKey2[]:newKey3".split("[]")}
def transform(key, source):
mapping = predefined_mapping.get(key, None)
if not mapping: # no mapping for this key found, return unaltered
return source
indexes = re.findall(r'\[.*?\]', source) # get individual indexes
return "".join(i for e in itertools.izip_longest(mapping, indexes) for i in e if i)
print(transform("key1:key2[]:key3[]:key4", "key1:key2[2]:key3[3]:key4"))
# newKey1[2]:newKey2[3]:newKey3
NOTE: On Python 3 use itertools.zip_longest() instead.
I still think you're over-engineering this and that there is probably a much more elegant and far less error-prone approach to the whole problem. I'd advise stepping back and looking at the bigger picture instead of hammering out this particular solution just because it seems to be addressing the immediate need.

Python substitute elements inside a list

I have the following code that is filtering and printing a list. The final output is json that is in the form of name.example.com. I want to substitute that with name.sub.example.com but I'm having a hard time actually doing that. filterIP is a working bit of code that removes elements entirely and I have been trying to re-use that bit to also modify elements, it doesn't have to be handled this way.
def filterIP(fullList):
regexIP = re.compile(r'\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}$')
return filter(lambda i: not regexIP.search(i), fullList)
def filterSub(fullList2):
regexSub = re.compile(r'example\.com, sub.example.com')
return filter(lambda i: regexSub.search(i), fullList2)
groups = {key : filterSub(filterIP(list(set(items)))) for (key, items) in groups.iteritems() }
print(self.json_format_dict(groups, pretty=True))
This is what I get without filterSub
"type_1": [
"server1.example.com",
"server2.example.com"
],
This is what I get with filterSub
"type_1": [],
This is what I'm trying to get
"type_1": [
"server1.sub.example.com",
"server2.sub.example.com"
],
The statement:
regexSub = re.compile(r'example\.com, sub.example.com')
doesn't do what you think it does. It creates a compiled regular expression that matches the string "example.com" followed by a comma, a space, the string "sub", an arbitrary character, the string "example", an arbitrary character, and the string "com". It does not create any sort of substitution.
Instead, you want to write something like this, using the re.sub function to perform the substitution and using map to apply it:
def filterSub(fullList2):
regexSub = re.compile(r'example\.com')
return map(lambda i: re.sub(regexSub, "sub.example.com", i),
filter(lambda i: re.search(regexSub, i), fullList2))
If the examples are all truly as simple as those you listed, a regex is probably overkill. A simple solution would be to use string .split and .join. This would likely give better performance.
First split the url at the first period:
url = 'server1.example.com'
split_url = url.split('.', 1)
# ['server1', 'example.com']
Then you can use the sub to rejoin the url:
subbed_url = '.sub.'.join(split_url)
# 'server1.sub.example.com'
Of course you can do the split and the join at the same time
'.sub.'.join(url.split('.', 1))
Or create a simple function:
def sub_url(url):
return '.sub.'.join(url.split('.', 1))
To apply this to the list you can take several approaches.
A list comprehension:
subbed_list = [sub_url(url)
for url in url_list]
Map it:
subbed_list = map(sub_url, url_list)
Or my favorite, a generator:
gen_subbed = (sub_url(url)
for url in url_list)
The last looks like a list comprehension but gives the added benefit that you don't rebuild the entire list. It processes the elements one item at a time as the generator is iterated through. If you decide you do need the list later you can simply convert it to a list as follows:
subbed_list = list(gen_subbed)

remove certain files in a list matching a pattern

I have a list with files (the path to them).
I wrote a function like this to remove certain files matching a pattern but it just removes 2 files at most and I don't understand why.
remove_list = ('*.txt',) # Example for removing all .txt files in the list
def removal(list):
for f in list:
if any(fnmatch(basename(f.lower()), pattern) for pattern in remove_list:
list.remove(f)
return list
//Edit; Ok naming my list "list" in the code was a bad idea. in my code here its called differently. Just wanted to give an abstract idea what I'm dealing with. Should have mentioned that
Modifying a list while you're iterating over it is a bad idea, as you can very easily get in edge cases when behaviour is not determined.
The best way to do what you want is to build a new list without the items you don't want:
remove_list = (r'*.txt',) # Example for removing all .txt files in the list
def removal(l, rm_list):
for f in l:
for pattern in rm_list:
if not fnmatch(basename(f.lower()), pattern):
yield f
print(list(removal(list_with_files, remove_list))
Here, I'm unrolling your any one-liner that might make your code look smart, but is hard to read, and might give you headaches in six months. It's better (because more readable) to do a simple for and an if instead!
The yield keyword will make the function return what's called a generator in python, so that when you're iterating over the result of the function, it will return the value, to make it available to the calling context, and then get back to the function to return the next item.
This is why in the print statement, I use list() around the function call, whereas if you iterate over it, you don't need to put it in a list:
for elt in removal(list_with_files, remove_list):
print(elt)
If you don't like using a generator (and the yield statement), then you have to build the list manually, before returning it:
remove_list = (r'*.txt',) # Example for removing all .txt files in the list
def removal(l, rm_list):
ret_list = []
for f in l:
for pattern in rm_list:
if not fnmatch(basename(f.lower()), pattern):
ret_list.append(f)
return ret_list
HTH
You can use str.endswith if you are removing based on extension, you just need to pass a tuple of extensions:
remove_tup = (".txt",".py") # Example for removing all .txt files in the list
def removal(lst):
return [f for f in lst if not f.endswith(remove_tup)]
The code you provided is vague.
1.don't use list it is shadow the build-in list
2.don't modify the list when you iterate it, you can make a copy of it
My suggestion is:
You can iterate your original list and the remove_list as below:
test.py
list1=["file1.txt", "file2.txt", "other.csv"]
list2=["file1.txt", "file2.txt"] # simulates your remove_list
listX = [x for x in list1 if x not in list2] # creates a new list
print listX
$python test.py
['other.csv']
As was said in the comments, don't modify a list as you iterate over it. Can also use a list comprehension like so:
patterns = ('*.txt', '*.csv')
good = [f for f in all_files if not any(fnmatch(basename(f.lower()), pattern) for pattern in patterns)]

How can I sort list of strings in specific order?

Let's say I have such a list:
['word_4_0_w_7',
'word_4_0_w_6',
'word_3_0_w_10',
'word_3_0_w_2']
and I want to sort them according to number that comes after "word" and according to number after "w".
It will look like this:
['word_3_0_w_2',
'word_3_0_w_10',
'word_4_0_w_6',
'word_4_0_w_7']
What comes in mind is to create a bunch of list and according to index after "word" stuff them with sorted strings according "w", and then merge them.
Is in Python more clever way to do it?
Use Python's key functionality, in conjunction with other answers:
def mykey(value):
ls = value.split("_")
return int(ls[1]), int(ls[-1])
newlist = sorted(firstlist, key=mykey)
## or, if you want it in place:
firstlist.sort(key=mykey)
Python will be more efficient with key vs cmp.
You can provide a function to the sort() method of list objects:
l = ['word_4_0_w_7',
'word_4_0_w_6',
'word_3_0_w_10',
'word_3_0_w_2']
def my_key_func(x):
xx = x.split("_")
return (int(xx[1]), int(xx[-1]))
l.sort(key=my_key_func)
Output:
print l
['word_3_0_w_2', 'word_3_0_w_10', 'word_4_0_w_6', 'word_4_0_w_7']
edit: Changed code according to comment by #dwanderson ; more info on this can be found here.
You can use a function to extract the relevant parts of your string and then use those parts to sort:
a = ['word_4_0_w_7', 'word_4_0_w_6', 'word_3_0_w_10', 'word_3_0_w_2']
def sort_func(x):
parts = x.split('_');
sort_key = parts[1]+parts[2]+"%02d"%int(parts[4])
return sort_key
a_sorted = sorted(a,key=sort_func)
The expression "%02d" %int(x.split('_')[4]) is used to add a leading zero in front of second number otherwise 10 will sort before 2. You may have to do the same with the number extracted by x.split('_')[2].

Categories