How to Naturally Sort Pathlib objects in Python?

How to Naturally Sort Pathlib objects in Python? - python

I am trying to create a sorted list of files in the ./pages directory. This is what I have so far:
import numpy as np
from PIL import Image
import glob
from pathlib import Path
# sorted( l, key=lambda a: int(a.split("-")[1]) )
image_list = []
for filename in Path('./pages').glob('*.jpg'):
# sorted( i, key=lambda a: int(a.split("_")[1]) )
# im=Image.open(filename)
image_list.append(filename)
print(*image_list, sep = "\n")
current output:
pages/page_1.jpg
pages/page_10.jpg
pages/page_11.jpg
pages/page_12.jpg
pages/page_2.jpg
pages/page_3.jpg
pages/page_4.jpg
pages/page_5.jpg
pages/page_6.jpg
pages/page_7.jpg
pages/page_8.jpg
pages/page_9.jpg
Expected Output:
pages/page_1.jpg
pages/page_2.jpg
pages/page_3.jpg
pages/page_4.jpg
pages/page_5.jpg
pages/page_6.jpg
pages/page_7.jpg
pages/page_8.jpg
pages/page_9.jpg
pages/page_10.jpg
pages/page_11.jpg
pages/page_12.jpg
I've tried the solutions found in the duplicate, but they don't work because the pathlib files are class objects, and not strings. They only appear as filenames when I print them.
For example:
print(filename) # pages/page_1.jpg
print(type(filename)) # <class 'pathlib.PosixPath'>
Finally, this is working code. Thanks to all.
from pathlib import Path
import numpy as np
from PIL import Image
import natsort
def merge_to_single_image():
image_list1 = []
image_list2 = []
image_list3 = []
image_list4 = []
for filename in Path('./pages').glob('*.jpg'):
image_list1.append(filename)
for i in image_list1:
image_list2.append(i.stem)
# print(type(i.stem))
image_list3 = natsort.natsorted(image_list2, reverse=False)
for i in image_list3:
i = str(i)+ ".jpg"
image_list4.append(Path('./pages', i))
images = [Image.open(i) for i in image_list4]
# for a vertical stacking it is simple: use vstack
images_combined = np.vstack(images)
images_combined = Image.fromarray(images_combined)
images_combined.save('Single_image.jpg')

One can use natsort lib (pip install natsort. It should look simple too.
[! This works, at least tested for versions 5.5 and 7.1 (current)]
from natsort import natsorted
image_list = Path('./pages').glob('*.jpg')
image_list = natsorted(image_list, key=str)
# Or convert list of paths to list of string and (naturally)sort it, then convert back to list of paths
image_list = [Path(p) for p in natsorted([str(p) for p in image_list ])]

Just for posterity, maybe this is more succinct?
natsorted(list_of_pathlib_objects, key=str)

Note that sorted doesn't sort your data in place, but returns a new list, so you have to iterate on its output.
In order to get your sorting key, which is the integer value at the end of your filename:
You can first take the stem of your path, which is its final component without extension (so, for example, 'page_13').
Then, it is better to split it once from the right, in order to be safe in case your filename contains other underscores in the first part, like 'some_page_33.jpg'.
Once converted to int, you have the key you want for sorting.
So, your code could look like:
for filename in sorted(Path('./pages').glob('*.jpg'),
key=lambda path: int(path.stem.rsplit("_", 1)[1])):
print(filename)
Sample output:
pages/ma_page_2.jpg
pages/ma_page_11.jpg
pages/ma_page_13.jpg
pages/ma_page_20.jpg

The problem is not as easy as it sounds, "natural" sorting can be quite challenging, especially with potential arbitrary input strings, e.g what if you have "69_helloKitty.jpg" in your data?
I used https://github.com/SethMMorton/natsort a while ago for a similar problem, maybe it helps you.

Just use like this...
from pathlib import Path
- sorted by name:
sorted(Path('anywhere/you/want').glob('*.jpg'))
- sorted by modification time:
import os
sorted(Path('anywhere/you/want').glob('*.jpg'), key=os.path.getmtime)
- sorted by size:
import os
sorted(Path('anywhere/you/want').glob('*.jpg'), key=os.path.getsize)
etc.
Hint: since filenames are also created by you. Write file names adding padded zeros, like:
for i in range(100):
with open('filename'+f'_{i:03d}','wb'): # py3.6+ fstring
# write your file stuff...
# py3.3+ 'filename'+'_{:03d}'.format(i) for str.format()
...
'filename_007',
'filename_008',
'filename_009',
'filename_010',
'filename_011',
'filename_012',
'filename_013',
'filename_014',
...

Related

How to extract jpg EXIF metadata from a folder in chronological order

I am currently writing a script to extract EXIF GPS data from a folder of jpg images. I am using os.scandir to extract the entries from the folder but from my understanding os.scandir opens the files in an arbitrary way. I need the images to be opened in chronological order by filename. Below is my current code which works as intended however it does not open the images in the correct order. The files within my image folder are named chronologically like so: "IMG_0097, IMG_0098" etc.
#!/usr/bin/python
import os, exif, folium
def convert_lat(coordinates, ref):
latCoords = coordinates[0] + coordinates[1] / 60 + coordinates[2] / 3600
if ref == 'W' or ref == 'S':
latCoords = -latCoords
return latCoords
coordList=[]
map = folium.Map(location=[51.50197125069916, -0.14000860301423912], zoom_start = 16)
from exif import Image
with os.scandir('gps/') as entries:
try:
for entry in entries:
img_path = 'gps/'+entry.name
with open (img_path, 'rb') as src:
img = Image(src)
if img.has_exif:
latCoords = (convert_lat(img.gps_latitude, img.gps_latitude_ref))
longCoords = (convert_lat(img.gps_longitude, img.gps_longitude_ref))
coord = [latCoords, longCoords]
coordList.append(coord)
folium.Marker(coord, popup=str(coord)).add_to(map)
folium.PolyLine(coordList, color =" red", weight=2.5, opacity=1).add_to(map)
print(img_path)
print(coord)
else:
print (src.name,'has no EXIF information')
except:
print(img_path)
print("error occured")
map.save(outfile='/home/jamesdean/Desktop/Python scripts/map.html')
print ("Map generated successfully")

I would say ditch os.scandir and take advantage of more modern features the standard library has to offer:
from pathlib import Path
from operator import attrgetter
# assuming there is a folder named "gps" in the current working directory...
for path in sorted(Path("gps").glob("*.jpg"), key=attrgetter("stem")):
print(path) # do something with the current path
The from operator import attrgetter and key=attrgetter("stem") are a bit redundant, but I'm just being explicit about what attribute I would like to use for determining the sorted order. In this case, the "stem" attribute of a path refers to just the name of the file as a string. For example, if the current path has a filename (including extension) of "IMG_0097.jpg", then path.stem would be "IMG_0097". Like I said, the stem is a string, so your paths will be sorted in lexicographical order. You don't need to do any conversion to integers, because your filenames already include leading zeroes, so lexicographical ordering should work just fine.

You can sort a list using the built-in sorted function, Paul made an interesting point and simply sorting without any arguments will work just as fine:
a = ["IMG_0097.jpg", "IMG_0085.jpg", "IMG_0043.jpg", "IMG_0098.jpg", "IMG_0099.jpg", "IMG_0100.jpg"]
sorted_list = sorted(a)
print(sorted_list)
Output:
['IMG_0043.jpg', 'IMG_0085.jpg', 'IMG_0097.jpg', 'IMG_0098.jpg', 'IMG_0099.jpg', 'IMG_0100.jpg']
In your case you can do:
for entry in sorted(entries):

Is there a way in Python to find a file with the smallest number in its name?

I have a bunch of documents created by one script that are all called like this:
name_*score*
*score* is a float and I need in another script to identify the file with the smallest number in the folder. Example:
name_123.12
name_145.45
This should return string "name_123.12"

min takes a key function. You can use that to define the way min is calculated:
files = [
"name_123.12",
"name_145.45",
"name_121.45",
"name_121.457"
]
min(files, key=lambda x: float((x.split('_')[1])))
# name_121.45

You can try get the number part first, and then convert it to float and sort.
for example:
new_list = [float(name[5:]) for name in YOURLIST] # trim out the unrelated part and convert to float
result = 'name_' + str(min(new_list)) # this is your result

Just wanted to say Mark Meyer is completely right on this one, but you also mentioned that you were reading these file names from a directory. In that case, there is a bit of code you could add to Mark's answer:
import glob, os
os.chdir("/path/to/directory")
files = glob.glob("*")
print(min(files, key=lambda x: float((x.split('_')[1]))))

A way to get the lowest value by providing a directory.
import os
import re
import sys
def get_lowest(directory):
lowest = sys.maxint
for filename in os.listdir(directory):
match = re.match(r'name_\d+(?:\.\d+)', filename)
if match:
number = re.search(r'(?<=_)\d+(?:\.\d+)', match.group(0))
if number:
value = float(number.group(0))
if value < lowest:
lowest = value
return lowest
print(get_lowest('./'))
Expanded on Tim Biegeleisen's answer, thank you Tim!

sort images based on a cluster correspondances list

I have the following working code to sort images according to a cluster list which is a list of tuples: (image_id, cluster_id).
One image can only be in one and only one cluster (there is never the same image in two clusters for example).
I wonder if there is a way to shorten the "for+for+if+if" loops at the end of the code as yet, for each file name, I must check in every pairs in the cluster list, which makes it a little redundant.
import os
import re
import shutil
srcdir = '/home/username/pictures/' #
if not os.path.isdir(srcdir):
print("Error, %s is not a valid directory!" % srcdir)
return None
pts_cls # is the list of pairs (image_id, cluster_id)
filelist = [(srcdir+fn) for fn in os.listdir(srcdir) if
re.search(r'\.jpg$', fn, re.IGNORECASE)]
filelist.sort(key=lambda var:[int(x) if x.isdigit() else
x for x in re.findall(r'[^0-9]|[0-9]+', var)])
for f in filelist:
fbname = os.path.splitext(os.path.basename(f))[0]
for e,cls in enumerate(pts_cls): # for each (img_id, clst_id) pair
if str(cls[0])==fbname: # check if image_id corresponds to file basename on disk)
if cls[1]==-1: # if cluster_id is -1 (->noise)
outdir = srcdir+'cluster_'+'Noise'+'/'
else:
outdir = srcdir+'cluster_'+str(cls[1])+'/'
if not os.path.isdir(outdir):
os.makedirs(outdir)
dstf = outdir+os.path.basename(f)
if os.path.isfile(dstf)==False:
shutil.copy2(f,dstf)
Of course, as I am pretty new to Python, any other well explained improvements are welcome!

I think you're complicating this far more than needed. Since your image names are unique (there can only be one image_id) you can safely convert pts_cls into a dict and have fast lookups on the spot instead of looping through the list of pairs each and every time. You are also utilizing regex where its not needed and you're packing your paths only to unpack them later.
Also, your code would break if it happens that an image from your source directory is not in the pts_cls as its outdir would never be set (or worse, its outdir would be the one from the previous loop).
I'd streamline it like:
import os
import shutil
src_dir = "/home/username/pictures/"
if not os.path.isdir(src_dir):
print("Error, %s is not a valid directory!" % src_dir)
exit(1) # return is expected only from functions
pts_cls = [] # is the list of pairs (image_id, cluster_id), load from whereever...
# convert your pts_cls into a dict - since there cannot be any images in multiple clusters
# base image name is perfectly ok to use as a key for blazingly fast lookups later
cluster_map = dict(pts_cls)
# get only `.jpg` files; store base name and file name, no need for a full path at this time
files = [(fn[:-4], fn) for fn in os.listdir(src_dir) if fn.lower()[-4:] == ".jpg"]
# no need for sorting based on your code
for name, file_name in files: # loop through all files
if name in cluster_map: # proceed with the file only if in pts_cls
cls = cluster_map[name] # get our cluster value
# get our `cluster_<cluster_id>` or `cluster_Noise` (if cluster == -1) target path
target_dir = os.path.join(src_dir, "cluster_" + str(cls if cls != -1 else "Noise"))
target_file = os.path.join(target_dir, file_name) # get the final target path
if not os.path.exists(target_file): # if the target file doesn't exists
if not os.path.isdir(target_dir): # make sure our target path exists
os.makedirs(target_dir, exist_ok=True) # create a full path if it doesn't
shutil.copy(os.path.join(src_dir, file_name), target_file) # copy
UPDATE - If you have multiple 'special' folders for certain cluster IDs (like Noise is for -1) you can create a map like cluster_targets = {-1: "Noise"} where the keys are your cluster IDs and their values are, obviously, the special names. Then you can replace the target_dir generation with: target_dir = os.path.join(src_dir, "cluster_" + str(cluster_targets.get(cls,cls)))
UPDATE #2 - Since your image_id values appear to be integers while filenames are strings, I'd suggest you to just build your cluster_map dict by converting your image_id parts to strings. That way you'd be comparing likes to likes without the danger of type mismatch:
cluster_map = {str(k): v for k, v in pts_cls}
If you're sure that none of the *.jpg files in your src_dir will have a non-integer in their name you can instead convert the filename into an integer to begin with in the files list generation - just replace fn[:-4] with int(fn[:-4]). But I wouldn't advise that as, again, you never know how your files might be named.

How can I sort given a specific order that I provide

I am trying to sort files in a directory given their extension, but provided an order that I give first. Let's say I want the extension order to be
ext_list = [ .bb, .cc , .dd , aa ]
The only way that I can think of would be to go through every single file
and place them in a list every time a specific extension is encountered.
for subdir, dirs, files in os.walk(directory):
if file.endswith( '.bb') --> append file
then go to the end of the directory
then loop again
if file.endswith( '.cc') -->append file
and so on...
return sorted_extension_list
and then finally
for file in sorted_extension_list :
print file

Here is another way of doing it:
files = []
for _, _, f in os.walk('directory'):
files.append(f)
sorted(files,key=lambda x: ext_list.index(*[os.path.basename(x).split('.',1)[-1]]))
['buzz.bb', 'foo.cc', 'fizz.aa']
Edit: My output doesn't have dd since I didn't make a file for it in my local test dir. It will work regardless.

You can use sorted() with a custom key:
import os
my_custom_keymap = {".aa":2, ".bb":3, ".cc":0, ".dd":1}
def mycompare(f):
return my_custom_keymap[os.path.splitext(f)[1]]
files = ["alfred.bb", "butters.dd", "charlie.cc", "derkins.aa"]
print(sorted(files, key=mycompare))
The above uses the mycompare function as a custom key compare. In this case, it extracts the extension, and the looks up the extension in the my_custom_keymap dictionary.
A very similar way (but closer to your example) could use a list as the map:
import os
my_custom_keymap = [".cc", ".dd", ".aa", ".bb"]
def mycompare(f):
return my_custom_keymap.index(os.path.splitext(f)[1])
files = ["alfred.bb", "butters.dd", "charlie.cc", "derkins.aa"]
print(sorted(files, key=mycompare))

import os
# List you should get once:
file_list_name =['mickey.aa','mickey.dd','mickey_B.cc','mickey.bb']
ext_list = [ '.bb', '.cc' , '.dd' , '.aa' ]
order_list =[]
for ext in ext_list:
for file in file_list_name:
extension = os.path.splitext(file)[1]
if extension == ext:
order_list.append(file)
order_list is what you are looking for. Otherwise you can use the command sorted() with key attribute. Just look for it on SO!

Using sorted with a custom key is probably best, but here's another method where you store the filenames in lists based on their extension. Then put those lists together based on your custom order.
def get_sorted_list_of_files(dirname, extensions):
extension_map = collections.defaultdict(list)
for _, _, files in os.walk(dirname):
for filename in files:
extension = os.path.splitext(filename)[1]
extension_map[extension].append(filename)
pprint.pprint(extension_map)
sorted_list = []
for extension in extensions:
sorted_list.extend(extension_map[extension])
pprint.pprint(extensions)
pprint.pprint(sorted_list)
return sorted_list

Why does calling natsorted not modify my list?

from natsort import natsorted
filelist=[]
filelist=os.listdir("C:\\Users\\Amit\\Downloads\\Compressed\\trainigVid")
natsorted(filelist)
print filelist
I am getting following output
['1.avi', '10.avi', '11.avi', '12.avi', '13.avi', '14.avi', '15.avi', '16.avi', '17.avi', '18.avi', '19.avi', '2.avi', '20.avi', '21.avi', '22.avi', '23.avi', '24.avi', '3.avi', '4.avi', '5.avi', '6.avi', '7.avi', '8.avi', '9.avi']
I want this list to be naturally sorted like
[1.avi, 2.avi, 3.avi.....]
I am stuck,please help

natsorted returns the newly sorted list, it does not modify the original list in place. This means you should use:
filelist = natsorted(filelist)
to get that return value.

Try this:
import os
filelist = os.listdir("C:\\Users\\Amit\\Downloads\\Compressed\\trainigVid")
out = []
for s in sorted(filelist, key=lambda x:int(os.path.splitext(x)[0])):
out.append(s)
print out
Demo:
['1.avi', '2.avi', '3.avi', '4.avi', '5.avi', '6.avi', '7.avi', '8.avi', '9.avi', '10.avi', '11.avi', '12.avi', '13.avi', '14.avi', '15.avi', '16.avi', '17.avi', '18.avi', '19.avi', '20.avi', '21.avi', '22.avi', '23.avi', '24.avi']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to Naturally Sort Pathlib objects in Python? - python

Just for posterity, maybe this is more succinct? natsorted(list_of_pathlib_objects, key=str)

The problem is not as easy as it sounds, "natural" sorting can be quite challenging, especially with potential arbitrary input strings, e.g what if you have "69_helloKitty.jpg" in your data? I used https://github.com/SethMMorton/natsort a while ago for a similar problem, maybe it helps you.

Related

How to extract jpg EXIF metadata from a folder in chronological order

Is there a way in Python to find a file with the smallest number in its name?

sort images based on a cluster correspondances list

How can I sort given a specific order that I provide

Why does calling natsorted not modify my list?

Categories

Resources