How to get sequence number of the file in the folder? - python

I have Windows PC. My script should identify sequency number of the file passed in the command line in the folder, i.e.
myscript.py \\network-drive\files\Long-long.file.name.with.numbers.txt
Folder content is the following:
\\network-drive\files\
folder1
folder2
file1
file2
Long.long.file.name.with.numbers.txt
file3
file4
My script should identify sequence number of the file given in the command line, i.e. should return 5 (folders are also to be counted; assumption is that files are sorted by their names).
Upd. I've stopped with the following:
import sys
import os.path
if sys.argv[1]: # regardless of this verification, exception happens if argument is not passed
head, tail = os.path.split(sys.argv[1])
print head
print os.listdir(head)
The list returned by listdir doesn't allow me to identify what is folder and what is file. So, I can not sort them properly.

There are a couple of problems you are trying to solve, and a couple of options for the solutions.
1st - are you looking for something that is naturally sorted i.e.:
/path/to/folder/
subfolder01/
test1.png
test2.png
test3.png
test10.png
test11.png
If so...you'll need to create a natural sort method. If you are happy with alpha-numeric sorting:
/path/to/folder/
subfolder01/
test1.png
test10.png
test11.png
test2.png
test3.png
Then the standard sort will work. Depending on how you sort your files, the index of your result will vary.
To get the directory and files from the system, you can do it one of two ways - not 100% sure which is faster, so test them both out. I'm going to break the answer into chunks so you can piece it together how best seems fit:
Part 01: Initialization
import os
import sys
try:
searchpath = sys.argv[1]
except IndexError:
print 'No searchpath supplied'
sys.exit(0)
basepath, searchname = os.path.split(searchpath)
Part 02: Collecting folders and files
Option #1: os.listdir + os.path.isfile
files = []
folders = []
for filepath in os.listdir(basepath):
if ( os.path.isfile(filepath) ):
files.append(filepath)
else:
folders.append(folder)
Option #2: os.walk
# we only want the top level list of folders and files,
# so break out of the loop after the first result
for basepath, folders, files in os.walk(basepath):
break
Part 03: Calculating the Index
Option #1: no sorting - what you get from the system is what you get
# no sorting
try:
index = len(folders) + files.index(searchname)
except IndexError:
index = -1
Option #2: alphanumeric sorting
# sort alpha-numerically (only need to sort the files)
try:
index = len(folders) + sorted(files).index(searchname)
except IndexError:
index = -1
Option #3: natural sorting
# natural sort using the projex.sorting.natural method
import projex.sorting
sorted_files = sorted(files, projex.sorting.natural)
try:
index = len(folders) + sorted_files.index(searchname)
except IndexError:
index = -1
Part 04: Logging the result
# if wanting a 1-based answer
index += 1
print index
I'm not going to go into detail about natural sorting since that wasn't a part of the question - I think there are other forums on here you can find with advice on that. The projex.sorting module is one that I've written and is available here: http://dev.projexsoftware.com/projects/projex if you want to see the exact implementation of it.
Suffice to say this would be the difference in results:
>>> import pprint, projex.sorting
>>> files = ['test2.png', 'test1.png', 'test10.png', 'test5.png', 'test11.png']
>>> print files.index('test10.png')
2
>>> print sorted(files).index('test10.png')
1
>>> print sorted(files, projex.sorting.natural).index('test10.png')
3
>>> print files
['test2.png', 'test1.png', 'test10.png', 'test5.png', 'test11.png']
>>> print sorted(files)
['test1.png', 'test10.png', 'test11.png', 'test2.png', 'test5.png']
>>> print sorted(files, projex.sorting.natural)
['test1.png', 'test2.png', 'test5.png', 'test10.png', 'test11.png']
So just keep that in mind when you're working with it.
Cheers!

It looks like something like this should work:
import os
import glob
import sys
import os.path as path
try:
directory,file = path.split( sys.argv[1] )
def sort_func(fname):
"""
Russian directories , english directories, russian files then english files
although, honestly I don't know how russian files will actually be sorted ...
"""
fullname = path.join(directory,fname)
isRussian = any(ord(x) > 127 for x in fullname)
isDirectory = path.isdir(fullname)
return ( not isDirectory, not isRussian, fullname)
files = sorted( os.listdir(directory), key=sort_func)
print ( files.index(file) + 1 )
except IndexError:
print "oops, no commandline arguments"

from os import listdir
from sys import argv
from os.path import *
print listdir(dirname(argv[1]).index(basename(argv[1]))
but it really means nothing, can't even imagine usecase when you need it. See os.path for details.

Related

How to get the latest folder in a directory using Python

I need to retrieve the directory of the most recently create folder. I am using a program that will output a new run## folder each time it is executed (i.e run01, run02, run03 and so on). Within any one run## folder resides a data file that I want analyze (file-i-want.txt).
folder_numb = 'run01'
dir = os.path.dirname(__file__)
filepath = os.path.join(dir, '..\data\directory',run_numb,'file-i-want.txt')
In short I want to skip having to hardcode in run## and just get the directory of a file within the most recently created run## folder.
You can get the creation date with os.stat
path = '/a/b/c'
#newest
newest = max([f for f in os.listdir(path)], key=lambda x: os.stat(os.path.join(path,x)).st_birthtime)
# all files sorted
sorted_files = sorted([f for f in os.listdir(path)],key=lambda x: os.stat(os.path.join(path, x)).st_birthtime, reverse=True)
pathlib is the recommeded over os for filesystem related tasks.
reference
You can try:
filepath = Path(__file__).parent / 'data/directory'
fnames = sorted(list(Path(filepath).rglob('file-i-want.txt')), key=lambda x: Path.stat(x).st_mtime, reverse=True)
filepath = str(fnames[0])
filepath
glob.glob('run*') will return the list of files/directories that match the pattern ordered by name.
so if you want the latest run your code will be:
import glob
print(glob.glob('run*')[-1]) # raises index error if there are no runs
IMPORTANT, the files are ordered by name, in this case, for example, 'run21' will come AFTER 'run100', so you will need to use a high enough number of digits to not see this error. or just count the number of matched files and recreate the name of the folder with this number.
you can use glob to check the number of files with the same name pattern:
import glob
n = len(glob.glob('run*')) # number of files which name starts with 'run'
new_run_name = 'run' + str(n)
Note: with this code the file names starts from 0, if you want to start from 1 just add 1 to n.
if you want always double digit run number (00, 01, 02) instead of 'str(n)' use 'str(n).zfill(2)'
example:
import glob
n = len(glob.glob('run*')) # number of files which name starts with 'run'
new_run_name = 'run' + str(n + 1).zfill(2)

Extracting "unsigned files" from a directory

I have a directory with xml files associated with encrypted P7M files, meaning that for every name.xml there is a name.P7M. But there are some exceptions (P7M file is absent) and my goal is to detect them using python.
I'm thinking this code.. Can you help with an elegant code?
import glob
# functions to eleminate extension name
def is_xml(x):
a = re.search(r"(\s)(.xml)",x)
if a :
return a.group(0)
else:
return False
def is_P7M(x):
a = re.search(r"(\s)(.P7M)", x)
if a :
return a.group(0)
else:
return False
# putting xml files and P7M files in two sets
setA = set (glob.glob('directory/*.xml'))
setB = set (glob.glob('directory/*.P7M'))
#eliminating extention names
for elt in setA:
elt= is_xml(elt)
for elt in setB:
elt= is_P7M(elt)
#difference between two sets. setB is always a larger set
print "unsigned files are:", setB.difference(setA)
A simpler way is to glob for the .xml files, then check using os.path.exists for a .P7M file:
import os, glob
for xmlfile in glob.glob('*.xml'):
if not os.path.exists(xmlfile.rsplit(".", 1)[0] + ".P7M"):
print xmlfile, "is unsigned"
This code:
Uses glob.glob to get all the xml files.
Uses str.rsplit to split the filename up into name and extension (e.g. "name.xml" to ("name", ".xml")). The second argument stops str.rsplit splitting more than once.
Takes the name of the file and adds the .P7M extension.
Uses os.path.exists to see if the key file is there. If is isn't, the xmlfile is unsigned, so print it out.
If you need them in a list, you can do:
unsigned = [xmlfile for xmlfile in glob.glob('*.xml') if not os.path.exists(xmlfile.rsplit(".", 1)[0] + ".P7M")]
Or a set:
unsigned = {xmlfile for xmlfile in glob.glob('*.xml') if not os.path.exists(xmlfile.rsplit(".", 1)[0] + ".P7M")}
My solution would be:
import glob
import os
get_name = lambda fname: os.path.splitext(fname)[0]
xml_names = {get_name(fname) for fname in glob.glob('directory/*.xml')}
p7m_names = {get_name(fname) for fname in glob.glob('directory/*.p7m')}
unsigned = [xml_name + ".xml" for xml_name in \
xml_names.difference(p7m_names)]
print unsigned
get all xml's in a dict removing the extension and using the name as key and setting the value to false initially, if we find a matching P7M set value to True, finally print all keys with a False value.
xmls = glob.glob('directory/*.xml')
p7ms = glob.glob('directory/*.P7M')
# use xml file names as keys by removing the extension
d = {k[rsplit(k,1)[0]]:False for k in xmls}
# go over every .P7M again removing extension
# setting value to True for every match
for k in p7ms:
k[rsplit(k,1)[0]] = True
# any values that are False means there is no .P7M match for the xml file
for k,v in d.items():
if not v:
print(k)
Or create a set of each and find the difference:
xmls = {x.rsplit(".",1)[0] for x in in glob.glob('directory/*.xml')}
pm7s = {x.rsplit(".",1)[0] for x in glob.glob('directory/*.P7M')}
print(xmls - pm7s)
Iterate over glob once and populate a dict of filenames by extension. Finally, compute the difference between 'xml' and 'P7M' sets.
import os, glob, collections
fnames = collections.defaultdict(set)
for fname in glob.glob('*'):
f, e = os.path.splitext(fname)
fnames[e].add(f)
print fnames['.xml'] - fnames['.P7M']
Note that unlike other suggestions, this makes one single request to the filesystem, which might be important if the FS is slow (e.g. a network mount).

Looping through a directory, importing and processing text files in a chronology order using Python

Hoping I could get help with my python code, currently I have to change the working directory manually every time I run my code that loops through all the .txt files in chronological order, since they are numbered 1_Ix_100.txt, 2_Ix_99.txt etc etc until 201_Ix_-100.txt. all the text files are in the same directory i.e. C:/test/Ig=*/340_TXT what changes is the starred folder which goes from 340 to 1020 in increments of 40 i.e. C:/test/Ig=340/340_TXT, C:/test/Ig=380/340_TXT etc etc etc until C:/test/Ig=1020/340_TXT.
I'm looking for a way to automate this process so that the code loops through the different /Ig=*/ folder, process the text files and save the outcome as csv file in the /Ig=/
import matplotlib.pylab as plt
import pandas as pd
import numpy as np
import re
import os
import glob
D = []
E = []
F = []
os.chdir('C:/test/**Ig=700**/340_TXT') #Need to loop through the different folders in bold, these go from Ig=340 to Ig=1020 in incruments of 40
numbers = re.compile(r'(\d+)')
def numericalSort(value):
parts = numbers.split(value)
parts[1::2] = map(int, parts[1::2])
return parts
for infile in sorted(glob.glob('*.txt'), key=numericalSort):
name=['1', '2']
results = pd.read_table(infile, sep='\s+', names=name)
#process files here with output [D], [E], [F]
ArrayMain = []
ArrayMain = np.column_stack((D,E,F))
np.savetxt("C:/test/**Ig=700**/Grey_Zone.csv", ArrayMain, delimiter=",", fmt='%.9f') #save output in this directory which is one less than the working directory
I really hope the way I have worded it makes sense and I appreciate any help at all, thank you
Using a simple loop and some string manipulation you can create a list of the paths you want and then iterate over them.
Ig_txts = []
i=340
while i <= 1020:
Ig_txts.append( 'Ig='+str(i) )
i += 40
for Ig_txt in Ig_txts:
path = 'C:/test/'+Ig_txt+'/340_TXT'
out_file = 'C:/test/'+Ig_txt+'/Grey_Zone.csv'
os.chdir(path)
...
...
EDIT: Gabriel brought up that my range is a little off. Check the second code blurb for the modification.
I would first put your script into a function that takes, as one of its arguments, a path. The details are up to you, this code just details how to loop through the file names.
for root, _, files in os.walk('C:/test/'):
for f in files:
os.chdir(os.path.join(root, f))
#You now have the paths you need to open, close, etc.
Now, if there are other garbage files in 'C:/test/', then you could use a range based loop:
min_file_num = 340
max_file_num = 1020
for dir_num in range(min_file_num, max_file_num+1, 40):
path = 'C:/test/Ig=' + str(dir_num) + '/'
for root, _, files in os.walk(path):
for f in files:
os.chdir(os.path.join(root, f))
#You now have the paths you need to open, close, etc.

Keep latest file and delete all other

In my folder there are many pdf files with date-timestamp format such as shown in the last.
I would like to keep the latest file for the day and delete the rest for that day. How can I do in Python ?
2012-07-13-15-13-27_1342167207.pdf
2012-07-13-15-18-22_1342167502.pdf
2012-07-13-15-18-33_1342167513.pdf
2012-07-23-14-45-12_1343029512.pdf
2012-07-23-14-56-48_1343030208.pdf
2012-07-23-16-03-45_1343034225.pdf
2012-07-23-16-04-23_1343034263.pdf
2012-07-26-07-27-19_1343262439.pdf
2012-07-26-07-33-27_1343262807.pdf
2012-07-26-07-51-59_1343263919.pdf
2012-07-26-22-38-30_1343317110.pdf
2012-07-26-22-38-54_1343317134.pdf
2012-07-27-10-43-27_1343360607.pdf
2012-07-27-10-58-40_1343361520.pdf
2012-07-27-11-03-19_1343361799.pdf
2012-07-27-11-04-14_1343361854.pdf
Should I use list to fill and sort out then ? Desired output is:
2012-07-13-15-18-33_1342167513.pdf
2012-07-23-16-04-23_1343034263.pdf
2012-07-26-22-38-54_1343317134.pdf
2012-07-27-11-04-14_1343361854.pdf
Thanks
Your desired list can also be achieved using groupby .
from itertools import groupby
from os import listdir,unlink
filtered_list = list()
names = os.listdir()
for key,group in groupby(names,lambda x : x[:10]): # groups based on the start 10 characters of file
filtered_list.append([item for item in group][-1]) #picks the last file from the group
print filtered_list
Sort the list and delete files if the next file in the list is on the same day,
import glob
import os
files = glob.glob("*.pdf")
files.sort()
for ifl, fl in enumerate(files[:-1]):
if files[ifl+1].startswith(fl[:10]): #Check if next file is same day
os.unlink(fl) # It is - delete current file
Edit:
As the OPs question became clearer it became evident that not just the last file of the list is required, but the latest file of each day - to achieve this I included a "same day" conditioned unlinking.
You could do it that way. The following code is untested, but may work:
import os
names = os.listdir()
names.sort()
for f in names[:-1]:
os.unlink(f)
Fortunately your file names use ISO8601 date format so the textual sort achieves the desired result with no need to parse the dates.
The following snippet works with the test case given.
files = os.listdir(".")
days = set(fname[8:10] for fname in files)
for d in days:
f = [i for i in files if i[8:10] == d]
for x in sorted(f)[:-1]:
os.remove(x)
Using dictionary You can keep one value. This can be dirty and quickest solution, maybe not the best.
#!/usr/bin/env python
import os
import datetime
import stat
import shutil
filelist=[]
lst=[]
dc={}
os.chdir(".")
for files in os.listdir("."):
if files.endswith(".pdf"):
lst.append(files)
for x in lst:
print x[0:10].replace("-","")
dc[int(x[0:10].replace("-",""))]=x
a = dc.items()
flist=[]
for k, v in a:
flist.append(v)
dir="tmpdir"
if not os.path.exists(dir):
os.makedirs(dir)
from shutil import copyfile
for x in flist:
print x
copyfile(x, dir + "/" + x)
#os.chdir(".")
for files in os.listdir("."):
if files.endswith(".pdf"):
os.unlink(files)
os.chdir("./tmpdir")
for files in os.listdir("."):
if files.endswith(".pdf"):
copyfile(files, "../"+files)
os.chdir("../")
shutil.rmtree(os.path.abspath(".")+"/tmpdir")

batch renaming 100K files with python

I have a folder with over 100,000 files, all numbered with the same stub, but without leading zeros, and the numbers aren't always contiguous (usually they are, but there are gaps) e.g:
file-21.png,
file-22.png,
file-640.png,
file-641.png,
file-642.png,
file-645.png,
file-2130.png,
file-2131.png,
file-3012.png,
etc.
I would like to batch process this to create padded, contiguous files. e.g:
file-000000.png,
file-000001.png,
file-000002.png,
file-000003.png,
When I parse the folder with for filename in os.listdir('.'): the files don't come up in the order I'd like to them to. Understandably they come up
file-1,
file-1x,
file-1xx,
file-1xxx,
etc. then
file-2,
file-2x,
file-2xx,
etc. How can I get it to go through in the order of the numeric value? I am a complete python noob, but looking at the docs i'm guessing I could use map to create a new list filtering out only the numerical part, and then sort that list, then iterate that? With over 100K files this could be heavy. Any tips welcome!
import re
thenum = re.compile('^file-(\d+)\.png$')
def bynumber(fn):
mo = thenum.match(fn)
if mo: return int(mo.group(1))
allnames = os.listdir('.')
allnames.sort(key=bynumber)
Now you have the files in the order you want them and can loop
for i, fn in enumerate(allnames):
...
using the progressive number i (which will be 0, 1, 2, ...) padded as you wish in the destination-name.
There are three steps. The first is getting all the filenames. The second is converting the filenames. The third is renaming them.
If all the files are in the same folder, then glob should work.
import glob
filenames = glob.glob("/path/to/folder/*.txt")
Next, you want to change the name of the file. You can print with padding to do this.
>>> filename = "file-338.txt"
>>> import os
>>> fnpart = os.path.splitext(filename)[0]
>>> fnpart
'file-338'
>>> _, num = fnpart.split("-")
>>> num.rjust(5, "0")
'00338'
>>> newname = "file-%s.txt" % num.rjust(5, "0")
>>> newname
'file-00338.txt'
Now, you need to rename them all. os.rename does just that.
os.rename(filename, newname)
To put it together:
for filename in glob.glob("/path/to/folder/*.txt"): # loop through each file
newname = make_new_filename(filename) # create a function that does step 2, above
os.rename(filename, newname)
Thank you all for your suggestions, I will try them all to learn the different approaches. The solution I went for is based on using a natural sort on my filelist, and then iterating that to rename. This was one of the suggested answers but for some reason it has disappeared now so I cannot mark it as accepted!
import os
files = os.listdir('.')
natsort(files)
index = 0
for filename in files:
os.rename(filename, str(index).zfill(7)+'.png')
index += 1
where natsort is defined in http://code.activestate.com/recipes/285264-natural-string-sorting/
Why don't you do it in a two step process. Parse all the files and rename with padded numbers and then run another script that takes those files, which are sorted correctly now, and renames them so they're contiguous?
1) Take the number in the filename.
2) Left-pad it with zeros
3) Save name.
def renamer():
for iname in os.listdir('.'):
first, second = iname.replace(" ", "").split("-")
number, ext = second.split('.')
first, number, ext = first.strip(), number.strip(), ext.strip()
number = '0'*(6-len(number)) + number # pad the number to be 7 digits long
oname = first + "-" + number + '.' + ext
os.rename(iname, oname)
print "Done"
Hope this helps
The simplest method is given below. You can also modify for recursive search this script.
use os module.
get filenames
os.rename
import os
class Renamer:
def __init__(self, pattern, extension):
self.ext = extension
self.pat = pattern
return
def rename(self):
p, e = (self.pat, self.ext)
number = 0
for x in os.listdir(os.getcwd()):
if str(x).endswith(f".{e}") == True:
os.rename(x, f'{p}_{number}.{e}')
number+=1
return
if __name__ == "__main__":
pattern = "myfile"
extension = "txt"
r = Renamer(pattern=pattern, extension=extension)
r.rename()

Categories