Get file path of continuously updating file - python

I have found a few approaches to search for the newest file created by a user in a directory, but I need to determine if an easier approach exists. Most posts on this topics work in some instances or have major hurdles, so I am hoping to unmuddy the water.
I am having difficulty looking through a growing file system, as well as bringing more users in with more potential errors.
I get data from a Superlogics Winview CP 32 for a continuously streaming system. On each occasion of use of the system, I have the operator input a unique identifier for the file name containing a few of the initial conditions of the system we need to track. I would like to get that file name with no help from the operator/user.
Eventually, the end goal is to pare down a list of files I want to search, filtered based on keys, so my first instinct was to use only matching file types, trim all folders in a pathway into a list, and sort based on max timestamp. I used some pretty common functions from these pages:
def fileWalkIn(path='.',matches=[],filt='*.csv'): # Useful for walking through a given directory
"""Iterates through all files under the given path using a filter."""
for root, dirnames, filenames in os.walk(path):
for filename in fnmatch.filter(filenames, filt):
matches.append(os.path.join(root, filename))
yield os.path.join(root, filename)
def getRecentFile(path='.',matches=[],filt='*.dat'):
rr = max(fileWalkIn(path=path,matches=matches,filt=filt), key=os.path.getmtime)
return rr
This got me far, but is rather bulky and slow, which means I cannot do this repeatedly if I want to explore the files that match, lest I have to carry around a bulky list of the matching files.
Ideally, I will be able to process the data on the fly, executing and printing live while it writes, so this approach is not usable in that instance.
I borrowed from these pages a new approach by alex-martelli, which does not use a filter, gives the option of giving files, opposed to directories, is much slimmer than fileWalkIn, and works quicker if using the timestamp.
def all_subdirs_of(b='.'): # Useful for walking through a given directory
# Create hashable list of files or directories in the parent directory
results = []
for d in os.listdir(b):
bd = os.path.join(b, d)
if os.path.isfile(bd):
results.append(bd)
elif os.path.isdir(bd):
results.append(bd)
# return both
return results
def newest(path='.'):
rr = max(all_subdirs_of(b=path), key=os.path.getmtime)
return rr
def getActiveFile(newFile ='.'):
while os.path.exists(newFile):
newFile = newest(newFile)
if os.path.isfile(newFile):
return newFile
else:
if newFile:
continue
else:
return newFile
This gets me the active file in a directory much more quickly, but only if no other files have written since launching my data collection. I can see all kinds of problems here and need some help determining if I have gone down a rabbit hole and there is a more simple solution, like testing file sizes, or whether a more cohesive solution with less potential snags exists.
I found other answers for different languages (java, how-to-get-the-path-of-a-running-jar-file), but would need something in Python. I have explored functions like watchdog and win32, but both require steep learning curves, and I feel like I am either very close, or need to change my paradigm entirely.

dircache might speed up the second approach a bit. It's a wrapper around listdir that checks the directory timestamp and only re-reads directory contents if there's been a change.
Beyond that you really need something that listens to file system events. A quick google turned up two pip packages, pyinotify for Linux only and watchdog.
Hope this helps.

Related

Find and remove duplicate files using Python

I have several folders which contain duplicate files that have slightly different names (e.g. file_abc.jpg, file_abc(1).jpg), or a suffix with "(1) on the end. I am trying to develop a relative simple method to search through a folder, identify duplicates, and then delete them. The criteria for a duplicate is "(1)" at the end of file, so long as the original also exists.
I can identify duplicate okay, however I am having trouble creating the text string in the right format to delete them. It needs to be "C:\Data\temp\file_abc(1).jpg", however using the code below I end up with r"C:\Data\temp''file_abc(1).jpg".
I have looked at answers [Finding duplicate files and removing them, however this seems to be far more sophisticated than what I need.
If there are better (+simple) ways to do this then I let me know, however I only have around 10,000 files in total in 50 odd folders, so not a great deal of data to crunch through.
My code so far is:
import os
file_path = r"C:\Data\temp"
file_list = os.listdir(file_path)
print (file_list)
for file in file_list:
if ("(1)" in file):
index_no = file_list.index(file)
print("!! Duplicate file, number in list: "+str(file_list.index(file)))
file_remove = ('r"%s' %file_path+"'\'"+file+'"')
print ("The text string is: " + file_remove)
os.remove(file_remove)
Your code is just a little more complex than necessary, and you didn't apply a proper way to create a file path out of a path and a file name. And I think you should not remove files which have no original (i. e. which aren't duplicates though their name looks like it).
Try this:
for file_name in file_list:
if "(1)" not in file_name:
continue
original_file_name = file_name.replace('(1)', '')
if not os.path.exists(os.path.join(file_path, original_file_name):
continue # do not remove files which have no original
os.remove(os.path.join(file_path, file_name))
Mind though, that this doesn't work properly for files which have multiple occurrences of (1) in them, and files with (2) or higher numbers also aren't handled at all. So my real proposition would be this:
Make a list of all files in the whole directory tree below a given start (use os.walk() to get this), then
sort all files by size, then
walk linearly through this list, identify the doubles (which are neighbours in this list) and
yield each such double-group (i. e. a small list of files (typically just two) which are identical).
Of course you should check the contents of these few files then to be sure that not just two of them are accidentally the same size without being identical. If you are sure you have a group of identical ones, remove all but the one with the simplest names (e. g. without suffixes (1) etc.).
By the way, I would call the file_path something like dir_path or root_dir_path (because it is a directory and a complete path to it).

Data structures suitable for searching

I have common problem. I have some data and I want search in them. My issue is, that I dont know a proper data structures and algorhitm suitable for this situation.
There are two kind of objects - Process and Package. Both have some properties, but they are only data structures (dont have any methods). Next, there are PackageManager and something what can be called ProcessManager, which both have function returning list of files that belongs to some Package or files that is used by some Process.
So semantically, we can imagine these data as
Packages:
Package_1
file_1
_ file_2
file_3
Package_2
file_4
file_5
file_6
Actually file that belongs to Package_k can not belong to Package_l for k != l :-)
Processes:
Process_1
file_2
file_3
Process_2
file_1
Files used by processes corresponds to files owned by packages. Also, there the rule doesn't applies on this as for packages - that means, n processes can use one same file at the same time.
Now what is the task. I need to find some match between processes and packages - for given list of packages, I need to find list of processes which uses any of files owned by packages.
My temporary solution was making list of [package_name, package_files] and list of [process_name, process_files] and for every file from every package I was searching through every file of every process searching for match, but of course it could be only temporary solution vzhledem to horrible time complexity (even when I sort the files and use binary search to it).
What you can recommend me for this kind of searching please?
(I am coding it in python)
Doing the matching with sets should be faster:
watched_packages = [Package_1, Package_3] # Packages to consider
watched_files = { # set comprehension
file_
for package in watched_packages
for file_ in package.list_of_files
}
watched_processes = [
process
for process in all_processes
if any(
file_ in watched_files
for file_ in process.list_of_files
)
]
Based on my understanding of what you are trying to do - given a file name, you want to find a list of all the processes that use that file, this snippet of code should help:
from collections import defaultdict
# First make a dictionary that contains a file, and all processes it is a member of.
file_process_map=defaultdict(list)
[file_process_map[fn].append(p) for p in processes for fn in p.file_list]
Basically, we're converting your existing structure (where a process has one or more files) into a structure where we have a filename, and a list of processes that use it.
Now when you have a file you need to search for (in the processes) just look it up in the "file_process_map" dictionary and you'll have a list of all the processes that use the given file.
It is assumed here that "processes" is a list of objects, and each object has a file_list attribute that contains a list of associated files. Obviously, depending on your data structure you might need to alter the code..

What is the fastest method of finding a file in Linux and Windows using Python?

I am writing a plug-in for RawTherapee in Python. I need to extract the version number from a file called 'AboutThisBuild.txt' that may exist anywhere in the directory tree. Although RawTherapee knows where it is installed this data is baked into the binary file.
My plug-in is being designed to collect basic system data when run without any command line parameters for the purpose of short circuiting troubleshooting. By having the version number, revision number and changeset (AKA Mercurial), I can sort out why the script may not be working as expected. OK that is the context.
I have tried a variety of methods, some suggested elsewhere on this site. The main one is using os.walk and fnmatch.
The problem is speed. Searching the entire directory tree is like watching paint dry!
To reduce load I have tried to predict likely hiding places and only traverse these. This is quicker but has the obvious disadvantage of missing some files.
This is what I have at the moment. Tested on Linux but not Windows as yet as I am still researching where the file might be placed.
import fnmatch
import os
import sys
rootPath = ('/usr/share/doc/rawtherapee',
'~',
'/media/CoreData/opt/',
'/opt')
pattern = 'AboutThisBuild.txt'
# Return the first instance of RT found in the paths searched
for CheckPath in rootPath:
print("\n")
print(">>>>>>>>>>>>> " + CheckPath)
print("\n")
for root, dirs, files in os.walk(CheckPath, True, None, False):
for filename in fnmatch.filter(files, pattern):
print( os.path.join(root, filename))
break
Usually 'AboutThisBuild.txt' is stored in a directory/subdirectory called 'rawtherapee' or has the string somewhere in the directory tree. I had naively though I could get the 5000 odd directory names and search these for 'rawtherapee' then use os.walk to traverse those directories but all modules and functions I have looked at collate all files in the directory (again).
Anyone have a quicker method of searching the entire directory tree or am I stuck with this hybrid option?
I am a beginner in Python, but I think I know the simplest way of finding a file in Windows.
import os
for dirpath, subdirs, filenames in os.walk('The directory you wanna search the file in'):
if 'name of your file with extension' in filenames:
print(dirpath)
This code will print out the directory of the file you are searching for in the console. All you have to do is get to the directory.
The thing about searching is that it doesn't matter too much how you get there (eg cheating). Once you have a result, you can verify it is correct relatively quickly.
You may be able to identify candidate locations fairly efficiently by guessing. For example, on Linux, you could first try looking in these locations (obviously not all are directories, but it doesn't do any harm to os.path.isfile('/;l$/AboutThisBuild.txt'))
$ strings /usr/bin/rawtherapee | grep '^/'
/lib/ld-linux.so.2
/H=!
/;l$
/9T$,
/.ba
/usr/share/rawtherapee
/usr/share/doc/rawtherapee
/themes/
/themes/slim
/options
/usr/share/color/icc
/cache
/languages/default
/languages/
/languages
/themes
/batch/queue
/batch/
/dcpprofiles
/#q=
/N6rtexif16NAISOInterpreterE
If you have it installed, you can try the locate command
If you still don't find it, move on to the brute force method
Here is a rough equivalent of strings using Python
>>> from string import printable, whitespace
>>> from itertools import groupby
>>> pathchars = set(printable) - set(whitespace)
>>> with open("/usr/bin/rawtherapee") as fp:
... data = fp.read()
...
>>> for k, g in groupby(data, pathchars.__contains__):
... if not k: continue
... g = ''.join(g)
... if len(g) > 3 and g.startswith("/"):
... print g
...
/lib64/ld-linux-x86-64.so.2
/^W0Kq[
/pW$<
/3R8
/)wyX
/WUO
/w=H
/t_1
/.badpixH
/d$(
/\$P
/D$Pv
/D$#
/D$(
/l$#
/d$#v?H
/usr/share/rawtherapee
/usr/share/doc/rawtherapee
/themes/
/themes/slim
/options
/usr/share/color/icc
/cache
/languages/default
/languages/
/languages
/themes
/batch/queue.csv
/batch/
/dcpprofiles
/#q=
/N6rtexif16NAISOInterpreterE
It sounds like you need a pure python solution here. If not, other answers will suffice.
In this case, you should traverse the folders using a queue and threads. While some may say Threads are never the solution, Threads are a great way of speeding up when you are I/O bound, which you are in this case. Essentially, you'll os.listdir the current dir. If it contains your file, party like it's 1999. If it doesn't, add each subfolder to the work queue.
If you're clever, you can play with depth first vs breadth first traversal to get the best results.
There is a great example I have used quite successfully at work at http://www.tutorialspoint.com/python/python_multithreading.htm. See the section titled Multithreaded Priority Queue. The example could probably be updated to include threadpools though, but it's not necessary.

Python File Creation Date & Rename - Request for Critique

Scenario: When I photograph an object, I take multiple images, from several angles. Multiplied by the number of objects I "shoot", I can generate a large number of images. Problem: Camera generates images identified as, 'DSCN100001', 'DSCN100002", etc. Cryptic.
I put together a script that will prompt for directory specification (Windows), as well as a "Prefix". The script reads the file's creation date and time, and rename the file accordingly. The prefix will be added to the front of the file name. So, 'DSCN100002.jpg' can become "FatMonkey 20110721 17:51:02". The time detail is important to me for chronology.
The script follows. Please tell me whether it is Pythonic, whether or not it is poorly written and, of course, whether there is a cleaner - more efficient way of doing this. Thank you.
import os
import datetime
target = raw_input('Enter full directory path: ')
prefix = raw_input('Enter prefix: ')
os.chdir(target)
allfiles = os.listdir(target)
for filename in allfiles:
t = os.path.getmtime(filename)
v = datetime.datetime.fromtimestamp(t)
x = v.strftime('%Y%m%d-%H%M%S')
os.rename(filename, prefix + x +".jpg")
The way you're doing it looks Pythonic. A few alternatives (not necessarily suggestions):
You could skip os.chdir(target) and do os.path.join(target, filename) in the loop.
You could do strftime('{0}-%Y-%m-%d-%H:%M:%S.jpg'.format(prefix)) to avoid string concatenation. This is the only one I'd reccomend.
You could reuse a variable name like temp_date instead of t, v, and x. This would be OK.
You could skip storing temporary variables and just do:
for filename in os.listdir(target):
os.rename(filename, datetime.fromtimestamp(
os.path.getmtime(filename)).strftime(
'{0}-%Y-%m-%d-%H:%M:%S.jpeg'.format(prefix)))
You could generalize your function to work for recursive directories by using os.walk().
You could detect the file extension of files so it would be correct not just for .jpegs.
You could make sure you only renamed files of the form DSCN1#####.jpeg
Your code is nice and simple. Few possible improvements I can suggest:
Command line arguments is more preferable for dir names because of autocomplition by TAB
EXIF is more accurate source of date and time of photo creating. If you modify photo in image editor, modify time will be changed while EXIF information will be preserved. Here is discussion about EXIF library for Python: Exif manipulation library for python
My only thought is that if you are going to have the computer do the work for you, let it do more of the work. My assumption is that you are going to shoot one object several times, then either move to another object or move another object into place. If so, you could consider grouping the photos by how close the timestamps are together (maybe any delta over 2 minutes is considered a new object). Then based on these pseudo clusters, you could name the photos by object.
May not be what you are looking for, but thought I'd add in the suggestion.

How to get progress of os.walk in python?

I have a piece of code which I'm using to search for the executables of game files and returning the directories. I would really like to get some sort of progress indicator as to how far along os.walk is. How would I accomplish such a thing?
I tried doing startpt = root.count(os.sep) and gauging off of that but that just gives how deep os.walk is in a directory tree.
def locate(filelist, root=os.curdir): #Find a list of files, return directories.
for path, dirs, files in os.walk(os.path.abspath(root)):
for filename in returnMatches(filelist, [k.lower() for k in files]):
yield path + "\\"
It depends!
If the files and directories are distributed more or less evenly you could show rough process by assuming every toplevel directory is going to take the same amount of time. But if they are not distributed evenly you cannot find out about it cheaply. You either have to know roughly how populated every directory is in advance, or you have to os.walk the entire thing twice (but that is only useful if your actual processing takes much longer than the os.walk itself does).
That is: say you have 4 toplevel directories, and each one contains 4 files. If you assume every toplevel dir takes 25% of progress, and each file takes another 25% of the progress for that dir, you can show a nice progress indicator. But if the last subdir turns out to contain many more files than the first few your progress indicator will have hit 75% before you find out about it. You cannot really fix that if the os.walk itself is the bottleneck (not your processing) and it's an arbitrary directory tree (not one where you know in advance roughly how long every subtree is going to take).
And of course that's assuming the cost here is about the same for every file...
I figured this out.
I used os.listdir to get a list of toplevel directories, and then used the .split function on the path that os.walk returned, returning the first level directory that it was currently in.
That left me with a list of toplevel directories, which I could find the index of the current directory of os.walk, and compare the index returned with the length of the list, giving me a % complete. ;)
This doesn't give me a smooth progress, because the level of work done in each directory can vary but smoothing out the progress indicator is of no concern for me. But it could easily be accomplished by extending the path checking deeper into the directory structure.
Here is the final code from getting my progress:
def locateGameDirs(filelist, root=os.curdir): #Find a list of files, return directories.
toplevel = [folder for folder in os.listdir(root) if os.path.isdir(os.path.join(root, folder))] #List of top-level directories
fileset = set(filelist)
for path, dirs, files in os.walk(os.path.abspath(root)):
curdir = path.split('\\')[1] #The directory os.walk is currently in.
try: #Thrown here because there's a nonexistant(?) first entry.
youarehere = toplevel.index(curdir)
progress = int(((youarehere)/len(toplevel))*100)
except:
pass
for filename in returnMatches(filelist, [k.lower() for k in files]):
yield filename, path + "\\", progress
And right now for debugging purposes I'm doing this further in the code:
for wow in locateGameDirs(["wow.exe", "firefox.exe", "vlc.exe"], "C:\\"):
print wow
Is there a nice little way to get rid of that try/except?; it seems the first iteration of path gives me nothing...
Just show an indeterminate progress bar (i.e. the ones that show a blob bouncing back and forth or the barber pole effect). That way users know that the program is doing something useful but doesn't mislead them as far as time to complete and such.
Do it in two passes: first count how many total files/folders are in the tree, and then during the second pass do actual processing.
You need to know the total number of files to do a meaningful progress indicator.
You can get the number of files like this
len(list(os.walk(os.path.abspath(root))))
but that is going to take some time and you probably need a progress indicator for that...
To find the number of files really quickly you'd need a filesystem which keeps track of the number of files for you.
Perhaps you can save the total from a previous run and use that as an estimate
I suggest you avoid walking the directory. Instead use an indexed-based app for quickly finding files. You can use the app's command-line interface via subprocess and find the files almost instantaneously.
On Windows, see Everything. On UNIX, check out locate. Not sure about Mac, but I'm sure there's an option there too.
as I said in the comment, the performance bottle neck likely lies outside of the locate function. your returnMatches is a fairly expensive function. I think you'd be better off replacing it with the following code:
def locate(filelist, root=os.curdir)
fileset = set(filelist) # if possible, pass the set instead of the list as a first argument
for path, dirs, files in os.walk(os.path.abspath(root)):
if any(file.lower() in fileset for file in files):
yield path + '\\'
This way you reduce the number of wasteful operations, yield once per file in the directory (which I think is what you actually indented to do) and you can forget about the progress at the same time. I don't think that progress would be an expected feature of the interface anyway.
Thinking out of the box here...what if you did it based on size:
Use subprocess to run 'du -sb' and get the total_size of your root directory
As you walk, check the size of each file and decrement from your total_size (giving you remaining_size)
pct_complete = (total_size - remaining_size)/total_size
Thoughts?
-aj
One optimisation you could do - you are converting filelist into a set on every call to returnMatches, even though it never changes. move the conversion to the start of the 'locate' function and pass the set in on every iteration.
Well, this was fun. Here is another silly way of doing it, but as everything else, it only calculates the right progress for uniform paths.
import os, sys, time
def calc_progress(progress, root, dirs):
prog_start, prog_end, prog_slice = 0.0, 1.0, 1.0
current_progress = 0.0
parent_path, current_name = os.path.split(root)
data = progress.get(parent_path)
if data:
prog_start, prog_end, subdirs = data
i = subdirs.index(current_name)
prog_slice = (prog_end - prog_start) / len(subdirs)
current_progress = prog_slice * i + prog_start
if i == (len(subdirs) - 1):
del progress[parent_path]
if dirs:
progress[root] = (current_progress, current_progress+prog_slice, dirs)
return current_progress
def walk(start_root):
progress = {}
print 'Starting with {start_root}'.format(**locals())
for root, dirs, files in os.walk(start_root):
print '{0}: {1:%}'.format(root[len(start_root)+1:], calc_progress(progress, root, dirs))

Categories