Find files that have been changed recursively

Find files that have been changed recursively - python

I am attempting to write a simple script to recursively rip through a directory and check if any of the files have been changed. I only have the traversal so far:
import fnmatch
import os
from optparse import OptionParser
rootPath = os.getcwd()
pattern = '*.js'
for root, dirs, files in os.walk(rootPath):
for filename in files:
print( os.path.join(root, filename))
I have two issues:
1. How do I tell if a file has been modified?
2. How can I check if a directory has been modified? - I need to do this because the folder I wish to traverse is huge. If I can check if the dir has been modified and not recursively rip through an unchanged dir, this would greatly help.
Thanks!

If you are comparing two files between two folders, you can use os.path.getmtime() on both files and compare the results. If they're the same, they haven't been modified. Note that this will work on both files and folders.

The typical fast way to tell if a file has been modified is to use os.path.getmtime(path) (assuming a Linux or similar environment). This will give you the modification timestamp, which you can compare to a stored timestamp to determine if a file has been modified.
getmtime() works on directories too, but it will only tell you whether a file has been added, removed or renamed in the directory; it will not tell you whether a file has been modified inside the directory.

This is my own implementation of what you might be looking for. Mind that, beside timestamps you might want to track files that have been added or deleted too (like i do). If not you can just change the code on line:
if now == before:
here is the code:
# check if any txt file in folder "wd" has been modified (rewritten added or deleted)
def src_dir_modified(wd):
now = []
global before
all_files = glob.glob(os.path.join(wd,'*.txt'))
for infile in all_files:
now.append([infile, os.stat(infile).st_mtime])
if now == before: # compare files and their time stamps
return False
else:
before = now
print 'Source code has been modified.'
return True

If you can admit the use of a command-line tool, you could use rsync instead of re-inventing the wheel. rsync uses file modification time and file size to decide if a file has been changed or not.
rsync --verbose --recursive --dry-run dir1 dir2 should get the differences between files in dir1 and dir2. You can write the output to a log file to act on it.

Related

Lost files while tried to move files using python shutil.move

I had 120 files in my source folder which I need to move to a new directory (destination). The destination is made in the function I wrote, based on the string in the filename. For example, here is the function I used.
path ='/path/to/source'
dropbox='/path/to/dropbox'
files = = [os.path.join(path,i).split('/')[-1] for i in os.listdir(path) if i.startswith("SSE")]
sam_lis =list()
for sam in files:
sam_list =sam.split('_')[5]
sam_lis.append(sam_list)
sam_lis =pd.unique(sam_lis).tolist()
# Using the above list
ID = sam_lis
def filemover(ID,files,dropbox):
"""
Function to move files from the common place to the destination folder
"""
for samples in ID:
for fs in files:
if samples in fs:
desination = dropbox + "/"+ samples + "/raw/"
if not os.path.isdir(desination):
os.makedirs(desination)
for rawfiles in fnmatch.filter(files, pat="*"):
if samples in rawfiles:
shutil.move(os.path.join(path,rawfiles),
os.path.join(desination,rawfiles))
In the function, I am creating the destination folders, based on the ID's derived from the files list. When I tried to run this for the first time it threw me FILE NOT exists error.
However, later when I checked the source all files starting with SSE were missing. In the beginning, the files were there. I want some insights here;
Whether or not os.shutil.move moves the files to somewhere like a temp folder instead of destination folder?
whether or not the os.shutil.move deletes the files from the source in any circumstance?
Is there any way I can test my script to find the potential reasons for missing files?
Any help or suggestions are much appreciated?

It is late but people don't understand the op's question. If you move a file into a non-existing folder, the file seems to become a compressed binary and get lost forever. It has happened to me twice, once in git bash and the other time using shutil.move in Python. I remember the python happens when your shutil.move destination points to a folder instead of to a copy of the full file path.
For example, if you run the code below, a similar situation to what the op described will happen:
src_folder = r'C:/Users/name'
dst_folder = r'C:/Users/name/data_images'
file_names = glob.glob(r'C:/Users/name/*.jpg')
for file in file_names:
file_name = os.path.basename(file)
shutil.move(os.path.join(src_folder, file_name), dst_folder)
Note that dst_folder in the else block is just a folder. It should be dst_folder + file_name. This will cause what the Op described in his question. I find something similar on the link here with a more detailed explanation of what went wrong: File moving mistake with Python

shutil.move does not delete your files, if for any reason your files failed to move to a given location, check the directory where your code is stored, for a '+' folder your files are most likely stored there.

Search a directory, including all subdirectories that may or may not exist, for a file in Python.

I want to search a directory, including all subdirectories that may or may not exist, for a file in Python.
I see lots of examples where the directory we are peeking into is known, such as:
os.path.exists(/dir1/myfile.pdf)
...but what if the file I want is located in some arbitrary subdirectory that I don't already know exists or not? For example, the above snippet could never find a file here:
/dir1/dir2/dir3/.../dir20/myfile.pdf
and could clearly never generalize without explicitly running that line 20 times, once for each directory.
I suppose I'm looking for a recursive search, where I don't know the exact structure of the filesystem (if I said that right).

As suggested by #idjaw, try os.walk() like so:
import os
import os.path
for (dir,subdirs,files) in os.walk('/dir1'):
# Don't go into the CVS subdir!
if 'CVS' in subdirs:
subdirs.remove('CVS')
if 'myfile.pdf' in files:
print("Found:", os.path.join(dir, 'myfile.pdf'))

Here is code do find a file (in my case "wsgi.py") below the pwd
import os
for root, dirs, files in os.walk('.'):
if "wsgi.py" in files:
print root
./jg18/blog/blog
./goat/superlists/superlists
./jcg_blog/jcg_blog
./joelgoldstick.com.16/blog/blog
./blankdj19/blank/blank
./cp/cpblog/cpblog
./baseball/baseball_stats/baseball_stats
./zipcodes/zipcodes/zipcodes
./django.1.6.tut/mysite/mysite
./bits/bits/bits
If the file exists only in one dir, it will list one directory

Recreate input folder tree for output of some analyses

I am new to Python and, although having been reading and enjoying it so far, have ∂ experience, where ∂ → 0.
I have a folder tree and each folder at the bottom of the tree's branches contains many files. For me, this whole tree in the input.
I would to perform several steps of analysis (I believe these are irrelavant to this question), the results of which I would like to have returned in an identical tree to that of the input, called output.
I have two ideas:
Read through each folder recursively using os.walk() and for each file to perform the analysis, and
Use a function such as shutil.copytree() and perform the analysis somewhere along the way. So actually, I do not want to COPY the tree at all, rather replicate it's structure but with new files. I thought this might be a kind of 'hack' as I do actually want to use each input file to create the output file, so instead of a copycommand, I need an analyse command. The rest should remain unchanged as far as my imagination allows me to understand.
I have little experience with option 1 and zero experience with option 2.
For smaller trees up until now I have been hard-coding the paths, which has become too time-consuming at this point.
I have also seen more mundane ways, such as using glob to first find all the files I would like and work on them, but I don't know how this might help find a shortcut in recreating the input tree for my output.
My attempt at option 1 looks like this:
import os
for root, dirs, files in os.walk('/Volumes/Mac OS Drive/Data/input/'):
# I have no actual need to print these, it just helps me see what is happening
print root, "\n"
print dirs, "\n"
# This is my actual work going on
[analysis_function(name) for name in files]
however I fear this is going to be very slow, I would also like to do some kind of filtering on files too - for example the .DS_Store files created in mac trees are included in the results of the above. I would attempt to use the fnmatch module to filter only the files I want.
I have seen in the copytree function that it is possible to ignore files according to a pattern, which would be helpful, however I do not understand from the documentation where I could put my analysis function in on each file.

You can use both options: you could provide your custom copy_function that performs analysis instead of the default shutil.copy2 to shutil.copytree() (it is a more of a hack) or you could use os.walk() to have a finer control over the process.
You don't need to create parent directories manually either way. copytree() creates the parent directories for you and os.makedirs(root) can create parent directories if you use os.walk():
#!/usr/bin/env python2
import fnmatch
import itertools
import os
ignore_dir = lambda d: d in ('.git', '.svn', '.hg')
src_dir = '/Volumes/Mac OS Drive/Data/input/' # source directory
dst_dir = '/path/to/destination/' # destination directory
for root, dirs, files in os.walk(src_dir):
for input_file in fnmatch.filter(files, "*.input"): # for each input file
output_file = os.path.splitext(input_file)[0] + '.output'
output_dir = os.path.join(dst_dir, root[len(src_dir):])
if not os.path.isdir(output_dir):
os.makedirs(output_dir) # create destination directories
analyze(os.path.join(root, input_file), # perform analysis
os.path.join(output_dir, output_file))
# don't visit ignored subtrees
dirs[:] = itertools.ifilterfalse(ignore_dir, dirs)

taking data from files which are in folder

How do I get the data from multiple txt files that placed in a specific folder. I started with this could not fix. It gives an error like 'No such file or directory: '.idea' (??)
(Let's say I have an A folder and in that, there are x.txt, y.txt, z.txt and so on. I am trying to get and print the information from all the files x,y,z)
def find_get(folder):
for file in os.listdir(folder):
f = open(file, 'r')
for data in open(file, 'r'):
print data
find_get('filex')
Thanks.

If you just want to print each line:
import glob
import os
def find_get(path):
for f in glob.glob(os.path.join(path,"*.txt")):
with open(os.path.join(path, f)) as data:
for line in data:
print(line)
glob will find only your .txt files in the specified path.
Your error comes from not joining the path to the filename, unless the file was in the same directory you were running the code from python would not be able to find the file without the full path. Another issue is you seem to have a directory .idea which would also give you an error when trying to open it as a file. This also presumes you actually have permissions to read the files in the directory.
If your files were larger I would avoid reading all into memory and/or storing the full content.

First of all make sure you add the folder name to the file name, so you can find the file relative to where the script is executed.
To do so you want to use os.path.join, which as it's name suggests - joins paths. So, using a generator:
def find_get(folder):
for filename in os.listdir(folder):
relative_file_path = os.path.join(folder, filename)
with open(relative_file_path) as f:
# read() gives the entire data from the file
yield f.read()
# this consumes the generator to a list
files_data = list(find_get('filex'))
See what we got in the list that consumed the generator:
print files_data
It may be more convenient to produce tuples which can be used to construct a dict:
def find_get(folder):
for filename in os.listdir(folder):
relative_file_path = os.path.join(folder, filename)
with open(relative_file_path) as f:
# read() gives the entire data from the file
yield (relative_file_path, f.read(), )
# this consumes the generator to a list
files_data = dict(find_get('filex'))
You will now have a mapping from the file's name to it's content.
Also, take a look at the answer by #Padraic Cunningham . He brought up the glob module which is suitable in this case.

The error you're facing is simple: listdir returns filenames, not full pathnames. To turn them into pathnames you can access from your current working directory, you have to join them to the directory path:
for filename in os.listdir(directory):
pathname = os.path.join(directory, filename)
with open(pathname) as f:
# do stuff
So, in your case, there's a file named .idea in the folder directory, but you're trying to open a file named .idea in the current working directory, and there is no such file.
There are at least four other potential problems with your code that you also need to think about and possibly fix after this one:
You don't handle errors. There are many very common reasons you may not be able to open and read a file--it may be a directory, you may not have read access, it may be exclusively locked, it may have been moved since your listdir, etc. And those aren't logic errors in your code or user errors in specifying the wrong directory, they're part of the normal flow of events, so your code should handle them, not just die. Which means you need a try statement.
You don't do anything with the files but print out every line. Basically, this is like running cat folder/* from the shell. Is that what you want? If not, you have to figure out what you want and write the corresponding code.
You open the same file twice in a row, without closing in between. At best this is wasteful, at worst it will mean your code doesn't run on any system where opens are exclusive by default. (Are there such systems? Unless you know the answer to that is "no", you should assume there are.)
You don't close your files. Sure, the garbage collector will get to them eventually--and if you're using CPython and know how it works, you can even prove the maximum number of open file handles that your code can accumulate is fixed and pretty small. But why rely on that? Just use a with statement, or call close.
However, none of those problems are related to your current error. So, while you have to fix them too, don't expect fixing one of them to make the first problem go away.

Full variant:
import os
def find_get(path):
files = {}
for file in os.listdir(path):
if os.path.isfile(os.path.join(path,file)):
with open(os.path.join(path,file), "r") as data:
files[file] = data.read()
return files
print(find_get("filex"))
Output:
{'1.txt': 'dsad', '2.txt': 'fsdfs'}
After the you could generate one file from that content, etc.
Key-thing:
os.listdir return a list of files without full path, so you need to concatenate initial path with fount item to operate.
there could be ideally used dicts :)
os.listdir return files and folders, so you need to check if list item is really file

You should check if the file is actually file and not a folder, since you can't open folders for reading. Also, you can't just open a relative path file, since it is under a folder, so you should get the correct path with os.path.join. Check below:
import os
def find_get(folder):
for file in os.listdir(folder):
if not os.path.isfile(file):
continue # skip other directories
f = open(os.path.join(folder, file), 'r')
for line in f:
print line

Python script errors out

I have this script, which I have no doubt is flawed:
import fnmatch, os, sys
def findit (rootdir, find, pattern):
for folder, dirs, files in os.walk(rootdir):
print (folder)
for filename in fnmatch.filter(files,pattern):
with open(filename) as f:
s = f.read()
f.close()
if find in s :
print(filename)
findit(sys.argv[1], sys.argv[2], sys.argv[3])
when I run it I get Errno2, no such file or directory. BUT the file exists. For instance if I execute it by going: findit.py c:\python "folder" *.py it will work just fine, listing all the *.py files which contain the word "folder". BUT if I go findit.py c:\php\projects1 "include" *.php
as an example I get [Errno2] no such file or directory: 'About.php' (for example). But About.php exists. I don't understand what it's doing, or what I'm doing wrong.

If you look at any of the examples for os.walk, you'll see that they all do os.path.join(root, name). You need to do that too.
Why? Quoting from the docs:
filenames is a list of the names of the non-directory files in dirpath. Note that the names in the lists contain no path components. To get a full path (which begins with top) to a file or directory in dirpath, do os.path.join(dirpath, name).
If you just use the filename as a path, it's going to look for a file of the same name in the current working directory. If there's no such file, you'll get a FileNotFoundError. If there is such a file, you'll open and read the wrong file. Only if you happen to be looking inside the current working directory will it work.
There's also another major problem in your code: os.walk walks a directory tree recursively, finding all files in the given top directory, or any subdirectory of top, or any subdirectory of… and so on, yielding once for each directory. But you're not doing anything useful with that (except printing out the folders). Instead, you wait until it finishes, and then use the files from whichever directory it happened to reach last.
If you just want to get a flat listing of the files directly in a directory, use os.listdir, not os.walk. (Or maybe use glob.glob instead of explicitly listing everything then filtering with fnmatch.)
On the other hand, if you want to walk the tree, you have to move your second for loop inside the first one.
You've also got a minor problem: You call f.close() inside a with open(…) as f:, which leads to f being closed twice. This is guaranteed to be completely harmless (at least in 2.5+, including 3.x), but it's still a bad idea.
Putting it together, here's a working version of your code:
def findit (rootdir, find, pattern):
for folder, dirs, files in os.walk(rootdir):
print (folder)
for filename in fnmatch.filter(files,pattern):
pathname = os.path.join(folder, filename)
with open(pathname) as f:
s = f.read()
if find in s:
print(pathname)

You are using a relative filename. But your current directory does not contain the file. And you don't want to search there anyway. Use os.path.join(folder, filename) to make an absolute path.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.