Recursive Searching and MySql Comparison - python

Good evening. I am looking at developing some code that will collect EXIF data from JPEG images and store it in a MySQL database using Python v2.x The stumbling block lies in the fact that the JPEGs are scattered in a number of subdirectories and further subdirectories in a root folder so for example 200 JPEGs may be stored in root > subroot > subsubroot1 as well as a further 100 in root > subroot > subroot2. Once all images are identified, they will be scanned and their respective EXIF data abstracted before being added to a MySQL table.
At the moment I am just at the planning stage but I am just wondering, what would be the most efficient and pythonic way to carry out the recursive searching? I am looking to scan the root directory and append any new identified subdirectories to a list and then scan all subdirectory paths in the list for further subdirectories until I have a total list of all directories. This just seems to be a clumsy way though IMHO and a bit repetitive so I assume there may be a more OOP manner of carrying out this function.
Similarly, I am only looking to add new info to my MySQL table and so what would be the most efficient way to establish if an entry already exists? The filename both in the table and the JPEG file name will be its MD5 hash values. I was considering scanning through the table at the beginning of the code and placing all filenames in a set and so, before scanning a new JPEG, if an entry already exists in the set, there would be no need to extract the EXIF and move onto the next picture. Is this an efficient method however or would it be better to scan through the MySQL table when a new image is encountered? I anticipate the set method may be the most efficient however the table may potentially contain tens of millions of entries eventually and so to add the filenames for these entries into a set (volatile memory) may not be the best idea.
Thanks folks.

I would just write a function that scanned a directory for all files; if it's a jpeg, add the full path name of the jpeg to the list of results. If it's a directory, then immediately call the function with the newly discovered directory as an argument. If it's another type of file, do nothing. This is a classic recursive divide-and-conquer strategy. It will break if there are loops in your directory path, for instance with symlinks -- if this is a danger for you, then you'll have to make sure you don't traverse the same directory twice by finding the "real" non-symlinked path of each directory and recording it.
How to avoid duplicate entries is a trickier problem and you have to consider whether you are tolerant of two differently-named files with the exact same contents (and also consider the edge cases of symlinked or multiply-hard-linked files), how new files appear in the directories you are scanning and whether you have any control over that process. One idea to speed it up would be to use os.path.getmtime(). Record the moment you start the directory traversal process. Next time around, have your recursive traversal process ignore any jpeg files with an mtime older than your recorded time. This can't be your only method of keeping track because files modified between the start and end times of your process may or may not be recorded, so you will still have to check the database for those records (for instance using the full path, a hash on the file info or a hash on the data itself, depending on what kind of duplication you're intolerant of) but used as a heuristic it should speed up the process greatly.
You could theoretically load all filenames (probably paths and not filenames) from the database into memory to speed up comparison, but if there's any danger of the table becoming very large it would be better to leave that info in the database. For instance, you could create a hash from the filename, and then simply add that to the database with a UNIQUE constraint -- the database will reject any duplicate entries, you can catch the exception and go on your way. This won't be slow if you use the aforementioned heuristic checking file mtime.
Be sure you account for the possibility of files that may be only modified and not newly created, if that's important for your application.

Related

File update : multiple versions stored inside the ZIP archive

Let's say we have a test.zip file and we update a file:
zfh = zipfile.ZipFile("test.zip", mode = "a")
zfh.write("/home/msala/test.txt")
zfh.close()
Repeating a few times this "update", using the builtin method printdir()
I see in the archive there are stored not only the last one "test.txt" but also all the previous copies of the file.
Ok, I understand the zipfile library hasn't a delete method.
Questions:
if I call the builtin method extract("/home/msala/test.txt"),
which copy of the file is extracted and written to the file system ?
inside the zip archive, is there any flag telling that old copies .. are old copies, superseded by the last one ?
At the moment I list all the stored files and sort them by filename, last modification time...
The tl;dr is no, you can't do this without building a bit of extra info—but that can be done without sorting, and, even if you did have to sort, the performance cost would be irrelevant.
First, let me explain how zipfiles work. (Even if you understand this, later readers with the same problem may not.)
Unfortunately, the specification is a copyrighted and paywalled ISO document, so I can't link to it or quote it. The original PKZip APPNOTE.TXT which is the de facto pro-standardization standard is available, however. And numerous sites like Wikipedia have nice summaries.
A zipfile is 0 or more fragments, followed by a central directory.
Fragments are just treated as if they were all concatenated into one big file.
The body of the file can contain zip entries, in arbitrary order, along with anything you want. (This is how DOS/Windows self-extracting archives work—the unzip executable comes at the start of the first fragment.) Anything that looks like a zip entry, but isn't referenced by the central directory, is not treated as a zip entry (except when repairing a corrupted zipfile.)
Each zip entries starts with a header that gives you the filename, compression format, etc. of the following data.
The directory is a list of directory entries that contain most of the same information, plus a pointer to where to find the zip entry.
It's the order of directory entries that determines the order of the files in the archive.
if I call the builtin method extract("/home/msala/test.txt"), which copy of the file is extracted and written to the file system ?
The behavior isn't really specified anywhere.
Extracting the whole archive should extract both files, in the order present in the zip directory (the same order given by infolist), with the second one overwriting the first.
But extracting by name doesn't have to give you both—it could give you the last one, or the first, or pick one at random.
Python gives you the last. The way this works is that, when reading the directory, it builds a dict mapping filenames to ZipInfos, just adding them as encountered, so the last one will overwrite the previous ones. (Here's the 3.7 code.) Whenever you try to access something by filename, it just looks up the filename in that dict to get the ZipInfo.
But is that something you want to rely on? I'm not sure. On the one hand, this behavior has been the same from Python 1.6 to 3.7, which is usually a good sign that it's not going to change, even if it's never been documented. On the other hand, there are open issues—including #6818, which is intended to add deletion support to the library one way or another—that could change it.
And it's really not that hard to do the same thing yourself. With the added benefit that you can use a different rule—always keep the first, always keep the one with the latest mod time, etc.
You seem to be worried about the performance cost of sorting the infolist, which is probably not worth worrying about. The time it takes to read and parse the zip directory is going to make the cost of your sort virtually invisible.
But you don't really need to sort here. After all, you don't want to be able to get all of the entries with a given name in some order, you just want to get one particular entry for each name. So, you can just do what ZipFile does internally, which takes only linear time to build, and constant time each time you search it. And you can use any rule you want here.
entries = {}
for entry in zfh.infolist():
if entry.filename not in entries:
entries[entry.filename] = entries
This keeps the first entry for any name. If you want to keep the last, just remove the if. If you want to keep the latest by modtime, just change it if entry.date_time > entries[entry.filename].date_time:. And so on.
Now, instead of relying on what happens when you call extract("home/msala/test.txt"), you can call extract(entries["home/msala/test.txt"]) and know that you're getting the first/last/latest/whatever file of that name.
inside the zip archive, is there any flag telling that old copies .. are old copies, superseded by the last one ?
No, not really.
The way to delete a file is to remove it from the central directory. Which you do just by rewriting the central directory. Since it comes at the end of the zipfile, and is almost always more than small enough to fit on even the smallest floppy, this was generally considered fine even back in the DOS days.
(But notice that if you unplug the computer in the middle of it, you've got a zipfile without a central directory, which has to be rebuilt by scanning all of the file entries. So, many newer tools will instead, at least for smaller files, rewrite the whole file to a tempfile then rename it over the original, to guarantee a safe, atomic write.)
At least some early tools would sometimes, especially for gigantic archives, rewrite the entry's pathname's first byte with a NUL. But this doesn't really mark the entry as deleted, it just renames it to "\0ome/msala/test.txt". And many modern tools will in fact treat it as meaning exactly that and give you weird errors telling you they can't find a directory named 'ome' or '' or something else fun. Plus, this means the filename in the directory entry no longer matches the filename in the file entry header, which will cause many modern tools to flag the zipfile as corrupted.
At any rate, Python's zipfile module doesn't do either of these, so you'd need to subclass ZipFile to add the support yourself.
I solved this way, similar to database records management.
Adding a file to the archive, I look for previous stored copies (same filename).
For each of them, I set their field "comment" to a specific marker, for example "deleted".
We add the new file, with comment = empty.
As we like, we can "vacuum": shrink the zip archive using the usually tools (under the hood a new archive is created, discarding the files having the comment set to "deleted").
This way, we have also a simple "versioning".
We have all the previous files copies, until the vacuum.

Performance of walking directory structure vs reading file with contents of ls (or similar command) for performing search

Is it better to walk a directory structure when performing multiple searches or is it a good idea to catalog the directory structure (in a file or memory) and then operate on that catalog? Or are there other methods which are better suited which I haven't hit upon?
I have a 3.5TB external HDD with thousands of files.
I have a set of files which list the contents of a directory. These listing files hold a folder name, filenames and file sizes.
I want to search the external HDD for the files in these listing files. If a file is found I then want to check and see if the file size of the actual file matches that in the listing file.
This process will cover about 1000 listing files and probably 10s of thousands of actual files.
A listing file would have contents like
folder: SummerPhotos
name: IMG0096.jpg, length: 6589
name: IMG0097.jpg, length: 6489
name: IMG0098.jpg, length: 6500
name: IMG0099.jpg, length: 6589
name: BeachPhotos/IMG0100.jpg, length, 34892
name: BeachPhotos/IMG0101.jpg, length, 34896
I like the offline processing of the listing files with a file which lists the contents of the external HDD because then I can perform this operation on a faster computer (as the hard drive is on an old computer acting as a server) or split the listing files over several computers and split up the work. Plus I think that continually walking the directory structure is about as inefficient as you can get and putting unnecessary wear on the hardware.
Walk pseudo code:
for each listing file
get base_foldername,filelist
for root,subfolder,files in os.walk(/path/to/3.5TBdrive)
if base_foldername in subfolder
for file in filelist
if file in files
if file.size == os.path.getsize(file)
dosomething
else
somethingelse
else
not_found
For the catalog file method I'm thinking of dumping a recursive 'ls' to file and then pretty much doing a string search on that file. I'll extract the filesize and perform a match there.
My 'ls -RlQ' dump file is 11MB in size with ~150k lines. If there is a better way to get the required data I'm open to suggestions. I'm thinking of using the os.walk() to compile a list and create my own file in a format I like vs trying to parse my ls command.
I feel like I should be doing somethign to make my college professors proud and making a hashtable or balanced tree, but feel like the effort to implement that will take longer than simply brute forcing the solution w cpu cycles.
OS: Linux
preferred programming language: Python 2/3
Thanks!
Is it better to walk a directory structure when performing multiple
searches or is it a good idea to catalog the directory structure (in a
file or memory) and then operate on that catalog?
If you just want to check if the file exists or the directory structure is not too complex, I suggest you to just use your filesystem. You're basically duplicating the work that it already does anyway and this will lead to problems in the future, as complexity always does.
I don't see any point using hashtables or balanced trees for in-program data structures - this is also what your filesystem already does. What you should instead do to speed up lookups is to design a deep directory structure instead of a few single directories that contain thousands of files. There are filesystems that choke while trying to list directories with dozens of thousands of files and it is a better idea to limit yourself to a few thousands and create a new level of directory depth should you exceed it.
For example, if you want to keep logs of your internet-wide scanning research, if you use a single file for each host you scanned, you don't want to create a directory scanning-logs with files such as 1.1.1.1.xml, 1.1.1.2.xml and so on. Instead, naming such as scanning-logs/1/1/1.1.1.1.xml is a better idea.
Also, watch out for the inode limit! I was once building a large file-based database on EXT4 filesystem. One day I started getting error messages like "no space left on device" even though I clearly had quite a lot of space left. The real reason was that I created too many inodes - the limit can be manually set while creating a volume.

How can I find if the contents in a Google Drive folder have changed

I am currently working on an app that syncs one specific folder in a users Google Drive. I need to find when any of the files/folders in that specific folder have changed. The actual syncing process is easy, but I don't want to do a full sync every few seconds.
I am condisering one of these methods:
1) Moniter the changes feed and look for any file changes
This method is easy but it will cause a sync if ANY file in the drive changes.
2) Frequently request all files in the whole drive eg. service.files().list().execute() and look for changes within the specific tree. This is a brute force approach. It will be too slow if the user has 1000's of files in their drive.
3) Start at the specific folder, and move down the folder tree looking for changes.
This method will be fast if there are only a few directories in the specific tree, but it will still lead to numerous API requests.
Are there any better ways to find whether a specific folder and its contents have changed?
Are there any optimisations I could apply to method 1,2 or 3.
As you have correctly stated, you will need to keep (or work out) the file hierarchy for a changed file to know whether a file has changed within a folder tree.
There is no way of knowing directly from the changes feed whether a deeply nested file within a folder has been changed. Sorry.
There are a couple of tricks that might help.
Firstly, if your app is using drive.file scope, then it will only see its own files. Depending on your specific situation, this may equate to your folder hierarchy.
Secondly, files can have multiple parents. So when creating a file in folder-top/folder-1/folder-1a/folder-1ai. you could declare both folder-1ai and folder-top as parents. Then you simply need to check for folder-top.

Data structures in python: maintaining filesystem structure within a database

I have a data organization issue. I'm working on a client/server project where the server must maintain a copy of the client's filesystem structure inside of a database that resides on the server. The idea is to display the filesystem contents on the server side in an AJAX-ified web interface. Right now I'm simply uploading a list of files to the database where the files are dumped sequentially. The problem is how to recapture the filesystem structure on the server end once they're in the database. It doesn't seem feasible to reconstruct the parent->child structure on the server end by iterating through a huge list of files. However, when the file objects have no references to each other, that seems to be the only option.
I'm not entirely sure how to handle this. As near as I can tell, I would need to duplicate some type of filesystem data structure on the server side (in a Btree perhaps?) with objects maintaining pointers to their parents and/or children. I'm wondering if anyone has had any similar past experiences they could share, or maybe some helpful resources to point me in the right direction.
I suggest to follow the Unix way. Each file is considered a stream of bytes, nothing more, nothing less. Each file is technically represented by a single structure called i-node (index node) that keeps all information related to the physical stream of the data (including attributes, ownership,...).
The i-node does not contain anything about the readable name. Each i-node is given a unique number (forever) that acts for the file as its technical name. You can use similar number to give the stream of bytes in database its unique identification. The i-nodes are stored on the disk in a separate contiguous section -- think about the array of i-node structures (in the abstract sense), or about the separate table in the database.
Back to the file. This way it is represented by unique number. For your database representation, the number will be the unique key. If you need the other i-node information (file attributes), you can add the other columns to the table. One column will be of the blob type, and it will represent the content of the file (the stream of bytes). For AJAX, I gues that the files will be rather small; so, you should not have a problem with the size limits of the blob.
So far, the files are stored in as a flat structure (as the physical disk is, and as the relational database is).
The structure of directory names and file names of the files are kept separately, in another files (kept in the same structure, together with the other files, represented also by their i-node). Basically, the directory file captures tuples (bare_name, i-node number). (This way the hard links are implemented in Unix -- two names are paired with the same i-none number.) The root directory file has to have a fixed technical identification -- i.e. the reserved i-node number.
If by "database" you mean an SQL database, then the magic word you're looking for is "self-referential tables" or, alternatively "modified pre-ordered tree traversal" (MPTT)
Basically, the first approach deals with "nodes" which have id, parent_id and name attributes. So, to select the root-level directories you would do something like
SELECT id, name from mytable WHERE parent_id IS NULL AND kind="directory";
which let's assume returns you
[(1, "Documents and Settings"), (2, "Program Files"), (3, "Windows")]
then, to get directories inside "Documents and Settings" you issue another query:
SELECT id, name from mytable WHERE parent_id=1 AND kind="directory";
and so on. Simple!
MPTT is a little bit trickier but you'll find a good tutorial, for, example, in Wikipedia. This approach is very efficient for queries like "find all children of a given node", "how many files are in this directory including subdirectories" etc., and is less efficient when the tree changes as you'll need to re-order all the nodes.
Since you're using Python, you must to be using an ORM, you're not going to build those queries manually, right? SQLAlchemy is capable of modelling self-referential relations, including "eagerly loading" the tree up to a certain depth with a single query.

How does git fetches commits associated to a file?

I'm writing a simple parser of .git/* files. I covered almost everything, like objects, refs, pack files etc. But I have a problem. Let's say I have a big 300M repository (in a pack file) and I want to find out all the commits which changed /some/deep/inside/file file. What I'm doing now is:
fetching last commit
finding a file in it by:
fetching parent tree
finding out a tree inside
recursively repeat until I get into the file
additionally I'm checking hashes of each subfolders on my way to file. If one of them is the same as in commit before, I assume that file was not changed (because it's parent dir didn't change)
then I store the hash of a file and fetch parent commit
finding file again and check if hash change occurs
if yes then original commit (i.e. one before parent) was changing a file
And I repeat it over and over until I reach very first commit.
This solution works, but it sucks. In worse case scenario, first search can take even 3 minutes (for 300M pack).
Is there any way to speed it up ? I tried to avoid putting so large objects in memory, but right now I don't see any other way. And even that, initial memory load will take forever :(
Greets and thanks for any help!
That's the basic algorithm that git uses to track changes to a particular file. That's why "git log -- some/path/to/file.txt" is a comparatively slow operation, compared to many other SCM systems where it would be simple (e.g. in CVS, P4 et al each repo file is a server file with the file's history).
It shouldn't take so long to evaluate though: the amount you ever have to keep in memory is quite small. You already mentioned the main point: remember the tree IDs going down to the path to quickly eliminate commits that didn't even touch that subtree. It's rare for tree objects to be very big, just like directories on a filesystem (unsurprisingly).
Are you using the pack index? If you're not, then you essentially have to unpack the entire pack to find this out since trees could be at the end of a long delta chain. If you have an index, you'll still have to apply deltas to get your tree objects, but at least you should be able to find them quickly. Keep a cache of applied deltas, since obviously it's very common for trees to reuse the same or similar bases- most tree object changes are just changing 20 bytes from a previous tree object. So if in order to get tree T1, you have to start with object T8 and apply Td7 to get T7, T6.... etc. it's entirely likely that these other trees T2-8 will be referenced again.

Categories