naming and storing fileinformation for comparison

naming and storing fileinformation for comparison - python

I am currently working on a script that automatically syncs files from the Documents and Picture directory with an USB stick that I use as sort of an "essentials backup". In practice, this should identify filenames and some information about them (like last time edited etc.) in the directories that I choose to sync.
If a file exists in one directory, but not in the other (i.e. it's on my computer but not on my USB drive), it should automatically copy that file to the USB as well. Likewise, if a file exists in both directories, but has different mod-times, it should replace the older with the newer one.
However, I have some issues with storing that information for the purpose of comparing those files. I initially thought about a file class, that stores all that information and through which I can compare objects with the same name.
Problem 1 with that approach is, that if I create an object, how do I name it? Do I name it like the file? I then would have to remove the file-extension like .txt or .py, because I'd run into trouble with my code. but I might have a notes.odt and a notes.jpg, which would be problem 2.
I am pretty new to Python, so my imagination is probably limited by my lack of knowledge. Any pointers on how I could make that work?

Related

How to change all the names of a huge list of excel files in a folder and insert the title of each one of them into one cell inside each file

We have a script that´s generating one excel file for each feature in a report. However such files do not have the title that we need them to have, making us manually rename them with the right filename (sometimes we have hundreds of files generated). Also, inside each one of those files, we have to modify one cell to match it with the new title name inserting that string in a particular cell.
I was thinking about making a script capable of modifying all the files in a particular folder using python (Which is the language I know "the most" though I'm not proficient in it).
I would like to know if any of you have an idea of how I could accomplish this, which libraries I should use, or whether any of you may have done something similar in the past and which recommendations could you give me.

File update : multiple versions stored inside the ZIP archive

Let's say we have a test.zip file and we update a file:
zfh = zipfile.ZipFile("test.zip", mode = "a")
zfh.write("/home/msala/test.txt")
zfh.close()
Repeating a few times this "update", using the builtin method printdir()
I see in the archive there are stored not only the last one "test.txt" but also all the previous copies of the file.
Ok, I understand the zipfile library hasn't a delete method.
Questions:
if I call the builtin method extract("/home/msala/test.txt"),
which copy of the file is extracted and written to the file system ?
inside the zip archive, is there any flag telling that old copies .. are old copies, superseded by the last one ?
At the moment I list all the stored files and sort them by filename, last modification time...

The tl;dr is no, you can't do this without building a bit of extra info—but that can be done without sorting, and, even if you did have to sort, the performance cost would be irrelevant.
First, let me explain how zipfiles work. (Even if you understand this, later readers with the same problem may not.)
Unfortunately, the specification is a copyrighted and paywalled ISO document, so I can't link to it or quote it. The original PKZip APPNOTE.TXT which is the de facto pro-standardization standard is available, however. And numerous sites like Wikipedia have nice summaries.
A zipfile is 0 or more fragments, followed by a central directory.
Fragments are just treated as if they were all concatenated into one big file.
The body of the file can contain zip entries, in arbitrary order, along with anything you want. (This is how DOS/Windows self-extracting archives work—the unzip executable comes at the start of the first fragment.) Anything that looks like a zip entry, but isn't referenced by the central directory, is not treated as a zip entry (except when repairing a corrupted zipfile.)
Each zip entries starts with a header that gives you the filename, compression format, etc. of the following data.
The directory is a list of directory entries that contain most of the same information, plus a pointer to where to find the zip entry.
It's the order of directory entries that determines the order of the files in the archive.
if I call the builtin method extract("/home/msala/test.txt"), which copy of the file is extracted and written to the file system ?
The behavior isn't really specified anywhere.
Extracting the whole archive should extract both files, in the order present in the zip directory (the same order given by infolist), with the second one overwriting the first.
But extracting by name doesn't have to give you both—it could give you the last one, or the first, or pick one at random.
Python gives you the last. The way this works is that, when reading the directory, it builds a dict mapping filenames to ZipInfos, just adding them as encountered, so the last one will overwrite the previous ones. (Here's the 3.7 code.) Whenever you try to access something by filename, it just looks up the filename in that dict to get the ZipInfo.
But is that something you want to rely on? I'm not sure. On the one hand, this behavior has been the same from Python 1.6 to 3.7, which is usually a good sign that it's not going to change, even if it's never been documented. On the other hand, there are open issues—including #6818, which is intended to add deletion support to the library one way or another—that could change it.
And it's really not that hard to do the same thing yourself. With the added benefit that you can use a different rule—always keep the first, always keep the one with the latest mod time, etc.
You seem to be worried about the performance cost of sorting the infolist, which is probably not worth worrying about. The time it takes to read and parse the zip directory is going to make the cost of your sort virtually invisible.
But you don't really need to sort here. After all, you don't want to be able to get all of the entries with a given name in some order, you just want to get one particular entry for each name. So, you can just do what ZipFile does internally, which takes only linear time to build, and constant time each time you search it. And you can use any rule you want here.
entries = {}
for entry in zfh.infolist():
if entry.filename not in entries:
entries[entry.filename] = entries
This keeps the first entry for any name. If you want to keep the last, just remove the if. If you want to keep the latest by modtime, just change it if entry.date_time > entries[entry.filename].date_time:. And so on.
Now, instead of relying on what happens when you call extract("home/msala/test.txt"), you can call extract(entries["home/msala/test.txt"]) and know that you're getting the first/last/latest/whatever file of that name.
inside the zip archive, is there any flag telling that old copies .. are old copies, superseded by the last one ?
No, not really.
The way to delete a file is to remove it from the central directory. Which you do just by rewriting the central directory. Since it comes at the end of the zipfile, and is almost always more than small enough to fit on even the smallest floppy, this was generally considered fine even back in the DOS days.
(But notice that if you unplug the computer in the middle of it, you've got a zipfile without a central directory, which has to be rebuilt by scanning all of the file entries. So, many newer tools will instead, at least for smaller files, rewrite the whole file to a tempfile then rename it over the original, to guarantee a safe, atomic write.)
At least some early tools would sometimes, especially for gigantic archives, rewrite the entry's pathname's first byte with a NUL. But this doesn't really mark the entry as deleted, it just renames it to "\0ome/msala/test.txt". And many modern tools will in fact treat it as meaning exactly that and give you weird errors telling you they can't find a directory named 'ome' or '' or something else fun. Plus, this means the filename in the directory entry no longer matches the filename in the file entry header, which will cause many modern tools to flag the zipfile as corrupted.
At any rate, Python's zipfile module doesn't do either of these, so you'd need to subclass ZipFile to add the support yourself.

I solved this way, similar to database records management.
Adding a file to the archive, I look for previous stored copies (same filename).
For each of them, I set their field "comment" to a specific marker, for example "deleted".
We add the new file, with comment = empty.
As we like, we can "vacuum": shrink the zip archive using the usually tools (under the hood a new archive is created, discarding the files having the comment set to "deleted").
This way, we have also a simple "versioning".
We have all the previous files copies, until the vacuum.

Erase a folder and it's files completely from a speceific dirctory using python

I'm currently writing a little program to "reset" hard drives. In this program the user should be able to choose if he wants to have everything deleted completely or just a part of it, e.g a special folder.
Since I want to provide anonymousity to all pre-owners, I want to completely delete the folder or the drive, essentially I want to format a single folder.
The problem is, that with file recovery tools it is very easy to restore deleted files, since they are mostly not erased but just thrown out of the file system. How can I set all bytes that were taken by the folder and the files in it to Zero, or at least make them inrepairable?
I'm using python 2.7 and Debian

I found exactly a solution! Perfect!
https://manpages.debian.org/stretch/manpages-de/shred.1.de.html
You can shred files and directorys and whole partitions with it and it is for DEBIAN!!! You can set all bytes to zero if you want and many more options! Great command for such jobs!

How can I find if the contents in a Google Drive folder have changed

I am currently working on an app that syncs one specific folder in a users Google Drive. I need to find when any of the files/folders in that specific folder have changed. The actual syncing process is easy, but I don't want to do a full sync every few seconds.
I am condisering one of these methods:
1) Moniter the changes feed and look for any file changes
This method is easy but it will cause a sync if ANY file in the drive changes.
2) Frequently request all files in the whole drive eg. service.files().list().execute() and look for changes within the specific tree. This is a brute force approach. It will be too slow if the user has 1000's of files in their drive.
3) Start at the specific folder, and move down the folder tree looking for changes.
This method will be fast if there are only a few directories in the specific tree, but it will still lead to numerous API requests.
Are there any better ways to find whether a specific folder and its contents have changed?
Are there any optimisations I could apply to method 1,2 or 3.

As you have correctly stated, you will need to keep (or work out) the file hierarchy for a changed file to know whether a file has changed within a folder tree.
There is no way of knowing directly from the changes feed whether a deeply nested file within a folder has been changed. Sorry.

There are a couple of tricks that might help.
Firstly, if your app is using drive.file scope, then it will only see its own files. Depending on your specific situation, this may equate to your folder hierarchy.
Secondly, files can have multiple parents. So when creating a file in folder-top/folder-1/folder-1a/folder-1ai. you could declare both folder-1ai and folder-top as parents. Then you simply need to check for folder-top.

Is there a standard way, across operating systems, of adding "tags" to files

I'm writing a script to make backups of various different files. What I'd like to do is store meta information about the backup. Currently I'm using the file name, so for example:
backups/cool_file_bkp_c20120119_104955_d20120102
Where c represents the file creation datetime, and d represents "data time" which represents what the cool_file actually contains. The reason I currently use "data time" is that a later backup may be made of the same file, in which case, I know I can safely replace the previous backup of the same "data time" without loosing any information.
It seems like an awful way to do things, but it does seem to have the benefit of being non-os dependent. Is there a better way?
FYI: I am using Python to script my backup creation, and currently need to have this working on Windows XP, 2003, and Redhat Linux.
EDIT: Solution:
From the answers below, I've inferred that metadata on files is not widely supported in a standard way. Given my goal was to tightly couple the metadata with the file, it seems that archiving the file alongside a metadata textfile is the way to go.

I'd take one of two approaches there:
create a stand alone file, on the backub dir, that would contain the desired metadata - this could be somethnng in human readable form, just to make life easier, such as a json data structure, or "ini" like file.
The other one is to archive the copied files - possibily using "zip", and bundle along with it a textual file with the desired meta-data.
The idea of creating zip archives to group files that you want together is used in several places, like in java .jar files, Open Document Format (offfice files created by several office sutres), Office Open XML (Microsoft specific offic files), and even Python language own eggs.
The ziplib module in Python's standard library has all the toools necessary to acomplish this - you can just use a dictionary's representation in a file bundled with the original one to have as much metadata as you need.
In any of these approaches you will also need a helper script to letyou see and filter the metadata on the files, of course.

Different file systems (not different operating systems) have different capabilities for storing metadata. NTFS has plenty of possibilities, while FAT is very limited, and ext* are somewhere in between. None of widespread (subjective term, yes) filesystems support custom tags which you could use. Consequently there exists no standard way to work with such tags.
On Windows there was an attempt to introduce Extended Attributes, but these were implemented in such a tricky way that were almost unusable.
So putting whatever you can into the filename remains the only working approach. Remember that filesystems have limitations on file name and file path length, and with this approach you can exceed the limit, so be careful.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.