Treating a directory like a file in Python - python

We have a tool which is designed to allow vendors to deliver files to a company and update their database. These files (generally of predetermined types) use our web-based transport system, a new record is created in the db for each one, and the files are moved into a new structure when delivered.
We have a new request from a client to use this tool to be able to pass through entire directories without parsing every record. Imagine if the client made digital cars then this tool allows the delivery of the digital nuts and bolts and tracks each part, but they want to also deliver a directory with all of the assets which went into creating a digital bolt without adding each asset as a new record.
The issue is that the original code doesn't have a nice way to handle these passthrough folders, and would require a lot of rewriting to make it work. We'd obviously need to create a new function which happens around the time of the directory walk, which takes out each folder which matches this passthrough and then handles it separately. The problem is that all the tools which do the transport, db entry, and delivery all expect files, not folders.
My thinking: what if we could treat that entire folder as a file? That way the current file-level tools don't need to be modified, we'd just need to add the "conversion" step. After generating the manifest, what if we used a library to turn it into a "file", send that, and then turn it back into a "folder" after ingest. The most obvious way to do that is ZIP files - and the current delivery tool does handle ZIPs - but that is slow and some of these deliveries are very large, which means when transporting if something goes wrong the entire ZIP would fail.
Is there a method which we can use which doesn't necessarily compress the files but just somehow otherwise can treat a directory and all its contents like a file, so the rest of the code doesn't need to be rewritten? Or something else I'm missing entirely?
Thanks!

You could use tar files. Python has great support for it, and it is customary in *nix environments to use them as backup files. For compression you could use Gzip (also supported by the standard library and great for streaming).

Related

naming and storing fileinformation for comparison

I am currently working on a script that automatically syncs files from the Documents and Picture directory with an USB stick that I use as sort of an "essentials backup". In practice, this should identify filenames and some information about them (like last time edited etc.) in the directories that I choose to sync.
If a file exists in one directory, but not in the other (i.e. it's on my computer but not on my USB drive), it should automatically copy that file to the USB as well. Likewise, if a file exists in both directories, but has different mod-times, it should replace the older with the newer one.
However, I have some issues with storing that information for the purpose of comparing those files. I initially thought about a file class, that stores all that information and through which I can compare objects with the same name.
Problem 1 with that approach is, that if I create an object, how do I name it? Do I name it like the file? I then would have to remove the file-extension like .txt or .py, because I'd run into trouble with my code. but I might have a notes.odt and a notes.jpg, which would be problem 2.
I am pretty new to Python, so my imagination is probably limited by my lack of knowledge. Any pointers on how I could make that work?

How do I make files downloadable for a particular role in Plone?

I wish to make the contents of a folder in Plone downloadable only for certain roles. Can this be done easily? At present anybody who clicks the hyperlink for file name in the folder contents can download the file easily. I know about the site-wide option of overriding the at_download code using ZMI.
The codeless way to do this is to make use of Plone's workflow system.
Out-of-the-box, Plone's file and image content types do not have their own workflow. That means that files and images will simply inherit the publication state of their parent folder. This is easy and sensible, but it doesn't meet the need you're describing.
To change the situation, you may use the "types" configuration panel to turn on independent workflow for files and images. Then, their publication status may be set separately from their containing folders. Typically, you'd choose the same workflow that you're using for documents. Then, you may publish a folder and list its contents while having the files within be private -- thus requiring login for viewing.
If you need this to work differently in different places, you may turn on "placeful" workflow (turn it on by adding it in the add-ons panel; it's pre-installed, but not active). This allows different workflows in different parts of a site. It increases complexity, but is often an ideal solution to this kind of puzzle.
This is probably not so simple and you need to add some line of code in a little Plone product (no way TTW). Code snippets below are not tested.
Plone file are developed using the Archetypes framework (this will probably change on Plone 5). What you need to change is the read_permission of the file field (see the Archetypes field reference).
from Products.Archetypes.content.file import ATFile
ATFile.schema['file'].read_permission = 'you new permission'
The you simply need to assign your new permission to a role.
This could be not enough (probably step 1 is not useful nowadays). You need to perform the same operation for the [plone.app.blob extension][2]:
from plone.app.blob.subtypes import SchemaExtender
SchemaExtender.fields[0]..read_permission = 'you new permission'
Last one: you probably need to customize the file_view template or an "Unauthorized" error will be raised when a user without the permission will visit the file view.

How can I find if the contents in a Google Drive folder have changed

I am currently working on an app that syncs one specific folder in a users Google Drive. I need to find when any of the files/folders in that specific folder have changed. The actual syncing process is easy, but I don't want to do a full sync every few seconds.
I am condisering one of these methods:
1) Moniter the changes feed and look for any file changes
This method is easy but it will cause a sync if ANY file in the drive changes.
2) Frequently request all files in the whole drive eg. service.files().list().execute() and look for changes within the specific tree. This is a brute force approach. It will be too slow if the user has 1000's of files in their drive.
3) Start at the specific folder, and move down the folder tree looking for changes.
This method will be fast if there are only a few directories in the specific tree, but it will still lead to numerous API requests.
Are there any better ways to find whether a specific folder and its contents have changed?
Are there any optimisations I could apply to method 1,2 or 3.
As you have correctly stated, you will need to keep (or work out) the file hierarchy for a changed file to know whether a file has changed within a folder tree.
There is no way of knowing directly from the changes feed whether a deeply nested file within a folder has been changed. Sorry.
There are a couple of tricks that might help.
Firstly, if your app is using drive.file scope, then it will only see its own files. Depending on your specific situation, this may equate to your folder hierarchy.
Secondly, files can have multiple parents. So when creating a file in folder-top/folder-1/folder-1a/folder-1ai. you could declare both folder-1ai and folder-top as parents. Then you simply need to check for folder-top.

Python/Django: how to get files fastest (based on path and name)

My website users can upload image files, which then need to be found whenever they are to be displayed on a page (using src = ""). Currently, I put all images into one directory. What if there are many files - is it slow to find the right file? Are they indexed? Should I create subdirectories instead?
I use Python/Django. Everything is on webfaction.
The access time for an individual file are not affected by the quantity of files in the same directory.
running ls -l on a directory with more files in it will take longer of course. Same as viewing that directory in the file browser. Of course it might be easier to work with these images if you store them in a subdirectory defined by the user's name. But that just depends on what you are going to doing with them. There is no technical reason to do so.
Think about it like this. The full path to the image file (/srv/site/images/my_pony.jpg) is the actual address of the file. Your web server process looks there, and returns any data it finds or a 404 if there is nothing. What it doesn't do is list all the files in /srv/site/images and look through that list to see if it contains an item called my_pony.jpg.
If only for organizational purposes, and to help with system maintenance you should create subdirectories. Otherwise, there is very little chance you'll run into the maximum number of files that a directory can hold.
There is negligible performance implication for the web. For other applications though (file listing, ftp, backup, etc.) there may be consequences, but only if you reach a very large number of files.

Is there a standard way, across operating systems, of adding "tags" to files

I'm writing a script to make backups of various different files. What I'd like to do is store meta information about the backup. Currently I'm using the file name, so for example:
backups/cool_file_bkp_c20120119_104955_d20120102
Where c represents the file creation datetime, and d represents "data time" which represents what the cool_file actually contains. The reason I currently use "data time" is that a later backup may be made of the same file, in which case, I know I can safely replace the previous backup of the same "data time" without loosing any information.
It seems like an awful way to do things, but it does seem to have the benefit of being non-os dependent. Is there a better way?
FYI: I am using Python to script my backup creation, and currently need to have this working on Windows XP, 2003, and Redhat Linux.
EDIT: Solution:
From the answers below, I've inferred that metadata on files is not widely supported in a standard way. Given my goal was to tightly couple the metadata with the file, it seems that archiving the file alongside a metadata textfile is the way to go.
I'd take one of two approaches there:
create a stand alone file, on the backub dir, that would contain the desired metadata - this could be somethnng in human readable form, just to make life easier, such as a json data structure, or "ini" like file.
The other one is to archive the copied files - possibily using "zip", and bundle along with it a textual file with the desired meta-data.
The idea of creating zip archives to group files that you want together is used in several places, like in java .jar files, Open Document Format (offfice files created by several office sutres), Office Open XML (Microsoft specific offic files), and even Python language own eggs.
The ziplib module in Python's standard library has all the toools necessary to acomplish this - you can just use a dictionary's representation in a file bundled with the original one to have as much metadata as you need.
In any of these approaches you will also need a helper script to letyou see and filter the metadata on the files, of course.
Different file systems (not different operating systems) have different capabilities for storing metadata. NTFS has plenty of possibilities, while FAT is very limited, and ext* are somewhere in between. None of widespread (subjective term, yes) filesystems support custom tags which you could use. Consequently there exists no standard way to work with such tags.
On Windows there was an attempt to introduce Extended Attributes, but these were implemented in such a tricky way that were almost unusable.
So putting whatever you can into the filename remains the only working approach. Remember that filesystems have limitations on file name and file path length, and with this approach you can exceed the limit, so be careful.

Categories