Data structures in python: maintaining filesystem structure within a database

Data structures in python: maintaining filesystem structure within a database - python

I have a data organization issue. I'm working on a client/server project where the server must maintain a copy of the client's filesystem structure inside of a database that resides on the server. The idea is to display the filesystem contents on the server side in an AJAX-ified web interface. Right now I'm simply uploading a list of files to the database where the files are dumped sequentially. The problem is how to recapture the filesystem structure on the server end once they're in the database. It doesn't seem feasible to reconstruct the parent->child structure on the server end by iterating through a huge list of files. However, when the file objects have no references to each other, that seems to be the only option.
I'm not entirely sure how to handle this. As near as I can tell, I would need to duplicate some type of filesystem data structure on the server side (in a Btree perhaps?) with objects maintaining pointers to their parents and/or children. I'm wondering if anyone has had any similar past experiences they could share, or maybe some helpful resources to point me in the right direction.

I suggest to follow the Unix way. Each file is considered a stream of bytes, nothing more, nothing less. Each file is technically represented by a single structure called i-node (index node) that keeps all information related to the physical stream of the data (including attributes, ownership,...).
The i-node does not contain anything about the readable name. Each i-node is given a unique number (forever) that acts for the file as its technical name. You can use similar number to give the stream of bytes in database its unique identification. The i-nodes are stored on the disk in a separate contiguous section -- think about the array of i-node structures (in the abstract sense), or about the separate table in the database.
Back to the file. This way it is represented by unique number. For your database representation, the number will be the unique key. If you need the other i-node information (file attributes), you can add the other columns to the table. One column will be of the blob type, and it will represent the content of the file (the stream of bytes). For AJAX, I gues that the files will be rather small; so, you should not have a problem with the size limits of the blob.
So far, the files are stored in as a flat structure (as the physical disk is, and as the relational database is).
The structure of directory names and file names of the files are kept separately, in another files (kept in the same structure, together with the other files, represented also by their i-node). Basically, the directory file captures tuples (bare_name, i-node number). (This way the hard links are implemented in Unix -- two names are paired with the same i-none number.) The root directory file has to have a fixed technical identification -- i.e. the reserved i-node number.

If by "database" you mean an SQL database, then the magic word you're looking for is "self-referential tables" or, alternatively "modified pre-ordered tree traversal" (MPTT)
Basically, the first approach deals with "nodes" which have id, parent_id and name attributes. So, to select the root-level directories you would do something like
SELECT id, name from mytable WHERE parent_id IS NULL AND kind="directory";
which let's assume returns you
[(1, "Documents and Settings"), (2, "Program Files"), (3, "Windows")]
then, to get directories inside "Documents and Settings" you issue another query:
SELECT id, name from mytable WHERE parent_id=1 AND kind="directory";
and so on. Simple!
MPTT is a little bit trickier but you'll find a good tutorial, for, example, in Wikipedia. This approach is very efficient for queries like "find all children of a given node", "how many files are in this directory including subdirectories" etc., and is less efficient when the tree changes as you'll need to re-order all the nodes.
Since you're using Python, you must to be using an ORM, you're not going to build those queries manually, right? SQLAlchemy is capable of modelling self-referential relations, including "eagerly loading" the tree up to a certain depth with a single query.

Related

MySQL table with a lot of text

So I have this database (Size 3.1Gb total), but this is due to one specific table I've got, containing A LOT of console output text, from some test runs. The table itself is 2.7Gb, and I was wondering if there could be another solution for this table, so the database would get a lot smaller? It's getting a bit anoying to backup the database or even make a copy of the database to a playground, because it's so big this table.
The Table is this one
Would it be better to delete this table and make all the LogTextData <- LongText, be stored in a PDF, instead of the database? (Then I can't backup this data tho...)
Do anyone have an idea on how to make this table smaller, or another solution? I'm open for suggestions, to make this table smaller.
The way this console log data gets imported to the database is by Python scipts, so I have fully access to other python solutions, if there is any.

You could try enabling either the Storage-Engine Independent Column Compression or InnoDB page compression. Both provides ways to have a smaller on-disk database which is especially useful for the large text fields.
Since there's only one table with one particular field that's taking up space, trying out individual column compression seems like the easiest first step.

According to me you should just store the path of log files instead of the complete logs in the database. By using those paths you can access the files anytime you want.
It will decrease the size of database too.
Your new table would look like this,
LogID, BuildID, JenkinsJobName,LogTextData.

How to store a dictionary in a file?

i'm rather new to python and coding in general.
I'm writing my own chat statistics bot for russian social net (vk. com).
My question is can i store a dictionary in a file and work with it?
For example:
Userlist=open('userlist.txt', '+')
If lastmessage['uid'] not in Userlist.read():
Userlist.read()[lastmessage.'uid']=1
Userlist.close()
Or do i have to use some side modules like JSON?
Thank you

(Ammended answer in light of clarifying comment: in the while true cycle i want to check, if a user's id is in 'userlist' dictionary (as a key) and if not, add it to this dictionary with value 1. Then i want to rewrite the file with a new dictionary. the file is opened as soon as the program is launched, before the cycle):
For robustly using data on disk as though it were a dictionary you should consider either one of the dbm modules or just using the SQLite3 support.
A dbm file is simply a set of keys and values stored with transparently maintained and used indexing. Once you've opened your dbm file you simply use it exactly like you would any other Python dictionary (with strings as keys). Any changes can simply be flushed and written before closing the file. This is very simple though it offers no special features for locking (or managing consistency in the case where you might have multiple processes writing to the file concurrently) and so on.
On the other hand the incredibly powerful SQLite subsystem, which has been included in the Python standard libraries for many years, allows you to easily treat a set of local file as an SQL database management system ... with all of the features you'd expect from a client/server based system (foreign keys, data type and referential integrity constraint management, views and triggers, indexes, etc).
In your case you could simply have a single table containing a single column. Binding to that database (by its filename) would allow you to query for a user's name with SELECT and add the user's name with INSERT. As your application grows and changes you could add other columns to track when the account was created and when it was most recently used or checked (a couple of time/date stamp columns) and you could create other tables with related data (selected using JOINs, for example).
(Original answer):
In general the processing of storing any internal data structure as a file, or transmitting it over a network connection, is referred to a "serialization." The complementary process of loading or receiving such data and instantiating its contents into a new data structure is referred to (unsurprisingly) as "deserialization."
That's true of all programming languages.
There are many ways to serialize and deserialize data in Python. In particular we have the native (standard library) pickle module which produces files (or strings) which are only intended or use with other processes running Python or we can, as you said, use JSON ... the JavaScript Object Notation which has become the de facto cross-language data structure serialization standard. (There are others such as YAML and XML ... but JSON has come to predominate).
The caveat about using JSON vs. Pickle is that JavaScript (and a number of other programming and scripting languages, uses different semantics for some sorts of "dictionary" (associative array) keys than Python. In particular Python (and Ruby and Lua) treats keys such as "1" (a string containing the digit "one") and 1 or 1.0 (numeric values equal to one) as distinct keys. JavaScript, Perl and some others treats the keys as "scalar" values in which strings like "1" and the the number 1 will evaluate into the same key.
There are some other nuances which can affect the fidelity of your serialization. But that's the easiest to understand. Dictionaries with strings as keys are fine ... mixtures of numeric and string keys are the most likely cause of any troubles you'll encounter using JSON serialization/deserialization in lieu of pickling.

Recursive Searching and MySql Comparison

Good evening. I am looking at developing some code that will collect EXIF data from JPEG images and store it in a MySQL database using Python v2.x The stumbling block lies in the fact that the JPEGs are scattered in a number of subdirectories and further subdirectories in a root folder so for example 200 JPEGs may be stored in root > subroot > subsubroot1 as well as a further 100 in root > subroot > subroot2. Once all images are identified, they will be scanned and their respective EXIF data abstracted before being added to a MySQL table.
At the moment I am just at the planning stage but I am just wondering, what would be the most efficient and pythonic way to carry out the recursive searching? I am looking to scan the root directory and append any new identified subdirectories to a list and then scan all subdirectory paths in the list for further subdirectories until I have a total list of all directories. This just seems to be a clumsy way though IMHO and a bit repetitive so I assume there may be a more OOP manner of carrying out this function.
Similarly, I am only looking to add new info to my MySQL table and so what would be the most efficient way to establish if an entry already exists? The filename both in the table and the JPEG file name will be its MD5 hash values. I was considering scanning through the table at the beginning of the code and placing all filenames in a set and so, before scanning a new JPEG, if an entry already exists in the set, there would be no need to extract the EXIF and move onto the next picture. Is this an efficient method however or would it be better to scan through the MySQL table when a new image is encountered? I anticipate the set method may be the most efficient however the table may potentially contain tens of millions of entries eventually and so to add the filenames for these entries into a set (volatile memory) may not be the best idea.
Thanks folks.

I would just write a function that scanned a directory for all files; if it's a jpeg, add the full path name of the jpeg to the list of results. If it's a directory, then immediately call the function with the newly discovered directory as an argument. If it's another type of file, do nothing. This is a classic recursive divide-and-conquer strategy. It will break if there are loops in your directory path, for instance with symlinks -- if this is a danger for you, then you'll have to make sure you don't traverse the same directory twice by finding the "real" non-symlinked path of each directory and recording it.
How to avoid duplicate entries is a trickier problem and you have to consider whether you are tolerant of two differently-named files with the exact same contents (and also consider the edge cases of symlinked or multiply-hard-linked files), how new files appear in the directories you are scanning and whether you have any control over that process. One idea to speed it up would be to use os.path.getmtime(). Record the moment you start the directory traversal process. Next time around, have your recursive traversal process ignore any jpeg files with an mtime older than your recorded time. This can't be your only method of keeping track because files modified between the start and end times of your process may or may not be recorded, so you will still have to check the database for those records (for instance using the full path, a hash on the file info or a hash on the data itself, depending on what kind of duplication you're intolerant of) but used as a heuristic it should speed up the process greatly.
You could theoretically load all filenames (probably paths and not filenames) from the database into memory to speed up comparison, but if there's any danger of the table becoming very large it would be better to leave that info in the database. For instance, you could create a hash from the filename, and then simply add that to the database with a UNIQUE constraint -- the database will reject any duplicate entries, you can catch the exception and go on your way. This won't be slow if you use the aforementioned heuristic checking file mtime.
Be sure you account for the possibility of files that may be only modified and not newly created, if that's important for your application.

Is there a standard way, across operating systems, of adding "tags" to files

I'm writing a script to make backups of various different files. What I'd like to do is store meta information about the backup. Currently I'm using the file name, so for example:
backups/cool_file_bkp_c20120119_104955_d20120102
Where c represents the file creation datetime, and d represents "data time" which represents what the cool_file actually contains. The reason I currently use "data time" is that a later backup may be made of the same file, in which case, I know I can safely replace the previous backup of the same "data time" without loosing any information.
It seems like an awful way to do things, but it does seem to have the benefit of being non-os dependent. Is there a better way?
FYI: I am using Python to script my backup creation, and currently need to have this working on Windows XP, 2003, and Redhat Linux.
EDIT: Solution:
From the answers below, I've inferred that metadata on files is not widely supported in a standard way. Given my goal was to tightly couple the metadata with the file, it seems that archiving the file alongside a metadata textfile is the way to go.

I'd take one of two approaches there:
create a stand alone file, on the backub dir, that would contain the desired metadata - this could be somethnng in human readable form, just to make life easier, such as a json data structure, or "ini" like file.
The other one is to archive the copied files - possibily using "zip", and bundle along with it a textual file with the desired meta-data.
The idea of creating zip archives to group files that you want together is used in several places, like in java .jar files, Open Document Format (offfice files created by several office sutres), Office Open XML (Microsoft specific offic files), and even Python language own eggs.
The ziplib module in Python's standard library has all the toools necessary to acomplish this - you can just use a dictionary's representation in a file bundled with the original one to have as much metadata as you need.
In any of these approaches you will also need a helper script to letyou see and filter the metadata on the files, of course.

Different file systems (not different operating systems) have different capabilities for storing metadata. NTFS has plenty of possibilities, while FAT is very limited, and ext* are somewhere in between. None of widespread (subjective term, yes) filesystems support custom tags which you could use. Consequently there exists no standard way to work with such tags.
On Windows there was an attempt to introduce Extended Attributes, but these were implemented in such a tricky way that were almost unusable.
So putting whatever you can into the filename remains the only working approach. Remember that filesystems have limitations on file name and file path length, and with this approach you can exceed the limit, so be careful.

How does git fetches commits associated to a file?

I'm writing a simple parser of .git/* files. I covered almost everything, like objects, refs, pack files etc. But I have a problem. Let's say I have a big 300M repository (in a pack file) and I want to find out all the commits which changed /some/deep/inside/file file. What I'm doing now is:
fetching last commit
finding a file in it by:
fetching parent tree
finding out a tree inside
recursively repeat until I get into the file
additionally I'm checking hashes of each subfolders on my way to file. If one of them is the same as in commit before, I assume that file was not changed (because it's parent dir didn't change)
then I store the hash of a file and fetch parent commit
finding file again and check if hash change occurs
if yes then original commit (i.e. one before parent) was changing a file
And I repeat it over and over until I reach very first commit.
This solution works, but it sucks. In worse case scenario, first search can take even 3 minutes (for 300M pack).
Is there any way to speed it up ? I tried to avoid putting so large objects in memory, but right now I don't see any other way. And even that, initial memory load will take forever :(
Greets and thanks for any help!

That's the basic algorithm that git uses to track changes to a particular file. That's why "git log -- some/path/to/file.txt" is a comparatively slow operation, compared to many other SCM systems where it would be simple (e.g. in CVS, P4 et al each repo file is a server file with the file's history).
It shouldn't take so long to evaluate though: the amount you ever have to keep in memory is quite small. You already mentioned the main point: remember the tree IDs going down to the path to quickly eliminate commits that didn't even touch that subtree. It's rare for tree objects to be very big, just like directories on a filesystem (unsurprisingly).
Are you using the pack index? If you're not, then you essentially have to unpack the entire pack to find this out since trees could be at the end of a long delta chain. If you have an index, you'll still have to apply deltas to get your tree objects, but at least you should be able to find them quickly. Keep a cache of applied deltas, since obviously it's very common for trees to reuse the same or similar bases- most tree object changes are just changing 20 bytes from a previous tree object. So if in order to get tree T1, you have to start with object T8 and apply Td7 to get T7, T6.... etc. it's entirely likely that these other trees T2-8 will be referenced again.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.