Single-file history format/library for binary files? - python

My application is going to edit a bunch of large files, completely unrelated to each other (belonging to different users), and I need to store checkpoints of the previous state of the files. Delta compression should work extremely well on this file format. I only need a linear history, not branches or merges.
There are low-level libraries that give part of the solution, for example xdelta3 sounds like a good binary diff/patch system.
RCS actually seems like a pretty close match to my problem, but doesn't handle binary files well.
git provides a complete solution to my problem, but is an enormous suite of programs, and its storage format is an entire directory.
Is there anything less complicated than git that would:
work on binary files
perform delta compression
let me commit new "newest" versions
let me recall old versions
Bonus points if it would:
have a single-file storage format
be available as a C, C++, or Python library
I can't even find the right combination of words to google for this category of program, so that would also be helpful.

From RCS manual (1. Overview)
[RCS] can handle text as well as binary files, although functionality is reduced for the latter.
RCS seems a good option worth to try.
I work for a Foundation which has been using RCS to keep under version control tens of thousands of completely unrelated files (git or hg are not an option). Mostly text, but also some media files, which are binary in nature.
RCS does work quite well with binary files, only make sure not to use the Substitute mode options, to avoid inadvertently substituting binary bits that looks like $ Id.
To see if this could work for you, you could for example try with a Photoshop image, put it under version control with RCS. Then change a part, or add a layer, and commit the change. You could then verify how well RCS can manage binary files for you.
RCS has been serving us quite well. It is well maintained, reliable, predictable, and definitely worth a try.

Forgive me for asking, but my experience has taught me to challenge assumptions. I don't know why you need a 'single-file' solution, but my answer depends on that.
Option 1 - If you are simply looking for ease of use, have you considered using a single git repo to track multiple binaries?
By using git's per-file history capabilities, you can see the history for every file in the repo independently, create patches and rollback without affecting the rest of the repo. For example, by using a commit naming convention, you can easily rollback changes for individual files using:
git log -- filename
git revert <commit-id>
Option #2 - If you have a system constraint that forces you to store a single file, I would recommend considering git-bundle. Basically, that allows you to pack a git repo into a single file for easier storage/relocation (I guess that's pretty much as zipping your repo and storing the zipped file).
Option #3 - Consider Fossil. I haven't used it, so can't comment on it's qualities, but it looks like it might answer your requirements.

Related

Is there a way to combine the power of a DB with the transparency and simplicity of a textual file base?

I've just discovered Sir - a database based on text files, but it's far from ready and it's written in JS (i.e. not for me).
My first intuition was to ask if there's something like this available for Python or C++, but since that's not the kind of question one should ask on Stackoverflow let me put it more general:
I like the way e.g. git is made - it stores data as easy to handle separate files and it's astonishingly fast at the same time. Moreover git does not require a server which holds data in memory to be fast (the filesystem cache is doing a good enough job) and - maybe the best part - the way git keeps data in "memory" (the filesystem) is intrinsically language agnostic.
Of course git is not a database and databases have different challenges to master but I still dare to ask: are there generic approaches to make databases as transparent and manually modifiable as git is?
Are there keywords, examples, generally accepted concepts or working projects (like Sir but preferably Python or C++ based) I should learn to know if I want to enhance my fuzzy filesystem polluting project with a database-like fast technology, providing a nice query language without having to sacrifice the simplicity to just manually edit/copy/overwrite files on the filesystem?
SQLite is exactly what you are looking for. It is built in into Python as well: sqlite3.
It's just not human readable, but neither is git. It is purely serverless based on files however, just like git.

HDF5 possible data corruption or loss?

On wikipedia one can read the following criticism about HDF5:
Criticism of HDF5 follows from its monolithic design and lengthy
specification. Though a 150-page open standard, there is only a single
C implementation of HDF5, meaning all bindings share its bugs and
performance issues. Compounded with the lack of journaling, documented
bugs in the current stable release are capable of corrupting entire
HDF5 databases. Although 1.10-alpha adds journaling, it is
backwards-incompatible with previous versions. HDF5 also does not
support UTF-8 well, necessitating ASCII in most places. Furthermore
even in the latest draft, array data can never be deleted.
I am wondering if this is just applying to the C implementation of HDF5 or if this is a general flaw of HDF5?
I am doing scientific experiments which sometimes generate Gigabytes of data and in all cases at least several hundred Megabytes of data. Obviously data loss and especially corruption would be a huge disadvantage for me.
My scripts always have a Python API, hence I am using h5py (version 2.5.0).
So, is this criticism relevant to me and should I be concerned about corrupted data?
Declaration up front: I help maintain h5py, so I probably have a bias etc.
The wikipedia page has changed since the question was posted, here's what I see:
Criticism
Criticism of HDF5 follows from its monolithic design and lengthy specification.
Though a 150-page open standard, the only other C implementation of HDF5 is just a HDF5 reader.
HDF5 does not enforce the use of UTF-8, so client applications may be expecting ASCII in most places.
Dataset data cannot be freed in a file without generating a file copy using an external tool (h5repack).
I'd say that pretty much sums up the problems with HDF5, it's complex (but people need this complexity, see the virtual dataset support), it's got a long history with backwards compatibly as it's focus, and it's not really designed to allow for massive changes in files. It's also not the best on Windows (due to how it deals with filenames).
I picked HDF5 for my research because of the available options, it had decent metadata support (HDF5 at least allows UTF-8, formats like FITS don't even have that), support for multidimensional arrays (which formats like Protocol Buffers don't really support), and it supports more than just 64 bit floats (which is very rare).
I can't comment about known bugs, but I have seen corruption (this happened when I was writing to a file and linux OOM'd my script). However, this shouldn't be a concern as long as you have proper data hygiene practices (as mentioned in the hackernews link), which in your case would be to not continuously write to the same file, but for each run create a new file. You should also not modify the file, instead any data reduction should produce new files, and you should always backup the originals.
Finally, it is worth pointing out there are alternatives to HDF5, depending on what exactly your requirements are: SQL databases may fit you needs better (and sqlite comes with Python by default, so it's easy to experiment with), as could a simple csv file. I would recommend against custom/non-portable formats (e.g. pickle and similar), as they're neither more robust than HDF5, and more complex than a csv file.

Git: Master-thesis subprojects as submodules or stand-alone repositories

I just started using git to get my the code I write for my Master-thesis more organized. I have divided the tasks into 4 sub-folders, each one containing data and programs that work with that data. The 4 sub-projects do not necessarily need to be connected, none off the programs contained use functions from the other sub-projects. However the output-files produced by the programs in a certain sub-folder are used by programs of another sub-folder.
In addition some programs are written in Bash and some in Python.
I use git in combination with bitbucket. I am really new to the whole concept, so I wonder if I should create one "Master-thesis" repository or rather one repository for each of the (until now) 4 sub-projects. Thank you for your help!
Well, as devnull says, answers would be highly opinion based, but given that I disagree that that's a bad thing, I'll go ahead and answer if I can type before someone closes the question. :)
I'm always inclined to treat git repositories as separate units of work or projects. If I'm likely to work on various parts of something as a single project or toward a common goal (e.g., Master's thesis), my tendency would be to treat it as a single repository.
And by the way, since the .git repository will be in the root of that single repository, if you need to spin off a piece of your work later and track it separately, you can always create a new repository if needed at that point. Meantime it seems "keep it simple" would mean one repo.
I recommend a single master repository for this problem. You mentioned that the output files of certain programs are used as input to the others. These programs may not have run-time dependencies on each other, but they do have dependencies. It sounds like they will not work without each other being present to create the data. Especially if file location (e.g. relative path) is important, then a single repository will help you keep them better organized.

Persistant database state strategies

Due to several edits, this question might have become a bit incoherent. I apologize.
I'm currently writing a Python server. It will never see more than 4 active users, but I'm a computer science student, so I'm planning for it anyway.
Currently, I'm about to implement a function to save a backup of the current state of all relevant variables into CSV files. Of those I currently have 10, and they will never be really big, but... well, computer science student and so on.
So, I am currently thinking about two things:
When to run a backup?
What kind of backup?
When to run:
I can either run a backup every time a variable changes, which has the advantage of always having the current state in the backup, or something like once every minute, which has the advantage of not rewriting the file hundreds of times per minute if the server gets busy, but will create a lot of useless rewrites of the same data if I don't implement a detection which variables have changed since the last backup.
Directly related to that is the question what kind of backup I should do.
I can either do a full backup of all variables (Which is pointless if I'm running a backup every time a variable changes, but might be good if I'm running a backup every X minutes), or a full backup of a single variable (Which would be better if I'm backing up each time the variables change, but would involve either multiple backup functions or a smart detection of the variable that is currently backed up), or I can try some sort of delta-backup on the files (Which would probably involve reading the current file and rewriting it with the changes, so it's probably pretty stupid, unless there is a trick for this in Python I don't know about).
I cannot use shelves because I want the data to be portable between different programming languages (java, for example, probably cannot open python shelves), and I cannot use MySQL for different reasons, mainly that the machine that will run the Server has no MySQL support and I don't want to use an external MySQL-Server since I want the server to keep running when the internet connection drops.
I am also aware of the fact that there are several ways to do this with preimplemented functions of python and / or other software (sqlite, for example). I am just a big fan of building this stuff myself, not because I like to reinvent the wheel, but because I like to know how the things I use work. I'm building this server partly just for learning python, and although knowing how to use SQLite is something useful, I also enjoy doing the "dirty work" myself.
In my usage scenario of possibly a few requests per day I am tending towards the "backup on change" idea, but that would quickly fall apart if, for some reason, the server gets really, really busy.
So, my question basically boils down to this: Which backup method would be the most useful in this scenario, and have I possibly missed another backup strategy? How do you decide on which strategy to use in your applications?
Please note that I raise this question mostly out of a general curiosity for backup strategies and the thoughts behind them, and not because of problems in this special case.
Use sqlite. You're asking about building persistent storage using csv files, and about how to update the files as things change. What you're asking for is a lightweight, portable relational (as in, table based) database. Sqlite is perfect for this situation.
Python has had sqlite support in the standard library since version 2.5 with the sqlite3 module. Since a sqlite database is implemented as a single file, it's simple to move them across machines, and Java has a number of different ways to interact with sqlite.
I'm all for doing things for the sake of learning, but if you really want to learn about data persistence, I wouldn't marry yourself to the idea of a "csv database". I would start by looking at the wikipedia page for Persistence. What you're thinking about is basically a "System Image" for your data. The Wikipedia article describes some of the same shortcomings of this approach that you've mentioned:
State changes made to a system after its last image was saved are lost
in the case of a system failure or shutdown. Saving an image for every
single change would be too time-consuming for most systems
Rather than trying to update your state wholesale at every change, I think you'd be better off looking at some other form of persistence. For example, some sort of journal could work well. This makes it simple to just append any change to the end of a log-file, or some similar construct.
However, if you end up with many concurrent users, with processes running on multiple threads, you'll run in to concerns of whether or not your changes are atomic, or if they conflict with one another. While operating systems generally have some ways of dealing with locking files for edits, you're opening up a can of worms trying to learn about how that works and interacts with your system. At this point you're back to needing a database.
So sure, play around with a couple different approaches. But as soon as you're looking to just get it working in a clear and consistent manner, go with sqlite.
If your data is in CSV files, why not use a revision control system on those files? E.g. git would be pretty fast and give excellent history. The repository would be wholly contained in the directory where the files reside, so it's pretty easy to handle. You could also replicate that repository to other machines or directories easily.

How to programmatically merge text files with potential conflicts (ala git or svn, etc)?

As part of a larger project, I want the ability to take two bodies of text and hand them to a merge algorithm which returns either an auto-merged result (in cases where the changes are not conflicting) or throws an error and (potentially) produces a single text document with the conflicting changes highlighted.
Basically, I just want a programmatic way to do what every source control system on the planet does internally, but I'm having a hard time finding it. There are tons of visual GUIs for doing this sort of thing that dominate my search results, but none of them seem to make easily accessible the core merging algorithm. Does everyone rely on some common and well understood algorithm/library and I just don't know the name so I'm having a hard time searching for it? Is this some just minor tweak on diff and I should be looking for diff libraries instead of merge libraries?
Python libraries would be most helpful, but I can live with the overhead of interfacing with some other library (or command line solution) if I have to; this operation should be relatively infrequent.
You're probably searching for merge algorithms like 3-way merging, which you can find in many open source projects, e.g. in the bazaar VCS (merge3.py source).
Did you check out difflib
http://docs.python.org/library/difflib.html

Categories