medium datasets under source control

medium datasets under source control - python

This is more of a general question about how feasible is it to store data sets under source control.
I have 20 000 csv files with number data that I update every day. The overall size of the directory is 100Mbytes or so, that are stored on a local disk on ext4 partition.
Each day changes should be diffs of about 1kbyte.
I may have to issue corrections on the data, so I am considering versioning the whole directory= 1 toplevel contains 10 level1 dirs, each contain 10 level2 dirs, each containing 200 csv files.
The data is written to files by python processes( pandas frames ).
The question is about performance of writes where the deltas are small like this compared to the entire data.
svn and git come to mind, and they would have python modules to use them.
What works best?
Other solutions are I am sure possible but I would stick to keeping data is files as is...

If you're asking whether it would be efficient to put your datasets under version control, based on your description of the data, I believe the answer is yes. Both Mercurial and Git are very good at handling thousands of text files. Mercurial might be a better choice for you, since it is written in python and is easier to learn than Git. (As far as I know, there is no good reason to adopt Subversion for a new project now that better tools are available.)
If you're asking whether there's a way to speed up your application's writes by borrowing code from a version control system, I think it would be a lot easier to make your application modify existing files in place. (Maybe that's what you're doing already? It's not clear from what you wrote.)

Related

Is file reading faster in nested folders or doesn't matter?

My question concerns purely file-path reading and Disk Drives...I think.
I have python code that needs to pull up a specific file of which I know exactly the file-path to. And I have a choice either I store that file in a large folder with thousands of other files or segment them all into sub-folders. Which choice would give more reading speed?
My concern and lack of knowledge suggests that when code enters a big folder with thousands of other files then that is much more of a struggle than entering a folder with a few sub-folders. Or am I wrong and it is all instant if I produce the exact file-path?
Again I don't have to scan the files or folders as I know exactly the file-path link but I don't know what happens on the lower-level with Disk Drives?
EDIT: Which of the two would be faster given standard HDD on Windows 7?
C://Folder_with_millions_of_files/myfile.txt
or
C://small_folder/small_folder254/small_folder323/myfile.txt
NOTE: What I need this for is not to scan thousands of files but to pull up just that one file as quickly as possible. Sort of a lookup table I think this is.

Doing some reading for maximum scalability it appears best practice is to split the folder to subfolders although you nesting multiple folders is not recomended and it is best to use multiple larger folders than thousands of smaller folders,
Rather than shoveling all of those files into a single filesystem, why not spread them out across a series of smaller filesystems? The problems with that approach are that (1) it limits the kernel's ability to optimize head seeks and such, reducing performance, and (2) it forces developers (or administrators) to deal with the hassles involved in actually distributing the files. Inevitably things will get out of balance, forcing things to be redistributed in the future.
From looking at these articles I would draw the following conclusion,
< 65,534 files (One folder should suffice)
> 65,534 files (Split into folders)
To allow for scalability in the future it would be advisable to split the data across folders but based on the file system and observed performance potentially creating a new folder per 65,534 items or per day, category etc.
Based on,
single folder or many folders for storing 8 million images of hundreds of stores?
https://lwn.net/Articles/400629/
https://superuser.com/questions/446282/max-files-per-directory-on-ntfs-vol-vs-fat32

Strings, CSV/Excel, eventually DB, but statistics needed -which tool(s) (Ch, python, ceemple?)

I am currently working on aligning text data, mostly hidden in CSV or Excel files from multiple sources. I've done this easily enough with python (even on a Raspberry Pi) and Openoffice. The issues are:
transforming disparate names to unique names (easy)
storing the data in CSV or Excel files (because my collabs use Excel)
Eventually building a real DB (SQL based- MariaDB, Postgres) from the Excel files
Doing statistics on the data; mostly enumeration from different CSV files and comparison between samples - nice to generate graphs
for debugging purposes it would be nice to quickly generate bar charts and such of groups of the data
Nothing superfancy, except it gets slow in python (no doubt generously helped by my "I am not a programmer" 'code' . The data sets will get 'large' (10's of thousands of lines of data times multiple dozens data sets). I would like a programming tool which facilitates this.
I looked into Ch (& cling, cint) because I still remember a bit of C, interpreted, but Ch seems to offer a good set of libs. Python is ok for much of it, but I dislike the syntax. I try to work on Linux as much as I can, but eventually I have to hand it off to Windows users in a country not known for having fast computers. I was looking at ceemple (ceemple.com) and was wondering if anyone has used that for a project and what their experience has been. Does it help with cross platform issues (e.g., line termination)? Should I just forget Linux (with that wonderful shell and easy python and text editors which can load large files w/o bogging down) and move it to Windows? If so, then compiled is just about the only way to go for me, likely precluding Ch and probably python.
Please keep in mind that this is my 'side job' - I'm not a professional programmer. Low learning curve and least amount of tools required is important.

Python - Fast search for modified files in multiple folders

I am looking for an efficient implementation for finding modified files in multiple directories.
I do know that I can just recursively go through all those directories and check the modified date of those files I wanna check. This is quite trivial.
But what if I have a complex folder structure with many files in it? The upper approach won't really scale and might take up several minutes.
Is there a better approach to probe for modifications? Like is there something like a checksum on folders that I could use to narrow down the number of folders I have to check for modifications or anything like that?
A second step to my problem is also finding newly created files.
I am looking for a python based solution which is windows compatible

Optimizing a Mass ID3 Tag Scan [duplicate]

This question already has an answer here:
Optimizing Python Code Using SQLite3 + Mutagen
(1 answer)
Closed 9 years ago.
I'm building a small tool that I want to scan over a music collection, read the ID3 info of a track, and store it as long as that particular artist does not have a song that has been accessed more than twice. I'm planning on using Mutagen for reading the tags.
However, the music collections of myself and many others are massive, exceeding 20,000 songs. As far as I know, libraries like Mutagen have to open and close every song to get the ID3 info from it. While MP3s aren't terribly performance-heavy, that's a lot of songs. I'm already planning a minor optimization in the form of keeping a count of each artist and not storing any info if their song count exceeds 2, but as far as I can tell I still need to open every song to check the artist ID3 tag.
I toyed with the idea of using directories as a hint for the artist name and not reading any more info in that directory once the artist song count exceeds 2, but not everyone has their music set up in neat Artist/Album/Songs directories.
Does anyone have any other optimizations in mind that might cut down on the overhead of opening so many MP3s?

Beware of premature optimization. Are you really sure that this will be a performance problem? What are your requirements -- how quickly does the script need to run? How fast does it run with the naïve approach? Profile and evaluate before you optimize. I think there's a serious possibility that you're seeing a performance problem where none actually exists.
You can't avoid visiting each file once if you want a guaranteed correct answer. As you've seen, optimizations that entirely skip files will basically amount to automated guesswork.
Can you keep a record of previous scans you've done, and on a subsequent scan use the last-modified dates of the files to avoid re-scanning files you've already scanned once? This could mean that your first scan might take a little bit of time, but subsequent scans would be faster.
If you need to do a lot of complex queries on a music collection quickly, consider importing the metadata of the entire collection into a database (for instance SQLite or MySQL). Importing will take time -- updating to insert new files will take a little bit of time (checking the last-modified dates as above). Once the data is in your database, however, everything should be fairly snappy assuming that the database is set up sensibly.

In general for this question i would recommend you using multiple ways of detecting an artist or track title:
1st way to check: Is the filename maybe in ARTIST-TITLE.mp3 format? (or similar)
(filename for this would be "Artist-Track.mp3")
for file in os.listdir(PATH_TO_MP3s):
artist = re.split("[\_\-\.]", file)[-3]
track = re.split("[\_\-\.]", file)[-2]
filetype = re.split("[\_\-\.]", file)[-1]
Of course you have to make sure if the file is in that format first.
2nd step (if first doesn't fit for that file) would be checking if the directory names fit (like you said)
3rd and last one would be to check the ID3 tags.
But make sure to check if the values are the right before trusting it.
For example if someone would use "Track-Artist.mp3" for the code i provided artist and track would be switched.

Is there a standard way, across operating systems, of adding "tags" to files

I'm writing a script to make backups of various different files. What I'd like to do is store meta information about the backup. Currently I'm using the file name, so for example:
backups/cool_file_bkp_c20120119_104955_d20120102
Where c represents the file creation datetime, and d represents "data time" which represents what the cool_file actually contains. The reason I currently use "data time" is that a later backup may be made of the same file, in which case, I know I can safely replace the previous backup of the same "data time" without loosing any information.
It seems like an awful way to do things, but it does seem to have the benefit of being non-os dependent. Is there a better way?
FYI: I am using Python to script my backup creation, and currently need to have this working on Windows XP, 2003, and Redhat Linux.
EDIT: Solution:
From the answers below, I've inferred that metadata on files is not widely supported in a standard way. Given my goal was to tightly couple the metadata with the file, it seems that archiving the file alongside a metadata textfile is the way to go.

I'd take one of two approaches there:
create a stand alone file, on the backub dir, that would contain the desired metadata - this could be somethnng in human readable form, just to make life easier, such as a json data structure, or "ini" like file.
The other one is to archive the copied files - possibily using "zip", and bundle along with it a textual file with the desired meta-data.
The idea of creating zip archives to group files that you want together is used in several places, like in java .jar files, Open Document Format (offfice files created by several office sutres), Office Open XML (Microsoft specific offic files), and even Python language own eggs.
The ziplib module in Python's standard library has all the toools necessary to acomplish this - you can just use a dictionary's representation in a file bundled with the original one to have as much metadata as you need.
In any of these approaches you will also need a helper script to letyou see and filter the metadata on the files, of course.

Different file systems (not different operating systems) have different capabilities for storing metadata. NTFS has plenty of possibilities, while FAT is very limited, and ext* are somewhere in between. None of widespread (subjective term, yes) filesystems support custom tags which you could use. Consequently there exists no standard way to work with such tags.
On Windows there was an attempt to introduce Extended Attributes, but these were implemented in such a tricky way that were almost unusable.
So putting whatever you can into the filename remains the only working approach. Remember that filesystems have limitations on file name and file path length, and with this approach you can exceed the limit, so be careful.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.