I run several processes in Python (using multiprocessing.Process) on an Ubuntu machine.
Each of the processes writes various temporary files. Each process writes different files, but all files are in the same folder.
Is there any potential risk of error here?
The reason I think there might be a problem is that, AFAIK, a folder in Unix is just a file. So it's jsut like several processes writing to the same file at the same time, which might cause a loss of information.
Is this really a potential risk here? If so, how to solve it?
This has absolutely nothing to do with Python, as file operations in Python use OS level system calls (unless run as root, your Python program would not have permissions to do raw device writes anyway and doing them as root would be incredibly stupid).
A little bit of file system theory if anyone cares to read:
Yes, if you study file system architecture and how data is actually stored on drives, there are similarities between files and directories - but only on data storage level. The reason being there is no need to separate these two. For example ext4 file system has a method of storing information about a file (metadata), stored in small units called inodes, and the actual file itself. Inode contains a pointer to the actual disk space where file data can be found.
File systems generally are rather agnostic to directories. A file system is basically just this: it contains information about free disk space, information about files with pointers to data, and the actual data. Part of metadata is the directory where the file resides. In modern file systems (ancient FAT is the exception that is still in use) data storage on disk is not related to directories. Directories are used to allow both humans and the computer implementing the file system locate files and folders quickly instead of walking through sequentially the list of inodes until the correct file is found.
You may have read that directories are just files. Yes, they are "files" that contain either a list of files in it (or actually a tree but please do not confuse this with a directory tree - it is just a mechanism of storing information about large directories so that files in that directory do not need to be searched sequentially within the directory entry). The reason this is a file is that it is the mechanism how file systems store data. There is no need to have a specific data storage mechanism, as a directory only contains a list of files and pointers to their inodes. You could think of it as a database or even simpler, a text file. But in the end it is just a file that contains pointers, not something that is allocated on the disk surface to contain the actual files stored in the directory.
That was the background.
The file system implementation on your computer is just a piece of software that knows how to deal with all this. When you open a file in a certain directory for writing, something like this usually happens:
A free inode is located and an entry created there
Free clusters / blocks database is queried to find storage space for the file contents
File data is stored and blocks/clusters are marked "in use" in that database
Inode is updated to contain file metadata and a
pointer to this disk space
"File" containing the directory data of
the target directory is located
This file is modified so that one
record is added. This record has a pointer to the inode just
created, and the file name as well
Inode of the file is updated to
contain a link to the directory, too.
It is the job of operating system and file system driver within it to ensure all this happens consistently. In practice it means the file system driver queues operations. Writing several files into the same directory simultaneously is a routine operation - for example web browser cache directories get updated this way when you browse the internet. Under the hood the file system driver queues these operations and completes steps 1-7 for each new file before it starts processing the following operation.
To make it a bit more complex there is a journal acting as an intermediate buffer. Your transactions are written to the journal, and when the file system is idle, the file system driver commits the journal transactions to the actual storage space, but theory remains the same. This is a performance and reliability issue.
You do not need to worry about this on application level, as it is the job of the operating system to do all that.
In contrast, if you create a lot of randomly named files in the same directory, in theory there could be a conflict at some point if your random name generator produced two identical file names. There are ways to mitigate this, and this would be the part you need to worry about in your application. But anything deeper than that is the task of the operating system.
On linux, opening a file (with or without the O_CREAT flag set) is an atomic operation (see for example this list). In a nutshell, as long as your processes use different files, you should have no trouble at all.
Just for you information appending to a file (up to a certain byte limit) is atomic as well. This article is interesting in this regard.
Writing to different files in the same folder won't cause a problem. Sure, a folder is a file in Linux but you open the file for writing not the folder.
On the other hand wiritng to the same file with multiple processes can, depending on your log size, cause issues. See this question for more details: Does python logging support multiprocessing?
Related
I am currently working on a script that automatically syncs files from the Documents and Picture directory with an USB stick that I use as sort of an "essentials backup". In practice, this should identify filenames and some information about them (like last time edited etc.) in the directories that I choose to sync.
If a file exists in one directory, but not in the other (i.e. it's on my computer but not on my USB drive), it should automatically copy that file to the USB as well. Likewise, if a file exists in both directories, but has different mod-times, it should replace the older with the newer one.
However, I have some issues with storing that information for the purpose of comparing those files. I initially thought about a file class, that stores all that information and through which I can compare objects with the same name.
Problem 1 with that approach is, that if I create an object, how do I name it? Do I name it like the file? I then would have to remove the file-extension like .txt or .py, because I'd run into trouble with my code. but I might have a notes.odt and a notes.jpg, which would be problem 2.
I am pretty new to Python, so my imagination is probably limited by my lack of knowledge. Any pointers on how I could make that work?
Situation
I get a ton of json files from a remote data source. I organize these files into an archive, then read them into a database. The archive exists to rebuild the database, if necessary.
The json files are generated remotely and sent to my server periodically and the reading-in process happens continuously. One more than one occasion, we had a power loss to our severs overnight or over the weekend, this was a huge problem for database loading, since the processes halted and I didn't know what had been loaded and what hadn't so I had to roll back to some previously known state and rebuild out of the archive.
To fix this problem, my master loader daemon (written in python) now uses the logging package to track what files it has loaded. The basic workflow of the loader daemon is
cp json file to archive
`rm' original
insert archived copy to database (its MariaDB)
commit to database
log filename of loaded json file
I'm not so much worried about duplicates in the database, but I don't want gaps; that is, things in the archive that are not in the database. This methods has so far and seems guaranteed to prevent any gaps.
For my logging, it basically looks like this. When the daemon starts up on a set of received files' names, it checks for duplicates that have already been loaded to the destination database and then loads all the non-duplicates. It is possible to get duplicates from my remote data source.
def initialize_logs(filenames, destination)
try:
with open("/data/dblogs/{0}.log".format(destination), 'r') as already_used:
seen = set([line.rstrip("\n") for line in already_used])
except FileNotFoundError:
print("Log file for {0} not found. Repair database".format(destination))
quit()
fnamelog = logging.getLogger('filename.log')
fnamelog.setLevel(logging.INFO)
fh = logging.FileHandler("/data/dblogs/{0}.log".format(destination))
fh.setLevel(logging.INFO)
fnamelog.addHandler(fh)
Then, as I process the jsonfiles, I log each file added using
fnamelog.info(filename)
The database loader is run parallelized, so I originally chose the logging package for its built in concurrency protections. There are a variety of databases; not every database pulls all data from the json files. Some databases with more information are shorter in time, usually one to two months. In this case, it is nice to have a log file with all json files in a given database, so if I want to add some on to it, I don't have to worry about what is already in there, the log file is keeping track.
Problem
A year has passed. I have kept getting json files. I am now getting around a million files per month. The text logging of each filename as it is processed in is clumsy, but it still works...for now. There are multiple databases, but for the largest ones, the log file is over half a GB. I feel like this logging solution will not work well for much longer.
What options are available in python to track which filenames have been inserted into a database, when there are over 10 million filenames per database, and rising?
One approach would be to log the files in a table in the database itself rather than in a text log file. If you added some columns for things like import date or file name, that might provide you a little bit of flexibility with respect to finding information from these logs when you need to do that, but also it would allow you to perform periodic maintenance (like for example deleting log records that are more than a few months old if you know you won't need to look at those ever).
If you decide to keep using text based log files, you might consider breaking them up so you don't wind up with a giant monolithic log file. When you install things like Apache that log lots of data, you'll see it automatic sets up log rotation to compress and archive log files periodically...
You don't say what type of database you are using but the general approach to take is
1) make a hash of each json file. SHA256 is widely available. If you are concerned about performance see this post https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
2) make the hash field a unique key on your database and before you do the other operations try and insert it. If you can't the record already exists and the transaction will abort
Program 1:
Foreach file in input directory
INSERT IGNORE into database the MD5 of the file
"mv" the file to the archive directory
Program 2, a "keep-alive" program.
It is run via cron every minute and tries to launch Program 1, but don't start it if it is already running.
Notes:
'mv' and 'cron' assume Unix. If using Windows, do the equivalent.
'mv' is atomic, so the file will be either in one directory of the other; no hassle of knowing whether it is 'processed'. (So, I wonder why you even have a database table??)
Since the INSERT and mv cannot realistically be done "atomically", here is why my plan is safe: IGNORE.
The "is it running" can be handled in a number of ways, either in Program 1 or 2.
You could add a timestamp and/or filename to the table containing the md5; whatever you like.
Since it is not a good idea to have even 10K files in a directory, you should use something other than the pair of flat directories I have envisioned.
You are getting only about 1 file every 3 seconds. This is not a heavy load unless the files are huge. Then it becomes an I/O problem, not a database problem.
I have a feeling that either I am missing a hidden 'requirement', or you are being extra paranoid. I don't really understand what you need to do with the files.
Is it better to walk a directory structure when performing multiple searches or is it a good idea to catalog the directory structure (in a file or memory) and then operate on that catalog? Or are there other methods which are better suited which I haven't hit upon?
I have a 3.5TB external HDD with thousands of files.
I have a set of files which list the contents of a directory. These listing files hold a folder name, filenames and file sizes.
I want to search the external HDD for the files in these listing files. If a file is found I then want to check and see if the file size of the actual file matches that in the listing file.
This process will cover about 1000 listing files and probably 10s of thousands of actual files.
A listing file would have contents like
folder: SummerPhotos
name: IMG0096.jpg, length: 6589
name: IMG0097.jpg, length: 6489
name: IMG0098.jpg, length: 6500
name: IMG0099.jpg, length: 6589
name: BeachPhotos/IMG0100.jpg, length, 34892
name: BeachPhotos/IMG0101.jpg, length, 34896
I like the offline processing of the listing files with a file which lists the contents of the external HDD because then I can perform this operation on a faster computer (as the hard drive is on an old computer acting as a server) or split the listing files over several computers and split up the work. Plus I think that continually walking the directory structure is about as inefficient as you can get and putting unnecessary wear on the hardware.
Walk pseudo code:
for each listing file
get base_foldername,filelist
for root,subfolder,files in os.walk(/path/to/3.5TBdrive)
if base_foldername in subfolder
for file in filelist
if file in files
if file.size == os.path.getsize(file)
dosomething
else
somethingelse
else
not_found
For the catalog file method I'm thinking of dumping a recursive 'ls' to file and then pretty much doing a string search on that file. I'll extract the filesize and perform a match there.
My 'ls -RlQ' dump file is 11MB in size with ~150k lines. If there is a better way to get the required data I'm open to suggestions. I'm thinking of using the os.walk() to compile a list and create my own file in a format I like vs trying to parse my ls command.
I feel like I should be doing somethign to make my college professors proud and making a hashtable or balanced tree, but feel like the effort to implement that will take longer than simply brute forcing the solution w cpu cycles.
OS: Linux
preferred programming language: Python 2/3
Thanks!
Is it better to walk a directory structure when performing multiple
searches or is it a good idea to catalog the directory structure (in a
file or memory) and then operate on that catalog?
If you just want to check if the file exists or the directory structure is not too complex, I suggest you to just use your filesystem. You're basically duplicating the work that it already does anyway and this will lead to problems in the future, as complexity always does.
I don't see any point using hashtables or balanced trees for in-program data structures - this is also what your filesystem already does. What you should instead do to speed up lookups is to design a deep directory structure instead of a few single directories that contain thousands of files. There are filesystems that choke while trying to list directories with dozens of thousands of files and it is a better idea to limit yourself to a few thousands and create a new level of directory depth should you exceed it.
For example, if you want to keep logs of your internet-wide scanning research, if you use a single file for each host you scanned, you don't want to create a directory scanning-logs with files such as 1.1.1.1.xml, 1.1.1.2.xml and so on. Instead, naming such as scanning-logs/1/1/1.1.1.1.xml is a better idea.
Also, watch out for the inode limit! I was once building a large file-based database on EXT4 filesystem. One day I started getting error messages like "no space left on device" even though I clearly had quite a lot of space left. The real reason was that I created too many inodes - the limit can be manually set while creating a volume.
I am wondering if it is possible to compile a list of deleted files on a windows file system, FAT or NTFS. I do not need to actually recover the files, only have access to their name and any other accessible time (time deleted, created etc).
Even if I can run a cmd line tool to achieve this it would be acceptable.
The application is being developed in Python, however if another language has the capability I could always create a small component implemented in that language.
Thanks.
This is a very complex task. I woudl look at open-source forensic tools.
You also should analyze the recylcing bin ( not completly deleted file )
For FAT you will not be able to get the first character of a deleted file.
For some deleted files the metadata will be gone.
NTFS is much more complex and time consuming due to the more complex nature of this file system.
I have an application written in Python that's writing large amounts of data to the %TEMP% folder. Oddly, every once and awhile, it dies, returning IOError: [Errno 28] No space left on device. The drive has plenty of free space, %TEMP% is not its own partition, I'm an administrator, and the system has no quotas.
Does Windows artificially put some types of limits on the data in %TEMP%? If not, any ideas on what could be causing this issue?
EDIT: Following discussions below, I clarified the question to better explain what's going on.
What is the exact error you encounter?
Are you creating too many temp files?
The GetTempFileName method will raise
an IOException if it is used to
create more than 65535 files without
deleting previous temporary files.
The GetTempFileName method will raise
an IOException if no unique temporary
file name is available. To resolve
this error, delete all unneeded
temporary files.
One thing to note is that if you're indirectly using the Win32 API, and you're only using it to get temp file names, note that while (indirectly) calling it:
Creates a uniquely named, zero-byte
temporary file on disk and returns the
full path of that file.
If you're using that path but also changing the value returned, be aware you might actually be creating a 0byte file and an additional file on top of that (e.g. My_App_tmpXXXX.tmp and tmpXXXX.tmp).
As Nestor suggested below, consider deleting your temp files after you're done using them.
Using a FAT32 filesystem I can imagine this happening when:
Writing a lot of data to one file, and you reach the 4GB file size cap.
Or when you are creating a lot of small files and reaching the 2^16-2 files per directory cap.
Apart from this, I don't know of any limitations the system can impose on the temp folder, apart from the phyiscal partition actually being full.
Another limitation is as Mike Atlas has suggested the GetTempFileName() function which creates files of type tmpXXXX.tmp. Although you might not be using it directly, verify that the %TEMP% folder does not contain too many of them (2^16).
And maybe the obvious, have you tried emptying the %TEMP% folder before running the utility?
There shouldn't be such space limitation in Temp. If you wrote the app, I would recommend creating your files in ProgramData...
There should be no trouble whatsoever with regard to your %TEMP% directory.
What is your disk quota set to for %TEMP%'s hosting volume? Depending in part on what the apps themselves are doing, one of them may be throwing an error due to the disk quota being reached, which is a pain if this quota is set unreasonably high. If the quota is very high, try lowering it, which you can do as Administrator.