How to handle new files to process in cron job

How to handle new files to process in cron job - python

How can I check files that I already processed in a script so I don't process those again? and/or
What is wrong with the way I am doing this now?
Hello,
I am running tshark with the ring buffer option to dump to files after 5MB or 1 hour. I wrote a python script to read these files in XML and dump into a database, this works fine.
My issue is that this is really process intensive, one of those 5MB can turn into a 200MB file when converted to XML, so I do not want to do any unnecessary processing.
The script is running every 10 minutes and processes ~5 files per run, since is scanning the folder where the files are created for any new entries, I dump a hash of the file into the database and on the next run check the hash and if it isn't in the database I scan the file.
The problem is that this does not appear to work every time, it ends up processing files that it has already done. When I check the hash of the file that it keeps trying to process it doesn't show up anywhere in the database, hence why is trying to process it over and over.
I am printing out the filename + hash in the output of the script:
using file /var/ss01/SS01_00086_20100107100828.cap with hash: 982d664b574b84d6a8a5093889454e59
using file /var/ss02/SS02_00053_20100106125828.cap with hash: 8caceb6af7328c4aed2ea349062b74e9
using file /var/ss02/SS02_00075_20100106184519.cap with hash: 1b664b2e900d56ca9750d27ed1ec28fc
using file /var/ss02/SS02_00098_20100107104437.cap with hash: e0d7f5b004016febe707e9823f339fce
using file /var/ss02/SS02_00095_20100105132356.cap with hash: 41a3938150ec8e2d48ae9498c79a8d0c
using file /var/ss02/SS02_00097_20100107103332.cap with hash: 4e08b6926c87f5967484add22a76f220
using file /var/ss02/SS02_00090_20100105122531.cap with hash: 470b378ee5a2f4a14ca28330c2009f56
using file /var/ss03/SS03_00089_20100107104530.cap with hash: 468a01753a97a6a5dfa60418064574cc
using file /var/ss03/SS03_00086_20100105122537.cap with hash: 1fb8641f10f733384de01e94926e0853
using file /var/ss03/SS03_00090_20100107105832.cap with hash: d6209e65348029c3d211d1715301b9f8
using file /var/ss03/SS03_00088_20100107103248.cap with hash: 56a26b4e84b853e1f2128c831628c65e
using file /var/ss03/SS03_00072_20100105093543.cap with hash: dca18deb04b7c08e206a3b6f62262465
using file /var/ss03/SS03_00050_20100106140218.cap with hash: 36761e3f67017c626563601eaf68a133
using file /var/ss04/SS04_00010_20100105105912.cap with hash: 5188dc70616fa2971d57d4bfe029ec46
using file /var/ss04/SS04_00071_20100107094806.cap with hash: ab72eaddd9f368e01f9a57471ccead1a
using file /var/ss04/SS04_00072_20100107100234.cap with hash: 79dea347b04a05753cb4ff3576883494
using file /var/ss04/SS04_00070_20100107093350.cap with hash: 535920197129176c4d7a9891c71e0243
using file /var/ss04/SS04_00067_20100107084826.cap with hash: 64a88ecc1253e67d49e3cb68febb2e25
using file /var/ss04/SS04_00042_20100106144048.cap with hash: bb9bfa773f3bf94fd3af2514395d8d9e
using file /var/ss04/SS04_00007_20100105101951.cap with hash: d949e673f6138af2d388884f4a6b0f08
The only files it should be doing are one per folder, so only 4 files. This causes unecessary processing and I have to deal with overlapping cron jobs + other services been affected.
What I am hoping to get from this post is a better way to do this or hopefully someone can tell me why is happening, I know that the latter might be hard since it can be a bunch of reasons.
Here is the code (I am not a coder but a sys admin so be kind :P) line 30-32 handle the hash comparisons.
Thanks in advance.

A good way to handle/process files that are created at random times is to use
incron rather than cron. (Note: since incron uses the Linux kernel's
inotify syscalls, this solution only works with Linux.)
Whereas cron runs a job based on dates and times, incron runs a job based on
changes in a monitored directory. For example, you can configure incron to run a
job every time a new file is created or modified.
On Ubuntu, the package is called incron. I'm not sure about RedHat, but I believe this is the right package: http://rpmfind.net//linux/RPM/dag/redhat/el5/i386/incron-0.5.9-1.el5.rf.i386.html.
Once you install the incron package, read
man 5 incrontab
for information on how to setup the incrontab config file. Your incron_config file might look something like this:
/var/ss01/ IN_CLOSE_WRITE /path/to/processing/script.py $#
/var/ss02/ IN_CLOSE_WRITE /path/to/processing/script.py $#
/var/ss03/ IN_CLOSE_WRITE /path/to/processing/script.py $#
/var/ss04/ IN_CLOSE_WRITE /path/to/processing/script.py $#
Then to register this config with the incrond daemon, you'd run
incrontab /path/to/incron_config
That's all there is to it. Now whenever a file is created in /var/ss01, /var/ss02, /var/ss03 or /var/ss04, the command
/path/to/processing/script.py $#
is run, with $# replaced by the name of the newly created file.
This will obviate the need to store/compare hashes, and files will only get processed once -- immediately after they are created.
Just make sure your processing script does not write into the top level of the monitored directories.
If it does, then incrond will notice the new file created, and launch script.py again, sending you into an infinite loop.
incrond monitors individual directories, and does not recursively monitor subdirectories. So you could direct tshark to write to /var/ss01/tobeprocessed, use incron to monitor
/var/ss01/tobeprocessed, and have your script.py write to /var/ss01, for example.
PS. There is also a python interface to inotify, called pyinotify. Unlike incron, pyinotify can recursively monitor subdirectories. However, in your case, I don't think the recursive monitoring feature is useful or necessary.

I don't know enough about what is in these files, so this may not work for you, but if you have only one intended consumer, I would recommend using directories and moving the files to reflect their state. Specifically, you could have a dir structure like
/waiting
/progress
/done
and use the relative atomicity of mv to change the "state" of each file. (Whether mv is truly atomic depends on your filesystem, I believe.)
When your processing task wants to work on a file, it moves it from waiting to progress (and makes sure that the move succeeded). That way, no other task can pick it up, since it's no longer waiting. When the file is complete, it gets moved from progress to done, where a cleanup task might delete or archive old files that are no longer needed.

I see several issues.
If you have overlapping cron jobs you need to have a locking mechanism to control access. Only allow one process at a time to eliminate the overlap problem. You might setup a shell script to do that. Create a 'lock' by making a directory (mkdir is atomic), process the data, then delete the lock directory. If the shell script finds the directory already exists when it tries to make it then you know another copy is already running and it can just exit.
If you can't change the cron table(s) then just rename the executable and name your shell script the same as the old executable.
Hashes are not guaranteed to be unique identifiers for files, it's likely they are, but it's not absolutely guaranteed.

Why not just move a processed file to a different directory?
You mentioned overlapping cron jobs. Does this mean one conversion process can start before the previous one finished? That means you would perform the move at the beginning of the conversion. If you are worries about an interrupted conversion, use an intermediate directory, and move to a final directory after completion.

If I'm reading the code correctly, you're updating the database (by which I mean the log of files processed) at the very end. So when you have a huge file that's being processed and not yet complete, another cron job will 'legally' start working on it. - both completing succesfully resulting in two entries in the database.
I suggest you move up the logging-to-database, which would act as a lock for subsequent cronjobs and having a 'success' or 'completed' at the very end. The latter part is important as something that's shown as processing but doesnt have a completed state (coupled with the notion of time) can be programtically concluded as an error. (That is to say, a cronjob tried processing it but never completed it and the log show processing for 1 week!)
To summarize
Move up the log-to-database so that it would act as a lock
Add a 'success' or 'completed' state which would give the notion of errored state
PS: Dont take it in the wrong way, but the code is a little hard to understand. I am not sure whether I do at all.

Related

How to detect files in a directory if the files have finished copying/adding? [duplicate]

Files are being pushed to my server via FTP. I process them with PHP code in a Drupal module. O/S is Ubuntu and the FTP server is vsftp.
At regular intervals I will check for new files, process them with SimpleXML and move them to a "Done" folder. How do I avoid processing a partially uploaded file?
vsftp has lock_upload_files defaulted to yes. I thought of attempting to move the files first, expecting the move to fail on a currently uploading file. That doesn't seem to happen, at least on the command line. If I start uploading a large file and move, it just keeps growing in the new location. I guess the directory entry is not locked.
Should I try fopen with mode 'a' or 'r+' just to see if it succeeds before attempting to load into SimpleXML or is there a better way to do this? I guess I could just detect SimpleXML load failing but... that seems messy.
I don't have control of the sender. They won't do an upload and rename.
Thanks

Using the lock_upload_files configuration option of vsftpd leads to locking files with the fcntl() function. This places advisory lock(s) on uploaded file(s) which are in progress. Other programs don't need to consider advisory locks, and mv for example does not. Advisory locks are in general just an advice for programs that care about such locks.
You need another command line tool like lockrun which respects advisory locks.
Note: lockrun must be compiled with the WAIT_AND_LOCK(fd) macro to use the lockf() and not the flock() function in order to work with locks that are set by fcntl() under Linux. So when lockrun is compiled with using lockf() then it will cooperate with the locks set by vsftpd.
With such features (lockrun, mv, lock_upload_files) you can build a shell script or similar that moves files one by one, checking if the file is locked beforehand and holding an advisory lock on it as long as the file is moved. If the file is locked by vsftpd then lockrun can skip the call to mv so that running uploads are skipped.

If locking doesn't work, I don't know of a solution as clean/simple as you'd like. You could make an educated guess by not processing files whose last modified time (which you can get with filemtime()) is within the past x minutes.
If you want a higher degree of confidence than that, you could check and store each file's size (using filesize()) in a simple database, and every x minutes check new size against its old size. If the size hasn't changed in x minutes, you can assume nothing more is being sent.

The lsof linux command lists opened files on your system. I suggest executing it with shell_exec() from PHP and parsing the output to see what files are still being used by your FTP server.

Picking up on the previous answer, you could copy the file over and then compare the sizes of the copied file and the original file at a fixed interval.
If the sizes match, the upload is done, delete the copy, work with the file.
If the sizes do not match, copy the file again.
repeat.

Here's another idea: create a super (but hopefully not root) FTP user that can access some or all of the upload directories. Instead of your PHP code reading uploaded files right off the disk, make it connect to the local FTP server and download files. This way vsftpd handles the locking for you (assuming you leave lock_upload_files enabled). You'll only be able to download a file once vsftp releases the exclusive/write lock (once writing is complete).
You mentioned trying flock in your comment (and how it fails). It does indeed seem painful to try to match whatever locking vsftpd is doing, but dio_fcntl might be worth a shot.

I guess you've solved your problem years ago but still.
If you use some pattern to find the files you need you can ask the party uploading the file to use different name and rename the file once the upload has completed.

You should check the Hidden Stores in proftp, more info here:
http://www.proftpd.org/docs/directives/linked/config_ref_HiddenStores.html

Multiple processes reading&deleting files in the same directory

I have a directory with thousands of files and each of them has to be processed (by a python script) and subsequently deleted.
I would like to write a bash script that reads a file in the folder, processes it, deletes it and moves onto another file - the order is not important. There will be n running instances of this bash script (e.g. 10), all operating on the same directory. They quit when there are no more files left in the directory.
I think this creates a race condition. Could you give me an advice (or a code snippet) how to make sure that no two bash scripts operate on the same file?
Or do you think I should rather implement multithreading in Python (instead of running n different bash scripts)?

You can use the fact the file renames (on the same file system) are atomic on Unix systems, i.e. a file was either renamed or not. For the sake of clarity, let us assume that all files you need to process have name beginning with A (you can avoid this by having some separate folder for the files you are processing right now).
Then, your bash script iterates over the files, tries to rename them, calls the python script (I call it process here) if it succeeds and else just continues. Like this:
#!/bin/bash
for file in A*; do
pfile=processing.$file
if mv "$file" "$pfile"; then
process "$pfile"
rm "$pfile"
fi
done
This snippet uses the fact that mv returns a 0 exit code if it was able to move the file and a non-zero exit code else.

The only sure way that no two scripts will act on the same file at the same time is to employ some kind of file locking mechanism. A simple way to do this could be to rename the file before beginning work, by appending some known string to the file name. The work is then done and the file deleted. Each script tests the file name before doing anything, and moves on if it is 'special'.
A more complex approach would be to maintain a temporary file containing the names of files that are 'in process'. This file would obviously need to be removed once everything is finished.

I think the solution to your problem is a consumer producer pattern. I think this solution is the right way to start:
producer/consumer problem with python multiprocessing

which inotify event signals the completion of a large file operation?

for large files or slow connections, copying files may take some time.
using pyinotify, i have been watching for the IN_CREATE event code. but this seems to occur at the start of a file transfer. i need to know when a file is completely copied - it aint much use if it's only half there.
when a file transfer is finished and completed, what inotify event is fired?

IN_CLOSE probably means the write is complete. This isn't for sure since some applications are bad actors and open and close files constantly while working with them, but if you know the app you're dealing with (file transfer, etc.) and understand its' behaviour, you're probably fine. (Note, this doesn't mean the transfer completed successfully, obviously, it just means that the process that opened the file handle closed it).
IN_CLOSE catches both IN_CLOSE_WRITE and IN_CLOSE_NOWRITE, so make your own decisions about whether you want to just catch one of those. (You probably want them both - WRITE/NOWRITE have to do with file permissions and not whether any writes were actually made).
There is more documentation (although annoyingly, not this piece of information) in Documentation/filesystems/inotify.txt.

For my case I wanted to execute a script after a file was fully uploaded. I was using WinSCP which writes large files with a .filepart extension till done.
I first started modifying my script to ignore files if they're themselves ending with .filepart or if there's another file existing in the same directory with the same name but .filepart extension, hence that means the upload is not fully completed yet.
But then I noticed at the end of the upload, when all the parts have been finished, I have a IN_MOVED_IN notification getting triggered which helped me run my script exactly when I wanted it.
If you want to know how your file uploader behaves, add this to the incrontab:
/your/directory/ IN_ALL_EVENTS echo "$$ $# $# $% $&"
and then
tail -F /var/log/cron
and monitor all the events getting triggered to find out which one suits you best.
Good luck!

Why don't you add a dummy file at the end of the transfer? You can use the IN_CLOSE or IN_CREATE event code on the dummy. The important thing is that the dummy as to be transfered as the last file in the sequence.
I hope it'll help.

How do I watch a folder for changes and when changes are done using Python?

i need to watch a folder for incoming files. i did that with the following help:
How do I watch a file for changes?
the problem is that the files that are being moved are pretty big (10gb)
and i want to be notified when all files are done moving.
i tried comparing the size of the folder every 20 seconds but the file shows its correct size even tough windows shows that it is still moving.
i am using windows with python
i found a solution using open and waiting for an io exception.
if the file is still being moved i get errno 13.

You should take a look at this link:
http://timgolden.me.uk/python/win32_how_do_i/watch_directory_for_changes.html
There you can see the comparison of the method you are speaking about (simple polling) with two other windows-specific techniques which, in my opinion, offers a really better solution to your problem!
Otherwise, if you are using linux, there's iNotify and the relative Python wrapper:
Pyinotify is a pure Python module used
for monitoring filesystems events on
Linux platforms through inotify
Here: http://trac.dbzteam.org/pyinotify

If you have control over the process of importing the files, I would put a lock file when starting to copy files in, and remove it when you are done. by lock file I mean a tmp empty file, which is just there to indicate that you are coping a file. then your py script can check for the existence of the lock files.

You may be able to use os.stat() to monitor the mtime of the file. However be aware that under various network conditions, the copy may stall momentarily and so the mtime is not updated for a few seconds, so you need to make allowance for this.
Another option is to try opening the file with exclusive read/write which should fail under windows if the file is still opened by the other process

The most reliable method would be to write your own program to move the files.

try checking for the last-modified time change instead of the filesize during your poll.

Does python have hooks into EXT3

We have several cron jobs that ftp proxy logs to a centralized server. These files can be rather large and take some time to transfer. Part of the requirement of this project is to provide a logging mechanism in which we log the success or failure of these transfers. This is simple enough.
My question is, is there a way to check if a file is currently being written to? My first solution was to just check the file size twice within a given timeframe and check the file size. But a co-worker said that there may be able to hook into the EXT3 file system via python and check the attributes to see if the file is currently being appended to. My Google-Fu came up empty.
Is there a module for EXT3 or something else that would allow me to check the state of a file? The server is running Fedora Core 9 with EXT3 file system.

no need for ext3-specific hooks; just check lsof, or more exactly, /proc/<pid>/fd/* and /proc/<pid>/fdinfo/* (that's where lsof gets it's info, AFAICT). There you can check if the file is open, if it's writeable, and the 'cursor' position.
That's not the whole picture; but any more is done in processpace by stdlib on the writing process, as most writes are buffered and the kernel only sees bigger chunks of data, so any 'ext3-aware' monitor wouldn't get that either.

There's no ext3 hooks to check what you'd want directly.
I suppose you could dig through the source code of Fuser linux command, replicate the part that finds which process owns a file, and watch that resource. When noone longer has the file opened, it's done transferring.
Another approach:
Your cron jobs should tell that they're finished.
We have our cron jobs that transport files just write an empty filename.finished after it's transferred the filename. Another approach is to transfer them to a temporary filename, e.g. filename.part and then rename it to filename Renaming is atomic. In both cases you check repeatedly until the presence of filename or filename.finished

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.