Python: Reading New Information Added To Massive Files - python

I'm working on a Python script to parse Squid(http://www.squid-cache.org/) log files. While the logs are rotated every day to stop them getting to big, they do reach between 40-90MB by the end of each day.
Essentially what I'm doing is reading the file line by line, parsing out the data I need(IP, Requested URL, Time) and adding it to an sqlite database. However this seems to be taking a very long time(It's been running over 20 minutes now)
So obviously, re-reading the file can't be done. What I would like to do is read the file and then detect all new lines written. Or even better, at the start of the day the script will simply read the data in real time as it is added so there will never be any long processing times.
How would I go about doing this?

One way to achieve this is by emulating tail -f. The script would constantly monitor the file and process each new line as it appears.
For a discussion and some recipes, see tail -f in python with no time.sleep

One way to do this is to use file system monitoring with py-inotify http://pyinotify.sourceforge.net/ - and set a callback function to be executed whenever
the log file size changed.
Another way to do it, without requiring external modules, is to record in the filesystem
(possibily on your sqlite database itself), the offset of the end of the lest line read on the log file, (which you get with with file.tell() ), and just read the newly added lines
from that offset onwards, which is done with a simple call to file.seek(offset) before looping through the lines.
The main difference of keeping track of the offset and the "tail" emulation described ont he other post is that this one allows your script to be run multiple times, i.e. - no need for it to be running continually, or to recover in case of a crash.

Related

Edit file being processed by python script

I am trying to use a file (csv, json, txt, haven't decided format) that I can drop a few lines of data in. A python script will be on cron to run every 5 minutes and check the file if there is anything new and if there is, process it and delete each line as it is processed.
I am trying to prevent a situation where I open the file, make some changes and save it while the process came by grabbed the data and emptied the file but my save writes it back in.
I thought the only way to make this safe is to have it process a folder and just look for new files, all changes would be dropped in a new file. So there is never the risk of this happening.
Is there a better way, or is this the best approach?
Check this answer to see if the file is already open and if it is, just wait 5 more minutes until, or alternatively sleep internally and try again every 10 seconds until it works, but no longer than 4 minutes, for example:
for i in range(attempts):
if not fileInUse():
processFile()
else:
time.sleep(10)
You can use the below steps:
Python script which runs in cron will check if the file is opened by any other process. In Linux, it can be done using lsof.
If file is open, when cron runs, then it will not process the file data.
Same logic can add for the script which will add data to the file, if the file is used by some other scripts.

parallelize external program in python call?

I have an external program which I can not change.
It reads an input file, does some calculations, and writes out a result file. I need to run this for a million or so combinations of input parameters.
The way I do it at the moment is, that I open a template file, change some strings in it (to input the new parameters), write it out, start the program using os.popen(), read the output file, do a chisquare test on the result, and then I restart with a different set of parameters.
The external program is only running on one core, so I tried to split my parameters space up and started multiple instances in different folders. Different folders were necessary because the program overwrites its output file. This works, but it still took about ~24 hours to finish.
Is it possible to run this as seperate processes without the result file being overwritten? Or do you see any other thing I could do to speed this up?
Thx.

Python2: How to parse a logfile that is held open in another process reliably?

I'm trying to write a Python script that will parse a logfile produced by another daemon. This is being done on Linux. I want to be able to parse the log file reliably.
In other words, periodically, we run a script that reads the log file, line by line, and does something with each line. The logging script would need to see every line that may end up in the log file. It could run say once per minute via cron.
Here's the problem that I'm not sure exactly how to solve. Since the other process has a write handle to the file, it could write to the while at the same time that I am reading from the same log file.
Also, every so often we would want to clear this logfile so its size does not get out of control. But the process producing the log file has no way to clear the file other than regularly stopping, truncating or deleting the file, and then restarting. (I feel like logrotate has some method of doing this, but I don't know if logrotate depends on the daemon being aware, or if it's actually closing and restarting daemons, etc. Not to mention I don't want other logs rotated, just this one specific log; and I don't want this script to require other possible users to setup logrotate.)
Here's the problems:
Since the logger process could write to the file while I already have an open file handle, I feel like I could easily miss records in the log file.
If the logger process were to decide to stop, clear the log file, and restart, and the log analyzer didn't run at exactly the same time, log entries would be lost. Similarly, if the log analyzer causes the logger to stop logging while it analyzes, information could also be lost that is dropped because the logger daemon isn't listening.
If I were to use a method like "note the size of the file since last time and seek there if the file is larger", then what would happen if, for some reason, between runs, the logger reset the logfile, but then had reason to log even more than it contained last time? E.g. We execute a log analyze loop. We get 50 log entries, so we set a mark that we have read 50 entries. Next time we run, we see 60 entries. But, all 60 are brand new; the file had been cleared and restarted since the last log run. Instead we end up seeking to entry 51 and missing 50 entries! Either way it doesn't solve the problem of needing to periodically clear the log.
I have no control over the logger daemon. (Imagine we're talking about something like syslog here. It's not syslog but same idea - a process that is pretty critical holds a logfile open.) So I have no way to change its logging method. It starts at init time, opens a log file, and writes to it. We want to be able to clear that logfile AND analyze it, making sure we get every log entry through the Python script at some point.
The ideal scenario would be this:
The log daemon runs at system init.
Via cron, the Python log analyzer runs once per minute (or once per 5 minutes or whatever is deemed appropriate)
The log analyzer collects every single line from the current log file and immediately truncates it, causing the log file to be blanked out. Python maintains the original contents in a list.
The logger then continues to go about its business, with the now blank file. In the mean time, Python can continue to parse the entries at its leisure from the Python list in memory.
I've very, very vaguely studied fifo's, but am not sure if that would be appropriate. In that scenario the log analyzer would run as a daemon itself, while the original logger writes to a FIFO. I have very little knowledge in this area however and don't know if it'd be a solution or not.
So I guess the question really is twofold:
How to reliably read EVERY entry written to the log from Python? Including if the log grows, is reset, etc.
How, if possible to truncate a file that has an open write handle? (Ideally, this would be something I could do from Python; I could do something like logfile.readlines(); logfile.truncate so that way no entries would get lost. But this seems like unless the logger process was well aware of this, it'd end up causing more problems than it solves.)
Thanks!
I don’t see any particular reason why you should be not able to read log file created by syslogd. You are saying that you are using some process similar to syslog, and process is keeping your log file open? Since you are asking rather for ideas, I would recommend you to use syslog! http://pic.dhe.ibm.com/infocenter/tpfhelp/current/index.jsp?topic=%2Fcom.ibm.ztpf-ztpfdf.doc_put.cur%2Fgtpc1%2Fhsyslog.html
It is working anyway – use it. Some easy way to write to log is to use logger command:
logger “MYAP: hello”
In python script you can do it like:
import os
os.system(‘logger “MYAP: hello”’)
Also remember you can actually configure syslogd. http://pic.dhe.ibm.com/infocenter/tpfhelp/current/index.jsp?topic=%2Fcom.ibm.ztpf-ztpfdf.doc_put.cur%2Fgtpc1%2Fconstmt.html
Also about your problem with empty logs – sysclog is not clearing logs. There are other tools for it – on debian for example logrotate is used. In this scenario if your log is empty – you can check backup file created by logrotate.
Since it looks like your problem is in logging tool, my advise would be to use syslog for logging. And other tool for rotating logs. Then you can easily parse logs. And if by any means (I don’t know if it is even possible with syslog) you miss some data – remember you will get it in next iteration anyway ;)
Some other idea would be to copy your logfile and work with copy...

which inotify event signals the completion of a large file operation?

for large files or slow connections, copying files may take some time.
using pyinotify, i have been watching for the IN_CREATE event code. but this seems to occur at the start of a file transfer. i need to know when a file is completely copied - it aint much use if it's only half there.
when a file transfer is finished and completed, what inotify event is fired?
IN_CLOSE probably means the write is complete. This isn't for sure since some applications are bad actors and open and close files constantly while working with them, but if you know the app you're dealing with (file transfer, etc.) and understand its' behaviour, you're probably fine. (Note, this doesn't mean the transfer completed successfully, obviously, it just means that the process that opened the file handle closed it).
IN_CLOSE catches both IN_CLOSE_WRITE and IN_CLOSE_NOWRITE, so make your own decisions about whether you want to just catch one of those. (You probably want them both - WRITE/NOWRITE have to do with file permissions and not whether any writes were actually made).
There is more documentation (although annoyingly, not this piece of information) in Documentation/filesystems/inotify.txt.
For my case I wanted to execute a script after a file was fully uploaded. I was using WinSCP which writes large files with a .filepart extension till done.
I first started modifying my script to ignore files if they're themselves ending with .filepart or if there's another file existing in the same directory with the same name but .filepart extension, hence that means the upload is not fully completed yet.
But then I noticed at the end of the upload, when all the parts have been finished, I have a IN_MOVED_IN notification getting triggered which helped me run my script exactly when I wanted it.
If you want to know how your file uploader behaves, add this to the incrontab:
/your/directory/ IN_ALL_EVENTS echo "$$ $# $# $% $&"
and then
tail -F /var/log/cron
and monitor all the events getting triggered to find out which one suits you best.
Good luck!
Why don't you add a dummy file at the end of the transfer? You can use the IN_CLOSE or IN_CREATE event code on the dummy. The important thing is that the dummy as to be transfered as the last file in the sequence.
I hope it'll help.

How to handle new files to process in cron job

How can I check files that I already processed in a script so I don't process those again? and/or
What is wrong with the way I am doing this now?
Hello,
I am running tshark with the ring buffer option to dump to files after 5MB or 1 hour. I wrote a python script to read these files in XML and dump into a database, this works fine.
My issue is that this is really process intensive, one of those 5MB can turn into a 200MB file when converted to XML, so I do not want to do any unnecessary processing.
The script is running every 10 minutes and processes ~5 files per run, since is scanning the folder where the files are created for any new entries, I dump a hash of the file into the database and on the next run check the hash and if it isn't in the database I scan the file.
The problem is that this does not appear to work every time, it ends up processing files that it has already done. When I check the hash of the file that it keeps trying to process it doesn't show up anywhere in the database, hence why is trying to process it over and over.
I am printing out the filename + hash in the output of the script:
using file /var/ss01/SS01_00086_20100107100828.cap with hash: 982d664b574b84d6a8a5093889454e59
using file /var/ss02/SS02_00053_20100106125828.cap with hash: 8caceb6af7328c4aed2ea349062b74e9
using file /var/ss02/SS02_00075_20100106184519.cap with hash: 1b664b2e900d56ca9750d27ed1ec28fc
using file /var/ss02/SS02_00098_20100107104437.cap with hash: e0d7f5b004016febe707e9823f339fce
using file /var/ss02/SS02_00095_20100105132356.cap with hash: 41a3938150ec8e2d48ae9498c79a8d0c
using file /var/ss02/SS02_00097_20100107103332.cap with hash: 4e08b6926c87f5967484add22a76f220
using file /var/ss02/SS02_00090_20100105122531.cap with hash: 470b378ee5a2f4a14ca28330c2009f56
using file /var/ss03/SS03_00089_20100107104530.cap with hash: 468a01753a97a6a5dfa60418064574cc
using file /var/ss03/SS03_00086_20100105122537.cap with hash: 1fb8641f10f733384de01e94926e0853
using file /var/ss03/SS03_00090_20100107105832.cap with hash: d6209e65348029c3d211d1715301b9f8
using file /var/ss03/SS03_00088_20100107103248.cap with hash: 56a26b4e84b853e1f2128c831628c65e
using file /var/ss03/SS03_00072_20100105093543.cap with hash: dca18deb04b7c08e206a3b6f62262465
using file /var/ss03/SS03_00050_20100106140218.cap with hash: 36761e3f67017c626563601eaf68a133
using file /var/ss04/SS04_00010_20100105105912.cap with hash: 5188dc70616fa2971d57d4bfe029ec46
using file /var/ss04/SS04_00071_20100107094806.cap with hash: ab72eaddd9f368e01f9a57471ccead1a
using file /var/ss04/SS04_00072_20100107100234.cap with hash: 79dea347b04a05753cb4ff3576883494
using file /var/ss04/SS04_00070_20100107093350.cap with hash: 535920197129176c4d7a9891c71e0243
using file /var/ss04/SS04_00067_20100107084826.cap with hash: 64a88ecc1253e67d49e3cb68febb2e25
using file /var/ss04/SS04_00042_20100106144048.cap with hash: bb9bfa773f3bf94fd3af2514395d8d9e
using file /var/ss04/SS04_00007_20100105101951.cap with hash: d949e673f6138af2d388884f4a6b0f08
The only files it should be doing are one per folder, so only 4 files. This causes unecessary processing and I have to deal with overlapping cron jobs + other services been affected.
What I am hoping to get from this post is a better way to do this or hopefully someone can tell me why is happening, I know that the latter might be hard since it can be a bunch of reasons.
Here is the code (I am not a coder but a sys admin so be kind :P) line 30-32 handle the hash comparisons.
Thanks in advance.
A good way to handle/process files that are created at random times is to use
incron rather than cron. (Note: since incron uses the Linux kernel's
inotify syscalls, this solution only works with Linux.)
Whereas cron runs a job based on dates and times, incron runs a job based on
changes in a monitored directory. For example, you can configure incron to run a
job every time a new file is created or modified.
On Ubuntu, the package is called incron. I'm not sure about RedHat, but I believe this is the right package: http://rpmfind.net//linux/RPM/dag/redhat/el5/i386/incron-0.5.9-1.el5.rf.i386.html.
Once you install the incron package, read
man 5 incrontab
for information on how to setup the incrontab config file. Your incron_config file might look something like this:
/var/ss01/ IN_CLOSE_WRITE /path/to/processing/script.py $#
/var/ss02/ IN_CLOSE_WRITE /path/to/processing/script.py $#
/var/ss03/ IN_CLOSE_WRITE /path/to/processing/script.py $#
/var/ss04/ IN_CLOSE_WRITE /path/to/processing/script.py $#
Then to register this config with the incrond daemon, you'd run
incrontab /path/to/incron_config
That's all there is to it. Now whenever a file is created in /var/ss01, /var/ss02, /var/ss03 or /var/ss04, the command
/path/to/processing/script.py $#
is run, with $# replaced by the name of the newly created file.
This will obviate the need to store/compare hashes, and files will only get processed once -- immediately after they are created.
Just make sure your processing script does not write into the top level of the monitored directories.
If it does, then incrond will notice the new file created, and launch script.py again, sending you into an infinite loop.
incrond monitors individual directories, and does not recursively monitor subdirectories. So you could direct tshark to write to /var/ss01/tobeprocessed, use incron to monitor
/var/ss01/tobeprocessed, and have your script.py write to /var/ss01, for example.
PS. There is also a python interface to inotify, called pyinotify. Unlike incron, pyinotify can recursively monitor subdirectories. However, in your case, I don't think the recursive monitoring feature is useful or necessary.
I don't know enough about what is in these files, so this may not work for you, but if you have only one intended consumer, I would recommend using directories and moving the files to reflect their state. Specifically, you could have a dir structure like
/waiting
/progress
/done
and use the relative atomicity of mv to change the "state" of each file. (Whether mv is truly atomic depends on your filesystem, I believe.)
When your processing task wants to work on a file, it moves it from waiting to progress (and makes sure that the move succeeded). That way, no other task can pick it up, since it's no longer waiting. When the file is complete, it gets moved from progress to done, where a cleanup task might delete or archive old files that are no longer needed.
I see several issues.
If you have overlapping cron jobs you need to have a locking mechanism to control access. Only allow one process at a time to eliminate the overlap problem. You might setup a shell script to do that. Create a 'lock' by making a directory (mkdir is atomic), process the data, then delete the lock directory. If the shell script finds the directory already exists when it tries to make it then you know another copy is already running and it can just exit.
If you can't change the cron table(s) then just rename the executable and name your shell script the same as the old executable.
Hashes are not guaranteed to be unique identifiers for files, it's likely they are, but it's not absolutely guaranteed.
Why not just move a processed file to a different directory?
You mentioned overlapping cron jobs. Does this mean one conversion process can start before the previous one finished? That means you would perform the move at the beginning of the conversion. If you are worries about an interrupted conversion, use an intermediate directory, and move to a final directory after completion.
If I'm reading the code correctly, you're updating the database (by which I mean the log of files processed) at the very end. So when you have a huge file that's being processed and not yet complete, another cron job will 'legally' start working on it. - both completing succesfully resulting in two entries in the database.
I suggest you move up the logging-to-database, which would act as a lock for subsequent cronjobs and having a 'success' or 'completed' at the very end. The latter part is important as something that's shown as processing but doesnt have a completed state (coupled with the notion of time) can be programtically concluded as an error. (That is to say, a cronjob tried processing it but never completed it and the log show processing for 1 week!)
To summarize
Move up the log-to-database so that it would act as a lock
Add a 'success' or 'completed' state which would give the notion of errored state
PS: Dont take it in the wrong way, but the code is a little hard to understand. I am not sure whether I do at all.

Categories