I need to process log files from a remote directory and parse their contents. This task can be divided into the following:
Run ssh find with criteria to determine files to process
Get the relevant contents of these files with zgrep
Process the results from 2 with a python algorithm and store to a local db
Steps 1 and 3 are very fast, so I am looking to improve Step 2.
The files are stored as gz or plaintext and the order in which they are processed is very important. Newer logs need to be processed first to avoid discrepancies with older logs.
To get and filter the log lines, I have tried the following approaches:
Download logs to a temp folder and process them as they are downloaded, in parallel. A python process triggers an scp command and a parallel thread inspects the temp folder for completed downloads until scp is finished. If the file is downloaded run zgrep, process and delete the file.
Run ssh remote zgrep "regex" file1 file2 file3, grab the results and process them.
Method 2 is a more readable and elegant solution, however it is also much slower. Using method 1 I can download and parse 280 files in about 1:30 minutes. Using method 2, it will take closer to 5:00 minutes. One of the main problems with the download-process approach is that the directory can be altered while the script is running, leading to several checks being needed in the code.
To run the shell commands from python I am currently using subprocess.check_output and the multiprocessing and threading modules.
Can you think of a way this algorithm can be improved?
Related
I have a list of files that I need to access and process from S3 buckets through a lambda function and the idea is to loop through each of the files and collect data from all files in parallel. My first thought was to use threading which resulted in an issue that only allowed my max pool size to be 10, whereas I'm processing many files. I want to be able to continuously append processes until all files have been accessed instead of creating a list of processes and then running them in parallel which seems to be the case in multiprocessing's Pool. I'd appreciate any suggestions.
You may not be able to achieve performance gains using multi-threading under a single process in python due to the GIL. However you could use a bash script to start multiple python processes simultaneously.
For example if you wanted to perform some tasks and write results to a common file you could use the following prep.sh file to create an empty results file.
#!/bin/bash
if [ -e results.txt ]
then
rm results.txt
fi
touch results.txt
And the following control.sh file to spawn your multiple python processes.
#!/bin/bash
Arr=( Do Many Things )
for i in "${Arr[#]}"; do
python process.py $i &
done
With the following process.py file which simply takes a command line argument and writes it to the result file followed by a return carriage.
#!/bin/python
import sys
target = sys.argv[1]
def process(argument):
with open("results.txt", "a") as results:
results.write(argument+"\n")
process(target)
Ovbiously you would need to edit the array in control.sh to reflect the files you need to access and the process in process.py to reflect the retrieval and analysis of those files.
I am creating a python wrapper script and was wondering what'd be a good way to create it.
I want to run code serially. For example:
Step 1.
Run same program (in parallel - the parallelization is easy because I work with an LSF system so I just submit three different jobs).
I run the program in parallel, and each run takes one fin.txt and outputs one fout.txt, i.e., when they all run they would produce 3 output files from the three input files, f1in.txt, f2in.txt, f3in.txt, f1out.txt, f2out.txt, f3out.txt.
(in the LSF system) When each run of the program is completed successfully, it produces a log file output, f1log.out, f2log.out, f3log.out.
The log files output are of this form, i.e., f1log.out would look something like this if it runs successfully.
------------------------------------------------------------
# LSBATCH: User input
------------------------------------------------------------
Successfully completed.
Resource usage summary:
CPU time : 86.20 sec.
Max Memory : 103 MB
Max Swap : 881 MB
Max Processes : 4
Max Threads : 5
The output (if any) is above this job summary.
Thus, I'd like my wrapper to check (every 5 min or so) for each run (1,2,3) if the log file has been created, and if it has been created I'd like the wrapper to check if it was successfully completed (aka, if the string Successfully completed appears in the log file).
Also if the one of the runs finished and produces a log file that was not successfully completed I'd like my wrapper to end and report that run (k=1,2,3) was not completed.
After that,
Step2. If all three runs are successfully completed I would run another program that takes those three files as input... else I'd print an error.
Basically in my question I am looking for two things:
Does it sound like a good way to write a wrapper?
How in python I can check the existence of a file, and search for a pattern every certain time in a good way?
Note. I am aware that LSF has job dependencies but I find this way to be more clear and easy to work with, though may not be optimal.
I'm a user of an LSF system, and my major gripes are exit handling, and cleanup. I think a neat idea would be to send a batch job array that has for instance: Initialization Task, Legwork Task, Cleanup Task. The LSF could complete all three and send a return code to the waiting head node. Alot of times LSF works great to send one job or command, but it isn't really set up to handle systematic processing.
Other than that I wish you luck :)
I'm writing some code that takes a bunch of text files, runs OpinionFinder on them, then analyses the results. OpinionFinder is a python program that calls a java progam to manage various other programs.
I have:
some code (pull data off the web, write text files)
args = shlex.split('python opinionfinder.py -f doclist')
optout = subprocess.Popen(args)
retcode = optout.wait()
some more code to analyse OpinionFinder's text files.
When I didn't have the optout.wait bit, the subprocess would get executed after the rest of the script had finished, i.e. before the file analysis part. When I added the optout.wait OpinionFinder didn't run properly - I think because it couldn't find the files from the first part of the script - i.e. the order is wrong again.
What am I doing wrong?
What's the best way to run some script, execute an external process, then run the rest of the script?
Thanks.
How can I check files that I already processed in a script so I don't process those again? and/or
What is wrong with the way I am doing this now?
Hello,
I am running tshark with the ring buffer option to dump to files after 5MB or 1 hour. I wrote a python script to read these files in XML and dump into a database, this works fine.
My issue is that this is really process intensive, one of those 5MB can turn into a 200MB file when converted to XML, so I do not want to do any unnecessary processing.
The script is running every 10 minutes and processes ~5 files per run, since is scanning the folder where the files are created for any new entries, I dump a hash of the file into the database and on the next run check the hash and if it isn't in the database I scan the file.
The problem is that this does not appear to work every time, it ends up processing files that it has already done. When I check the hash of the file that it keeps trying to process it doesn't show up anywhere in the database, hence why is trying to process it over and over.
I am printing out the filename + hash in the output of the script:
using file /var/ss01/SS01_00086_20100107100828.cap with hash: 982d664b574b84d6a8a5093889454e59
using file /var/ss02/SS02_00053_20100106125828.cap with hash: 8caceb6af7328c4aed2ea349062b74e9
using file /var/ss02/SS02_00075_20100106184519.cap with hash: 1b664b2e900d56ca9750d27ed1ec28fc
using file /var/ss02/SS02_00098_20100107104437.cap with hash: e0d7f5b004016febe707e9823f339fce
using file /var/ss02/SS02_00095_20100105132356.cap with hash: 41a3938150ec8e2d48ae9498c79a8d0c
using file /var/ss02/SS02_00097_20100107103332.cap with hash: 4e08b6926c87f5967484add22a76f220
using file /var/ss02/SS02_00090_20100105122531.cap with hash: 470b378ee5a2f4a14ca28330c2009f56
using file /var/ss03/SS03_00089_20100107104530.cap with hash: 468a01753a97a6a5dfa60418064574cc
using file /var/ss03/SS03_00086_20100105122537.cap with hash: 1fb8641f10f733384de01e94926e0853
using file /var/ss03/SS03_00090_20100107105832.cap with hash: d6209e65348029c3d211d1715301b9f8
using file /var/ss03/SS03_00088_20100107103248.cap with hash: 56a26b4e84b853e1f2128c831628c65e
using file /var/ss03/SS03_00072_20100105093543.cap with hash: dca18deb04b7c08e206a3b6f62262465
using file /var/ss03/SS03_00050_20100106140218.cap with hash: 36761e3f67017c626563601eaf68a133
using file /var/ss04/SS04_00010_20100105105912.cap with hash: 5188dc70616fa2971d57d4bfe029ec46
using file /var/ss04/SS04_00071_20100107094806.cap with hash: ab72eaddd9f368e01f9a57471ccead1a
using file /var/ss04/SS04_00072_20100107100234.cap with hash: 79dea347b04a05753cb4ff3576883494
using file /var/ss04/SS04_00070_20100107093350.cap with hash: 535920197129176c4d7a9891c71e0243
using file /var/ss04/SS04_00067_20100107084826.cap with hash: 64a88ecc1253e67d49e3cb68febb2e25
using file /var/ss04/SS04_00042_20100106144048.cap with hash: bb9bfa773f3bf94fd3af2514395d8d9e
using file /var/ss04/SS04_00007_20100105101951.cap with hash: d949e673f6138af2d388884f4a6b0f08
The only files it should be doing are one per folder, so only 4 files. This causes unecessary processing and I have to deal with overlapping cron jobs + other services been affected.
What I am hoping to get from this post is a better way to do this or hopefully someone can tell me why is happening, I know that the latter might be hard since it can be a bunch of reasons.
Here is the code (I am not a coder but a sys admin so be kind :P) line 30-32 handle the hash comparisons.
Thanks in advance.
A good way to handle/process files that are created at random times is to use
incron rather than cron. (Note: since incron uses the Linux kernel's
inotify syscalls, this solution only works with Linux.)
Whereas cron runs a job based on dates and times, incron runs a job based on
changes in a monitored directory. For example, you can configure incron to run a
job every time a new file is created or modified.
On Ubuntu, the package is called incron. I'm not sure about RedHat, but I believe this is the right package: http://rpmfind.net//linux/RPM/dag/redhat/el5/i386/incron-0.5.9-1.el5.rf.i386.html.
Once you install the incron package, read
man 5 incrontab
for information on how to setup the incrontab config file. Your incron_config file might look something like this:
/var/ss01/ IN_CLOSE_WRITE /path/to/processing/script.py $#
/var/ss02/ IN_CLOSE_WRITE /path/to/processing/script.py $#
/var/ss03/ IN_CLOSE_WRITE /path/to/processing/script.py $#
/var/ss04/ IN_CLOSE_WRITE /path/to/processing/script.py $#
Then to register this config with the incrond daemon, you'd run
incrontab /path/to/incron_config
That's all there is to it. Now whenever a file is created in /var/ss01, /var/ss02, /var/ss03 or /var/ss04, the command
/path/to/processing/script.py $#
is run, with $# replaced by the name of the newly created file.
This will obviate the need to store/compare hashes, and files will only get processed once -- immediately after they are created.
Just make sure your processing script does not write into the top level of the monitored directories.
If it does, then incrond will notice the new file created, and launch script.py again, sending you into an infinite loop.
incrond monitors individual directories, and does not recursively monitor subdirectories. So you could direct tshark to write to /var/ss01/tobeprocessed, use incron to monitor
/var/ss01/tobeprocessed, and have your script.py write to /var/ss01, for example.
PS. There is also a python interface to inotify, called pyinotify. Unlike incron, pyinotify can recursively monitor subdirectories. However, in your case, I don't think the recursive monitoring feature is useful or necessary.
I don't know enough about what is in these files, so this may not work for you, but if you have only one intended consumer, I would recommend using directories and moving the files to reflect their state. Specifically, you could have a dir structure like
/waiting
/progress
/done
and use the relative atomicity of mv to change the "state" of each file. (Whether mv is truly atomic depends on your filesystem, I believe.)
When your processing task wants to work on a file, it moves it from waiting to progress (and makes sure that the move succeeded). That way, no other task can pick it up, since it's no longer waiting. When the file is complete, it gets moved from progress to done, where a cleanup task might delete or archive old files that are no longer needed.
I see several issues.
If you have overlapping cron jobs you need to have a locking mechanism to control access. Only allow one process at a time to eliminate the overlap problem. You might setup a shell script to do that. Create a 'lock' by making a directory (mkdir is atomic), process the data, then delete the lock directory. If the shell script finds the directory already exists when it tries to make it then you know another copy is already running and it can just exit.
If you can't change the cron table(s) then just rename the executable and name your shell script the same as the old executable.
Hashes are not guaranteed to be unique identifiers for files, it's likely they are, but it's not absolutely guaranteed.
Why not just move a processed file to a different directory?
You mentioned overlapping cron jobs. Does this mean one conversion process can start before the previous one finished? That means you would perform the move at the beginning of the conversion. If you are worries about an interrupted conversion, use an intermediate directory, and move to a final directory after completion.
If I'm reading the code correctly, you're updating the database (by which I mean the log of files processed) at the very end. So when you have a huge file that's being processed and not yet complete, another cron job will 'legally' start working on it. - both completing succesfully resulting in two entries in the database.
I suggest you move up the logging-to-database, which would act as a lock for subsequent cronjobs and having a 'success' or 'completed' at the very end. The latter part is important as something that's shown as processing but doesnt have a completed state (coupled with the notion of time) can be programtically concluded as an error. (That is to say, a cronjob tried processing it but never completed it and the log show processing for 1 week!)
To summarize
Move up the log-to-database so that it would act as a lock
Add a 'success' or 'completed' state which would give the notion of errored state
PS: Dont take it in the wrong way, but the code is a little hard to understand. I am not sure whether I do at all.
We have several cron jobs that ftp proxy logs to a centralized server. These files can be rather large and take some time to transfer. Part of the requirement of this project is to provide a logging mechanism in which we log the success or failure of these transfers. This is simple enough.
My question is, is there a way to check if a file is currently being written to? My first solution was to just check the file size twice within a given timeframe and check the file size. But a co-worker said that there may be able to hook into the EXT3 file system via python and check the attributes to see if the file is currently being appended to. My Google-Fu came up empty.
Is there a module for EXT3 or something else that would allow me to check the state of a file? The server is running Fedora Core 9 with EXT3 file system.
no need for ext3-specific hooks; just check lsof, or more exactly, /proc/<pid>/fd/* and /proc/<pid>/fdinfo/* (that's where lsof gets it's info, AFAICT). There you can check if the file is open, if it's writeable, and the 'cursor' position.
That's not the whole picture; but any more is done in processpace by stdlib on the writing process, as most writes are buffered and the kernel only sees bigger chunks of data, so any 'ext3-aware' monitor wouldn't get that either.
There's no ext3 hooks to check what you'd want directly.
I suppose you could dig through the source code of Fuser linux command, replicate the part that finds which process owns a file, and watch that resource. When noone longer has the file opened, it's done transferring.
Another approach:
Your cron jobs should tell that they're finished.
We have our cron jobs that transport files just write an empty filename.finished after it's transferred the filename. Another approach is to transfer them to a temporary filename, e.g. filename.part and then rename it to filename Renaming is atomic. In both cases you check repeatedly until the presence of filename or filename.finished