I wanted to categorize the errors in my log files. I have many folders(~100) and each of them has a log file. I want to be able to parse all the log files and categorize different errors with their frequency. The log would have following format
2014-10-22 07:55:02,997 ERROR log_message [optional_stack_trace]
One approach is to first parse all the log statements having ERROR and putting them in a single file. Ideally the resultant file will have just the log_messages without the date & ERROR strings. I guess I can just group similar strings after that. What do you guys think? Any cleaner and better approach?
You're going to want something like this (using GNU awk for true 2-d arrays):
$ awk '{cnt[$3][$4]++} END{for (err in cnt) for (msg in cnt[err]) print err, msg, cnt[err][msg]}' file1 file2 ...
but since you didn't post any sample input and expected output, it's a guess.
Related
Hello I am using pandas to process a excel file my code looks as follows:
df = xl.parse("Sheet1")
important_Parameters="bash generate.sh --companyCode"+" "+df[u'Company Code '].astype(str)+" "+"--isaADD"+df[u'TP Interchange Address '].astype(str)+" "+"--gsADD"+" "+df[u'TP Functional Group Address '].astype(str)
print(important_Parameters)
Everything works well, when I print my code it looks fine, I wish to write a txt file with the contain of my object called:
important_Parameters
I tried with:
important_Parameters.to_pickle("important.txt")
but the result does not seem like the printing, I believe that is due to the way that I took to write in disk,
I also tried with:
important_Parameters.to_string("importantParameters2.txt")
However this gave me a more friendly representation of the data but the result is including the number of the raw and also the rows's are not completed they look as follows:
bash generate.sh --companyCode 889009d --isaADD...
it is showing this ...
I would like to appreciate any suggestion to produce a simple .txt file called importante.txt with my result, the containing of important_Parameters, thanks for the support
I order to include more details my output looks like, I mean the result of the print:
0 bash generate.sh --companyCode 323232 --isaADD...
1 bash generate.sh --companyCode 323232 --isaADD...
2 bash generate.sh --companyCode 323232 --isaADD...
Panda dataframes have more than a few methods for saving to files. Have you tried important_Parameters.to_csv("important.csv")? I'm not certain what you want the output to look like.
If you want it tab-separated, you can try:
important_Parameters.to_csv("important.csv", sep='\t')
If the file absolutely must end in .txt, just change it to: important_Parameters.to_csv("important.txt"). CSVs are just specifically formatted text files so this shouldn't be a problem.
Mainly have 2 questions in respect to this topic:
I'm looking to get the row counts of a few CSV files. In Bash, I know I can do wc -l < filename.csv. How do I do this and subtract 1 from it (because of headers)?
For anyone familiar with CSV files and possible issues with grabbing raw line count, how plausible is it that a line is wrapped across multiple lines? I know that this is a very possible scenario, but want to say that this never happens. In the event of this being a possibility, would using Python's csv package be better to use? Does it read lines based on delimiters and other column wrappers?
As Barmar points out, (1) it is quite possible for CSV files to have wrapped lines and (2) CSV programming libraries can handle this well. As an example, here is a python program which uses python's CSV module to count the number of lines in file.csv minus 1:
python -c 'import csv; print( sum(1 for line in csv.reader(open("file.csv")))-1 )'
The -c arg option tells python to treat the arg string as a program to execute. In this case, we make the csv module available with the "import" statement. Then, we print out the number of lines minus one. The construct sum(1 for line in csv.reader(open("file.csv"))) counts the lines one at a time.
If your csv file has a non-typical format, you will need to set options. This might be the delimiter or quoting character. See the documentation for details.
Example
Consider this test file:
$ cat file.csv
First name,Last name,Address
John,Smith,"P O Box 1234
Somewhere, State"
Jane,Doe,"Unknown"
This file has two rows plus a header. One of the rows is split over two lines. Python's csv module correctly understands this:
$ python -c 'import csv; print( sum(1 for line in csv.reader(open("file.csv")))-1 )'
2
gzipped files
To open gzip files in python, we use the gzip module:
$ python -c 'import csv, gzip; print( sum(1 for line in csv.reader(gzip.GzipFile("file.csv.gz")))-1 )'
2
For getting the line count, just subtract 1 from the value returned by wc using an arithmetic expression
count=$(($(wc -l < filename.csv) - 1)
CSV format allows fields to contain newlines, by surrounding the field with quotes, e.g.
field1,field2,"field3 broken
across lines",field4
Dealing with this in a plain bash script would be difficult (indeed, any CSV processing that needs to handle quoted fields is tricky). If you need to deal with the full generality of CSV, you should probably use a programming language with a CSV library.
But if you know that your CSV files will never be like this, you can ignore it.
As an alternative to subtracting one from the total row count, you can discard the first line from the file before
row_count=$( { read; wc -l; } < filename.csv )
(This is in no way better than simply using $(($(wc -l < filename.csv) - 1)); it's just a useful trick to know.)
I need to filter messages out of a log file which has the following format:
2013-03-22T11:43:21.817078+01:00 INFO log msg 1...
...
2013-03-22T11:44:32.817114+01:00 WARNING log msg 2...
...
2013-03-22T11:45:45.817777+01:00 INFO log msg 3...
...
2013-03-22T11:46:59.547325+01:00 INFO log msg 4...
...
(where ... means "more messages")
The filtering must be done based on a timeframe.
This is part of a bash script, and at this point in the code the timeframe is stored as $start_time and $end_time. For example:
start_time = "2013-03-22T11:45:20"
end_time = "2013-03-22T11:45:50"
Note that the exact value of $start_time or $end_time may may never appear in the log file; yet there will be several messages within the timeframe [$start_time, $end_time] which are the ones I'm looking for.
Now, I'm almost convinced I'll need a Python script to do the filtering, but I'd rather use grep (or awk, or any other tool) since it should run much faster (the log files are big).
Any suggestions?
based on the log content in your question, I think an awk oneliner may help:
awk -F'.' -vs="$start_time" -ve="$end_time" '$1>s && $1<e' logfile
Note: this is filtering content excluding the start and end time.
$ start_time="2013-03-22T11:45:20"
$ end_time="2013-03-22T11:45:50"
$ awk -F'.' '$1>s&&$1<e' s=$start_time e=$end_time file
2013-03-22T11:45:45.817777+01:00 INFO log msg 3...
There is a huge log of errors/warnings/infos printed out on stdout. I am only interested in the lines logged after I start a specific action.
Other information: I am using Python to telnet to a shell environment. I execute the commands on shell and store the time the action is started. I then call a command to view the log which spits it on stdout. I expect to read in the greped lines after that timestamp back to Python. I also store the current time but not sure how to use that (maybe grep on a date range?)
I can redirect to a file and use find but the log is huge and I'd rather not read all of it.
I can grep -n to get line number and then read everything after but I'm not sure how to.
Concept regex to egrep on is something like: {a-timestamp}*
Any suggestions would be appreciated!
awk '/the-timestamp-I-have/,0' the-log-file
This will print the lines from the-log-file, starting at the first line that matches the-timestamp-I-have and continuing through the last line.
Ref:
http://www.catonmat.net/blog/awk-one-liners-explained-part-three/
http://www.catonmat.net/blog/ten-awk-tips-tricks-and-pitfalls/#awk_ranges
I am trying to convert my shell scripts into python code, but I am stuck while trying this operation.
I have a process that outputs a text file, the file has sections like this:
Running Operation kdasdakdnaskdaksdma
(error if present) error: kdmakmdasmdaksom
This file could have multiple lines for the operation and the error (if present, otherwise the next line will just have another operation); there is always a crlf after each block.
I am trying to scan the file to find the line that contains "error:", and then read the operation that caused the error and the details of the error, so i can extrapolate it from the text file and save it in an error log file.
So far i can find the line(s) that has "error:" in it, with this simple code, but I am not able to find any example about how do you actually print the lines that are not necessarily the ones that contain the error message, but the ones that came before and after the line where "error:" is located.
using awk or grep would be straightforward, but with Python I am not really sure about how to do so; this is what i have so far, that is able to print the line that has the error but it prints just that, while i would like to have control to the lines printed before and after.
import re
fh = open('~/logs_output.txt')
for line in fh:
if "error:" in line:
print line
Tried to look at RE module in python, and also to the string modules, but so far I haven't found anything that would allow me to do what you would do with awk for example, where you can look for an occurrence of a specific string and turn on print, and then turn it off, once that you are done
Can anyone point me to the right direction to tackle this issue? Thanks!
import re
ss = '''qhvfgbhgozr
yytuuuyuyuuuyuyuuyy
jhfg tryy error jjfkhdjhfjh ttrtr
aaaeeedddeedaeaeeaeeea
jhzdgcoiua zfaozifh cohfgdyg fuo'''
regx = re.compile('^(.*)\r?\n(.*?error.*)\r?\n(.*)', re.MULTILINE)
print regx.search(ss).groups()
result
('yytuuuyuyuuuyuyuuyy', 'jhfg tryy error jjfkhdjhfjh ttrtr', 'aaaeeedddeedaeaeeaeeea')