Summarizing log file to unique entries only - python

I have been using this script for years at work to summarize log files.
#!/usr/bin/perl
$logf = '/var/log/messages.log';
#logf=( `cat $logf` );
foreach $line ( #logf ) {
$line=~s/\d+/#/g;
$count{$line}++;
}
#alpha=sort #logf;
$prev = 'null';
#uniq = grep($_ ne $prev && ($prev = $_), #alpha);
foreach $line (#uniq) {
print "$count{$line}: ";
print "$line";
}
I have wanted to rewrite it in Python but I do not fully understand certain portions of it, such as:
#alpha=sort #logf;
$prev = 'null';
#uniq = grep($_ ne $prev && ($prev = $_), #alpha);
Does anyone know of a Python module that would negate the need to rewrite this? I haven't had any luck find something similar. Thanks in advance!

As the name of the var implies,
#alpha=sort #logf;
$prev = 'null';
#uniq = grep($_ ne $prev && ($prev = $_), #alpha);
is finding unique elements (i.e. removing duplicate lines), ignoring numbers in the line since they were previously replaced with #. Those three lines could have been written
#uniq = sort keys(%count);
or maybe even
#uniq = keys(%count);
Another way of writing the program in Perl:
my $log_qfn = '/var/log/messages.log';
open(my $fh, '<', $log_qfn)
or die("Can't open $log_qfn: $!\n");
my %counts;
while (<$fh>) {
s/\d+/#/g;
++$counts{$_};
}
#for (sort keys(%counts)) {
for (keys(%counts)) {
print "$counts{$_}: $_";
}
This should be easier to translate into Python.

#alpha=sort #logf;
$prev = 'null';
#uniq = grep($_ ne $prev && ($prev = $_), #alpha);
would be equivalent to
uniq = sorted(set(logf))
if logf were a list of lines.
However, since you are counting the freqency of lines,
you could use a collections.Counter to both count the lines and collect the unique lines (as keys) (thus removing the need to compute uniq at all):
count = collections.Counter()
for line in f:
count[line] += 1
import sys
import re
import collections
logf = '/var/log/messages.log'
count = collections.Counter()
write = sys.stdout.write
with open(logf, 'r') as f:
for line in f:
line = re.sub(r'\d+','#',line)
count[line] += 1
for line in sorted(count):
write("{c}: {l}".format(c = count[line], l = line))

I have to say I often encountered with people trying to do stuff in python perl can be done in one line on shell or bash:
I don't care for downvotes, since people should know there is no reason to do stuff in 20 lines of python if it can be done on shell
< my_file.txt | sort | uniq > uniq_my_file.txt

Related

Python print -> Perl STDIN line skip problem

Im newbie of perl and python.
I need to file handling in python(dataframe), and that file need to calculated in Perl.
At first, I tried to use python subprocess, and it was not working(borken pipe)
i need to multiple lines from python, and perl code need to read it and processing.
I just use | in command line, and it was work, but perl skip odds number line and just read even number line.
how can i fix it?
my python code is :
import pandas as pd
data = pd.read_csv('./data.txt', sep = '\t', header = None)
datalist = list(data[0] + '_' + data[1])
for line in kinase_list:
print(line)
and my perl code is :
//
use strict;
my %new_list = ();
while (<STDIN>){
my $line = <STDIN>;
# print STDERR $line;
# chomp $line;
my ($name, $title) = split('_', <STDIN>);
$new_list{$title} = $name;
print STDERR $name, "\t", $title, "\n";
}
print STDERR scalar(keys %new_list);
my python output 657 lines, but perl just out 329.
how can i fix it?
The expression <STDIN> reads a line from standard input, so your Perl code reads two lines for every iteration of the while loop.
It is sufficient to say
while (<STDIN>) {
my $line = $_;
...
or just
while (my $line = <STDIN>) {
...

Convert Perl syntax to Python [duplicate]

This question already has an answer here:
How to create a dict equivalent in Python from Perl hash?
(1 answer)
Closed 5 years ago.
I have a file with lines which are separated by white spaces.
I have written the program below in Perl and it works.
Now I must rewrite it in Python which is not my language, but I have solved it more or less.
I currently struggle with this expression in Perl which I can't convert it to Python.
$hash{$prefix}++;
I have found some solutions but I'm not sufficiently experienced with Python to solve this. All the solution looks complicated to me compared to the Perl one.
These Stack Overflow questions seem to be relevant.
Python variables as keys to dict
Python: How to pass key to a dictionary from the variable of a function?
Perl
#!perl -w
use strict;
use warnings FATAL => 'all';
our $line = "";
our #line = "";
our $prefix = "";
our %hash;
our $key;
while ( $line = <STDIN> ) {
# NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
next if $line =~ /^NAMESPACE/;
#aleks-event-test redis-1-m06k0 1/1 Running 0 1d 172.26.0.138 The_Server_name
#line = split ' ', $line;
$line[1] =~ /(.*)-\d+-\w+$/ ;
$prefix = $1;
#print "$prefix $line[7]\n";
print "$prefix $line[7]\n";
$hash{$prefix}++;
}
foreach $key ( keys %hash ) {
if ( $hash{$key} / 2 ){
print "$key : $hash{$key} mod 2 \n"
}
else {
print "$key : $hash{$key} not mod 2 \n"
}
}
Python
#!python
import sys
import re
myhash = {}
for line in sys.stdin:
"""
Diese Projekte werden ignoriert
"""
if re.match('^NAMESPACE|logging|default',line):
continue
linesplited = line.split()
prefix = re.split('(.*)(-\d+)?-\w+$',linesplited[1])
#print linesplited[1]
print prefix[1]
myhash[prefix[1]] += 1
Your problem is using this line:
myhash = {}
# ... code ...
myhash[prefix[1]] += 1
You likely are getting a KeyError. This is because you start off with an empty dictionary (or hash), and if you attempt to reference a key that doesn't exist yet, Python will raise an exception.
A simple solution that will let your script work is to use a defaultdict, which will auto-initialize any key-value pair you attempt to access.
#!python
import sys
import re
from collections import defaultdict
# Since you're keeping counts, we'll initialize this so that the values
# of the dictionary are `int` and will default to 0
myhash = defaultdict(int)
for line in sys.stdin:
"""
Diese Projekte werden ignoriert
"""
if re.match('^NAMESPACE|logging|default',line):
continue
linesplited = line.split()
prefix = re.split('(.*)(-\d+)?-\w+$',linesplited[1])
#print linesplited[1]
print prefix[1]
myhash[prefix[1]] += 1

Sort file by key with awk or perl like a join without presorting

I want to join two tab separated files, but they are in a different order. I know that it is doable with awk, but I don't know how. Here is the equivalent toy python code (python is too memory inefficient for this task without crazy workarounds):
import pandas as pd
from random import shuffle
a = ['bar','qux','baz','foo','spam']
df = pd.DataFrame({'nam':a,'asc':[1,2,3,4,5],'desc':[5,4,3,2,1]})
shuffle(a)
print(a)
dex = pd.DataFrame({'dex' : a})
df_b = pd.DataFrame({'VAL1' :[0,1,2,3,4,5,6]})
pd.merge(dex, df,left_on='dex',right_on='nam')[['asc','desc','nam']]
I have two files:
For file one, column 2 holds the identifier for each row, there 5 columns I don't need, and then there are about 3 million columns of data.
For file two, There are 12 columns, with the second column containing the same identifiers in a different order, along with additional ids.
I want to sort file one to have the same identifiers and order as file two, with the other columns appropriately rearranged.
File one is potentially multiple gigabytes.
Is this easier with awk and/or other GNU tools, or should I use perl?
If the size of file1 is in the order of GB, and you have 3 million columns of data, you have a tiny number of lines (no more than 200). While you can't load all of the lines themselves into memory, you could easily load all of their locations.
use feature qw( say );
use Fcntl qw( SEEK_SET );
open(my $fh1, '<', $qfn1) or die("Can't open \"$qfn1\": $!\n");
open(my $fh2, '<', $qfn2) or die("Can't open \"$qfn2\": $!\n");
my %offsets;
while (1) {
my $offset = tell($fh1);
my $row1 = <$fh1>;
last if !defined($row1);
chomp($row1);
my #fields1 = split(/\t/, $row1);
my $key = $fields1[1];
$offsets{$key} = $offset;
}
while (my $row2 = <$fh2>) {
chomp($row2);
my #fields2 = split(/\t/, $row2);
my $key = $fields2[1];
my $offset = $offsets{$key};
if (!defined($offset)) {
warn("Key $key not found.\n");
next;
}
seek($fh1, $offset, SEEK_SET);
my $row1 = <$fh1>;
chomp($row1);
my #fields1 = split(/\t/, $row1);
say join "\t", #fields2, #fields1[6..$#fields1];
}
This approach can be taken in Python as well.
Note: There exists a much simpler solution if the order is more flexible (i.e. if you're ok with the output being ordered as the records are ordered in file1). This assuming file2 easily fits in memory.
3 million columns of data, eh? It sounds like you're doing some NLP work.
Assuming this is true, and your matrix is sparse, python can handle it just fine (just not with pandas). Look at scipy.sparse. Example:
from scipy.sparse import dok_matrix
A = dok_matrix((10,10))
A[1,1] = 1
B = dok_matrix((10,10))
B[2,2] = 2
print A+B
DOK stands for "dictionary of keys", which is typically used to build the sparse matrix, then it's usually converted to CSR, etc. depending on use. See available sparse matrix types.
The important thing is not to split any more than necessary. If you have enough memory, putting the smaller file in a hash, and then reading through the second file ought to work.
Consider the following example (note the run time of this script includes the time it takes to create sample data):
#!/usr/bin/env perl
use strict;
use warnings;
# This is a string containing 10 lines corresponding to your "file one"
# Second column has the record ID
# Normally, you'd be reading this from a file
my $big_file = join "\n",
map join("\t", 'x', $_, ('x') x 3_000_000),
1 .. 10
;
# This is a string containing 10 lines corresponding to your "file two"
# Second column has the record ID
my $small_file = join "\n",
map join("\t", 'y', $_, ('y') x 10),
1 .. 10
;
# You would normally pass file names as arguments
join_with_big_file(
\$small_file,
\$big_file,
);
sub join_with_big_file {
my $small_records = load_small_file(shift);
my $big_file = shift;
open my $fh, '<', $big_file
or die "Cannot open '$big_file': $!";
while (my $line = <$fh>) {
chomp $line;
my ($first, $id, $rest) = split /\t/, $line, 3;
print join("\t", $first, $id, $rest, $small_records->{$id}), "\n";
}
return;
}
sub load_small_file {
my $file = shift;
my %records;
open my $fh, '<', $file
or die "Cannot open '$file' for reading: $!";
while (my $line = <$fh>) {
# limit the split
my ($first, $id, $rest) = split /\t/, $line, 3;
# I drop the id field here so it is not duplicated in the joined
# file. If that is not a problem, $records{$id} = $line
# would be better.
$records{$id} = join("\t", $first, $rest);
}
return \%records;
}

Python line by line data processing

I am new to python and I searched few articles but do not find a correct syntax to read a file and do awk line processing in python . I need your help in solving this problem .
This is how my bash script for build and deploy looks, I read a configurationf file in bash which looks like as below .
backup /apps/backup
oracle /opt/qosmon/qostool/oracle oracle-client-12.1.0.1.0
and the script for bash reading section looks like below
while read line
do
case "$line" in */package*) continue ;; esac
host_file_array+=("$line")
done < ${HOST_FILE}
for ((i=0 ; i < ${#host_file_array[*]}; i++))
do
# echo "${host_file_array[i]}"
host_file_line="${host_file_array[i]}"
if [[ "$host_file_line" != "#"* ]];
then
COMPONENT_NAME=$(echo $host_file_line | awk '{print $1;}' )
DIRECTORY=$(echo $host_file_line | awk '{print $2;}' )
VERSION=$(echo $host_file_line | awk '{print $3;}' )
if [[ ("${COMPONENT_NAME}" == *"oracle"*) ]];
then
print_parameters "Status ${DIRECTORY}/${COMPONENT_NAME}"
/bin/bash ${DIRECTORY}/${COMPONENT_NAME}/current/script/manage-oracle.sh ${FORMAT_STRING} start
fi
etc .........
How the same can be conveted to Python . This is what I have prepared so far in python .
f = open ('%s' % host_file,"r")
array = []
line = f.readline()
index = 0
while line:
line = line.strip("\n ' '")
line=line.split()
array.append([])
for item in line:
array[index].append(item)
line = f.readline()
index+= 1
f.close()
I tried with split in python , since the config file does not have equal number of columns in all rows, I get index bound error. what is the best way to process it .
I think dictionaries might be a good fit here, you can generate them as follows:
>>> result = []
>>> keys = ["COMPONENT_NAME", "DIRECTORY", "VERSION"]
>>> with open(hosts_file) as f:
... for line in f:
... result.append(dict(zip(keys, line.strip().split())))
...
>>> result
[{'DIRECTORY': '/apps/backup', 'COMPONENT_NAME': 'backup'},
{'DIRECTORY': '/opt/qosmon/qostool/oracle', 'VERSION': 'oracle-client-12.1.0.1.0', 'COMPONENT_NAME': 'oracle'}]
As you see this creates a list of dictionaries. Now when you're accessing the dictionaries, you know that some of them might not contain a 'VERSION' key. There are multiple ways of handling this. Either you try/except KeyError or get the value using dict.get().
Example:
>>> for r in result:
... print r.get('VERSION', "No version")
...
...
No version
oracle-client-12.1.0.1.0
result = [line.strip().split() for line in open(host_file)]

intersperse the lines of two different files

I have to do a simple task, but I don't know how to do it and I'm staked. I need to intersperse the lines of two different files each 4 lines:
File 1:
1
2
3
4
5
6
7
8
9
10
11
12
FILE 2:
A
B
C
D
E
F
G
H
I
J
K
L
Desired result:
1
2
3
4
A
B
C
D
5
6
7
8
E
F
G
H
9
10
11
12
I
J
K
L
I'm looking for a sed, awk or python script, or any other bash command.
Thanks for your time!!
I tried to do it using specific python libraries that recognize the 4 lines modules of each files. But It doesn't work and now I trying to do it without this libraries, but don't know how.
import sys
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
def main(forward,reverse):
for F, R in zip ( SeqIO.parse(forward, "fastq"), SeqIO.parse(reverse, "fastq") ):
fastq_out_F = SeqRecord( F.seq, id = F.id, description = "" )
fastq_out_F.letter_annotations["phred_quality"] = F.letter_annotations["phred_quality"]
fastq_out_R = SeqRecord( R.seq, id = R.id, description = "" )
fastq_out_R.letter_annotations["phred_quality"] = R.letter_annotations["phred_quality"]
print fastq_out_F.format("fastq"),
print fastq_out_R.format("fastq"),
if __name__ == '__main__':
main(sys.argv[1], sys.argv[2])
This might work for you:(using GNU sed)
sed -e 'n;n;n;R file2' -e 'R file2' -e 'R file2' -e 'R file2' file1
or using paste/bash:
paste -d' ' <(paste -sd' \n' file1) <(paste -sd' \n' file2) | tr ' ' '\n'
or:
parallel -N4 --xapply 'printf "%s\n%s\n" {1} {2}' :::: file1 :::: file2
It can be done in pure bash:
f1=""; f2=""
while test -z "$f1" -o -z "$f2"; do
{ read LINE && echo "$LINE" && \
read LINE && echo "$LINE" && \
read LINE && echo "$LINE" && \
read LINE && echo "$LINE"; } || f1=end;
{ read -u 3 LINE && echo "$LINE" && \
read -u 3 LINE && echo "$LINE" && \
read -u 3 LINE && echo "$LINE" && \
read -u 3 LINE && echo "$LINE"; } || f2=end;
done < f1 3< f2
The idea is to use a new file descriptor (3 in this case) and read from stdin and this file descriptor at the same time.
A mix of paste and sed can also be used if you do not have GNU sed:
paste -d '\n' f1 f2 | sed -e 'x;N;x;N;x;N;x;N;x;N;x;N;x;N;s/^\n//;H;s/.*//;x'
If you are not familiar with sed, there is a 2nd buffer called the hold space where you can save data. The x command exchanges the current buffer with the hold space, the N command appends one line to the current buffer, and the H command appends the current buffer to the hold space.
So the first x;N save the current line (from f1 because of paste) in the hold space and read the next line (from f2 because of paste), then each x;N;x;N read a new line from f1 and f2, and the script finishes by removing the new line from the 4 lines of f2, puts the lines from f2 at the end of the lines of f1, clean the hold space for the next run and print the 8 lines.
Try this, changing the appropriate filename values for f1 and f2.
awk 'BEGIN{
sectionSize=4; maxSectionCnt=sectionSize; maxSectionCnt++
notEof1=notEof2=1
f1="file1" ; f2="file2"
while (notEof1 && notEof2) {
if (notEof1) {
for (i=1;i<maxSectionCnt;i++) {
if (getline < f1 >0 ) { print "F1:" i":" $0 } else {notEof1=0}
}
}
if (notEof2) {
for (i=1;i<maxSectionCnt;i++) {
if (getline < f2 >0 ) { print "F2:" i":" $0 } else {notEof2=0}
}
}
}
}'
You can also remove the "F1: i":" etc record header. I added that help debug code.
As Pastafarianist rightly points out, you may need to modify this if you have expectations about what will happen if the files are not the same size, etc.
I hope this helps.
The code you posted looks extremely complicated. There is a rule of thumb with programming: there is always a simpler solution. In your case, way simpler.
First thing you should do is determine the limitations of the input. Are you going to process really big files? Or are they going to be only of one-or-two-kilobyte size? It matters.
Second thing: look at the tools you have. With Python, you've got file objects, lists, generators and so on. Try to combine these tools to produce the desired result.
In your particular case, there are some unclear points. What should the script do if the input files have different size? Or one of them is empty? Or the number of lines is not a factor of four? You should decide how to handle corner cases like these.
Take a look at the file object, xrange, list slicing and list comprehensions. If you prefer doing it the cool way, you can also take a look at the itertools module.

Categories