Python print -> Perl STDIN line skip problem - python

Im newbie of perl and python.
I need to file handling in python(dataframe), and that file need to calculated in Perl.
At first, I tried to use python subprocess, and it was not working(borken pipe)
i need to multiple lines from python, and perl code need to read it and processing.
I just use | in command line, and it was work, but perl skip odds number line and just read even number line.
how can i fix it?
my python code is :
import pandas as pd
data = pd.read_csv('./data.txt', sep = '\t', header = None)
datalist = list(data[0] + '_' + data[1])
for line in kinase_list:
print(line)
and my perl code is :
//
use strict;
my %new_list = ();
while (<STDIN>){
my $line = <STDIN>;
# print STDERR $line;
# chomp $line;
my ($name, $title) = split('_', <STDIN>);
$new_list{$title} = $name;
print STDERR $name, "\t", $title, "\n";
}
print STDERR scalar(keys %new_list);
my python output 657 lines, but perl just out 329.
how can i fix it?

The expression <STDIN> reads a line from standard input, so your Perl code reads two lines for every iteration of the while loop.
It is sufficient to say
while (<STDIN>) {
my $line = $_;
...
or just
while (my $line = <STDIN>) {
...

Related

Convert Perl syntax to Python [duplicate]

This question already has an answer here:
How to create a dict equivalent in Python from Perl hash?
(1 answer)
Closed 5 years ago.
I have a file with lines which are separated by white spaces.
I have written the program below in Perl and it works.
Now I must rewrite it in Python which is not my language, but I have solved it more or less.
I currently struggle with this expression in Perl which I can't convert it to Python.
$hash{$prefix}++;
I have found some solutions but I'm not sufficiently experienced with Python to solve this. All the solution looks complicated to me compared to the Perl one.
These Stack Overflow questions seem to be relevant.
Python variables as keys to dict
Python: How to pass key to a dictionary from the variable of a function?
Perl
#!perl -w
use strict;
use warnings FATAL => 'all';
our $line = "";
our #line = "";
our $prefix = "";
our %hash;
our $key;
while ( $line = <STDIN> ) {
# NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
next if $line =~ /^NAMESPACE/;
#aleks-event-test redis-1-m06k0 1/1 Running 0 1d 172.26.0.138 The_Server_name
#line = split ' ', $line;
$line[1] =~ /(.*)-\d+-\w+$/ ;
$prefix = $1;
#print "$prefix $line[7]\n";
print "$prefix $line[7]\n";
$hash{$prefix}++;
}
foreach $key ( keys %hash ) {
if ( $hash{$key} / 2 ){
print "$key : $hash{$key} mod 2 \n"
}
else {
print "$key : $hash{$key} not mod 2 \n"
}
}
Python
#!python
import sys
import re
myhash = {}
for line in sys.stdin:
"""
Diese Projekte werden ignoriert
"""
if re.match('^NAMESPACE|logging|default',line):
continue
linesplited = line.split()
prefix = re.split('(.*)(-\d+)?-\w+$',linesplited[1])
#print linesplited[1]
print prefix[1]
myhash[prefix[1]] += 1
Your problem is using this line:
myhash = {}
# ... code ...
myhash[prefix[1]] += 1
You likely are getting a KeyError. This is because you start off with an empty dictionary (or hash), and if you attempt to reference a key that doesn't exist yet, Python will raise an exception.
A simple solution that will let your script work is to use a defaultdict, which will auto-initialize any key-value pair you attempt to access.
#!python
import sys
import re
from collections import defaultdict
# Since you're keeping counts, we'll initialize this so that the values
# of the dictionary are `int` and will default to 0
myhash = defaultdict(int)
for line in sys.stdin:
"""
Diese Projekte werden ignoriert
"""
if re.match('^NAMESPACE|logging|default',line):
continue
linesplited = line.split()
prefix = re.split('(.*)(-\d+)?-\w+$',linesplited[1])
#print linesplited[1]
print prefix[1]
myhash[prefix[1]] += 1

Find part of a string in CSV and replace whole cell with new entry?

I've got a CSV file with a column which I want to sift through. I want to use a pattern file to find all entries where the pattern exists even in part of the column's value, and replace the whole cell value with this "pattern".
I made a list of keywords that I want to use as my "pattern" bank;
So, if a cell in this column (this case only second) has this "pattern" as part of its string, then I want to replace the whole cell with this "pattern".
so for example:
my target file:
id1,Taxidermy Equipment & Supplies,moreinfo1
id2,Taxis & Private Hire,moreinfo2
id3,Tax Services,moreinfo3
id4,Tools & Hardware,moreinfo4
id5,Tool Sharpening,moreinfo5
id6,Tool Shops,moreinfo6
id7,Video Conferencing,moreinfo7
id8,Video & DVD Shops,moreinfo8
id9,Woodworking Equipment & Supplies,moreinfo9
my "pattern" file:
Taxidermy Equipment & Supplies
Taxis
Tax Services
Tool
Video
Wood
output file:
id1,Taxidermy Equipment & Supplies,moreinfo1
id2,Taxis,moreinfo2
id3,Tax Services,moreinfo3
id4,Tool,moreinfo4
id5,Tool,moreinfo5
id6,Tool,moreinfo6
id7,Video,moreinfo7
id8,Video,moreinfo8
id9,Wood,moreinfo9
I came up with the usual "find and replace" sed:
sed -i 's/PATTERN/REPLACE/g' file.csv
but I want it to run on a specific column, so I came up with:
awk 'BEGIN{OFS=FS="|"}$2==PATTERN{$2=REPLACE}{print}' file.csv
but it doesn't work on "part of string" ([Video]:"Video & DVD Shops" -> "Video") and I can't seem to get it how awk takes input as a file for the "Pattern" block.
Is there an awk script for this? Or do I have to write something (in python with the built in csv suit for example?)
In awk, using index. It only prints record if a replacement is made but it's easy to modify to printing even if there is no match (for example replace the print $1,i,$3} with $0=$1 OFS i OFS $3} 1):
$ awk -F, -v OFS=, '
NR==FNR { a[$1]; next } # store "patterns" to a arr
{ for(i in a) # go thru whole a for each record
if(index($2,i)) # if "pattern" matches $2
print $1,i,$3 # print with replacement
}
' pattern_file target_file
id1,Taxidermy Equipment & Supplies,moreinfo1
id2,Taxis,moreinfo2
id3,Tax Services,moreinfo3
id4,Tool,moreinfo4
id5,Tool,moreinfo5
id6,Tool,moreinfo6
id7,Video,moreinfo7
id8,Video,moreinfo8
id9,Wood,moreinfo9
Perl solution, using Text::CSV_XS:
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS qw{ csv };
my ($input_file, $pattern_file) = #ARGV;
open my $pfh, '<', $pattern_file or die $!;
chomp( my #patterns = <$pfh> );
my $aoa = csv(in => $input_file);
for my $line (#$aoa) {
for my $pattern (#patterns) {
if (-1 != index $line->[1], $pattern) {
$line->[1] = $pattern;
last
}
}
}
csv(in => $aoa, quote_space => 0, eol => "\n", out => \*STDOUT);
Here's a (mostly) awk solution:
#/bin/bash
patterns_regex=`cat patterns_file | tr '\n' '|'`
cat target_file | awk -F"," -v patterns="$patterns_regex" '
BEGIN {
OFS=",";
split(patterns, patterns_split, "|");
}
{
for (pattern_num in patterns_split) {
pattern=patterns_split[pattern_num];
if (pattern != "" && $2 ~ pattern) {
print $1,pattern,$3
}
}
}'
When you want to solve this with sed, you will need some steps.
For each pattern you will need a command like
sed 's/^\([^,]*\),\(.*Tool.*\),/\1,Tool,/' inputfile
You will need each pattern twice, you can translate the patternfile with
sed 's/.*/"&" "&"/' patternfile
# Change the / into #, thats easier for the final command
sed 's#.*#"&" "&"#' patternfile
When you instruct sed to read a commandfile, you do need to start each line with sed. The commandfile will look like
sed 's#.*#s/^\\([^,]*\\),\\(.*&.*\\),/\\1,&,/#;s#&#\\&#g' patternfile
You can store this is a file and use the file, but with process substitution you can do things like
cat <(echo "Now this line from echo is handled as a file")
Nice. Lets test the solution
sed -f <(sed 's#.*#s/^\\([^,]*\\),\\(.*&.*\\),/\\1,&,/#' patternfile) inputfile
Almost there! Only the first output line is strange. Whats happening?
The first pattern has a &, and that has a special meaning.
We can patch our command by adding a backslash in the pattern:
sed -f <(sed 's#.*#s/^\\([^,]*\\),\\(.*&.*\\),/\\1,&,/#;s#&#\\&#g' patternfile) inputfile

Convert csv file to txt file

I'm using perl to convert a comma separated file to a tab separated file with this command:
perl -e ' $sep=","; while(<>) { s/\Q$sep\E/\t/g; print $_; } warn "Changed $sep to tab on $. lines\n" ' csvfile.csv > tabfile.tab
However, my file has additional commas that I do not want to be separated in specific columns. Here's and example of my file:
ADNP, "descript1, descript2", 1
PTB, "descriptA, descriptB", 5
I only want to convert the comma's outside of the quotations to tabs as so:
ADNP descript1, descript2 1
PTB descriptA, descriptB 5
Is there anyway to go about doing this with either perl, python, or bash?
Trivial in Perl, using Text::CSV:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
#configure our read format using the default separator of ","
my $input_csv = Text::CSV->new( { binary => 1 } );
#configure our output format with a tab as separator.
my $output_csv = Text::CSV->new( { binary => 1, sep_char => "\t", eol => "\n" } );
#open input file
open my $input_fh, '<', "sample.csv" or die $!;
#iterate input file - reading in 'comma separated'
#printing out (to stdout -can use filehandle) tab separated.
while ( my $row = $input_csv->getline($input_fh) ) {
$output_csv->print( \*STDOUT, $row );
}
In python
import csv
with open('input', 'rb') as inf:
reader = csv.reader(inf)
with open('output', 'wb') as out:
writer = csv.writer(out, delimiter='\t')
writer.writerows(reader)
You need regular expressions to help you. In python it would simply be:
>>> re.split(r'(?!\B"[^"]*),(?![^"]*"\B)', 'ADNP, "descript1, descript2", 1'
['ADNP', ' "descript1, descript2"', ' 1']
Building off rll's regex answer, you can turn it into a perl oneliner like you're currenly doing
perl -ne 'BEGIN{$,="\t";}#a=split(/(?!\B"[^"]*),(?![^"]*"\B)/);print #a' csvfile.csv > tabfile.tab
This'll work:
perl -e '$sep=","; while(<STDIN>) { #data = split(/(\Q$sep\E?\s*"[^"]+"\s*\Q$sep\E?)/); foreach(#data){if(/"/){s/^\Q$sep\E\s*"//;s/"\s*\Q$sep\E$//;}else{s/\Q$sep\E/\t/g;}}print(join("\t",#data));} warn "Changed $sep to tab on $. lines\n"' < csvfile.csv > tabfile.tab
Putting parens in the pattern to split, returns the captured separators along with the split elements and effectively separates the strings containing quotes into separate list elements that can be treated differently when quotes are detected. You just strip off the commas and quotes for the quoted strings and substitute for tabs in the other elements, then join the elements with tabs (so that the quoted strings get joined with tabs to the other already tabbed strings.
The Text::CSV module is what you're looking for. There are a lot of considerations when parsing CSV files, and you really don't want to handle all of them yourself.

Python line by line data processing

I am new to python and I searched few articles but do not find a correct syntax to read a file and do awk line processing in python . I need your help in solving this problem .
This is how my bash script for build and deploy looks, I read a configurationf file in bash which looks like as below .
backup /apps/backup
oracle /opt/qosmon/qostool/oracle oracle-client-12.1.0.1.0
and the script for bash reading section looks like below
while read line
do
case "$line" in */package*) continue ;; esac
host_file_array+=("$line")
done < ${HOST_FILE}
for ((i=0 ; i < ${#host_file_array[*]}; i++))
do
# echo "${host_file_array[i]}"
host_file_line="${host_file_array[i]}"
if [[ "$host_file_line" != "#"* ]];
then
COMPONENT_NAME=$(echo $host_file_line | awk '{print $1;}' )
DIRECTORY=$(echo $host_file_line | awk '{print $2;}' )
VERSION=$(echo $host_file_line | awk '{print $3;}' )
if [[ ("${COMPONENT_NAME}" == *"oracle"*) ]];
then
print_parameters "Status ${DIRECTORY}/${COMPONENT_NAME}"
/bin/bash ${DIRECTORY}/${COMPONENT_NAME}/current/script/manage-oracle.sh ${FORMAT_STRING} start
fi
etc .........
How the same can be conveted to Python . This is what I have prepared so far in python .
f = open ('%s' % host_file,"r")
array = []
line = f.readline()
index = 0
while line:
line = line.strip("\n ' '")
line=line.split()
array.append([])
for item in line:
array[index].append(item)
line = f.readline()
index+= 1
f.close()
I tried with split in python , since the config file does not have equal number of columns in all rows, I get index bound error. what is the best way to process it .
I think dictionaries might be a good fit here, you can generate them as follows:
>>> result = []
>>> keys = ["COMPONENT_NAME", "DIRECTORY", "VERSION"]
>>> with open(hosts_file) as f:
... for line in f:
... result.append(dict(zip(keys, line.strip().split())))
...
>>> result
[{'DIRECTORY': '/apps/backup', 'COMPONENT_NAME': 'backup'},
{'DIRECTORY': '/opt/qosmon/qostool/oracle', 'VERSION': 'oracle-client-12.1.0.1.0', 'COMPONENT_NAME': 'oracle'}]
As you see this creates a list of dictionaries. Now when you're accessing the dictionaries, you know that some of them might not contain a 'VERSION' key. There are multiple ways of handling this. Either you try/except KeyError or get the value using dict.get().
Example:
>>> for r in result:
... print r.get('VERSION', "No version")
...
...
No version
oracle-client-12.1.0.1.0
result = [line.strip().split() for line in open(host_file)]

Summarizing log file to unique entries only

I have been using this script for years at work to summarize log files.
#!/usr/bin/perl
$logf = '/var/log/messages.log';
#logf=( `cat $logf` );
foreach $line ( #logf ) {
$line=~s/\d+/#/g;
$count{$line}++;
}
#alpha=sort #logf;
$prev = 'null';
#uniq = grep($_ ne $prev && ($prev = $_), #alpha);
foreach $line (#uniq) {
print "$count{$line}: ";
print "$line";
}
I have wanted to rewrite it in Python but I do not fully understand certain portions of it, such as:
#alpha=sort #logf;
$prev = 'null';
#uniq = grep($_ ne $prev && ($prev = $_), #alpha);
Does anyone know of a Python module that would negate the need to rewrite this? I haven't had any luck find something similar. Thanks in advance!
As the name of the var implies,
#alpha=sort #logf;
$prev = 'null';
#uniq = grep($_ ne $prev && ($prev = $_), #alpha);
is finding unique elements (i.e. removing duplicate lines), ignoring numbers in the line since they were previously replaced with #. Those three lines could have been written
#uniq = sort keys(%count);
or maybe even
#uniq = keys(%count);
Another way of writing the program in Perl:
my $log_qfn = '/var/log/messages.log';
open(my $fh, '<', $log_qfn)
or die("Can't open $log_qfn: $!\n");
my %counts;
while (<$fh>) {
s/\d+/#/g;
++$counts{$_};
}
#for (sort keys(%counts)) {
for (keys(%counts)) {
print "$counts{$_}: $_";
}
This should be easier to translate into Python.
#alpha=sort #logf;
$prev = 'null';
#uniq = grep($_ ne $prev && ($prev = $_), #alpha);
would be equivalent to
uniq = sorted(set(logf))
if logf were a list of lines.
However, since you are counting the freqency of lines,
you could use a collections.Counter to both count the lines and collect the unique lines (as keys) (thus removing the need to compute uniq at all):
count = collections.Counter()
for line in f:
count[line] += 1
import sys
import re
import collections
logf = '/var/log/messages.log'
count = collections.Counter()
write = sys.stdout.write
with open(logf, 'r') as f:
for line in f:
line = re.sub(r'\d+','#',line)
count[line] += 1
for line in sorted(count):
write("{c}: {l}".format(c = count[line], l = line))
I have to say I often encountered with people trying to do stuff in python perl can be done in one line on shell or bash:
I don't care for downvotes, since people should know there is no reason to do stuff in 20 lines of python if it can be done on shell
< my_file.txt | sort | uniq > uniq_my_file.txt

Categories