Python line by line data processing

Python line by line data processing - python

I am new to python and I searched few articles but do not find a correct syntax to read a file and do awk line processing in python . I need your help in solving this problem .
This is how my bash script for build and deploy looks, I read a configurationf file in bash which looks like as below .
backup /apps/backup
oracle /opt/qosmon/qostool/oracle oracle-client-12.1.0.1.0
and the script for bash reading section looks like below
while read line
do
case "$line" in */package*) continue ;; esac
host_file_array+=("$line")
done < ${HOST_FILE}
for ((i=0 ; i < ${#host_file_array[*]}; i++))
do
# echo "${host_file_array[i]}"
host_file_line="${host_file_array[i]}"
if [[ "$host_file_line" != "#"* ]];
then
COMPONENT_NAME=$(echo $host_file_line | awk '{print $1;}' )
DIRECTORY=$(echo $host_file_line | awk '{print $2;}' )
VERSION=$(echo $host_file_line | awk '{print $3;}' )
if [[ ("${COMPONENT_NAME}" == *"oracle"*) ]];
then
print_parameters "Status ${DIRECTORY}/${COMPONENT_NAME}"
/bin/bash ${DIRECTORY}/${COMPONENT_NAME}/current/script/manage-oracle.sh ${FORMAT_STRING} start
fi
etc .........
How the same can be conveted to Python . This is what I have prepared so far in python .
f = open ('%s' % host_file,"r")
array = []
line = f.readline()
index = 0
while line:
line = line.strip("\n ' '")
line=line.split()
array.append([])
for item in line:
array[index].append(item)
line = f.readline()
index+= 1
f.close()
I tried with split in python , since the config file does not have equal number of columns in all rows, I get index bound error. what is the best way to process it .

I think dictionaries might be a good fit here, you can generate them as follows:
>>> result = []
>>> keys = ["COMPONENT_NAME", "DIRECTORY", "VERSION"]
>>> with open(hosts_file) as f:
... for line in f:
... result.append(dict(zip(keys, line.strip().split())))
...
>>> result
[{'DIRECTORY': '/apps/backup', 'COMPONENT_NAME': 'backup'},
{'DIRECTORY': '/opt/qosmon/qostool/oracle', 'VERSION': 'oracle-client-12.1.0.1.0', 'COMPONENT_NAME': 'oracle'}]
As you see this creates a list of dictionaries. Now when you're accessing the dictionaries, you know that some of them might not contain a 'VERSION' key. There are multiple ways of handling this. Either you try/except KeyError or get the value using dict.get().
Example:
>>> for r in result:
... print r.get('VERSION', "No version")
...
...
No version
oracle-client-12.1.0.1.0

result = [line.strip().split() for line in open(host_file)]

Related

Python print -> Perl STDIN line skip problem

Im newbie of perl and python.
I need to file handling in python(dataframe), and that file need to calculated in Perl.
At first, I tried to use python subprocess, and it was not working(borken pipe)
i need to multiple lines from python, and perl code need to read it and processing.
I just use | in command line, and it was work, but perl skip odds number line and just read even number line.
how can i fix it?
my python code is :
import pandas as pd
data = pd.read_csv('./data.txt', sep = '\t', header = None)
datalist = list(data[0] + '_' + data[1])
for line in kinase_list:
print(line)
and my perl code is :
//
use strict;
my %new_list = ();
while (<STDIN>){
my $line = <STDIN>;
# print STDERR $line;
# chomp $line;
my ($name, $title) = split('_', <STDIN>);
$new_list{$title} = $name;
print STDERR $name, "\t", $title, "\n";
}
print STDERR scalar(keys %new_list);
my python output 657 lines, but perl just out 329.
how can i fix it?

The expression <STDIN> reads a line from standard input, so your Perl code reads two lines for every iteration of the while loop.
It is sufficient to say
while (<STDIN>) {
my $line = $_;
...
or just
while (my $line = <STDIN>) {
...

How to remove all-N sequence entries from fasta file(s)

I would like to remove entries of a fasta file where all nucleotides are N, but not entries which contain ACGT and N nucleotides.
from an example of input file content:
#>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
#>seq2
NNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNN
#>seq3
catgcatcgacgatgctgacgatc
#>seq4
cacacaccNNNNttgtgca
#...
Hoping the output file content to be:
#>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
#>seq3
catgcatcgacgatgctgacgatc
#>seq4
cacacaccNNNNttgtgca
#...
Any suggestions in doing this with awk, perl, python, other?
Thank you!
FDS

With GNU awk
awk -v RS='#>seq[[:digit:]]+' '!/^[N\n]+$/{printf "%s",term""$0}; {term=RT}' file
#>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
#>seq3
catgcatcgacgatgctgacgatc
#>seq4
cacacaccNNNNttgtgca

With python, using the BioPython module:
import Bio
INPUT = "bio_input.fas"
OUTPUT = "bio_output.fas"
def main():
records = Bio.SeqIO.parse(INPUT, 'fasta')
filtered = (rec for rec in records if any(ch != 'N' for ch in rec.seq))
Bio.SeqIO.write(filtered, OUTPUT, 'fasta')
if __name__=="__main__":
main()
however note that the FastA spec says sequence ids are supposed to start with >, not #>!
Run against
>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
>seq2
NNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNN
>seq3
catgcatcgacgatgctgacgatc
>seq4
cacacaccNNNNttgtgca
this produces
>seq_1
TGCTAGCTAGCTGATCGTGTCGATCGCACCACANNNNNCACGTGTCG
>seq3
catgcatcgacgatgctgacgatc
>seq4
cacacaccNNNNttgtgca
(Note, the default linewrap length is 60 chars).

So essentially blocks start with a #> marker, and you wish to remove blocks where no line contains anything other than N. One way in Python:
#! /usr/bin/env python
import fileinput, sys, re
block=[]
nonN=re.compile('[^N\n]')
for line in fileinput.input():
if line.startswith('#>'):
if len(block)==1 or any(map(nonN.search, block[1:])):
sys.stdout.writelines(block)
block=[line]
else:
block.append(line)
if len(block)==1 or any(map(nonN.search, block[1:])):
sys.stdout.writelines(block)

in python using regex:
#!/usr/bin/env python
import re
ff = open('test', 'r')
data = ff.read()
ff.close()
m = re.compile(r'(#>seq\d+[N\n]+)$', re.M)
f = re.sub(m, '', data)
fo = open('out', 'w')
fo.write(f)
fo.close()
and you will get in your out file:
#>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
#>seq3
catgcatcgacgatgctgacgatc
#>seq4
cacacaccNNNNttgtgca
#...
hope this helps.

With shell command egrep (grep + regrex)
egrep -B 1 "^[^NNNN && ^#seq]" your.fa >conert.fa
# -B 1 means print match row and the previous row
# ^[^NNNN && ^#seq] is regrex pattern means not match with (begin with NNNN and
# #seq)
so only match with begin with common A/T/G/C sequence row and its previous row which is fasta header

Summarizing log file to unique entries only

I have been using this script for years at work to summarize log files.
#!/usr/bin/perl
$logf = '/var/log/messages.log';
#logf=( `cat $logf` );
foreach $line ( #logf ) {
$line=~s/\d+/#/g;
$count{$line}++;
}
#alpha=sort #logf;
$prev = 'null';
#uniq = grep($_ ne $prev && ($prev = $_), #alpha);
foreach $line (#uniq) {
print "$count{$line}: ";
print "$line";
}
I have wanted to rewrite it in Python but I do not fully understand certain portions of it, such as:
#alpha=sort #logf;
$prev = 'null';
#uniq = grep($_ ne $prev && ($prev = $_), #alpha);
Does anyone know of a Python module that would negate the need to rewrite this? I haven't had any luck find something similar. Thanks in advance!

As the name of the var implies,
#alpha=sort #logf;
$prev = 'null';
#uniq = grep($_ ne $prev && ($prev = $_), #alpha);
is finding unique elements (i.e. removing duplicate lines), ignoring numbers in the line since they were previously replaced with #. Those three lines could have been written
#uniq = sort keys(%count);
or maybe even
#uniq = keys(%count);
Another way of writing the program in Perl:
my $log_qfn = '/var/log/messages.log';
open(my $fh, '<', $log_qfn)
or die("Can't open $log_qfn: $!\n");
my %counts;
while (<$fh>) {
s/\d+/#/g;
++$counts{$_};
}
#for (sort keys(%counts)) {
for (keys(%counts)) {
print "$counts{$_}: $_";
}
This should be easier to translate into Python.

#alpha=sort #logf;
$prev = 'null';
#uniq = grep($_ ne $prev && ($prev = $_), #alpha);
would be equivalent to
uniq = sorted(set(logf))
if logf were a list of lines.
However, since you are counting the freqency of lines,
you could use a collections.Counter to both count the lines and collect the unique lines (as keys) (thus removing the need to compute uniq at all):
count = collections.Counter()
for line in f:
count[line] += 1
import sys
import re
import collections
logf = '/var/log/messages.log'
count = collections.Counter()
write = sys.stdout.write
with open(logf, 'r') as f:
for line in f:
line = re.sub(r'\d+','#',line)
count[line] += 1
for line in sorted(count):
write("{c}: {l}".format(c = count[line], l = line))

I have to say I often encountered with people trying to do stuff in python perl can be done in one line on shell or bash:
I don't care for downvotes, since people should know there is no reason to do stuff in 20 lines of python if it can be done on shell
< my_file.txt | sort | uniq > uniq_my_file.txt

Text processing with two files

I have two text files in the following format:
The first is this on every line:
Key1:Value1
The second is this:
Key2:Value2
Is there a way I can replace Value1 in file1 by the Value2 obtained from using it as a key in file2?
For example:
file1:
foo:hello
bar:world
file2:
hello:adam
bar:eve
I would like to get:
foo:adam
bar:eve
There isn't necessarily a match between the two files on every line. Can this be done neatly in awk or something, or should I do it naively in Python?

Create two dictionaries, one for each file. For example:
file1 = {}
for line in open('file1', 'r'):
k, v = line.strip().split(':')
file1[k] = v
Or if you prefer a one-liner:
file1 = dict(l.strip().split(':') for l in open('file1', 'r'))
Then you could do something like:
result = {}
for key, value in file1.iteritems():
if value in file2:
result[key] = file2[value]
Another way is you could generate the key-value pairs in reverse for file1 and use sets. For example, if your file1 contains foo:bar, your file1 dict is {bar: foo}.
for key in set(file1) & set(file2):
result[file1[key]] = file2[key]
Basically, you can quickly find common elements using set intersection, so those elements are guaranteed to be in file2 and you don't waste time checking for their existence.
Edit: As pointed out by #pepr You can use collections.OrderedDict for the first method if order is important to you.

The awk solution:
awk '
BEGIN {FS = OFS = ":"}
NR==FNR {val[$1] = $2; next}
$1 in val {$2 = val[$1]}
{print}
}' file2 file1

join -t : -1 2 -2 1 -o 0 2.2 -a 2 <(sort -k 2 -t : file1) <(sort file2)
The input files must be sorted on the field they are joined on.
The options:
-t : - Use a colon as the delimiter
-1 2 - Join on field 2 of file 1
-2 1 - Join on field 1 of file 2
-o 0 2.2 - Output the join field followed by field 2 from file2 (separated by the delimiter character)
-a 2 - Output unjoined lines from file2

Once you have:
file1 = {'foo':'hello', 'bar':'world'}
file2 = {'hello':'adam', 'bar':'eve'}
You can do an ugly one liner:
print dict([(i,file2[i]) if i in file2 else (i,file2[j]) if j in file2 else (i,j) for i,j in file1.items()])
{'foo': 'adam', 'bar': 'eve'}
As in your example you are using both the keys and values of file1 as keys in file2.

This might work for you (probably GNU sed):
sed 's#\([^:]*\):\(.*\)#/\\(^\1:\\|:\1$\\)/s/:.*/:\2/#' file2 | sed -f - file1

If you do not consider using basic Unix/Linux commands cheating, then here is a solution using paste and awk.
paste file1.txt file2.txt | awk -F ":" '{ print $1":"$3 }'

TXR:
#(next "file2")
#(collect)
#key:#value1
# (cases)
# (next "file1")
# (skip)
#value2:#key
# (or)
# (bind value2 key)
# (end)
# (output)
#value2:#value1
# (end)
#(end)
Run:
$ txr subst.txr
foo:adam
bar:eve

intersperse the lines of two different files

I have to do a simple task, but I don't know how to do it and I'm staked. I need to intersperse the lines of two different files each 4 lines:
File 1:
1
2
3
4
5
6
7
8
9
10
11
12
FILE 2:
A
B
C
D
E
F
G
H
I
J
K
L
Desired result:
1
2
3
4
A
B
C
D
5
6
7
8
E
F
G
H
9
10
11
12
I
J
K
L
I'm looking for a sed, awk or python script, or any other bash command.
Thanks for your time!!
I tried to do it using specific python libraries that recognize the 4 lines modules of each files. But It doesn't work and now I trying to do it without this libraries, but don't know how.
import sys
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
def main(forward,reverse):
for F, R in zip ( SeqIO.parse(forward, "fastq"), SeqIO.parse(reverse, "fastq") ):
fastq_out_F = SeqRecord( F.seq, id = F.id, description = "" )
fastq_out_F.letter_annotations["phred_quality"] = F.letter_annotations["phred_quality"]
fastq_out_R = SeqRecord( R.seq, id = R.id, description = "" )
fastq_out_R.letter_annotations["phred_quality"] = R.letter_annotations["phred_quality"]
print fastq_out_F.format("fastq"),
print fastq_out_R.format("fastq"),
if __name__ == '__main__':
main(sys.argv[1], sys.argv[2])

This might work for you:(using GNU sed)
sed -e 'n;n;n;R file2' -e 'R file2' -e 'R file2' -e 'R file2' file1
or using paste/bash:
paste -d' ' <(paste -sd' \n' file1) <(paste -sd' \n' file2) | tr ' ' '\n'
or:
parallel -N4 --xapply 'printf "%s\n%s\n" {1} {2}' :::: file1 :::: file2

It can be done in pure bash:
f1=""; f2=""
while test -z "$f1" -o -z "$f2"; do
{ read LINE && echo "$LINE" && \
read LINE && echo "$LINE" && \
read LINE && echo "$LINE" && \
read LINE && echo "$LINE"; } || f1=end;
{ read -u 3 LINE && echo "$LINE" && \
read -u 3 LINE && echo "$LINE" && \
read -u 3 LINE && echo "$LINE" && \
read -u 3 LINE && echo "$LINE"; } || f2=end;
done < f1 3< f2
The idea is to use a new file descriptor (3 in this case) and read from stdin and this file descriptor at the same time.

A mix of paste and sed can also be used if you do not have GNU sed:
paste -d '\n' f1 f2 | sed -e 'x;N;x;N;x;N;x;N;x;N;x;N;x;N;s/^\n//;H;s/.*//;x'
If you are not familiar with sed, there is a 2nd buffer called the hold space where you can save data. The x command exchanges the current buffer with the hold space, the N command appends one line to the current buffer, and the H command appends the current buffer to the hold space.
So the first x;N save the current line (from f1 because of paste) in the hold space and read the next line (from f2 because of paste), then each x;N;x;N read a new line from f1 and f2, and the script finishes by removing the new line from the 4 lines of f2, puts the lines from f2 at the end of the lines of f1, clean the hold space for the next run and print the 8 lines.

Try this, changing the appropriate filename values for f1 and f2.
awk 'BEGIN{
sectionSize=4; maxSectionCnt=sectionSize; maxSectionCnt++
notEof1=notEof2=1
f1="file1" ; f2="file2"
while (notEof1 && notEof2) {
if (notEof1) {
for (i=1;i<maxSectionCnt;i++) {
if (getline < f1 >0 ) { print "F1:" i":" $0 } else {notEof1=0}
}
}
if (notEof2) {
for (i=1;i<maxSectionCnt;i++) {
if (getline < f2 >0 ) { print "F2:" i":" $0 } else {notEof2=0}
}
}
}
}'
You can also remove the "F1: i":" etc record header. I added that help debug code.
As Pastafarianist rightly points out, you may need to modify this if you have expectations about what will happen if the files are not the same size, etc.
I hope this helps.

The code you posted looks extremely complicated. There is a rule of thumb with programming: there is always a simpler solution. In your case, way simpler.
First thing you should do is determine the limitations of the input. Are you going to process really big files? Or are they going to be only of one-or-two-kilobyte size? It matters.
Second thing: look at the tools you have. With Python, you've got file objects, lists, generators and so on. Try to combine these tools to produce the desired result.
In your particular case, there are some unclear points. What should the script do if the input files have different size? Or one of them is empty? Or the number of lines is not a factor of four? You should decide how to handle corner cases like these.
Take a look at the file object, xrange, list slicing and list comprehensions. If you prefer doing it the cool way, you can also take a look at the itertools module.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python line by line data processing - python

result = [line.strip().split() for line in open(host_file)]

Related

Python print -> Perl STDIN line skip problem

How to remove all-N sequence entries from fasta file(s)

Summarizing log file to unique entries only

Text processing with two files

intersperse the lines of two different files

Categories

Resources