Python - parsing a line when it match, with specific formula

Python - parsing a line when it match, with specific formula - python

I have some real time packets coming as below stored in a file log.log, and i am reading the log.log file as tail -f and parsing it. But all the lines are random with no fixed values, such as random ip, random values in data::blocks, each data:: is a column values. e.g in a log.log
Ohter type of lines...
[TCP]: incomeing data: 91 bytes, data=connect data::10.109.0.200data::10.109.0.86data::wandata::p4data::1400data::800data::end
[TCP]: incomeing data: 91 bytes, data=connect data::10.109.0.201data::10.109.8.86data::landata::p4data::1400data::700data::end
[TCP]: incomeing data: 91 bytes, data=connect data::10.109.0.200data::10.109.58.86data::3gdata::p4data::400data::800data::end
something.. else...
Now, How can i parse the line? where it can ignore anything and only parse when this match:
connect data::ANYdata::ANYdata::ANYdata::ANYdata::ANYdata::ANYdata::end
Run:
$ tail -f /var/tmp/log.log | python -u /var/tmp/myparse.py
myparse.py:
import sys, time, os, subprocess
import re
def p(command):
subprocess.Popen(command, shell=False, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
while True:
line = sys.stdin.readline()
if line:
if "command:start" in line:
print "OK - working"
p("/var/tmp/single_thread_process.sh")
if "connect data::" in line:
..
else:
# ^(?:\+|00)(\d+)$ Parse the 0032, 32, +32
#match = re.search(r'^(?:\+|00)(\d+)$', line)
#if match:
#print "OK"
### NOT working ###
match = re.search(r'^connect data::*data::*data::*data::*data::*data::*data::end$', line)
if match:
print "OK"

Try using:
match = re.search(r'connect data::[^:]+::[^:]+::[^:]+::[^:]+::[^:]+::[^:]+::end$', line)
The beginning of line anchor ^ is the first thing that's preventing the matches.
Also * is not a wildcard in regex, it's a quantifier meaning 0 or more times. You can use [^:]+ to mean 'any character except colons'.
regex101 demo

Related

Receiving NameError - how to fix?

I am working on a project and am having issues with the following code that I have written in nano:
from Bio import SeqIO
import sys
import re
fasta_file = (sys.argv[1])
for myfile in SeqIO.parse(fasta_file, "fasta"):
if len(myfile) > 250:
gene_id = myfile.id
list = re.match('H149xcV\_\w+\_\w+\_\w+', gene_id)
print (">"+list.group(1))
This is the error I receive when I execute my command on command-line:
File "mpo.py", line 7, in <module>
gene_id = myfile.id
NameError: name 'myfile' is not defined
I have a fasta file with the format
>H149xcV_Fge342_r3_h2_d1 len=210 path=[0:0-206]
ACTATACATGAGGAGAACATAGAACAAAAATGGGACCATAGATATATAACAATAGAAGATATAGAGAACACAATAGACAACTTATTAGGAAAGAGGTGTGTCGTCATGGAGCTGATGTTCGAGGATACTTTGCATGGTCATTCTTGGATAATTTTGAGTGGGCTATGGGATACACCAAGAGGTTTGGCATTGTTTATGTTGATTATAAGAACGGGC
>H149xcV_ytR1oP_r3_h2_d1 len=306 path=[0:0-207]
ATTAGAGTCTGAGAGAGTCTTGATTTGTCGTCGTCGAGAAATATAGGAGATCTGATTAGAGGAGAGAGCGGCCTAGGCGATGCGCGATATAGCGCTATATAGGCCTAGAGGAGAGTCTCTCTCTTTTAGAAGAGATAATATATATATATATATGGCTCTCCGGCGGGGCCGCGCGAGAGCTCGATCGATCGATATTAGCTGTACGATGCTAGCTAGCTTATATTCGATCGATTATAGCTTAGATCTCTCTCTAAAGGTCGATATCGCTTATGCGCGCGTATATCG
I would like to reformat my file so that it only provides me with the unique gene id's and only output those genes id's with a length greater than 250 bp.
I would like my desired output to look like this:
>H149xcV_Fge342_r3_h2
>H149xcV_ytR1oP_r3_h2
>H149xcV_DPN78333_r3_h2
>H149xcV_AgV472_r3_h2
>H149xcV_DNP733_r3_h2

As suggested in the comments following your question, the parameter to match should be a string. The one thing I'll add is that python3 has a r"" string delimiter for regular expressions. Your code becomes this:
from Bio import SeqIO
import sys
import re
fasta_file = (sys.argv[1])
for myfile in SeqIO.parse(fasta_file, "fasta"):
if len(myfile) > 250:
gene_id = myfile.id
list = re.match(r"H149xcV_\w+_\w+_\w+", gene_id)
print (">"+list.group(0))
The underscore _ is not a special regular expression character (as I recall) so it doesn't need to be escaped.
The match() function takes a regex and the string you are searching (so I added gene_id). Lastly, you want to output group(0). group(0) means the whole match. group(1) is from the first capturing paren (of which you have none) so stick with group(0).

Use infoseq utility from the EMBOSS package and pipe the output (a table with sequence ids and lengths) through a one-liner in a scripting language of your choice. Here, I am using Perl for simplicity:
cat input.fasta | \
infoseq -auto -only -name -length stdin | \
perl -lane 'my ($name, $length) = #F; if ( !$seen{$name}++ && $length > 250 ) { print ">$name"; }' > output.fasta
Install EMBOSS, for example, using conda:
conda create --name emboss emboss

How to count the total number of files to be synced using rsync?

I am trying to read the total number of the files to be synced using 'rsync', and read the value using the following python code, I get the following output. What code should I modify to get the desired output
Output
b'10'
Desired Output
10
cmd
rsync -nvaz --delete --stats user#host:/www/ . | ./awk.sh
awk.sh
awk '\
BEGIN {count = 0}
/deleting/ {if ( length($1) > 0 ) ++count} \
/Number of regular files transferred: / {count += $6} \
END \
{
printf "%d",count
}'
Python
subprocess.check_process(cmd, shell=True, stdout=False)

Your awk script is just looking for a line that includes a string and then printing it. Since your python script needs to read stdout to get that value anyway, you may as well ditch the script and stick with python. With the Popen object you can read stdout line by line
import subprocess
# for test...
source_dir = 'test1/'
target_dir = 'test2/'
count = 0
proc = subprocess.Popen(['rsync', '-nvaz', '--delete', '--stats',
source_dir, target_dir],
stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
for line in proc.stdout:
if line.startswith(b'Number of regular files transferred:'):
count = int(line.split(b':')[1])
proc.wait()
print(count)

Decoded the output to utf-8, and then parsed using RegEx
o = subprocess.check_output(cmd, shell=True)
g = re.search(r'count=(\d+)', o.decode("utf-8"), re.M|re.I)

Change formatting of command output

I have to change formatting of a text.
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 20G 15G 4.2G 78% /
/dev/sda6 68G 39G 26G 61% /u01
/dev/sda2 30G 5.8G 22G 21% /opt
/dev/sda1 99M 19M 76M 20% /boot
tmpfs 48G 8.2G 39G 18% /dev/shm
/dev/mapper/vg3-KPGBKUP4
10T 7.6T 2.5T 76% /KPGBKUP4
I want the output as below:
20G 15G 4.2G 78%
68G 39G 26G 61%
30G 5.8G 22G 21%
99M 19M 76M 20%
48G 8.2G 39G 18%
10T 7.6T 2.5T 76%
This is what I have done, but this requires me to put name of all partitions in my script. I have to run this script on more than 25 servers which have different names of partition, which keeps on changing. Is there any better way?
This is my current script:
import os, sys, fileinput
for line in fileinput.input('report.txt', inplace= True):
line = line.replace("/dev/sda3 ", "")
line = line.replace("/dev/sda6 ", "")
line = line.replace("/dev/sda2 ", "")
line = line.replace("/dev/sda1 ", "")
line = line.replace("tmpfs ", "")
line = line.replace("/dev/mapper/vg3-KPGBKUP4", "")
line = line.replace("/KPGBKUP4", "")
line = line.lstrip()
sys.stdout.write(line)

Did you try something like this?
import os, sys, fileinput
for line in fileinput.input('report.txt'):
line = line.split()
print ' '.join(line[1:])

This works for the input you provided:
for line in open('report.txt'):
if line.startswith('Filesystem'):
# Skip header line
continue
try:
part, size, used, avail, used_percent, mount = line.split(None)
except ValueError:
# Unexpected line format, skip
continue
print ' '.join([size,
used.rjust(5),
avail.rjust(5),
used_percent.rjust(5)])
A key point here is to use line.split(None) in order to split on consecutive whitespace - see docs on split()` for details on this behavior.
Also see str.rjust() on how to format a string to be right justified.

You can use regex:
import os, sys, fileinput
import re
for line in fileinput.input('report.txt', inplace= True):
sys.stdout.write(re.sub('^.+?\s+','', line.strip()))
If there are occurances in the file like in your first example, when the line is actually (not just in the terminal) broken into two lines by a newline character, you should read the whole file to a variable, and use flags=re.MULTILINE in the sub() function:
f = open('report.txt', 'r')
print re.sub('^.+?\s+','',a,flags=re.MULTILINE)

#!/usr/bin/env python
import re, subprocess, shlex
cmd = 'df -Th'
# execute command
output = subprocess.Popen(shlex.split(cmd), stdout = subprocess.PIPE)
# read stdout-output from command
output = output.communicate()[0]
# concat line-continuations
output = re.sub(r'\n\s+', ' ', output)
# split output to list of lines
output = output.splitlines()[1:]
# print fields right-aligned in columns of width 6
for line in output:
print("{2:>6}{3:>6}{4:>6}{5:>6}".format(*tuple(line.split())))
I use subprocess to execute the df-Command instead of reading report.txt.
To format the output i used pythons Format String Syntax.

Here's a quick and dirty solution:
first = True
for line in open('A.py'):
# Skip the header row
if first:
first = False
continue
# For skipping improperly formatted lines, like the last one
if len(line.split()) == 6:
filesystem, size, used, avail, use, mount = line.split()
print size, used, avail, use

print ' '.join(line.split()[1:5])

How to remove all-N sequence entries from fasta file(s)

I would like to remove entries of a fasta file where all nucleotides are N, but not entries which contain ACGT and N nucleotides.
from an example of input file content:
#>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
#>seq2
NNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNN
#>seq3
catgcatcgacgatgctgacgatc
#>seq4
cacacaccNNNNttgtgca
#...
Hoping the output file content to be:
#>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
#>seq3
catgcatcgacgatgctgacgatc
#>seq4
cacacaccNNNNttgtgca
#...
Any suggestions in doing this with awk, perl, python, other?
Thank you!
FDS

With GNU awk
awk -v RS='#>seq[[:digit:]]+' '!/^[N\n]+$/{printf "%s",term""$0}; {term=RT}' file
#>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
#>seq3
catgcatcgacgatgctgacgatc
#>seq4
cacacaccNNNNttgtgca

With python, using the BioPython module:
import Bio
INPUT = "bio_input.fas"
OUTPUT = "bio_output.fas"
def main():
records = Bio.SeqIO.parse(INPUT, 'fasta')
filtered = (rec for rec in records if any(ch != 'N' for ch in rec.seq))
Bio.SeqIO.write(filtered, OUTPUT, 'fasta')
if __name__=="__main__":
main()
however note that the FastA spec says sequence ids are supposed to start with >, not #>!
Run against
>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
>seq2
NNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNN
>seq3
catgcatcgacgatgctgacgatc
>seq4
cacacaccNNNNttgtgca
this produces
>seq_1
TGCTAGCTAGCTGATCGTGTCGATCGCACCACANNNNNCACGTGTCG
>seq3
catgcatcgacgatgctgacgatc
>seq4
cacacaccNNNNttgtgca
(Note, the default linewrap length is 60 chars).

So essentially blocks start with a #> marker, and you wish to remove blocks where no line contains anything other than N. One way in Python:
#! /usr/bin/env python
import fileinput, sys, re
block=[]
nonN=re.compile('[^N\n]')
for line in fileinput.input():
if line.startswith('#>'):
if len(block)==1 or any(map(nonN.search, block[1:])):
sys.stdout.writelines(block)
block=[line]
else:
block.append(line)
if len(block)==1 or any(map(nonN.search, block[1:])):
sys.stdout.writelines(block)

in python using regex:
#!/usr/bin/env python
import re
ff = open('test', 'r')
data = ff.read()
ff.close()
m = re.compile(r'(#>seq\d+[N\n]+)$', re.M)
f = re.sub(m, '', data)
fo = open('out', 'w')
fo.write(f)
fo.close()
and you will get in your out file:
#>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
#>seq3
catgcatcgacgatgctgacgatc
#>seq4
cacacaccNNNNttgtgca
#...
hope this helps.

With shell command egrep (grep + regrex)
egrep -B 1 "^[^NNNN && ^#seq]" your.fa >conert.fa
# -B 1 means print match row and the previous row
# ^[^NNNN && ^#seq] is regrex pattern means not match with (begin with NNNN and
# #seq)
so only match with begin with common A/T/G/C sequence row and its previous row which is fasta header

Hadoop MapReduce job on file containing HTML tags

I have a bunch of large HTML files and I want to run a Hadoop MapReduce job on them to find the most frequently used words. I wrote both my mapper and reducer in Python and used Hadoop streaming to run them.
Here is my mapper:
#!/usr/bin/env python
import sys
import re
import string
def remove_html_tags(in_text):
'''
Remove any HTML tags that are found.
'''
global flag
in_text=in_text.lstrip()
in_text=in_text.rstrip()
in_text=in_text+"\n"
if flag==True:
in_text="<"+in_text
flag=False
if re.search('^<',in_text)!=None and re.search('(>\n+)$', in_text)==None:
in_text=in_text+">"
flag=True
p = re.compile(r'<[^<]*?>')
in_text=p.sub('', in_text)
return in_text
# input comes from STDIN (standard input)
global flag
flag=False
for line in sys.stdin:
# remove leading and trailing whitespace, set to lowercase and remove HTMl tags
line = line.strip().lower()
line = remove_html_tags(line)
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
if word =='': continue
for c in string.punctuation:
word= word.replace(c,'')
print '%s\t%s' % (word, 1)
Here is my reducer:
#!/usr/bin/env python
from operator import itemgetter
import sys
# maps words to their counts
word2count = {}
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
word2count[word] = word2count.get(word, 0) + count
except ValueError:
pass
sorted_word2count = sorted(word2count.iteritems(),
key=lambda(k,v):(v,k),reverse=True)
# write the results to STDOUT (standard output)
for word, count in sorted_word2count:
print '%s\t%s'% (word, count)
Whenever I just pipe a small sample small string like 'hello world hello hello world ...' I get the proper output of a ranked list. However, when I try to use a small HTML file, and try using cat to pipe the HTML into my mapper, I get the following error (input2 contains some HTML code):
rohanbk#hadoop:~$ cat input2 | /home/rohanbk/mapper.py | sort | /home/rohanbk/reducer.py
Traceback (most recent call last):
File "/home/rohanbk/reducer.py", line 15, in <module>
word, count = line.split('\t', 1)
ValueError: need more than 1 value to unpack
Can anyone explain why I'm getting this? Also, what is a good way to debug a MapReduce job program?

You can reproduce the bug even with just:
echo "hello - world" | ./mapper.py | sort | ./reducer.py
The issue is here:
if word =='': continue
for c in string.punctuation:
word= word.replace(c,'')
If word is a single punctuation mark, as would be the case for the above input (after it is split), then it is converted to an empty string. So, just move the check for an empty string to after the replacement.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - parsing a line when it match, with specific formula - python

Related

Receiving NameError - how to fix?

How to count the total number of files to be synced using rsync?

Change formatting of command output

How to remove all-N sequence entries from fasta file(s)

Hadoop MapReduce job on file containing HTML tags

Categories

Resources