I would like to remove entries of a fasta file where all nucleotides are N, but not entries which contain ACGT and N nucleotides.
from an example of input file content:
#>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
#>seq2
NNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNN
#>seq3
catgcatcgacgatgctgacgatc
#>seq4
cacacaccNNNNttgtgca
#...
Hoping the output file content to be:
#>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
#>seq3
catgcatcgacgatgctgacgatc
#>seq4
cacacaccNNNNttgtgca
#...
Any suggestions in doing this with awk, perl, python, other?
Thank you!
FDS
With GNU awk
awk -v RS='#>seq[[:digit:]]+' '!/^[N\n]+$/{printf "%s",term""$0}; {term=RT}' file
#>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
#>seq3
catgcatcgacgatgctgacgatc
#>seq4
cacacaccNNNNttgtgca
With python, using the BioPython module:
import Bio
INPUT = "bio_input.fas"
OUTPUT = "bio_output.fas"
def main():
records = Bio.SeqIO.parse(INPUT, 'fasta')
filtered = (rec for rec in records if any(ch != 'N' for ch in rec.seq))
Bio.SeqIO.write(filtered, OUTPUT, 'fasta')
if __name__=="__main__":
main()
however note that the FastA spec says sequence ids are supposed to start with >, not #>!
Run against
>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
>seq2
NNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNN
>seq3
catgcatcgacgatgctgacgatc
>seq4
cacacaccNNNNttgtgca
this produces
>seq_1
TGCTAGCTAGCTGATCGTGTCGATCGCACCACANNNNNCACGTGTCG
>seq3
catgcatcgacgatgctgacgatc
>seq4
cacacaccNNNNttgtgca
(Note, the default linewrap length is 60 chars).
So essentially blocks start with a #> marker, and you wish to remove blocks where no line contains anything other than N. One way in Python:
#! /usr/bin/env python
import fileinput, sys, re
block=[]
nonN=re.compile('[^N\n]')
for line in fileinput.input():
if line.startswith('#>'):
if len(block)==1 or any(map(nonN.search, block[1:])):
sys.stdout.writelines(block)
block=[line]
else:
block.append(line)
if len(block)==1 or any(map(nonN.search, block[1:])):
sys.stdout.writelines(block)
in python using regex:
#!/usr/bin/env python
import re
ff = open('test', 'r')
data = ff.read()
ff.close()
m = re.compile(r'(#>seq\d+[N\n]+)$', re.M)
f = re.sub(m, '', data)
fo = open('out', 'w')
fo.write(f)
fo.close()
and you will get in your out file:
#>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
#>seq3
catgcatcgacgatgctgacgatc
#>seq4
cacacaccNNNNttgtgca
#...
hope this helps.
With shell command egrep (grep + regrex)
egrep -B 1 "^[^NNNN && ^#seq]" your.fa >conert.fa
# -B 1 means print match row and the previous row
# ^[^NNNN && ^#seq] is regrex pattern means not match with (begin with NNNN and
# #seq)
so only match with begin with common A/T/G/C sequence row and its previous row which is fasta header
Related
I am working on a project and am having issues with the following code that I have written in nano:
from Bio import SeqIO
import sys
import re
fasta_file = (sys.argv[1])
for myfile in SeqIO.parse(fasta_file, "fasta"):
if len(myfile) > 250:
gene_id = myfile.id
list = re.match('H149xcV\_\w+\_\w+\_\w+', gene_id)
print (">"+list.group(1))
This is the error I receive when I execute my command on command-line:
File "mpo.py", line 7, in <module>
gene_id = myfile.id
NameError: name 'myfile' is not defined
I have a fasta file with the format
>H149xcV_Fge342_r3_h2_d1 len=210 path=[0:0-206]
ACTATACATGAGGAGAACATAGAACAAAAATGGGACCATAGATATATAACAATAGAAGATATAGAGAACACAATAGACAACTTATTAGGAAAGAGGTGTGTCGTCATGGAGCTGATGTTCGAGGATACTTTGCATGGTCATTCTTGGATAATTTTGAGTGGGCTATGGGATACACCAAGAGGTTTGGCATTGTTTATGTTGATTATAAGAACGGGC
>H149xcV_ytR1oP_r3_h2_d1 len=306 path=[0:0-207]
ATTAGAGTCTGAGAGAGTCTTGATTTGTCGTCGTCGAGAAATATAGGAGATCTGATTAGAGGAGAGAGCGGCCTAGGCGATGCGCGATATAGCGCTATATAGGCCTAGAGGAGAGTCTCTCTCTTTTAGAAGAGATAATATATATATATATATGGCTCTCCGGCGGGGCCGCGCGAGAGCTCGATCGATCGATATTAGCTGTACGATGCTAGCTAGCTTATATTCGATCGATTATAGCTTAGATCTCTCTCTAAAGGTCGATATCGCTTATGCGCGCGTATATCG
I would like to reformat my file so that it only provides me with the unique gene id's and only output those genes id's with a length greater than 250 bp.
I would like my desired output to look like this:
>H149xcV_Fge342_r3_h2
>H149xcV_ytR1oP_r3_h2
>H149xcV_DPN78333_r3_h2
>H149xcV_AgV472_r3_h2
>H149xcV_DNP733_r3_h2
As suggested in the comments following your question, the parameter to match should be a string. The one thing I'll add is that python3 has a r"" string delimiter for regular expressions. Your code becomes this:
from Bio import SeqIO
import sys
import re
fasta_file = (sys.argv[1])
for myfile in SeqIO.parse(fasta_file, "fasta"):
if len(myfile) > 250:
gene_id = myfile.id
list = re.match(r"H149xcV_\w+_\w+_\w+", gene_id)
print (">"+list.group(0))
The underscore _ is not a special regular expression character (as I recall) so it doesn't need to be escaped.
The match() function takes a regex and the string you are searching (so I added gene_id). Lastly, you want to output group(0). group(0) means the whole match. group(1) is from the first capturing paren (of which you have none) so stick with group(0).
Use infoseq utility from the EMBOSS package and pipe the output (a table with sequence ids and lengths) through a one-liner in a scripting language of your choice. Here, I am using Perl for simplicity:
cat input.fasta | \
infoseq -auto -only -name -length stdin | \
perl -lane 'my ($name, $length) = #F; if ( !$seen{$name}++ && $length > 250 ) { print ">$name"; }' > output.fasta
Install EMBOSS, for example, using conda:
conda create --name emboss emboss
TLDR: Is there a clean way to make a list of entries for subprocess.check_output('pcregrep', '-M', '-e', pattern, file)?
I'm using python's subprocess.check_output() to call pcregrep -M. Normally I would separate results by calling splitlines() but since I'm looking for a multiline pattern, that won't work. I'm having trouble finding a clean way to create a list of the matching patterns, where each entry of the list is an individual matching pattern.
Here's a simple example file I'm pcgrep'ing
module test_module(
input wire in0,
input wire in1,
input wire in2,
input wire cond,
input wire cond2,
output wire out0,
output wire out1
);
assign out0 = (in0 & in1 & in2);
assign out1 = cond1 ? in1 & in2 :
cond2 ? in1 || in2 :
in0;
Here's (some of) my python code
#!/usr/bin/env python
import subprocess, re
output_str = subprocess.check_output(['pcregrep', '-M', '-e',"^\s*assign\\s+\\bout0\\b[^;]+;",
"/home/<username>/pcregrep_file.sv"]).split(';')
# Print out the matches
for idx, line in enumerate(output_str):
print "output_str[%d] = %s" % (idx, line)
# Clear out the whitespace list entries
output_str = [line for line in output_str if re.match(\S+, line)]
Here is the output
output_str[0] =
assign out0 = in0 & in1 & in2
output_str[1] =
assign out1 = cond1 ? in1 & in2 :
cond2 ? in1 || in2 :
in0
output_str[2] =
It would be nice if I could do something like
output_list = subprocess.check_output('pcregrep', -M, -e, <pattern>, <file>).split(<multiline_delimiter>)
without creating garbage to clean up (whitespace list entries) or even to have a delimiter to split() on that is independent on the pattern.
Is there a clean way to create a list of the matching multiline patterns?
Per Casimir et Hippolyte's comment, and the very helpful post, How do I re.search or re.match on a whole file without reading it all into memory?, I read in the file using re instead of an external call to pcregrep and used re.findall(pattern, file, re.MULTILINE)
Full solution (which only slightly modifies the referenced post)
#!/usr/bin/env python
import re, mmap
filename = "/home/<username>/pcregrep_file.sv"
with open(filename, 'r+') as f:
data = mmap.mmap(f.fileno(), 0)
output_str = re.findall(r'^\s*assign\s+\bct_ela\b[^;]+;', data, re.MULTILINE)
for i, l in enumerate(output_str):
print "output_str[%d] = '%s'" % (i,l)
which creates the desired list.
Don't do that. If you can't use the Python regular expression module for some reason, just use the Python bindings for pcre.
I have to change formatting of a text.
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 20G 15G 4.2G 78% /
/dev/sda6 68G 39G 26G 61% /u01
/dev/sda2 30G 5.8G 22G 21% /opt
/dev/sda1 99M 19M 76M 20% /boot
tmpfs 48G 8.2G 39G 18% /dev/shm
/dev/mapper/vg3-KPGBKUP4
10T 7.6T 2.5T 76% /KPGBKUP4
I want the output as below:
20G 15G 4.2G 78%
68G 39G 26G 61%
30G 5.8G 22G 21%
99M 19M 76M 20%
48G 8.2G 39G 18%
10T 7.6T 2.5T 76%
This is what I have done, but this requires me to put name of all partitions in my script. I have to run this script on more than 25 servers which have different names of partition, which keeps on changing. Is there any better way?
This is my current script:
import os, sys, fileinput
for line in fileinput.input('report.txt', inplace= True):
line = line.replace("/dev/sda3 ", "")
line = line.replace("/dev/sda6 ", "")
line = line.replace("/dev/sda2 ", "")
line = line.replace("/dev/sda1 ", "")
line = line.replace("tmpfs ", "")
line = line.replace("/dev/mapper/vg3-KPGBKUP4", "")
line = line.replace("/KPGBKUP4", "")
line = line.lstrip()
sys.stdout.write(line)
Did you try something like this?
import os, sys, fileinput
for line in fileinput.input('report.txt'):
line = line.split()
print ' '.join(line[1:])
This works for the input you provided:
for line in open('report.txt'):
if line.startswith('Filesystem'):
# Skip header line
continue
try:
part, size, used, avail, used_percent, mount = line.split(None)
except ValueError:
# Unexpected line format, skip
continue
print ' '.join([size,
used.rjust(5),
avail.rjust(5),
used_percent.rjust(5)])
A key point here is to use line.split(None) in order to split on consecutive whitespace - see docs on split()` for details on this behavior.
Also see str.rjust() on how to format a string to be right justified.
You can use regex:
import os, sys, fileinput
import re
for line in fileinput.input('report.txt', inplace= True):
sys.stdout.write(re.sub('^.+?\s+','', line.strip()))
If there are occurances in the file like in your first example, when the line is actually (not just in the terminal) broken into two lines by a newline character, you should read the whole file to a variable, and use flags=re.MULTILINE in the sub() function:
f = open('report.txt', 'r')
print re.sub('^.+?\s+','',a,flags=re.MULTILINE)
#!/usr/bin/env python
import re, subprocess, shlex
cmd = 'df -Th'
# execute command
output = subprocess.Popen(shlex.split(cmd), stdout = subprocess.PIPE)
# read stdout-output from command
output = output.communicate()[0]
# concat line-continuations
output = re.sub(r'\n\s+', ' ', output)
# split output to list of lines
output = output.splitlines()[1:]
# print fields right-aligned in columns of width 6
for line in output:
print("{2:>6}{3:>6}{4:>6}{5:>6}".format(*tuple(line.split())))
I use subprocess to execute the df-Command instead of reading report.txt.
To format the output i used pythons Format String Syntax.
Here's a quick and dirty solution:
first = True
for line in open('A.py'):
# Skip the header row
if first:
first = False
continue
# For skipping improperly formatted lines, like the last one
if len(line.split()) == 6:
filesystem, size, used, avail, use, mount = line.split()
print size, used, avail, use
print ' '.join(line.split()[1:5])
I have some real time packets coming as below stored in a file log.log, and i am reading the log.log file as tail -f and parsing it. But all the lines are random with no fixed values, such as random ip, random values in data::blocks, each data:: is a column values. e.g in a log.log
Ohter type of lines...
[TCP]: incomeing data: 91 bytes, data=connect data::10.109.0.200data::10.109.0.86data::wandata::p4data::1400data::800data::end
[TCP]: incomeing data: 91 bytes, data=connect data::10.109.0.201data::10.109.8.86data::landata::p4data::1400data::700data::end
[TCP]: incomeing data: 91 bytes, data=connect data::10.109.0.200data::10.109.58.86data::3gdata::p4data::400data::800data::end
something.. else...
Now, How can i parse the line? where it can ignore anything and only parse when this match:
connect data::ANYdata::ANYdata::ANYdata::ANYdata::ANYdata::ANYdata::end
Run:
$ tail -f /var/tmp/log.log | python -u /var/tmp/myparse.py
myparse.py:
import sys, time, os, subprocess
import re
def p(command):
subprocess.Popen(command, shell=False, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
while True:
line = sys.stdin.readline()
if line:
if "command:start" in line:
print "OK - working"
p("/var/tmp/single_thread_process.sh")
if "connect data::" in line:
..
else:
# ^(?:\+|00)(\d+)$ Parse the 0032, 32, +32
#match = re.search(r'^(?:\+|00)(\d+)$', line)
#if match:
#print "OK"
### NOT working ###
match = re.search(r'^connect data::*data::*data::*data::*data::*data::*data::end$', line)
if match:
print "OK"
Try using:
match = re.search(r'connect data::[^:]+::[^:]+::[^:]+::[^:]+::[^:]+::[^:]+::end$', line)
The beginning of line anchor ^ is the first thing that's preventing the matches.
Also * is not a wildcard in regex, it's a quantifier meaning 0 or more times. You can use [^:]+ to mean 'any character except colons'.
regex101 demo
I have a bunch of large HTML files and I want to run a Hadoop MapReduce job on them to find the most frequently used words. I wrote both my mapper and reducer in Python and used Hadoop streaming to run them.
Here is my mapper:
#!/usr/bin/env python
import sys
import re
import string
def remove_html_tags(in_text):
'''
Remove any HTML tags that are found.
'''
global flag
in_text=in_text.lstrip()
in_text=in_text.rstrip()
in_text=in_text+"\n"
if flag==True:
in_text="<"+in_text
flag=False
if re.search('^<',in_text)!=None and re.search('(>\n+)$', in_text)==None:
in_text=in_text+">"
flag=True
p = re.compile(r'<[^<]*?>')
in_text=p.sub('', in_text)
return in_text
# input comes from STDIN (standard input)
global flag
flag=False
for line in sys.stdin:
# remove leading and trailing whitespace, set to lowercase and remove HTMl tags
line = line.strip().lower()
line = remove_html_tags(line)
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
if word =='': continue
for c in string.punctuation:
word= word.replace(c,'')
print '%s\t%s' % (word, 1)
Here is my reducer:
#!/usr/bin/env python
from operator import itemgetter
import sys
# maps words to their counts
word2count = {}
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
word2count[word] = word2count.get(word, 0) + count
except ValueError:
pass
sorted_word2count = sorted(word2count.iteritems(),
key=lambda(k,v):(v,k),reverse=True)
# write the results to STDOUT (standard output)
for word, count in sorted_word2count:
print '%s\t%s'% (word, count)
Whenever I just pipe a small sample small string like 'hello world hello hello world ...' I get the proper output of a ranked list. However, when I try to use a small HTML file, and try using cat to pipe the HTML into my mapper, I get the following error (input2 contains some HTML code):
rohanbk#hadoop:~$ cat input2 | /home/rohanbk/mapper.py | sort | /home/rohanbk/reducer.py
Traceback (most recent call last):
File "/home/rohanbk/reducer.py", line 15, in <module>
word, count = line.split('\t', 1)
ValueError: need more than 1 value to unpack
Can anyone explain why I'm getting this? Also, what is a good way to debug a MapReduce job program?
You can reproduce the bug even with just:
echo "hello - world" | ./mapper.py | sort | ./reducer.py
The issue is here:
if word =='': continue
for c in string.punctuation:
word= word.replace(c,'')
If word is a single punctuation mark, as would be the case for the above input (after it is split), then it is converted to an empty string. So, just move the check for an empty string to after the replacement.