I need to create a program that can take some text file called a fasta file and transform it to give the sequence_name, Domain_names, Start of Domain, end Of Domain.
So a fasta file is just a text file that looks like this
>MICE_8
ATTCGATCGATCGATTTCGATCGATCGATCGATCGGGATCGATCGATCGATCGATC
>MICE_59
ATTTTTCGGCATCGATAGCTAGCTAGCTAG
My program needs to take one command argument which is the file name of the fasta and give an output like this:
MICE_8 gnl|CDD|256537 819 923 gnl|CDD|260076 111 189 gnl|CDD|260056 4 93
MICE_59
here is a decription of the output for more information:
MICE_8 is the name of the first sequence in the fasta file
gnl|CDD|256537 is the name of the first protein domain
819 this is where the domain stats
923 this is where it ends
gnl|CDD|260076 is the name of the second protein domain for the first sequence and so on it starts at 111 and end at position 189.
Also since the last sequence did not get a hit the program still needs to display the name of the sequence.
OK so here is my code so far and what it outputs so far
import sys
import os
fastaname = sys.argv[1]
rpsblastname = "rpsblast.out"
cmd = "rpsblast+ -db /home/bryan/data/cdd/cdd -query %s -outfmt 6 -evalue 0.05 > %s" % (fastaname,rpsblastname)
os.system(cmd)
handle = open(rpsblastname, "r")
seqname = ""
for line in handle:
linearr = line.split()
# seqname = linearr [0]
domain = linearr[1]
start = linearr[6]
end = linearr[7]
# If sequence name is the same as last time, don't print it
if seqname == linearr[0]:
sys.stdout.write("%s %s %s" % (domain, start, end))
# Otherwise do print the sequence name, and update seqname
else:
seqname = linearr[0]
print
sys.stdout.write("%s %s %s %s" % (seqname,domain,start,end))
here is what my output looks like so far:
mel#roswald:~$ ./Domainfinder.py bioinformation.fasta
MICE_8 gnl|CDD|256537 819 923gnl|CDD|260076 111 189gnl|CDD|260056 4 93
The program i created is almost to the required specification. * only have 3 problems that * need be to address:
there is an extra space between where I run the program and the result
my program does not write out the name of the sequence which has zero hits
my program does not separate the domain names by a space.
the correct output should look like this
mel#roswald:~$ ./Domainfinder.py bioinformation.fasta
MICE_8 gnl|CDD|256537 819 923 gnl|CDD|260076 111 189 gnl|CDD|260056 4 93
MICE_59
solved the issue. Mainly what needed to be done is use a dictionary to hold the sequence names as keys instead of using lists. Than since dictionaries are random we need to be able to create a list from the dictionary to order the sequence names as they are read. Also we extract the sequence names from the rpsblast out. if anyone has any question feel free to pm.
Related
This script is supposed to calculate the total weight of a protein, so I decided to count the occurrences of certain characters in a script. However, only the first equation produces a result which causes the total weight to be the same value (everything under the first one equals zero, which is definitely incorrect). How do I get my script to pay attention to the other lines???
This is a shortened version:
akt3_file = open('AKT3 fasta.txt', 'r') #open up the fasta file
for line in akt3_file:
ala =(sum(line.count('A') for line in akt3_file)*89*1000) #this value is 1780000
arg =(sum(line.count('R') for line in akt3_file)*174*1000)
asn =(sum(line.count('N') for line in akt3_file)*132*1000)
asp =(sum(line.count('D') for line in akt3_file)*133*1000)
asx =(sum(line.count('B') for line in akt3_file)*133*1000)
protein_weight = ala+arg+asn+asp+asx
print(protein_weight) # the problem is that this value is also 1780000
akt3_file.close() #close the fasta file
The issue you have is that you're trying to iterate over your file's lines several times. While that's actually possible (unlike most iterators, file objects can be rewound with seek), you're not doing it properly, so all the iterations except for the first don't see any data.
In this case, you probably don't need to iterate over the lines at all. Just read the full text of the file into a string, and count the characters you want out of that string:
with open('AKT3 fasta.txt', 'r') as akt_3file: # A with is not necessary, but a good idea.
data = akt_3file.read() # Read the whole file into the data string.
ala = data.count('A') * 89 * 1000 # Now we can count all the occurrences in all lines at
arg = data.count('R') * 174 * 1000 # once, and there's no issue iterating the file, since
asn = data.count('N') * 132 * 1000 # we're not dealing with the file any more, just the
asp = data.count('D') * 133 * 1000 # immutable data string.
asx = data.count('B') * 133 * 1000
So I am having a problem extracting text from a larger (>GB) text file. The file is structured as follows:
>header1
hereComesTextWithNewlineAtPosition_80
hereComesTextWithNewlineAtPosition_80
hereComesTextWithNewlineAtPosition_80
andEnds
>header2
hereComesTextWithNewlineAtPosition_80
hereComesTextWithNewlineAtPosAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAlineAtPosition_80
MaybeAnotherTargetBBBBBBBBBBBrestText
andEndsSomewhereHere
Now I have the information that in the entry with header2 I need to extract the text from position X to position Y (the A's in this example), starting with 1 as the first letter in the line below the header.
BUT: the positions do not account for newline characters. So basically when it says from 1 to 95 it really means just the letters from 1 to 80 and the following 15 of the next line.
My first solution was to use file.read(X-1) to skip the unwanted part in front and then file.read(Y-X) to get the part I want, but when that stretches over newline(s) I get to few characters extracted.
Is there a way to solve this with another python-function than read() maybe? I thought about just replacing all newlines with empty strings but the file maybe quite large (millions of lines).
I also tried to account for the newlines by taking extractLength // 80 as added length, but this is problematic in cases like the example when eg. of 95 characters it's 2-80-3 over 3 lines I actually need 2 additional positions but 95 // 80 is 1.
UPDATE:
I modified my code to use Biopython:
for s in SeqIO.parse(sys.argv[2], "fasta"):
#foundClusters stores the information for substrings I want extracted
currentCluster = foundClusters.get(s.id)
if(currentCluster is not None):
for i in range(len(currentCluster)):
outputFile.write(">"+s.id+"|cluster"+str(i)+"\n")
flanking = 25
start = currentCluster[i][0]
end = currentCluster[i][1]
left = currentCluster[i][2]
if(start - flanking < 0):
start = 0
else:
start = start - flanking
if(end + flanking > end + left):
end = end + left
else:
end = end + flanking
#for debugging only
print(currentCluster)
print(start)
print(end)
outputFile.write(s.seq[start, end+1])
But I get the following error:
[[1, 55, 2782]]
0
80
Traceback (most recent call last):
File "findClaClusters.py", line 92, in <module>
outputFile.write(s.seq[start, end+1])
File "/usr/local/lib/python3.4/dist-packages/Bio/Seq.py", line 236, in __getitem__
return Seq(self._data[index], self.alphabet)
TypeError: string indices must be integers
UPDATE2:
Changed outputFile.write(s.seq[start, end+1]) to:
outRecord = SeqRecord(s.seq[start: end+1], id=s.id+"|cluster"+str(i), description="Repeat-Cluster")
SeqIO.write(outRecord, outputFile, "fasta")
and its working :)
With Biopython:
from Bio import SeqIO
X = 66
Y = 130
for s in in SeqIO.parse("test.fst", "fasta"):
if "header2" == s.id:
print s.seq[X: Y+1]
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Biopython let's you parse fasta file and access its id, description and sequence easily. You have then a Seq object and you can manipulate it conveniently without recoding everything (like reverse complement and so on).
I have a file that come from map reduce output for the format below that needs conversion to CSV using shell script
25-MAY-15
04:20
Client
0000000010
127.0.0.1
PAY
ISO20022
PAIN000
100
1
CUST
API
ABF07
ABC03_LIFE.xml
AFF07/LIFE
100000
Standard Life
================================================
==================================================
AFF07-B000001
2000
ABC Corp
..
BE900000075000027
AFF07-B000002
2000
XYZ corp
..
BE900000075000027
AFF07-B000003
2000
3MM corp
..
BE900000075000027
I need the output like CSV format below where I want to repeat some of the values in the file and add the TRANSACTION ID as below format
25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,ABF07,ABC03_LIFE.xml,AFF07/LIFE,100000,Standard Life, 25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,AFF07-B000001, 2000,ABC Corp,..,BE900000075000027
25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,ABF07,ABC03_LIFE.xml,AFF07/LIFE,100000,Standard Life, 25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,AFF07-B000002,2000,XYZ Corp,..,BE900000075000027
TRANSACTION ID is AFF07-B000001,AFF07-B000002,AFF07-B000003 which have different values and I have put a marked line from where the Transaction ID starts . Before the demarkation , the values should be repeating and the transaction ID column needs to be added along with the repeating values before the line as given in above format
BASH shell script I may need and CentOS is the flavour of linux
I am getting error as below when I execute the code
Traceback (most recent call last):
File "abc.py", line 37, in <module>
main()
File "abc.py", line 36, in main
createTxns(fh)
File "abc.py", line 7, in createTxns
first17.append( fh.readLine().rstrip() )
AttributeError: 'file' object has no attribute 'readLine'
Can someone help me out
Is this a correct description of the input file and output format?
The input file consists of:
17 lines, followed by
groups of 10 lines each - each group holding one transaction id
Each output row consists of:
29 common fields, followed by
5 fields derived from each of the 10-line groups above
So we just translate this into some Python:
def createTxns(fh):
"""fh is the file handle of the input file"""
# 1. Read 17 lines from fh
first17 = []
for i in range(17):
first17.append( fh.readLine().rstrip() )
# 2. Form the common fields.
commonFields = first17 + first17[0:12]
# 3. Process the rest of the file in groups of ten lines.
while True:
# read 10 lines
group = []
for i in range(10):
x = fh.readline()
if x == '':
break
group.append( x.rstrip() )
if len(group) <> 10:
break # we've reached the end of the file
fields = commonFields + [ group[2], group[4], group[6], group[7[, group[9] ]
row = ",".join(fields)
print row
def main():
with open("input-file", "r") as fh:
createTxns(fh)
main()
This code shows how to:
open a file handle
read lines from a file handle
strip off the ending newline
check for end of input when reading from a file
concatenate lists together
join strings together
I would recommend you to read Input and Output if you are going for the python route.
You just have to break the problem down and try it. For the first 17 line use f.readline() and concat into the string. Then the replace method to get the begining of the string that you want in the csv.
str.replace("\n", ",")
Then use the split method and break them down into the list.
str.split("\n")
Then write the file out in the loop. Use a counter to make your life easier. First write out the header string
25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,ABF07,ABC03_LIFE.xml,AFF07/LIFE,100000,Standard Life, 25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API
Then write the item in the list with a ",".
,AFF07-B000001, 2000,ABC Corp,..,BE900000075000027
At the count of 5 write the "\n" with the header again and don't forget to reset your counter so it can begin again.
\n25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,ABF07,ABC03_LIFE.xml,AFF07/LIFE,100000,Standard Life, 25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API
Give it a try and let us know if you need more assistant. I assumed that you have some scripting background :) Good luck!!
This seems fairly trivial but I can't seem to work it out
I have a text file with the contents:
B>F
I am reading this with the code below, stripping the '>' and trying to convert the strings into their corresponding ASCII value, minus 65 to give me a value that will correspond to another list index
def readRoute():
routeFile = open('route.txt', 'r')
for line in routeFile.readlines():
route = line.strip('\n' '\r')
route = line.split('>')
#startNode, endNode = route
startNode = ord(route[0])-65
endNode = ord(route[1])-65
# Debug (this comment was for my use to explain below the print values)
print 'Route Entered:'
print line
print startNode, ',', endNode, '\n'
return[startNode, endNode]
However I am having slight trouble doing the conversion nicely, because the text file only contains one line at the moment but ideally I need it to be able to support more than one line and run an amount of code for each line.
For example it could contain:
B>F
A>D
C>F
E>D
So I would want to run the same code outside this function 4 times with the different inputs
Anyone able to give me a hand
Edit:
Not sure I made my issue that clear, sorry
What I need it do it parse the text file (possibly containing one line or multiple lines like above. I am able to do it for one line with the lines
startNode = ord(route[0])-65
endNode = ord(route[1])-65
But I get errors when trying to do more than one line because the ord() is expecting different inputs
If I have (below) in the route.txt
B>F
A>D
This is the error it gives me:
line 43, in readRoute endNode = ord(route[1])-65
TypeError: ord() expected a character, but string of length 2 found
My code above should read the route.txt file and see that B>F is the first route, strip the '>' - convert the B & F to ASCII, so 66 & 70 respectively then minus 65 from both to give 1 & 5 (in this example)
The 1 & 5 are corresponding indexes for another "array" (list of lists) to do computations and other things on
Once the other code has completed it can then go to the next line in route.txt which could be A>D and perform the above again
Perhaps this will work for you. I turned the fileread into a generator so you can do as you please with the parsed results in the for-i loop.
def readRoute(file_name):
with open(file_name, 'r') as r:
for line in r:
yield (ord(line[0])-65, ord(line[2])-65)
filename = 'route.txt'
for startnode, endnode in readRoute(filename):
print startnode, endnode
If you can't change readRoute, change the contents of the file before each call. Better yet, make readRoute take the filename as a parameter (default it to 'route.txt' to preserve the current behavior) so you can have it process other files.
What about something like this? It takes the routes defined in your file and turns them into path objects with start and end member variables. As an added bonus PathManager.readFile() allows you to load multiple route files without overwriting the existing paths.
import re
class Path:
def __init__(self, start, end):
self.start = ord(start) - 65 # Scale the values as desired
self.end = ord(end) - 65 # Scale the values as desired
class PathManager:
def __init__(self):
self.expr = re.compile("^([A-Za-z])[>]([A-Za-z])$") # looks for string "C>C"
# where C is a char
self.paths = []
def do_logic_routine(self, start, end):
# Do custom logic here that will execute before the next line is read
# Return True for 'continue reading' or False to stop parsing file
return True
def readFile(self, path):
file = open(path,"r")
for line in file:
item = self.expr.match(line.strip()) # strip whitespaces before parsing
if item:
'''
item.group(0) is *not* used here; it matches the whole expression
item.group(1) matches the first parenthesis in the regular expression
item.group(2) matches the second
'''
self.paths.append(Path(item.group(1), item.group(2)))
if not do_logic_routine(self.paths[-1].start, self.paths[-1].end):
break
# Running the example
MyManager = PathManager()
MyManager.readFile('route.txt')
for path in MyManager.paths:
print "Start: %s End: %s" % (path.start, path.end)
Output is:
Start: 1 End: 5
Start: 0 End: 3
Start: 2 End: 5
Start: 4 End: 3
Hey there, I have a rather large file that I want to process using Python and I'm kind of stuck as to how to do it.
The format of my file is like this:
0 xxx xxxx xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
1 xxx xxxx xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
So I basically want to read in the chunk up from 0-1, do my processing on it, then move on to the chunk between 1 and 2.
So far I've tried using a regex to match the number and then keep iterating, but I'm sure there has to be a better way of going about this. Any suggestion/info would be greatly appreciated.
If they are all within the same line, that is there are no line breaks between "1." and "2." then you can iterate over the lines of the file like this:
for line in open("myfile.txt"):
#do stuff
The line will be disposed of and overwritten at each iteration meaning you can handle large file sizes with ease. If they're not on the same line:
for line in open("myfile.txt"):
if #regex to match start of new string
parsed_line = line
else:
parsed_line += line
and the rest of your code.
Why don't you just read the file char by char using file.read(1)?
Then, you could - in each iteration - check whether you arrived at the char 1. Then you have to make sure that storing the string is fast.
If the "N " can only start a line, then why not use use the "simple" solution? (It sounds like this already being done, I am trying to reinforce/support it ;-))
That is, just reading a line at a time, and build up the data representing the current N object. After say N=0, and N=1 are loaded, process them together, then move onto the next pair (N=2, N=3). The only thing that is even remotely tricky is making sure not to throw out a read line. (The line read that determined the end condition -- e.g. "N " -- also contain the data for the next N).
Unless seeking is required (or IO caching is disabled or there is an absurd amount of data per item), there is really no reason not to use readline AFAIK.
Happy coding.
Here is some off-the-cuff code, which likely contains multiple errors. In any case, it shows the general idea using a minimized side-effect approach.
# given an input and previous item data, return either
# [item_number, data, next_overflow] if another item is read
# or None if there are no more items
def read_item (inp, overflow):
data = overflow or ""
# this can be replaced with any method to "read the header"
# the regex is just "the easiest". the contract is just:
# given "N ....", return N. given anything else, return None
def get_num(d):
m = re.match(r"(\d+) ", d)
return int(m.groups(1)) if m else None
for line in inp:
if data and get_num(line) ne None:
# already in an item (have data); current line "overflows".
# item number is still at start of current data
return [get_num(data), data, line]
# not in item, or new item not found yet
data += line
# and end of input, with data. only returns above
# if a "new" item was encountered; this covers case of
# no more items (or no items at all)
if data:
return [get_num(data), data, None]
else
return None
And usage might be akin to the following, where f represents an open file:
# check for error conditions (e.g. None returned)
# note feed-through of "overflow"
num1, data1, overflow = read_item(f, None)
num2, data2, overflow = read_item(f, overflow)
If the format is fixed, why not just read 3 lines at a time with readline()
If the file is small, you could read the whole file in and split() on number digits (might want to use strip() to get rid of whitespace and newlines), then fold over the list to process each string in the list. You'll probably have to check that the resultant string you are processing on is not initially empty in case two digits were next to each other.
If the file's content can be loaded in memory, and that's what you answered, then the following code (needs to have filename defined) may be a solution.
import re
regx = re.compile('^((\d+).*?)(?=^\d|\Z)',re.DOTALL|re.MULTILINE)
with open(filename) as f:
text = f.read()
def treat(inp,regx=regx):
m1 = regx.search(inp)
numb,chunk = m1.group(2,1)
li = [chunk]
for mat in regx.finditer(inp,m1.end()):
n,ch = mat.group(2,1)
if int(n) == int(numb) + 1:
yield ''.join(li)
numb = n
li = []
li.append(ch)
chunk = ch
yield ''.join(li)
for y in treat(text):
print repr(y)
This code, run on a file containing :
1 mountain
orange 2
apple
produce
2 gas
solemn
enlightment
protectorate
3 grimace
song
4 snow
wheat
51 guludururu
kelemekinonoto
52asabi dabada
5 yellow
6 pink
music
air
7 guitar
blank 8
8 Canada
9 Rimini
produces:
'1 mountain\norange 2\napple\nproduce\n'
'2 gas\nsolemn\nenlightment\nprotectorate\n'
'3 grimace\nsong\n'
'4 snow\nwheat\n51 guludururu\nkelemekinonoto\n52asabi dabada\n'
'5 yellow\n'
'6 pink \nmusic\nair\n'
'7 guitar\nblank 8\n'
'8 Canada\n'
'9 Rimini'