Parsing blast output in .xml format

Parsing blast output in .xml format - python

I have a blast output in .xml format, but will not post an example here, since it is huge, unless you really require it. I go specifically to my question. The script below works OK. The only thing is I want to print the hit_def, which has different length in the file followed by a space. How to modify a code to print me hit_def? As you can see if I specify [:8] if will print me 8 characters, but then the length might be 10, 15 etc, how to improve this?
import re
import sys
base = sys.argv[1]
base = base.rstrip('xml')
if fasta_out == True:
seq_out = open(base+'fasta', 'w')
read_def = set()
with open(sys.argv[1],'rb') as xml:
for line in xml:
if re.search('<Iteration_query-def>', line) != None:
line = line.strip()
line = line.rstrip()
line = re.sub('<Iteration_query-def>', '', line)
line = re.sub('</Iteration_query-def>', '', line)
query_def = line
if re.search('<Hit_def>', line) != None:
line = line.strip()
line = line.rstrip()
line = re.sub('<Hit_def>', '', line)
line = re.sub('</Hit_def>', '', line)
hit_def = line[:8]
if fasta_out == True:
print >> seq_out, query_def+'\t'+hit_def+'\n'
if fasta_out == True:
seq_out.close()
This is an example how my hit_def looks,
>MLOC_36586.11 pep:known chromosome:030312v2:1:9883453:9888834:-1 gene:MLOC_36586 transcript:MLOC_36586.11 description:"Uncharacterized protein "
>MLOC_36586.2 pep:known chromosome:030312v2:1:9883444:9888847:-1 gene:MLOC_36586 transcript:MLOC_36586.2 description:"Uncharacterized protein "
>MLOC_51.2 pep:known chromosome:030312v2:1:322147737:322148802:1 gene:MLOC_51 transcript:MLOC_51.2 description:"Predicted protein\x3b Uncharacterized protein "
>MLOC_217.1 pep:known chromosome:030312v2:4:519918111:519919326:1 gene:MLOC_217 transcript:MLOC_217.1 description:"Uncharacterized protein "
Desired hit_def's,
MLOC_36586.11
MLOC_36586.2
MLOC_51.2
MLOC_217.1

If you know it's always the first item in the string, you can do something like this:
hit_def = line[:line.index(' ')]
If it isn't necessarily first, you might go for a regex like this:
hit_def = re.findall(r'(MLOC_\d+\.\d+) ',line)[0]
I'm assuming that your hit_defs are all of the form MLOC_XXX.X, but you get the idea.

Related

What Is the error in the code, i want to replace a set of characters from a text file when i give a work with blanks in it

i want to replace a set of characters from a text file when i give a work with blanks in it like for example :
i gave the line The Language Is _th_n !
it should return python replacing _ with text from a file like text.txt
i wrote this code please check once
with open('data/text','r', encoding='utf8') as file:
word_list = file.read()
def solve(message):
hint = []
for i in range(15,len(message) - 1):
if message[i] != '\\':
hint.append(message[i])
hint_string = ''
for i in hint:
hint_string += i
hint_replaced = hint_string.replace('_', '!')
solution = re.findall('^'+hint_replaced+'$', word_list, re.MULTILINE)
return solution```

Python: read line and modify it (if needed)

let's say I have a file Example.txt like this:
alpha_1 = 10
%alpha_2 = 20
Now, I'd like to have a python script which performs these tasks:
If the line containing alpha_1 parameter is not commented (% symbol), to rewrite the line adding %, like it is with alpha_2
To perform the task in 1. independently of the line order
To leave untouched the rest of the file Example.txt
The file I wrote is:
with open('Example.txt', 'r+') as config:
while 1:
line = config.readline()
if not line:
break
# remove line returns
line = line.strip('\r\n')
# make sure it has useful data
if (not "=" in line) or (line[0] == '%'):
continue
# split across equal sign
line = line.split("=",1)
this_param = line[0].strip()
this_value = line[1].strip()
for case in switch(this_param):
if case("alpha1"):
string = ('% alpha1 =', this_value )
s = str(string)
config.write(s)
Up to now the output is the same Example.txt with a further line (%alpha1 =, 10) down the original line alpha1 = 10.
Thanks everybody

I found the solution after a while. Everything can be easily done writing everything on another file and substituting it at the end.
configfile2 = open('Example.txt' + '_temp',"w")
with open('Example.txt', 'r') as configfile:
while 1:
line = configfile.readline()
string = line
if not line:
break
# remove line returns
line = line.strip('\r\n')
# make sure it has useful data
if (not "=" in line) or (line[0] == '%'):
configfile2.write(string)
else:
# split across equal sign
line = line.split("=",1)
this_param = line[0].strip()
this_value = line[1].strip()
#float values
if this_param == "alpha1":
stringalt = '% alpha1 = '+ this_value + ' \r\n'
configfile2.write(stringalt)
else:
configfile2.write(string)
configfile.close()
configfile2.close()
# the file is now replaced
os.remove('Example.txt' )
os.rename('Example.txt' + '_temp','Example.txt' )

When I open a text file, it only reads the last line

Say customPassFile.txt has two lines in it. First line is "123testing" and the second line is "testing321". If passwordCracking = "123testing", then the output would be that "123testing" was not found in the file (or something similar). If passwordCracking = "testing321", then the output would be that "testing321" was found in the file. I think that the for loop I have is only reading the last line of the text file. Any solutions to fix this?
import time
import linecache
def solution_one(passwordCracking):
print("Running Solution #1 # " + time.strftime("%Y-%m-%d %H:%M:%S",time.localtime()))
startingTimeSeconds = time.time()
currentLine = 1
attempt = 1
passwordFound = False
wordListFile = open("customPassFile.txt", encoding="utf8")
num_lines = sum(1 for line in open('customPassFile.txt'))
while(passwordFound == False):
for i, line in enumerate(wordListFile):
if(i == currentLine):
line = line
passwordChecking = line
if(passwordChecking == passwordCracking):
passwordFound = True
endingTimeSeconds = time.time()
overallTimeSeconds = endingTimeSeconds - startingTimeSeconds
print("~~~~~~~~~~~~~~~~~")
print("Password Found: {}".format(passwordChecking))
print("ATTEMPTS: {}".format(attempt))
print("TIME TO FIND: {} seconds".format(overallTimeSeconds))
wordListFile.close()
break
elif(currentLine == num_lines):
print("~~~~~~~~~~~~~~~~~")
print("Stopping Solution #1 # " + time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
print("REASON: Password could not be cracked")
print("ATTEMPTS: {}".format(attempt))
break
else:
attempt = attempt + 1
currentLine = currentLine + 1
continue

The main problem with your code is that you open the file and you read it multiple times. The first time the file object position goes to the end and stays there. Next time you read the file nothing happens, since you are already at the end of the file.
Example
Sometimes an example is worth more than lots of words.
Take the file test_file.txt with the following lines:
line1
line2
Now open the file and read it twice:
f = open('./test_file.txt')
f.tell()
>>> 0
for l in f:
print(l, end='')
else:
print('nothing')
>>> line1
>>> line2
>>> nothing
f.tell()
>>> 12
for l in f:
print(l, end='')
else:
print('nothing')
>>> nothing
f.close()
The second time nothing happen, as the file object is already at the end.
Solution
Here you have two options:
you read the file only once and save all the lines in a list and then use the list in your code. It should be enough to replace
wordListFile = open("customPassFile.txt", encoding="utf8")
num_lines = sum(1 for line in open('customPassFile.txt'))
with
with open("customPassFile.txt", encoding="utf8") as f:
wordListFile = f.readlines()
num_lines = len(wordListFile)
you reset the file object position after you read the file using seek. It would be something along the line:
for i, line in enumerate(wordListFile):
if(i == currentLine):
line = line
wordListFile.seek(0)
I would go with option 1., unless you have memory constraint (e.g. the file is bigger than memory)
Notes
I have a few extra notes:
python starts counters with 0 (like c/c++) and not 1 (like fortran). So probably you want to set:
currentLine = 0
when you read a file, the new line character \n is not stripped, so you have to do it (with strip) or account for it when comparing strings (using e.g. startswith). As example:
passwordChecking == passwordCracking
will likely always return False as passwordChecking contains \n and passwordCracking very likely doesn't.
Disclamer
I haven't tried the code, nor my suggestions, so there might be other bugs lurking around.

**I will delete this answer after OP understands the problem in indentation of I understand his intention of his code.*
for i, line in enumerate(wordListFile):
if(i == currentLine):
line = line
passwordChecking = line
#rest of the code.
Here your code is outside of for loop so only last line is cached.
for i, line in enumerate(wordListFile):
if(i == currentLine):
line = line
passwordChecking = line
#rest of the code.

Python3 add colour to specific outputted words from lists in a sentence

My below code is currently checking a text file to see if it can find words in a sentence from my lexicon file, if it does find one it then searches this line to see if it can find a word from a secondary list if both of these conditions are met in a line then this line is printed.
What i am trying to do is set the lexicon word colour to for example red & blue for the words found in the secondary list that is called CategoryGA, my purpose for this is to easily identify in the printout there each of the found words have came from.
import re
import collections
from collections import defaultdict
from collections import Counter
import sys
from Categories.GainingAccess import GA
Chatpath = "########/Chat1.txt"
Chatfile = Chatpath
lpath = 'Lexicons/######.txt'
lfile = lpath
CategoryGA = GA
Hits = []
"""
text_file = open(path, "r")
lines = text_file.read().split()
c = Counter(lines)
for i, j in c.most_common(50):
print(i, j)
"""
# class LanguageModelling:
def readfile():
Word_Hit = None
with open(Chatfile) as file_read:
content = file_read.readlines()
for line_num, line in enumerate(content):
if any(word in line for word in CategoryGA):
Word_Hit = False
for word in CategoryGA:
if line.find(word) != -1:
Word_Hit = True
Hits.append(word)
Cleanse = re.sub('<.*?>', '', line)
print('%s appeared on Line %d : %s' % (word, line_num, Cleanse))
file_read.close()
count = Counter(Hits)
count.keys()
for key, value in count.items():
print(key, ':', value)
def readlex():
with open(lfile) as l_read:
l_content = l_read.readlines()
for line in l_content:
r = re.compile(r'^\d+\s+\d+\.\d+%\s*')
l_Cleanse = r.sub('', line)
print(l_Cleanse)
l_read.close()
def LanguageDetect():
with open(Chatfile) as c_read, open(lfile) as l_read:
c_content = c_read.readlines()
lex_content = l_read.readlines()
for line in c_content:
Cleanse = re.sub('<.*?>', '', line)
if any(lex_word in line for lex_word in lex_content) \
and \
any(cat_word in line for cat_word in CategoryGA):
lex_word = '\033[1;31m{}\033[1;m'.format(lex_word)
cat_word = '\033[1;44m{}\033[1;m'.format(cat_word)
print(Cleanse)
# print(cat_word)
c_read.close()
l_read.close()
#readfile()
LanguageDetect()
# readlex()
This is my full code but the issue is occurring in the "LanguageDetect" method my current way of trying by assigning the lex_word & cat_word variables hasn't worked and frankly I'm stumped as to what to try next.
Lexicon:
31547 4.7072% i
25109 3.7466% u
20275 3.0253% you
10992 1.6401% me
9490 1.4160% do
7681 1.1461% like
6293 0.9390% want
6225 0.9288% my
5459 0.8145% have
5141 0.7671% your
5103 0.7614% lol
4857 0.7247% can
then within the readlex method i use:
r = re.compile(r'^\d+\s+\d+\.\d+%\s*')
l_Cleanse = r.sub('', line)
to remove everything before the word/character i believe this may be the main issue as to why i can't colour the lexicon word but unsure on how to fix this.

I think you problem comes from the way you handle the line data but maybe i did not understand your question clearly.
That should do the trick :
lex_content = ['aaa', 'xxx']
CategoryGA = ['ccc', 'ddd']
line = 'abc aaa bbb ccc'
for lex_word in lex_content:
for cat_word in CategoryGA:
if lex_word in line and cat_word in line:
print(lex_word, cat_word)
line = line.replace(lex_word, '\033[1;31m' + lex_word + '\033[1;m')
line = line.replace(cat_word, '\033[1;44m' + cat_word + '\033[1;m')
print(line)
Gives the output:

How to find a specific line of text in a text file with python?

def match_text(raw_data_file, concentration):
file = open(raw_data_file, 'r')
lines = ""
print("Testing")
for num, line in enumerate(file.readlines(), 0):
w = ' WITH A CONCENTRATION IN ' + concentration
if re.search(w, line):
for i in range(0, 6):
lines += linecache.getline(raw_data_file, num+1)
try:
write(lines, "lines.txt")
print("Lines Data Created...")
except:
print("Could not print Line Data")
else:
print("Didn't Work")
I am trying to open a .txt file and search for a specific string.

If you are simply trying to write all of the lines that hold your string to a file, this will do.
def match_text(raw_data_file, concentration):
look_for = ' WITH A CONCENTRATION IN ' + concentration
with open(raw_data_file) as fin, open('lines.txt', 'w') as fout:
fout.writelines(line for line in fin if look_for in line)

Fixed my own issue. The following works to find a specific line and get the lines following the matched line.
def match_text(raw_data_file, match_this_text):
w = match_this_text
lines = ""
with open(raw_data_file, 'r') as inF:
for line in inF:
if w in line:
lines += line //Will add the matched text to the lines string
for i in range(0, however_many_lines_after_matched_text):
lines += next(inF)
//do something with 'lines', which is final multiline text
This will return multiple lines plus the matched string that the user wants. I apologize if the question was confusing.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing blast output in .xml format - python

Related

What Is the error in the code, i want to replace a set of characters from a text file when i give a work with blanks in it

Python: read line and modify it (if needed)

When I open a text file, it only reads the last line

Python3 add colour to specific outputted words from lists in a sentence

How to find a specific line of text in a text file with python?

Categories

Resources