Sequence match using Python

Sequence match using Python - python

I am working on RNA sequence matching
seq = 'UCAGCUGUCAGUCAUGAUC'
sub_seq =['UGUCAG', 'CAGUCA', 'UCAGCU','GAUC']
I am matching the sub_seq to the seq, matched sub_seq is under the seq, if there is no matched, use dash line. Output looks like this:
UCAGCUGUCAGUCAUGAUC
UCAGCU--CAGUCA-GAUC
-----UGUCAG--------
I try to use the dictionary to do this
index_dict = {}
for i in xrange(len(sub_seq)):
index_dict[seq.find(sub_seq[i])] = {}
index_dict[seq.find(sub_seq[i])]['sequence'] = sub_seq[i]
index_dict[seq.find(sub_seq[i])]['end_index'] = seq.find(sub_seq[i]) + len(sub_seq[i]) - 1
I cannot figure out the algorithm to do alignment, any help will be appreciated!

seq_l = len(seq)
for ele in sub_seq:
start = seq.find(ele)
ln = len(ele)
if start != -1:
end = start + ln
print("-" * start + ele + "-"*(seq_l- end))
else:
print("-" * seq_l)
-----UGUCAG--------
--------CAGUCA-----
UCAGCU-------------
---------------GAUC
Not sure where UCAGCU--CAGUCA-GAUC comes from as you are only using a single sub sequence at a time in your code

Assuming you'll let me change your index_dict slightly, consider:
seq = 'UCAGCUGUCAGUCAUGAUC'
sub_seq =['UGUCAG', 'CAGUCA', 'UCAGCU','GAUC']
index_dict = {}
for i in xrange(len(sub_seq)):
index_dict[seq.find(sub_seq[i])] = {
'sequence': sub_seq[i],
'end_index': seq.find(sub_seq[i]) + len(sub_seq[i]) # Note this changed
}
sorted_keys = sorted(index_dict)
lines = []
while True:
if not sorted_keys: break
line = []
next_index = 0
for k in sorted_keys:
if k >= next_index:
line.append(k)
next_index = index_dict[k]['end_index']
# Remove keys we used, append line to lines
for k in line: sorted_keys.remove(k)
lines.append(line)
# Build output lines
olines = []
for line in lines:
oline = ''
for k in line:
oline += '-' * (k - len(oline)) # Add dashes before subseq
oline += index_dict[k]['sequence'] # Add subsequence
oline += '-' * (len(seq) - len(oline)) # Add trailing dashes
olines.append(oline)
print seq
print '\n'.join(olines)
Output:
UCAGCUGUCAGUCAUGAUC
UCAGCU--CAGUCA-GAUC
-----UGUCAG--------
Note this is pretty verbose, and could be condensed a bit. The while True and for line in lines loops could probably be merged into one, but it should help explain one possible approach.
Edit: This is one way you might join the last two loops:
seq = 'UCAGCUGUCAGUCAUGAUC'
sub_seq =['UGUCAG', 'CAGUCA', 'UCAGCU','GAUC']
index_dict = {}
for i in xrange(len(sub_seq)):
index_dict[seq.find(sub_seq[i])] = {
'sequence': sub_seq[i],
'end_index': seq.find(sub_seq[i]) + len(sub_seq[i]) # Note this changed
}
sorted_keys = sorted(index_dict)
lines = []
while True:
if not sorted_keys: break
line = ''
next_index = 0
keys_used = []
for k in sorted_keys:
if k >= next_index:
line += '-' * (k - len(line)) # Add dashes before subseq
line += index_dict[k]['sequence'] # Add subsequence
next_index = index_dict[k]['end_index'] # Update next_index
keys_used.append(k) # Mark key as used
for k in keys_used: sorted_keys.remove(k) # Remove used keys
line += '-' * (len(seq) - len(line)) # Add trailing dashes
lines.append(line) # Add line to lines
print seq
print '\n'.join(lines)
Output:
UCAGCUGUCAGUCAUGAUC
UCAGCU--CAGUCA-GAUC
-----UGUCAG--------

Related

diff list of multiline strings with difflib without knowing which were added, deleted or modified

I have two lists of multiline strings and I try to get the the diff lines for these strings. First I tried to just split all lines of each string and handled all these strings as one big "file" and get the diff for it but I had a lot of bugs. I cannot just diff by index since I do not know, which multiline string was added, which was deleted and which one was modified.
Lets say I had the following example:
import difflib
oldList = ["one\ntwo\nthree","four\nfive\nsix","seven\neight\nnine"]
newList = ["four\nfifty\nsix","seven\neight\nnine","ten\neleven\ntwelve"]
oldAllTogether = []
for string in oldList:
oldAllTogether.extend(string.splitlines())
newAllTogether = []
for string in newList:
newAllTogether.extend(string.splitlines())
diff = difflib.unified_diff(oldAllTogether,newAllTogether)
So I somehow have to find out, which strings belong to each other.

I had to implmenent my own code in order to get the desired output. It is basically the same as Differ.compare() with the difference that we have a look at multiline blocks instead of lines. So the code would be:
diffString = ""
oldList = ["one\ntwo\nthree","four\nfive\nsix","seven\neight\nnine"]
newList = ["four\nfifty\nsix","seven\neight\nnine","ten\neleven\ntwelve"]
a = oldList
b = newList
cruncher = difflib.SequenceMatcher(None, a, b)
for tag, alo, ahi, blo, bhi in cruncher.get_opcodes():
if tag == 'replace':
best_ratio, cutoff = 0.74, 0.75
oldstrings = a[alo:ahi]
newstrings = b[blo:bhi]
for j in range(len(newstrings)):
newstring = newstrings[j]
cruncher.set_seq2(newstring)
for i in range(len(oldstrings)):
oldstring = oldstrings[i]
cruncher.set_seq1(oldstring)
if cruncher.real_quick_ratio() > best_ratio and \
cruncher.quick_ratio() > best_ratio and \
cruncher.ratio() > best_ratio:
best_ratio, best_old, best_new = cruncher.ratio(), i, j
if best_ratio < cutoff:
#added string
stringLines = newstring.splitlines()
for line in stringLines: diffString += "+" + line + "\n"
else:
#replaced string
start = False
for diff in difflib.unified_diff(oldstrings[best_old].splitlines(),newstrings[best_new].splitlines()):
if start:
diffString += diff + "\n"
if diff[0:2] == '##':
start = True
del oldstrings[best_old]
#deleted strings
stringLines = []
for string in oldstrings:
stringLines.extend(string.splitlines())
for line in stringLines: diffString += "-" + line + "\n"
elif tag == 'delete':
stringLines = []
for string in a[alo:ahi]:
stringLines.extend(string.splitlines())
for line in stringLines:
diffString += "-" + line + "\n"
elif tag == 'insert':
stringLines = []
for string in b[blo:bhi]:
stringLines.extend(string.splitlines())
for line in stringLines:
diffString += "+" + line + "\n"
elif tag == 'equal':
continue
else:
raise ValueError('unknown tag %r' % (tag,))
which result in the following:
print(diffString)
four
-five
+fifty
six
-one
-two
-three
+ten
+eleven
+twelve

How would I format python code using python?

Let's say I've got this code in python:
total=0for i in range(100):print(i)if i > 50:total=total+i
How would I make an algorithm in python to format this python code into the code below:
total=0
for i in range(100):
print(i)
if i > 50:
total=total+i
Assume that everything is nested under each other, such that another statement would be assumed to be inside the if block.

This was quite a fun exercise! I'm running out of juice so just posting this as is. It works on your example but probably not much for anything more complex.
code_block = "total=0for i in range(100):print(i)if i > 50:total=total+iprint('finished')"
code_block_b = "def okay() {print('ff')while True:print('blbl')break}"
line_break_before = ['for', 'while', 'if', 'print', 'break', '}']
line_break_after = [':', '{']
indent_chars = [':', '{']
unindent_chars = ['}']
# Add line breaks before keywords
for kw in line_break_before:
kw_indexes = [idx for idx in range(len(code_block)) if code_block[idx:idx + len(kw)] == kw]
for kw_idx in kw_indexes[::-1]:
code_block = code_block[:kw_idx] + '\n' + code_block[kw_idx:]
# Add line breaks after other keywords if not present already
for kw in line_break_after:
kw_indexes = [idx for idx in range(len(code_block)) if code_block[idx:idx + len(kw)] == kw]
for kw_idx in kw_indexes[::-1]:
if code_block[kw_idx + 1: kw_idx + 2] != '\n':
code_block = code_block[:kw_idx + 1] + '\n' + code_block[kw_idx + 1:]
# Add indentation
indent = 0
formatted_code_lines = []
for line in code_block.split('\n'):
if line[-1] in unindent_chars:
indent = 0
formatted_code_lines.append(' ' * indent)
if line[-1] in indent_chars:
indent += 4
formatted_code_lines.append(line + '\n')
code_block = ''.join(formatted_code_lines)
print(code_block)
The basic premise for formatting is based around keywords. There are keys that require a line break before, and keys that require a line break after them. After that, the indentation was counted +4 spaces for every line after each : symbol. I tested some formatting with braces too in code_block_b.
Output a
total=0
for i in range(100):
print(i)
if i > 50:
total=total+i
Output b
def okay() {
print('ff')
while True:
print('blbl')
break
}

Python: Count kmers from fasta files

I want to count the kmers from a fasta file. I have the following script:
import operator
seq = open('file', 'r')
kmers = {}
k = 5
for i in range(len(seq) - k + 1):
kmer = seq[i:i+k]
if kmer in kmers:
kmers[kmer] += 1
else:
kmers[kmer] = 1
for kmer, count in kmers.items():
print (kmer + "\t" + str(count))
sortedKmer = sorted(kmers.items(), key=itemgetter(1), reverse=True)
for item in sortedKmer:
print (item[0] + "\t" + str(item[1]))
This works fine for a file with only one sequence, but now I have a fasta file with several contigs.
My fasta file looks like this:
>1
GTCTTCCGGCGAGCGGGCTTTTCACCCGCTTTATCGTTACTTATGTCAGCATTCGCACTT
CTGATACCTCCAGCAACCCTCACAGGCCACCTTCGCAGGCTTACAGAACGCTCCCCTACC
CAACAACGCATAAACGTCGCTGCCGCAGCTTCGGTGCATGGTTTAGCCCCGTTACATCTT
CCGCGCAGGCCGACTCGACCAGTGAGCTATTACGCTTTCTTTAAATGATGGCTGCTTCTA
AGCCAACATCCTGGCTGTCTGG
>2
AAAGAAAGCGTAATAGCTCACTGGTCGAGTCGGCCTGCGCGGAAGATGTAACGGGGCTAA
ACCATGCACCGAAGCTGCGGCAGCGACACTCAGGTGTTGTTGGGTAGGGGAGCGTTCTGT
AAGCCTGTGAAGGTGGCCTGTGAGGGTTGCTGGAGGTATCAGAAGTGCGAATGCTGACAT
AAGTAACGATAAAGCGGGTGAAAAGCCCGCTCGCCGGAAGACCAAGGGTTCCTGTCCAAC
GTTAATCGGGGCAGG
How can I change the script that it take first the sequence after ">1", print that output, go to ">2", print that output etc?

I have never heard about kmer or fasta, but I think I understand what you are trying to do.
You can try to split on a regex involving '>', but I would recommend processing the file line by line and accumulate kmers before printing them appropriately when reaching the '>1'-lines. See below code with comments
import operator
def printSeq(name, seq):
# Extract your code into a function and print header for current kmer
print("%s\n################################" %name)
kmers = {}
k = 5
for i in range(len(seq) - k + 1):
kmer = seq[i:i+k]
if kmer in kmers:
kmers[kmer] += 1
else:
kmers[kmer] = 1
for kmer, count in kmers.items():
print (kmer + "\t" + str(count))
sortedKmer = sorted(kmers.items(), reverse=True)
for item in sortedKmer:
print (item[0] + "\t" + str(item[1]))
with open('file', 'r') as f:
seq = ""
key = ""
for line in f.readlines():
# Loop over lines in file
if line.startswith(">"):
# if we get '>' it is time for a new sequence
if key and seq:
# if it wasn't the first we should print it before overwriting the variables
printSeq(key, seq)
# store name after '>' and reset sequence
key = line[1:].strip()
seq = ""
else:
# accumulate kmer until we hit another '>'
seq += line.strip()
# when we are done with all the lines, print the last sequence
printSeq(key, seq)

I tried the following with your example FASTA file and it should work:
def count_kmers(seq, k, kmers):
for i in range(len(seq) - k + 1):
kmr = seq[i:i + k]
if kmr in kmers:
kmers[kmr] += 1
else:
kmers[kmr] = 1
filename = raw_input('File name/path: ')
k = input('Value for k: ')
kmers = {}
# Put each line of the file into a list (avoid empty lines)
with open(filename) as f:
lines = [l.strip() for l in f.readlines() if l.strip() != '']
# Find the line indices where a new sequence starts
idx = [i for (i, l) in enumerate(lines) if l[0] == '>']
idx += [len(lines)]
for i in xrange(len(idx) - 1):
start = idx[i] + 1
stop = idx[i + 1]
sequence = ''.join(lines[start:stop])
count_kmers(sequence, k, kmers)
print kmers
Hope it helps :)

How to write an INI file with ConfigParser with duplicate options

I want to write an INI file with duplicate options,ie:
[test]
foo = value1
foo = value2
xxx = yyy
With ConfigParser.set only the last value is writed.
config = ConfigParser.ConfigParser()
config.read('example.cfg')
config.add_section('test')
config.set('test', service['foo'], service['value1'])
config.set('test', service['foo'], service['value2'])
config.set('test', service['xxx'], service['yyy'])
The result is:
[test]
foo = value2
xxx = yyy
Is there any way?

It looks like it isn't possible in a simple way. The default way ConfigParser stores is with dict's, i.e. one value per unique key.
In a similar question Python's ConfigParser unique keys per section the suggestions are to go with:
CongfigObj
Patched version of epydoc

i have a simple custom .ini parser in python (built for another project), which uses a list to store values but only if they are not in key=value format. if key=value then last key will be held since these are stored in a dictionary
The parser can also parse nested sections like:
[SECTION1][SECTION2]
key1=value1
; etc..
The code is below, it is easy to modify to store key/value in list instead of dictionary or even detect multiple key and rename to avoid collisions (e.g key, key$1 second key with same key value key and so on). use/modify as needed
##
#
# Simple .ini Parser for Python 2.x, 3.x
#
##
import re
class Ini_Parser():
"""Simple .ini parser for Python"""
NL = None
ACTUAL = {
'\\n' : "\n",
'\\t' : "\t",
'\\v' : "\v",
'\\f' : "\f"
}
def parseStr(s, q):
_self = Ini_Parser
endq = s.find(q, 1)
quoted = s[1:endq]
rem = s[endq+1:].strip()
for c,actual in _self.ACTUAL.items():
quoted = ( actual ).join( quoted.split( c ) )
quoted = ( '\\' ).join( quoted.split( '\\\\' ) )
return quoted, rem
def fromString(s, keysList=True, rootSection='_'):
_self = Ini_Parser
comments = [';', '#']
if rootSection: rootSection = str(rootSection)
else: rootSection = '_'
if not _self.NL:
_self.NL = re.compile(r'\n\r|\r\n|\r|\n')
sections = {}
currentSection = str(rootSection)
if keysList:
sections[currentSection] = { '__list__' : [] }
else:
sections[currentSection] = { }
currentRoot = sections
# parse the lines
lines = re.split(_self.NL, str(s))
# parse it line-by-line
for line in lines:
# strip the line of extra spaces
line = line.strip()
lenline = len(line)
# comment or empty line, skip it
if not lenline or (line[0] in comments): continue
linestartswith = line[0]
# section line
if '['==linestartswith:
SECTION = True
# parse any sub-sections
while '['==linestartswith:
if SECTION:
currentRoot = sections
else:
currentRoot = currentRoot[currentSection]
SECTION = False
endsection = line.find(']', 1)
currentSection = line[1:endsection]
if currentSection not in currentRoot:
if keysList:
currentRoot[currentSection] = { '__list__' : [] }
else:
currentRoot[currentSection] = { }
# has sub-section ??
line = line[endsection+1:].strip()
if not len(line): break
linestartswith = line[0]
# key-value pairs
else:
# quoted string
if '"'==linestartswith or "'"==linestartswith:
key, line = _self.parseStr(line, linestartswith)
# key-value pair
if line.find('=', 0)>-1:
line = line.split('=')
line.pop(0)
value = "=".join(line).strip()
valuestartswith = value[0]
# quoted value
if '"'==valuestartswith or "'"==valuestartswith:
value, rem = _self.parseStr(value, valuestartswith)
currentRoot[currentSection][key] = value
# single value
else:
if keysList:
currentRoot[currentSection]['__list__'].append(key)
else:
currentRoot[currentSection][key] = True
# un-quoted string
else:
line = line.split('=')
key = line.pop(0).strip()
# single value
if 1>len(line):
if keysList:
currentRoot[currentSection]['__list__'].append(key)
else:
currentRoot[currentSection][key] = True
# key-value pair
else:
value = "=".join(line).strip()
valuestartswith = value[0]
# quoted value
if '"'==valuestartswith or "'"==valuestartswith:
value, rem = _self.parseStr(value, valuestartswith)
currentRoot[currentSection][key] = value
return sections
def fromFile(filename, keysList=True, rootSection='_'):
s = ''
with open(filename, 'r') as f: s = f.read()
return Ini_Parser.fromString(s, keysList, rootSection)
def walk(o, key=None, top='', q='', EOL="\n"):
s = ''
if len(o):
o = dict(o)
if key: keys = [key]
else: keys = o.keys()
for section in keys:
keyvals = o[section]
if not len(keyvals): continue
s += str(top) + "[" + str(section) + "]" + EOL
if ('__list__' in keyvals) and len(keyvals['__list__']):
# only values as a list
s += q + (q+EOL+q).join(keyvals['__list__']) + q + EOL
del keyvals['__list__']
if len(keyvals):
for k,v in keyvals.items():
if not len(v): continue
if isinstance(v, dict) or isinstance(v, list):
# sub-section
s += Ini_Parser.walk(keyvals, k, top + "[" + str(section) + "]", q, EOL)
else:
# key-value pair
s += q+k+q+ '=' +q+v+q + EOL
s += EOL
return s
def toString(o, rootSection='_', quote=False, EOL="\n"):
s = ''
if rootSection: root = str(rootSection)
else: root = '_'
if quote: q = '"'
else: q = ''
# dump the root section first, if exists
if root in o:
section = dict(o[root])
llist = None
if '__list__' in section:
llist = section['__list__']
if llist and isinstance(llist, list) and len(llist):
s += q + (q+EOL+q).join(llist) + q + EOL
del section['__list__']
for k,v in section.items():
if not len(v): continue
s += q+k+q+ '=' +q+v+q + EOL
s += EOL
del o[root]
# walk the sections and sub-sections, if any
s += Ini_Parser.walk(o, None, '', q, EOL)
return s
def toFile(filename, o, rootSection='_', quote=False, EOL="\n"):
with open(filename, 'w') as f:
f.write( Ini_Parser.toString(o, rootSection, quote, EOL) )
# for use with 'import *'
__all__ = [ 'Ini_Parser' ]

Multidimensional arrays - PYTHON

i have problems with the array indexes in python.
at function readfile it crashes and prints: "list index out of range"
inputarr = []
def readfile(filename):
lines = readlines(filename)
with open(filename, 'r') as f:
i = 0
j= 0
k = 0
for line in f:
line = line.rstrip("\n")
if not line == '':
inputarr[j][k] = line
k += 1
#print("\tnew entry\tj=%d\tk=%d" % (j, k))
elif line == '':
k = 0
j += 1
#print("new block!\tj=%d\tk=%d" % (j, k))
i += 1
processing(i, lines)

This error is due to you trying to assign to an index of inputarr that is outside the bounds of the list. This causes an error in python (unlike some other languages like javascript which automatically extend an array if you try to access an index that is outside the initial bounds of the array).
You need to either pre-fill inputarr so it has the right shape and size, or you need to dynamically create it as you go. I prefer the latter:
inputarr = [[]]
# ^^ Set up the first row
def readfile(filename):
lines = readlines(filename)
with open(filename, 'r') as f:
i = 0
j= 0
k = 0
for line in f:
line = line.rstrip("\n")
if not line == '':
inputarr[j].append(line)
# ^^^^^^^^ Add a new value to the end of the current row of inputarr
k += 1
#print("\tnew entry\tj=%d\tk=%d" % (j, k))
elif line == '':
k = 0
inputarr.append([])
# ^^^^^^^^^^^^^^^^^^^ Add a new blank row to inputarr
j += 1
#print("new block!\tj=%d\tk=%d" % (j, k))
i += 1
processing(i, lines)

It happens because inputarr is empty. For example:
lst = []
lst[0] = 1 // error
In your case:
inputarr = []
j = 0
...
inputarr[j][k] = line // inputarr= []; j = 0; so inputarr[0] = ...!ERROR

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sequence match using Python - python

Related

diff list of multiline strings with difflib without knowing which were added, deleted or modified

How would I format python code using python?

Python: Count kmers from fasta files

How to write an INI file with ConfigParser with duplicate options

Multidimensional arrays - PYTHON

Categories

Resources