Python: Count kmers from fasta files

Python: Count kmers from fasta files - python

I want to count the kmers from a fasta file. I have the following script:
import operator
seq = open('file', 'r')
kmers = {}
k = 5
for i in range(len(seq) - k + 1):
kmer = seq[i:i+k]
if kmer in kmers:
kmers[kmer] += 1
else:
kmers[kmer] = 1
for kmer, count in kmers.items():
print (kmer + "\t" + str(count))
sortedKmer = sorted(kmers.items(), key=itemgetter(1), reverse=True)
for item in sortedKmer:
print (item[0] + "\t" + str(item[1]))
This works fine for a file with only one sequence, but now I have a fasta file with several contigs.
My fasta file looks like this:
>1
GTCTTCCGGCGAGCGGGCTTTTCACCCGCTTTATCGTTACTTATGTCAGCATTCGCACTT
CTGATACCTCCAGCAACCCTCACAGGCCACCTTCGCAGGCTTACAGAACGCTCCCCTACC
CAACAACGCATAAACGTCGCTGCCGCAGCTTCGGTGCATGGTTTAGCCCCGTTACATCTT
CCGCGCAGGCCGACTCGACCAGTGAGCTATTACGCTTTCTTTAAATGATGGCTGCTTCTA
AGCCAACATCCTGGCTGTCTGG
>2
AAAGAAAGCGTAATAGCTCACTGGTCGAGTCGGCCTGCGCGGAAGATGTAACGGGGCTAA
ACCATGCACCGAAGCTGCGGCAGCGACACTCAGGTGTTGTTGGGTAGGGGAGCGTTCTGT
AAGCCTGTGAAGGTGGCCTGTGAGGGTTGCTGGAGGTATCAGAAGTGCGAATGCTGACAT
AAGTAACGATAAAGCGGGTGAAAAGCCCGCTCGCCGGAAGACCAAGGGTTCCTGTCCAAC
GTTAATCGGGGCAGG
How can I change the script that it take first the sequence after ">1", print that output, go to ">2", print that output etc?

I have never heard about kmer or fasta, but I think I understand what you are trying to do.
You can try to split on a regex involving '>', but I would recommend processing the file line by line and accumulate kmers before printing them appropriately when reaching the '>1'-lines. See below code with comments
import operator
def printSeq(name, seq):
# Extract your code into a function and print header for current kmer
print("%s\n################################" %name)
kmers = {}
k = 5
for i in range(len(seq) - k + 1):
kmer = seq[i:i+k]
if kmer in kmers:
kmers[kmer] += 1
else:
kmers[kmer] = 1
for kmer, count in kmers.items():
print (kmer + "\t" + str(count))
sortedKmer = sorted(kmers.items(), reverse=True)
for item in sortedKmer:
print (item[0] + "\t" + str(item[1]))
with open('file', 'r') as f:
seq = ""
key = ""
for line in f.readlines():
# Loop over lines in file
if line.startswith(">"):
# if we get '>' it is time for a new sequence
if key and seq:
# if it wasn't the first we should print it before overwriting the variables
printSeq(key, seq)
# store name after '>' and reset sequence
key = line[1:].strip()
seq = ""
else:
# accumulate kmer until we hit another '>'
seq += line.strip()
# when we are done with all the lines, print the last sequence
printSeq(key, seq)

I tried the following with your example FASTA file and it should work:
def count_kmers(seq, k, kmers):
for i in range(len(seq) - k + 1):
kmr = seq[i:i + k]
if kmr in kmers:
kmers[kmr] += 1
else:
kmers[kmr] = 1
filename = raw_input('File name/path: ')
k = input('Value for k: ')
kmers = {}
# Put each line of the file into a list (avoid empty lines)
with open(filename) as f:
lines = [l.strip() for l in f.readlines() if l.strip() != '']
# Find the line indices where a new sequence starts
idx = [i for (i, l) in enumerate(lines) if l[0] == '>']
idx += [len(lines)]
for i in xrange(len(idx) - 1):
start = idx[i] + 1
stop = idx[i + 1]
sequence = ''.join(lines[start:stop])
count_kmers(sequence, k, kmers)
print kmers
Hope it helps :)

Related

Python - Problem in appending lines to output file with specific condition

My Problem is the following:
I have a file with lines that normally start with 'ab', condition is when line not start with ab it should be appended to previous line, but some lines are not get appended to output file
Source File:
grpid;UserGroup;Name;Description;Owner;Visibility;Members -> heading
ab;user1;name1;des1;bhalji;public
sss
ddd
fff
ab;user2;name2;des2;bhalji;private -> not appended in output
ab;user3;name3;des3;bhalji;public -> not appended in output
ab;user4;name4;des4;bhalji;private
rrr
ttt
yyy
uuu
ab;user5;name5;des5;bhalji;private
ttt
ooo
ppp
Here's is what I'm doing using python:
def grouping():
output = []
temp = []
currIdLine = ""
with( open ('usergroups.csv', 'r')) as f:
for lines in f.readlines():
line = lines.strip()
if not line:
print("Skipping empty line")
continue
if line.startswith('grpid'):
output.append(line)
continue
if line.startswith('ab'):
if temp:
output.append(currIdLine + ";" + ','.join(temp))
temp.clear()
currIdLine = line
else:
temp.append(line)
output.append(currIdLine + ";" + ','.join(temp))
#print("\n".join(output))
with open('new.csv', 'w') as f1:
for row in output:
f1.write(row + '\n')
grouping ()
Output of the above code:
grpid;UserGroup;Name;Description;Owner;Visibility;Members
ab;user1;name1;des1;bhalji;public;sss,ddd,fff
ab;user4;name4;des4;bhalji;private;rrr,ttt,yyy,uuu
ab;user5;name5;des5;bhalji;private;ttt,ooo,ppp
I hope this should be quite easy with Python but I'm not getting it right so far.
That's how the file should look at the end:
Expected Output:
grpid;UserGroup;Name;Description;Owner;Visibility;Members
ab;user1;name1;des1;bhalji;public;sss,ddd,fff
ab;user2;name2;des2;bhalji;private
ab;user3;name3;des3;bhalji;public
ab;user4;name4;des4;bhalji;private;rrr,ttt,yyy,uuu
ab;user5;name5;des5;bhalji;private;ttt,ooo,ppp

You are close. Missing an else statement to deal with the case that the line starts with ab, and the previous line started with ab.
def grouping():
output = []
temp = []
currIdLine = ""
with(open('usergroups.csv', 'r')) as f:
header = f.readline()
output.append(line.strip())
temp = []
for line in f.readlines():
if not line:
print("Skipping empty line")
continue
if line.startswith('ab'):
if temp:
output.append(currIdLine + ";" + ','.join(temp))
temp = []
else:
output.append(currIdLine)
currIdLine = line.strip()
else:
temp.append(line.strip())
if temp:
output.append(currIdLine + ";" + ','.join(temp))
else: # <-- this block is needed
output.append(currIdLine)
with open('new.csv', 'w') as f1:
for row in output:
f1.write(row + '\n')

diff list of multiline strings with difflib without knowing which were added, deleted or modified

I have two lists of multiline strings and I try to get the the diff lines for these strings. First I tried to just split all lines of each string and handled all these strings as one big "file" and get the diff for it but I had a lot of bugs. I cannot just diff by index since I do not know, which multiline string was added, which was deleted and which one was modified.
Lets say I had the following example:
import difflib
oldList = ["one\ntwo\nthree","four\nfive\nsix","seven\neight\nnine"]
newList = ["four\nfifty\nsix","seven\neight\nnine","ten\neleven\ntwelve"]
oldAllTogether = []
for string in oldList:
oldAllTogether.extend(string.splitlines())
newAllTogether = []
for string in newList:
newAllTogether.extend(string.splitlines())
diff = difflib.unified_diff(oldAllTogether,newAllTogether)
So I somehow have to find out, which strings belong to each other.

I had to implmenent my own code in order to get the desired output. It is basically the same as Differ.compare() with the difference that we have a look at multiline blocks instead of lines. So the code would be:
diffString = ""
oldList = ["one\ntwo\nthree","four\nfive\nsix","seven\neight\nnine"]
newList = ["four\nfifty\nsix","seven\neight\nnine","ten\neleven\ntwelve"]
a = oldList
b = newList
cruncher = difflib.SequenceMatcher(None, a, b)
for tag, alo, ahi, blo, bhi in cruncher.get_opcodes():
if tag == 'replace':
best_ratio, cutoff = 0.74, 0.75
oldstrings = a[alo:ahi]
newstrings = b[blo:bhi]
for j in range(len(newstrings)):
newstring = newstrings[j]
cruncher.set_seq2(newstring)
for i in range(len(oldstrings)):
oldstring = oldstrings[i]
cruncher.set_seq1(oldstring)
if cruncher.real_quick_ratio() > best_ratio and \
cruncher.quick_ratio() > best_ratio and \
cruncher.ratio() > best_ratio:
best_ratio, best_old, best_new = cruncher.ratio(), i, j
if best_ratio < cutoff:
#added string
stringLines = newstring.splitlines()
for line in stringLines: diffString += "+" + line + "\n"
else:
#replaced string
start = False
for diff in difflib.unified_diff(oldstrings[best_old].splitlines(),newstrings[best_new].splitlines()):
if start:
diffString += diff + "\n"
if diff[0:2] == '##':
start = True
del oldstrings[best_old]
#deleted strings
stringLines = []
for string in oldstrings:
stringLines.extend(string.splitlines())
for line in stringLines: diffString += "-" + line + "\n"
elif tag == 'delete':
stringLines = []
for string in a[alo:ahi]:
stringLines.extend(string.splitlines())
for line in stringLines:
diffString += "-" + line + "\n"
elif tag == 'insert':
stringLines = []
for string in b[blo:bhi]:
stringLines.extend(string.splitlines())
for line in stringLines:
diffString += "+" + line + "\n"
elif tag == 'equal':
continue
else:
raise ValueError('unknown tag %r' % (tag,))
which result in the following:
print(diffString)
four
-five
+fifty
six
-one
-two
-three
+ten
+eleven
+twelve

I am having Problem in output of reguler expression result is( 'IP-Subnet/mask',) but i want( IP-Subnet/mask) in out put

Below is my code i am reading .txt file then making a list and then saving it to excel but in excel i am getting ('ip subnet/mask', ) but i want only (ip subnet/mask) in out put
Below are my code blocks
1.I read routing Table output from Txt file and create a list
2.then from 10.0.0.0/8 address space i remove routing table sybnets
3.I save the available IP,s to Available.txt file
4.Create List from Available.txt file
5.then i create excel file and then i save the list out put to excel in specific 10.x.x.x/16 sheet
import os
import re
import xlsxwriter
from netaddr import *
from openpyxl import load_workbook
def ip_adresses():
lst = []
for line in fstring:
for word in line.split():
result = pattern.search(word)
if result:
lst.append(word)
return lst
def write_excel(aaa, bbb, num):
bbb = sorted(bbb)
work11 = load_workbook(r'C:\Users\irfan\PycharmProjects\pythonProject\irfan4.xlsx')
sheet11 = work11[aaa]
count = sheet11.max_row
max1 = sheet11.max_row
for row1, entry in enumerate(bbb, start=1):
sheet11.cell(row=row1 + max1, column=1, value=entry)
work11.save("irfan4.xlsx")
os.chdir(r'C:\Users\irfan\PycharmProjects\pythonProject')
file = open('RR-ROUTING TABLE.txt')
fstring = file.readlines()
# declaring the regex pattern for IP addresses
pattern = re.compile(r'(10\.\d{1,3}\.\d{1,3}\.\d{1,3}[/])')
# initializing the list object
unique = []
# extracting the IP addresses
IPs = ip_adresses()
unique = list(dict.fromkeys(IPs))
ipv4_addr_space = IPSet(['10.0.0.0/8'])
ip_list = IPSet(list(unique))
print(ip_list)
available = ipv4_addr_space ^ ip_list
print()
f = open("Available.txt", "a")
f.write(str(available))
f.close
print(available)
workbook = xlsxwriter.Workbook('irfan4.xlsx')
worksheet = workbook.add_worksheet()
for row_num, data in enumerate(available):
worksheet.write(row_num, 0, data)
num = 0
while num <= 255:
worksheet = workbook.add_worksheet("10." + str(num) + ".0.0")
num += 1
workbook.close()
# CREATE AUDIT BOOK
##################################################
os.chdir(r'C:\Users\irfan\PycharmProjects\pythonProject')
file_2 = open('Available.txt')
fstring_2 = file_2.readlines()
def ip_adresses1():
lst = []
for line in fstring_2:
for word in line.split():
result = pattern.search(word)
if result:
lst.append(word)
return lst
List_A=ip_adresses1()
print(List_A[1])
get_list = []
num = 0
while num <= 255:
pattern_sheet = re.compile(r'(10\.' + str(num) + '\.\d{1,3}\.\d{1,3}[/])')
for get_ips in fstring_2:
result_ip = pattern_sheet.search(get_ips)
if result_ip:
get_list.append(get_ips)
sheet_name = ("10." + str(num) + ".0.0")
write_excel(sheet_name, get_list, num)
get_list = []
num += 1
enter code here

I have used re.sub function to remove characters from string:
def ip_adresses1():
lst = []
for line in fstring_2:
for word in line.split():
word = re.sub("IPSet", " ", word)
word = re.sub(",", " ", word)
word = re.sub("'", " ", word)
word = re.sub("\(", " ", word)
word = re.sub("\)", " ", word)
word = re.sub("\]", " ", word)
word = re.sub("\[", " ", word)
result = pattern.search(word)
if result:
lst.append(word)
return lst

Sequence match using Python

I am working on RNA sequence matching
seq = 'UCAGCUGUCAGUCAUGAUC'
sub_seq =['UGUCAG', 'CAGUCA', 'UCAGCU','GAUC']
I am matching the sub_seq to the seq, matched sub_seq is under the seq, if there is no matched, use dash line. Output looks like this:
UCAGCUGUCAGUCAUGAUC
UCAGCU--CAGUCA-GAUC
-----UGUCAG--------
I try to use the dictionary to do this
index_dict = {}
for i in xrange(len(sub_seq)):
index_dict[seq.find(sub_seq[i])] = {}
index_dict[seq.find(sub_seq[i])]['sequence'] = sub_seq[i]
index_dict[seq.find(sub_seq[i])]['end_index'] = seq.find(sub_seq[i]) + len(sub_seq[i]) - 1
I cannot figure out the algorithm to do alignment, any help will be appreciated!

seq_l = len(seq)
for ele in sub_seq:
start = seq.find(ele)
ln = len(ele)
if start != -1:
end = start + ln
print("-" * start + ele + "-"*(seq_l- end))
else:
print("-" * seq_l)
-----UGUCAG--------
--------CAGUCA-----
UCAGCU-------------
---------------GAUC
Not sure where UCAGCU--CAGUCA-GAUC comes from as you are only using a single sub sequence at a time in your code

Assuming you'll let me change your index_dict slightly, consider:
seq = 'UCAGCUGUCAGUCAUGAUC'
sub_seq =['UGUCAG', 'CAGUCA', 'UCAGCU','GAUC']
index_dict = {}
for i in xrange(len(sub_seq)):
index_dict[seq.find(sub_seq[i])] = {
'sequence': sub_seq[i],
'end_index': seq.find(sub_seq[i]) + len(sub_seq[i]) # Note this changed
}
sorted_keys = sorted(index_dict)
lines = []
while True:
if not sorted_keys: break
line = []
next_index = 0
for k in sorted_keys:
if k >= next_index:
line.append(k)
next_index = index_dict[k]['end_index']
# Remove keys we used, append line to lines
for k in line: sorted_keys.remove(k)
lines.append(line)
# Build output lines
olines = []
for line in lines:
oline = ''
for k in line:
oline += '-' * (k - len(oline)) # Add dashes before subseq
oline += index_dict[k]['sequence'] # Add subsequence
oline += '-' * (len(seq) - len(oline)) # Add trailing dashes
olines.append(oline)
print seq
print '\n'.join(olines)
Output:
UCAGCUGUCAGUCAUGAUC
UCAGCU--CAGUCA-GAUC
-----UGUCAG--------
Note this is pretty verbose, and could be condensed a bit. The while True and for line in lines loops could probably be merged into one, but it should help explain one possible approach.
Edit: This is one way you might join the last two loops:
seq = 'UCAGCUGUCAGUCAUGAUC'
sub_seq =['UGUCAG', 'CAGUCA', 'UCAGCU','GAUC']
index_dict = {}
for i in xrange(len(sub_seq)):
index_dict[seq.find(sub_seq[i])] = {
'sequence': sub_seq[i],
'end_index': seq.find(sub_seq[i]) + len(sub_seq[i]) # Note this changed
}
sorted_keys = sorted(index_dict)
lines = []
while True:
if not sorted_keys: break
line = ''
next_index = 0
keys_used = []
for k in sorted_keys:
if k >= next_index:
line += '-' * (k - len(line)) # Add dashes before subseq
line += index_dict[k]['sequence'] # Add subsequence
next_index = index_dict[k]['end_index'] # Update next_index
keys_used.append(k) # Mark key as used
for k in keys_used: sorted_keys.remove(k) # Remove used keys
line += '-' * (len(seq) - len(line)) # Add trailing dashes
lines.append(line) # Add line to lines
print seq
print '\n'.join(lines)
Output:
UCAGCUGUCAGUCAUGAUC
UCAGCU--CAGUCA-GAUC
-----UGUCAG--------

Multidimensional arrays - PYTHON

i have problems with the array indexes in python.
at function readfile it crashes and prints: "list index out of range"
inputarr = []
def readfile(filename):
lines = readlines(filename)
with open(filename, 'r') as f:
i = 0
j= 0
k = 0
for line in f:
line = line.rstrip("\n")
if not line == '':
inputarr[j][k] = line
k += 1
#print("\tnew entry\tj=%d\tk=%d" % (j, k))
elif line == '':
k = 0
j += 1
#print("new block!\tj=%d\tk=%d" % (j, k))
i += 1
processing(i, lines)

This error is due to you trying to assign to an index of inputarr that is outside the bounds of the list. This causes an error in python (unlike some other languages like javascript which automatically extend an array if you try to access an index that is outside the initial bounds of the array).
You need to either pre-fill inputarr so it has the right shape and size, or you need to dynamically create it as you go. I prefer the latter:
inputarr = [[]]
# ^^ Set up the first row
def readfile(filename):
lines = readlines(filename)
with open(filename, 'r') as f:
i = 0
j= 0
k = 0
for line in f:
line = line.rstrip("\n")
if not line == '':
inputarr[j].append(line)
# ^^^^^^^^ Add a new value to the end of the current row of inputarr
k += 1
#print("\tnew entry\tj=%d\tk=%d" % (j, k))
elif line == '':
k = 0
inputarr.append([])
# ^^^^^^^^^^^^^^^^^^^ Add a new blank row to inputarr
j += 1
#print("new block!\tj=%d\tk=%d" % (j, k))
i += 1
processing(i, lines)

It happens because inputarr is empty. For example:
lst = []
lst[0] = 1 // error
In your case:
inputarr = []
j = 0
...
inputarr[j][k] = line // inputarr= []; j = 0; so inputarr[0] = ...!ERROR

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Count kmers from fasta files - python

Related

Python - Problem in appending lines to output file with specific condition

diff list of multiline strings with difflib without knowing which were added, deleted or modified

I am having Problem in output of reguler expression result is( 'IP-Subnet/mask',) but i want( IP-Subnet/mask) in out put

Sequence match using Python

Multidimensional arrays - PYTHON

Categories

Resources