Python: losing nucleotides from fasta file to dictionary

Python: losing nucleotides from fasta file to dictionary - python

I am trying to write a code to extract longest ORF in a fasta file. It is from Coursera Genomics data science course.
the file is a practice file: "dna.example.fasta"
Data is here:https://d396qusza40orc.cloudfront.net/genpython/data_sets/dna.example.fasta
Part of my code is below to extract reading frame 2 (start from the second position of a sequence. eg: seq: ATTGGG, to get reading frame 2: TTGGG):
#!/usr/bin/python
import sys
import getopt
o, a = getopt.getopt(sys.argv[1:], 'h')
opts = dict()
for k,v in o:
opts[k] = v
if '-h' in k:
print "--help\n"
if len(a) < 0:
print "missing fasta file\n"
f = open(a[0], "r")
seq = dict()
for line in f:
line = line.strip()
if line.startswith(">"):
name = line.split()[0]
seq[name] = ''
else:
seq[name] = seq[name] + line[1:]
k = seq[">gi|142022655|gb|EQ086233.1|323"]
print len(k)
The length of this particular sequence should be 4804 bp. Therefore by using this sequence alone I could get the correct answer.
However, with the code, here in the dictionary, this particular sequence becomes only 4736 bp.
I am new to python, so I can not wrap my head around as to where did those 100 bp go?
Thank you,
Xio

Take another look at your data file
An example of some of the lines:
>gi|142022655|gb|EQ086233.1|43 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence
TCGGGCGAAGGCGGCAGCAAGTCGTCCACGCGCAGCGCGGCACCGCGGGCCTCTGCCGTGCGCTGCTTGG
CCATGGCCTCCAGCGCACCGATCGGATCAAAGCCGCTGAAGCCTTCGCGCATCAGGCGGCCATAGTTGGC
Notice how the sequences start on the first value of each line.
Your addition line seq[name] = seq[name] + line[1:] is adding everything on that line after the first character, excluding the first (Python 2 indicies are zero based). It turns out your missing number of nucleotides is the number of lines it took to make that genome, because you're losing the first character every time.
The revised way is seq[name] = seq[name] + line which simply adds the line without losing that first character.
The quickest way to find these kind of debugging errors is to either use a formal debugger, or add a bunch of print statements on your code and test with a small portion of the file -- something that you can see the output of and check for yourself if it's coming out right. A short file with maybe 50 nucleotides instead of 5000 is much easier to evaluate by hand and make sure the code is doing what you want. That's what I did to come up with the answer to the problem in about 5 minutes.
Also for future reference, please mention the version of python you are using before hand. There are quite a few differences between python 2 (The one you're using) and python 3.
I did some additional testing with your code, and if you get any extra characters at the end, they might be whitespace. Make sure you use the .strip() method on each line before adding it to your string, which clears whitespace.
Addressing your comment,
To start from the 2nd position on the first line of the sequence only and then use the full lines until the following nucleotide, you can take advantage of the file's linear format and just add one more clause to your if statement, an elif. This will test if we're on the first line of the sequence, and if so, use the characters starting from the second, if we're on any other line, use the whole line.
if line.startswith(">"):
name = line.split()[0]
seq[name] = ''
#If it's the first line in the series, then the dict's value
# will be an empty string, so this elif means "If we're at the
# start of the series..."
elif seq[name] == '':
seq[name] = seq[name] + line[1:]
else:
seq[name] = seq[name]
This adaptation will start from the 2nd nucleotide in the genome without losing the first from every line in the rest of the nucleotide.

Related

Regular Expression to find valid words in file

I need to write a function get_specified_words(filename) to get a list of lowercase words from a text file. All of the following conditions must be applied:
Include all lower-case character sequences including those that
contain a - or ' character and those that end with a '
character.
Exclude words that end with a -.
The function must only process lines between the start and end marker lines
Use this regular expression to extract the words from each relevant line of a file: valid_line_words = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", line)
Ensure that the line string is lower case before using the regular expression.
Use the optional encoding parameter when opening files for reading. That is your open file call should look like open(filename, encoding='utf-8'). This will be especially helpful if your operating system doesn't set Python's default encoding to UTF-8.
The sample text file testing.txt contains this:
That are after the start and should be dumped.
So should that
and that
and yes, that
*** START OF SYNTHETIC TEST CASE ***
Toby's code was rather "interesting", it had the following issues: short,
meaningless identifiers such as n1 and n; deep, complicated nesting;
a doc-string drought; very long, rambling and unfocused functions; not
enough spacing between functions; inconsistent spacing before and
after operators, just like this here. Boy was he going to get a low
style mark.... Let's hope he asks his friend Bob to help him bring his code
up to an acceptable level.
*** END OF SYNTHETIC TEST CASE ***
This is after the end and should be ignored too.
Have a nice day.
Here's my code:
import re
def stripped_lines(lines):
for line in lines:
stripped_line = line.rstrip('\n')
yield stripped_line
def lines_from_file(fname):
with open(fname, 'rt') as flines:
for line in stripped_lines(flines):
yield line
def is_marker_line(line, start='***', end='***'):
min_len = len(start) + len(end)
if len(line) < min_len:
return False
return line.startswith(start) and line.endswith(end)
def advance_past_next_marker(lines):
for line in lines:
if is_marker_line(line):
break
def lines_before_next_marker(lines):
valid_lines = []
for line in lines:
if is_marker_line(line):
break
valid_lines.append(re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", line))
for content_line in valid_lines:
yield content_line
def lines_between_markers(lines):
it = iter(lines)
advance_past_next_marker(it)
for line in lines_before_next_marker(it):
yield line
def words(lines):
text = '\n'.join(lines).lower().split()
return text
def get_valid_words(fname):
return words(lines_between_markers(lines_from_file(fname)))
# This must be executed
filename = "valid.txt"
all_words = get_valid_words(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(all_words)))
print("word list:")
print("\n".join(all_words))
Here's my output:
File "C:/Users/jj.py", line 45, in <module>
text = '\n'.join(lines).lower().split()
builtins.TypeError: sequence item 0: expected str instance, list found
Here's the expected output:
valid.txt loaded ok.
73 valid words found.
word list:
toby's
code
was
rather
interesting
it
had
the
following
issues
short
meaningless
identifiers
such
as
n
and
n
deep
complicated
nesting
a
doc-string
drought
very
long
rambling
and
unfocused
functions
not
enough
spacing
between
functions
inconsistent
spacing
before
and
after
operators
just
like
this
here
boy
was
he
going
to
get
a
low
style
mark
let's
hope
he
asks
his
friend
bob
to
help
him
bring
his
code
up
to
an
acceptable
level
I need help with getting my code to work. Any help is appreciated.

lines_between_markers(lines_from_file(fname))
gives you a list of list of valid words.
So you just need to flatten it :
def words(lines):
words_list = [w for line in lines for w in line]
return words_list
Does the trick.
But I think that you should review the design of your program :
lines_between_markers should only yield lines between markers, but it does more. Regexp should be use on the result of this function and not inside the function.
What you didn't do :
Ensure that the line string is lower case before using the regular expression.
Use the optional encoding parameter when opening files for reading.
That is your open file call should look like open(filename,
encoding='utf-8').

Having problems with strings and arrays

I want to read a text file and copy text that is in between '~~~~~~~~~~~~~' into an array. However, I'm new in Python and this is as far as I got:
with open("textfile.txt", "r",encoding='utf8') as f:
searchlines = f.readlines()
a=[0]
b=0
for i,line in enumerate(searchlines):
if '~~~~~~~~~~~~~' in line:
b=b+1
if '~~~~~~~~~~~~~' not in line:
if 's1mb4d' in line:
break
a.insert(b,line)
This is what I envisioned:
First I read all the lines of the text file,
then I declare 'a' as an array in which text should be added,
then I declare 'b' because I need it as an index. The number of lines in between the '~~~~~~~~~~~~~' is not even, that's why I use 'b' so I can put lines of text into one array index until a new '~~~~~~~~~~~~~' was found.
I check for '~~~~~~~~~~~~~', if found I increase 'b' so I can start adding lines of text into a new array index.
The text file ends with 's1mb4d', so once its found, the program ends.
And if '~~~~~~~~~~~~~' is not found in the line, I add text to the array.
But things didn't go well. Only 1 line of the entire text between those '~~~~~~~~~~~~~' is being copied to the each array index.
Here is an example of the text file:
~~~~~~~~~~~~~
Text123asdasd
asdasdjfjfjf
~~~~~~~~~~~~~
123abc
321bca
gjjgfkk
~~~~~~~~~~~~~

You could use regex expression, give a try to this:
import re
input_text = ['Text123asdasd asdasdjfjfjf','~~~~~~~~~~~~~','123abc 321bca gjjgfkk','~~~~~~~~~~~~~']
a = []
for line in input_text:
my_text = re.findall(r'[^\~]+', line)
if len(my_text) != 0:
a.append(my_text)
What it does is it reads line by line looks for all characters but '~' if line consists only of '~' it ignores it, every line with text is appended to your a list afterwards.
And just because we can, oneliner (excluding import and source ofc):
import re
lines = ['Text123asdasd asdasdjfjfjf','~~~~~~~~~~~~~','123abc 321bca gjjgfkk','~~~~~~~~~~~~~']
a = [re.findall(r'[^\~]+', line) for line in lines if len(re.findall(r'[^\~]+', line)) != 0]

In python the solution to a large part of problems is often to find the right function from the standard library that does the job. Here you should try using split instead, it should be way easier.
If I understand correctly your goal, you can do it like that :
joined_lines = ''.join(searchlines)
result = joined_lines.split('~~~~~~~~~~')
The first line joins your list of lines into a sinle string, and then the second one cut that big string every times it encounters the '~~' sequence.

I tried to clean it up to the best of my knowledge, try this and let me know if it works. We can work together on this!:)
with open("textfile.txt", "r",encoding='utf8') as f:
searchlines = f.readlines()
a = []
currentline = ''
for i,line in enumerate(searchlines):
currentline += line
if '~~~~~~~~~~~~~' in line:
a.append(currentline)
elif 's1mb4d' in line:
break
Some notes:
You can use elif for your break function
Append will automatically add the next iteration to the end of the array
currentline will continue to add text on each line as long as it doesn't have 's1mb4d' or the ~~~ which I think is what you want

s = ['']
with open('path\\to\\sample.txt') as f:
for l in f:
a = l.strip().split("\n")
s += a
a = []
for line in s:
my_text = re.findall(r'[^\~]+', line)
if len(my_text) != 0:
a.append(my_text)
print a
>>> [['Text123asdasd asdasdjfjfjf'], ['123abc 321bca gjjgfkk']]

If you're willing to impose/accept the constraint that the separator should be exactly 13 ~ characters (actually '\n%s\n' % ( '~' * 13) to be specific) ...
then you could accomplish this for relatively normal sized files using just
#!/usr/bin/python
## (Should be #!/usr/bin/env python; but StackOverflow's syntax highlighter?)
separator = '\n%s\n' % ('~' * 13)
with open('somefile.txt') as f:
results = f.read().split(separator)
# Use your results, a list of the strings separated by these separators.
Note that '~' * 13 is a way, in Python, of constructing a string by repeating some smaller string thirteen times. 'xx%sxx' % 'YY' is a way to "interpolate" one string into another. Of course you could just paste the thirteen ~ characters into your source code ... but I would consider constructing the string as shown to make it clear that the length is part of the string's specification --- that this is part of your file format requirements ... and that any other number of ~ characters won't be sufficient.
If you really want any line of any number of ~ characters to serve as a separator than you'll want to use the .split() method from the regular expressions module rather than the .split() method provided by the built-in string objects.
Note that this snippet of code will return all of the text between your separator lines, including any newlines they include. There are other snippets of code which can filter those out. For example given our previous results:
# ... refine results by filtering out newlines (replacing them with spaces)
results = [' '.join(each.split('\n')) for each in results]
(You could also use the .replace() string method; but I prefer the join/split combination). In this case we're using a list comprehension (a feature of Python) to iterate over each item in our results, which we're arbitrarily naming each), performing our transformation on it, and the resulting list is being boun back to the name results; I highly recommend learning and getting comfortable with list comprehension if you're going to learn Python. They're commonly used and can be a bit exotic compared to the syntax of many other programming and scripting languages).
This should work on MS Windows as well as Unix (and Unix-like) systems because of how Python handles "universal newlines." To use these examples under Python 3 you might have to work a little on the encodings and string types. (I didn't need to for my Python3.6 installed under MacOS X using Homebrew ... but just be forewarned).

complex regex matches in python

I have a txt file that contains the following data:
chrI
ATGCCTTGGGCAACGGT...(multiple lines)
chrII
AGGTTGGCCAAGGTT...(multiple lines)
I want to first find 'chrI' and then iterate through the multiple lines of ATGC until I find the xth char. Then I want to print the xth char until the yth char. I have been using regex but once I have located the line containing chrI, I don't know how to continue iterating to find the xth char.
Here is my code:
for i, line in enumerate(sacc_gff):
for match in re.finditer(chromo_val, line):
print(line)
for match in re.finditer(r"[ATGC]{%d},{%d}\Z" % (int(amino_start), int(amino_end)), line):
print(match.group())
What the variables mean:
chromo_val = chrI
amino_start = (some start point my program found)
amino_end = (some end point my program found)
Note: amino_start and amino_end need to be in variable form.
Please let me know if I could clarify anything for you, Thank you.

It looks like you are working with fasta data, so I will provide an answer with that in mind, but if it isn't you can use the sub_sequence selection part still.
fasta_data = {} # creates an empty dictionary
with open( fasta_file, 'r' ) as fh:
for line in fh:
if line[0] == '>':
seq_id = line.rstrip()[1:] # strip newline character and remove leading '>' character
fasta_data[seq_id] = ''
else:
fasta_data[seq_id] += line.rstrip()
# return substring from chromosome 'chrI' with a first character at amino_start up to but not including amino_end
sequence_string1 = fasta_data['chrI'][amino_start:amino_end]
# return substring from chromosome 'chrII' with a first character at amino_start up to and including amino_end
sequence_string2 = fasta_data['chrII'][amino_start:amino_end+1]
fasta format:
>chr1
ATTTATATATAT
ATGGCGCGATCG
>chr2
AATCGCTGCTGC

Since you are working with fasta files which are formatted like this:
>Chr1
ATCGACTACAAATTT
>Chr2
ACCTGCCGTAAAAATTTCC
and are a bioinformatics major I am guessing you will be manipulating sequences often I recommend install the perl package called FAST. Once this is installed to get the 2-14 character of every sequence you would do this:
fascut 2..14 fasta_file.fa
Here is the recent publication for FAST and github that contains a whole toolbox for manipulating molecule sequence data on the command line.

Adding non-duplicate strings from one txt to another in Python3.3

I have 2 text files (new.txt and master.txt). Each has different data stored as such:
Cory 12 12:40:12.016221
Suzy 64 12:40:33.404614
Trent 145 12:40:56.640052
(catagorised by the first set of numbers appearing on each line)
I have to scan each line of new.txt for the name (e.g. Suzy), check if there is a duplicate in master.txt and if there isn't, then I add that line to master.txt catagorized by that line's number (e.g. 64 in Suzy 64 12:40:33.404614).
I have written the following script, but it falls into a loop of checking the 1st line of new.txt (I know why, I just don't know how to work around not closing fileinput.input(new.txt) so that I can then open fileinput.input(master.txt) further down the loop). I feel like I've highly over complicated things for myself and any help is appreciated.
import fileinput
import re
end_of_file = False
while end_of_file == False:
for line in fileinput.input('new.txt', inplace=1):
end_of_file = fileinput.isstdin() #ends while loop if on last line of new.txt
user_f_line_list = line.split()
master_f = open('master.txt', 'r')
master_f_read = master_f.read()
master_f.close()
fileinput.close()
if not re.findall(user_f_line_list[0], master_f_read):
for line in fileinput.input('master.txt', inplace=1):
master_line_list = line.split()
if int(user_f_line_list[1]) <= int(master_line_list[1]):
written = False
while written == False:
written = True
print(' '.join(user_f_line_list))
print(line, end='')
fileinput.close()
And for reference, master.txt starts with startline 0 and ends with endline 1000000000000000 so that it is impossible for the categorizing to be out of range.

Some suggestions:
Open master.txt into a list with readlines().
Use an OrderedDict from the collections module - it is the same as a regular dict but preserves the order. Make each key the unique element - a tuple in this case (e.g. ("Cory", 12)). Make the value whatever comes after.
Now you can very rapidly check to see if the entry is present by if key in my_dict:.
If it isn't, you can insert it. If you need to insert in order, it'll take a bit more work, but not too much. I would insert in the end, convert to a list when all is done, and apply a sort function to the list with a custom function to specify how to sort.
Output it back to the file.
I won't say it's necessarily shorter than your solution, but it is a lot cleaner.

Combine strings, extract substrings

(I'm using python)
I'm working with a large file of RNA sequences, and I'm trying to reformat it to use in a clustering program. My file is made up of two types of 'lines.' 1) Accession numbers for bacteria, (period) the nucleotide this sequence starts at, (period) the nucleotide it ends at. 2) lines of the actual sequence itself (across multiple lines, even though it's a continuous sequence):
>A45315.1.1521\n
GACGAACGCUGGCGGCGUGCCUAAUACAUGCAAGUCGAGCGCAGGAAGCCGGCGGAUCCC\n
UUCGGGGUGAANCCGGUGGAAUGAGCGGCGGACGGGUGAGUAACACGUGGGCAACCUACC\n
UUGUAGACUGGGAUAACUCCGGGAAACCGGGGCUAAUACCGGAUGAUCAUUUGGAUCGCAU\n
GAUCCGAAUGUAAAAGUGGGGAUUUAUCCUCACACUGCAAGAUGGGCCCGCGGCGCA…..
>A93610.15.1301\n
CCACUGCUAUGGGGGUCCGACUAAGCCAUGCGAGUCAUGGGGUCCCUCUGGGACACCACC\n
GGCGGACGGCUCAGUAACACGUCGGUAACCUACCCUCGGGAGGGGGAUAACCCCGGGAAA\n
CUGGGGCUAAUCCCCCAUAGGCCUGAGGUACUGGAAGGUCCUCAGGCCGAAAGGGGCUU….
I need to create something that looks at the lines that start with >, and go to the number after the first decimal (so above that would be 1 and 15). Starting a count at that number (so 1 or 15 in the above example), it needs to extract the nucleotides (As,Cs,Gs or Us) that start at 69 and go to 497 (note for this example I took out a bunch of the nucleotides).
So, for my attempt, I thought it would make sense to make the nucleotide sequences into one long string, and then try to extract the nucleotides. But I can't seem to make the lines of RNA sequences into one long string (see below for what I tried). And once I have the large string, I'm not sure how to extract the right nucleotides. I need to write something like s = [x:497], where x is 69-(insert that number before the first decimal).
#!/usr/bin/env python
#Make a program that takes SSURef_NR99 file of sequences, makes a new file of
#Accession numbers and size of 16S.
import re
infilename = 'SSUtestdata.txt'
outfilename = 'SSUtestdata3.txt'
#Here I'm trying to search for one of the nucleotides, an end of line character and another nucleotide, trying to make a long string.
replace = re.compile(r'([A|C|G|U])(\n)([A|C|G|U])')
#remove extra letters and spaces
with open(infilename, 'r') as infile, open(outfilename, 'w') as outfile:
for line in infile:
line = replace.sub(r'\1\3', line)
#Write to OutFile
outfile.write(line)
Thank you for any ideas you might have!

If I understand your problem correctly, this should do it:
with open('path/to/input') as infile:
while 1:
try:
line = infile.readline()
_, start, end = line.strip().split('.')
start, end = int(start), int(end)
beg = infile.read(start-1)
infile.read(beg.count('\n'))
seq = infile.read(end-start)
extra = infile.read(seq.count('\n'))
seq = seq.replace('\n') + extra
print seq # print(seq) in python3
except:
break

Perhaps something like this, although not as elegant as #inspectorG4dget's solution.
with open(infilename) as infile:
nucStart=69
nucStop=497
nucleotides=[]
for line in infile:
if line.startswith(">"):
# process the previous list if populated
if len(nucleotides) > 0:
nucleotides = ''.join(nucleotides) # make a single string
# write out the accession information and the nucleotides we want
outfile.write("%s %s" % (accession_line,
nucleotides[nucStart-start-1:nucStop-start]))
nucleotides=[] # clear it for the next run
# this is the start of the next sequence
accession_line = line
start = int(line.split('.')[1])
else:
# this is a line containing a partial nucleotide sequence, so add it
nucleotides.append(line)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: losing nucleotides from fasta file to dictionary - python

Related

Regular Expression to find valid words in file

Having problems with strings and arrays

complex regex matches in python

Adding non-duplicate strings from one txt to another in Python3.3

Combine strings, extract substrings

Categories

Resources