What does .readline() return in Python? - python

EDIT of entire post, in order to be more clear of the problem:
s = "GATATATGCATATACTT"
t = "ATAT"
for i in range(len(s)):
if t == s[i:i+len(t)]:
print i+1,
So the purpose of the program above is to scan through the long line of DNA (s) with the short line of DNA (t), in order to find at which positions on s, that t matches. The output of the above code is:
2 4 10 #This are basically the index numbers of string s that string t matches. but as can be seen in the code above, it's i+1 to give a 1-based numbering output.
The problem I'm having is that when i try to change the code, in order to make it receive the values for s and t through a file, the readline() function is not working for me. The motif.txt file contains two strings of DNA, one on each line.
with open('txt/motif.txt', 'r') as f:
s = f.readline()
t = f.readline()
for i in range(len(s)):
if t == s[i:i+len(t)]:
print i+1,
So this code, on the other hand will output nothing at all. But when I change t to:
t = f.readline().strip()
Then the program outputs the same result as the first example did.
So i hope this has made things more clear. My question is thus, if readline() returns a string, why isn't my program in example 2 working in the same way as in the very first example?

your problem statement is wrong, there's no way s or t has more content (and len(s) > 0 or len(t) > 0) in the first example than in the second.
basically with:
s = f.readline()
then s will contain a string like "foobar \n", and thus len(s) will be 9.
Then with:
s = f.readline().strip()
with the same string, len(s) will be 6 because the stripped string is "foobar".
so if you line is full of spaces like s = " \n", s.strip() will be the empty string "", with len(s) == 0.
Then in that case your loop won't start and will never print anything.
in almost all the other cases I can think of, you should get an execption raised, not silent exit.
But to be honest, your code is bad because nobody can understand what you want to do from reading it (including you in six months).

Related

Find all strings in text file fitting either of two formats

So I know similar questions have been asked before, but every method I have tried is not working...
Here is the ask: I have a text file (which is a log file) that I am parsing for any occurrence of "app.task2". The following are the 2 scenarios that can occur (As they appear in the text file, independent of my code):
Scenario 1:
Mar 23 10:28:24 dasd[116] <Notice>: app.task2.refresh:556A2D:[
{name: ApplicationPolicy, policyWeight: 50.000, response: {Decision: Can Proceed, Score: 0.45}}
] sumScores:68.785000, denominator:96.410000, FinalDecision: Can Proceed FinalScore: 0.713463}
Scenario 2:
Mar 23 10:35:56 dasd[116] <Notice>: 'app.task2.refresh:C6C2FE' CurrentScore: 0.636967, ThresholdScore: 0.410015 DecisionToRun:1
The problem I am facing is that my current code below, I am not getting the entire log entry for the first case, and it is only pulling the first line in the log, not the remainder of the log entry, and it appears to be stopping at the new line escape character, which is occurring after ":[".
My Code:
all = []
with open(path_to_log) as f:
for line in f:
if "app.task2" in line:
all.append(line)
print all
How can I get the entire log entry for the first case? I tried stripping escape characters with no luck. From here I should be able to parse the list of results returned for what I truly need, but this will help! ty!
OF NOTE: I need to be able to locate these types of log entries (which will then give us either scenario 1 or scenario 2) by the string "app.task2". So this needs to be incorporated, like in my example...
Before adding the line to all, check if it ends with [. If it does, keep reading and merge the lines until you get to ].
import re
all = []
with open(path_to_log) as f:
for line in f:
if "app.task2" in line:
if re.search(r'\[\s*$', line): # start of multiline log message
for line2 in f:
line += line2
if re.search(r'^\s*\]', line2): # end of multiline log message
break
all.append(line)
print(all)
You are iterating over each each line individually which is why you only get the first line in scenario 1.
Either you can add a counter like this:
all = []
count = -1
with open(path_to_log) as f:
for line in f:
if count > 0:
all.append(line)
if count == 1:
tmp = all[-count:]
del all[-count:]
all.append("\n".join(tmp))
count -= 1
continue
if "app.task2" in line:
all.append(line)
if line.endswith('[\n'):
count = 3
print all
In this case i think Barmar solution would work just as good.
Or you can (preferably) when storing the log file have some distinct delimiter between each log entry and just split the log file by this delimiter.
I like #Barmar's solution with nested loops on the same file object, and may use that technique in the future. But prior to seeing I would have done it with a single loop, which may or may not be more readable:
all = []
keep = False
for line in open(path_to_log,"rt"):
if "app.task2" in line:
all.append(line)
keep = line.rstrip().endswith("[")
elif keep:
all.append(line)
keep = not line.lstrip().startswith("]")
print (all)
or, you can print it nicer with:
print(*all,sep='\n')

Python: losing nucleotides from fasta file to dictionary

I am trying to write a code to extract longest ORF in a fasta file. It is from Coursera Genomics data science course.
the file is a practice file: "dna.example.fasta"
Data is here:https://d396qusza40orc.cloudfront.net/genpython/data_sets/dna.example.fasta
Part of my code is below to extract reading frame 2 (start from the second position of a sequence. eg: seq: ATTGGG, to get reading frame 2: TTGGG):
#!/usr/bin/python
import sys
import getopt
o, a = getopt.getopt(sys.argv[1:], 'h')
opts = dict()
for k,v in o:
opts[k] = v
if '-h' in k:
print "--help\n"
if len(a) < 0:
print "missing fasta file\n"
f = open(a[0], "r")
seq = dict()
for line in f:
line = line.strip()
if line.startswith(">"):
name = line.split()[0]
seq[name] = ''
else:
seq[name] = seq[name] + line[1:]
k = seq[">gi|142022655|gb|EQ086233.1|323"]
print len(k)
The length of this particular sequence should be 4804 bp. Therefore by using this sequence alone I could get the correct answer.
However, with the code, here in the dictionary, this particular sequence becomes only 4736 bp.
I am new to python, so I can not wrap my head around as to where did those 100 bp go?
Thank you,
Xio
Take another look at your data file
An example of some of the lines:
>gi|142022655|gb|EQ086233.1|43 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence
TCGGGCGAAGGCGGCAGCAAGTCGTCCACGCGCAGCGCGGCACCGCGGGCCTCTGCCGTGCGCTGCTTGG
CCATGGCCTCCAGCGCACCGATCGGATCAAAGCCGCTGAAGCCTTCGCGCATCAGGCGGCCATAGTTGGC
Notice how the sequences start on the first value of each line.
Your addition line seq[name] = seq[name] + line[1:] is adding everything on that line after the first character, excluding the first (Python 2 indicies are zero based). It turns out your missing number of nucleotides is the number of lines it took to make that genome, because you're losing the first character every time.
The revised way is seq[name] = seq[name] + line which simply adds the line without losing that first character.
The quickest way to find these kind of debugging errors is to either use a formal debugger, or add a bunch of print statements on your code and test with a small portion of the file -- something that you can see the output of and check for yourself if it's coming out right. A short file with maybe 50 nucleotides instead of 5000 is much easier to evaluate by hand and make sure the code is doing what you want. That's what I did to come up with the answer to the problem in about 5 minutes.
Also for future reference, please mention the version of python you are using before hand. There are quite a few differences between python 2 (The one you're using) and python 3.
I did some additional testing with your code, and if you get any extra characters at the end, they might be whitespace. Make sure you use the .strip() method on each line before adding it to your string, which clears whitespace.
Addressing your comment,
To start from the 2nd position on the first line of the sequence only and then use the full lines until the following nucleotide, you can take advantage of the file's linear format and just add one more clause to your if statement, an elif. This will test if we're on the first line of the sequence, and if so, use the characters starting from the second, if we're on any other line, use the whole line.
if line.startswith(">"):
name = line.split()[0]
seq[name] = ''
#If it's the first line in the series, then the dict's value
# will be an empty string, so this elif means "If we're at the
# start of the series..."
elif seq[name] == '':
seq[name] = seq[name] + line[1:]
else:
seq[name] = seq[name]
This adaptation will start from the 2nd nucleotide in the genome without losing the first from every line in the rest of the nucleotide.

Having problems with strings and arrays

I want to read a text file and copy text that is in between '~~~~~~~~~~~~~' into an array. However, I'm new in Python and this is as far as I got:
with open("textfile.txt", "r",encoding='utf8') as f:
searchlines = f.readlines()
a=[0]
b=0
for i,line in enumerate(searchlines):
if '~~~~~~~~~~~~~' in line:
b=b+1
if '~~~~~~~~~~~~~' not in line:
if 's1mb4d' in line:
break
a.insert(b,line)
This is what I envisioned:
First I read all the lines of the text file,
then I declare 'a' as an array in which text should be added,
then I declare 'b' because I need it as an index. The number of lines in between the '~~~~~~~~~~~~~' is not even, that's why I use 'b' so I can put lines of text into one array index until a new '~~~~~~~~~~~~~' was found.
I check for '~~~~~~~~~~~~~', if found I increase 'b' so I can start adding lines of text into a new array index.
The text file ends with 's1mb4d', so once its found, the program ends.
And if '~~~~~~~~~~~~~' is not found in the line, I add text to the array.
But things didn't go well. Only 1 line of the entire text between those '~~~~~~~~~~~~~' is being copied to the each array index.
Here is an example of the text file:
~~~~~~~~~~~~~
Text123asdasd
asdasdjfjfjf
~~~~~~~~~~~~~
123abc
321bca
gjjgfkk
~~~~~~~~~~~~~
You could use regex expression, give a try to this:
import re
input_text = ['Text123asdasd asdasdjfjfjf','~~~~~~~~~~~~~','123abc 321bca gjjgfkk','~~~~~~~~~~~~~']
a = []
for line in input_text:
my_text = re.findall(r'[^\~]+', line)
if len(my_text) != 0:
a.append(my_text)
What it does is it reads line by line looks for all characters but '~' if line consists only of '~' it ignores it, every line with text is appended to your a list afterwards.
And just because we can, oneliner (excluding import and source ofc):
import re
lines = ['Text123asdasd asdasdjfjfjf','~~~~~~~~~~~~~','123abc 321bca gjjgfkk','~~~~~~~~~~~~~']
a = [re.findall(r'[^\~]+', line) for line in lines if len(re.findall(r'[^\~]+', line)) != 0]
In python the solution to a large part of problems is often to find the right function from the standard library that does the job. Here you should try using split instead, it should be way easier.
If I understand correctly your goal, you can do it like that :
joined_lines = ''.join(searchlines)
result = joined_lines.split('~~~~~~~~~~')
The first line joins your list of lines into a sinle string, and then the second one cut that big string every times it encounters the '~~' sequence.
I tried to clean it up to the best of my knowledge, try this and let me know if it works. We can work together on this!:)
with open("textfile.txt", "r",encoding='utf8') as f:
searchlines = f.readlines()
a = []
currentline = ''
for i,line in enumerate(searchlines):
currentline += line
if '~~~~~~~~~~~~~' in line:
a.append(currentline)
elif 's1mb4d' in line:
break
Some notes:
You can use elif for your break function
Append will automatically add the next iteration to the end of the array
currentline will continue to add text on each line as long as it doesn't have 's1mb4d' or the ~~~ which I think is what you want
s = ['']
with open('path\\to\\sample.txt') as f:
for l in f:
a = l.strip().split("\n")
s += a
a = []
for line in s:
my_text = re.findall(r'[^\~]+', line)
if len(my_text) != 0:
a.append(my_text)
print a
>>> [['Text123asdasd asdasdjfjfjf'], ['123abc 321bca gjjgfkk']]
If you're willing to impose/accept the constraint that the separator should be exactly 13 ~ characters (actually '\n%s\n' % ( '~' * 13) to be specific) ...
then you could accomplish this for relatively normal sized files using just
#!/usr/bin/python
## (Should be #!/usr/bin/env python; but StackOverflow's syntax highlighter?)
separator = '\n%s\n' % ('~' * 13)
with open('somefile.txt') as f:
results = f.read().split(separator)
# Use your results, a list of the strings separated by these separators.
Note that '~' * 13 is a way, in Python, of constructing a string by repeating some smaller string thirteen times. 'xx%sxx' % 'YY' is a way to "interpolate" one string into another. Of course you could just paste the thirteen ~ characters into your source code ... but I would consider constructing the string as shown to make it clear that the length is part of the string's specification --- that this is part of your file format requirements ... and that any other number of ~ characters won't be sufficient.
If you really want any line of any number of ~ characters to serve as a separator than you'll want to use the .split() method from the regular expressions module rather than the .split() method provided by the built-in string objects.
Note that this snippet of code will return all of the text between your separator lines, including any newlines they include. There are other snippets of code which can filter those out. For example given our previous results:
# ... refine results by filtering out newlines (replacing them with spaces)
results = [' '.join(each.split('\n')) for each in results]
(You could also use the .replace() string method; but I prefer the join/split combination). In this case we're using a list comprehension (a feature of Python) to iterate over each item in our results, which we're arbitrarily naming each), performing our transformation on it, and the resulting list is being boun back to the name results; I highly recommend learning and getting comfortable with list comprehension if you're going to learn Python. They're commonly used and can be a bit exotic compared to the syntax of many other programming and scripting languages).
This should work on MS Windows as well as Unix (and Unix-like) systems because of how Python handles "universal newlines." To use these examples under Python 3 you might have to work a little on the encodings and string types. (I didn't need to for my Python3.6 installed under MacOS X using Homebrew ... but just be forewarned).

Error with .readlines()[n]

I'm a beginner with Python.
I tried to solve the problem: "If we have a file containing <1000 lines, how to print only the odd-numbered lines? ". That's my code:
with open(r'C:\Users\Savina\Desktop\rosalind_ini5.txt')as f:
n=1
num_lines=sum(1 for line in f)
while n<num_lines:
if n/2!=0:
a=f.readlines()[n]
print(a)
break
n=n+2
where n is a counter and num_lines calculates how many lines the file contains.
But when I try to execute the code, it says:
"a=f.readlines()[n]
IndexError: list index out of range"
Why it doesn't recognize n as a counter?
You have the call to readlines into a loop, but this is not its intended use,
because readlines ingests the whole of the file at once, returning you a LIST
of newline terminated strings.
You may want to save such a list and operate on it
list_of_lines = open(filename).readlines() # no need for closing, python will do it for you
odd = 1
for line in list_of_lines:
if odd : print(line, end='')
odd = 1-odd
Two remarks:
odd is alternating between 1 (hence true when argument of an if) or 0 (hence false when argument of an if),
the optional argument end='' to the print function is required because each line in list_of_lines is terminated by a new line character, if you omit the optional argument the print function will output a SECOND new line character at the end of each line.
Coming back to your code, you can fix its behavior using a
f.seek(0)
before the loop to rewind the file to its beginning position and using the
f.readline() (look, it's NOT readline**S**) method inside the loop,
but rest assured that proceding like this is. let's say, a bit unconventional...
Eventually, it is possible to do everything you want with a one-liner
print(''.join(open(filename).readlines()[::2]))
that uses the slice notation for lists and the string method .join()
Well, I'd personally do it like this:
def print_odd_lines(some_file):
with open(some_file) as my_file:
for index, each_line in enumerate(my_file): # keep track of the index of each line
if index % 2 == 1: # check if index is odd
print(each_line) # if it does, print it
if __name__ == '__main__':
print_odd_lines('C:\Users\Savina\Desktop\rosalind_ini5.txt')
Be aware that this will leave a blank line instead of the even number. I'm sure you figure how to get rid of it.
This code will do exactly as you asked:
with open(r'C:\Users\Savina\Desktop\rosalind_ini5.txt')as f:
for i, line in enumerate(f.readlines()): # Iterate over each line and add an index (i) to it.
if i % 2 == 0: # i starts at 0 in python, so if i is even, the line is odd
print(line)
To explain what happens in your code:
A file can only be read through once. After that is has to be closed and reopened again.
You first iterate over the entire file in num_lines=sum(1 for line in f). Now the object f is empty.
If n is odd however, you call f.readlines(). This will go through all the lines again, but none are left in f. So every time n is odd, you go through the entire file. It is faster to go through it once (as in the solutions offered to your question).
As a fix, you need to type
f.close()
f = open(r'C:\Users\Savina\Desktop\rosalind_ini5.txt')
everytime after you read through the file, in order to get back to the start.
As a side note, you should look up modolus % for finding odd numbers.

How to make a dictionary that counts repeated amount of length of any words from a file?

How do I correct this so my dictionary reads length of a word = how many times the length of the word is repeated? Parameters is a file.
def wordLengths(fileName):
d = {}
f = open(fileName)
filename.close()
for line in f:
for word in line:
if len(word) not in d:
d[len(word)] = count.len(word)
return(d)
You're on the right track, but you've got a few mistakes. Let's look at them line by line.
def wordLengths(fileName):
d = {}
f = open(fileName)
So far, so good
filename.close()
You can't close a filename—it's just a string. You can only close a file object, like f. Also, filename and fileName aren't the same thing; capitalization counts. Also, it's too early to close the file—you want to do it after reading all the lines, otherwise you won't get to read anything. So, scrap this line, and add a f.close() right before the return. (A with statement is even better, but you probably haven't learned those yet.)
for line in f:
for word in line:
When you loop over a string, you loop over each character in the string, not each word. If you want words, you have to call line.split().
if len(word) not in d:
d[len(word)] = count.len(word)
Close, but not right. What you want here is: if the length isn't already in the dictionary, store 1; otherwise, add 1 to what's already there. What you've written is: if the length isn't already there, store the length (using some object that doesn't exist); otherwise, do nothing. So:
if len(word) not in d:
d[len(word)] = 1
else:
d[len(word)] += 1
return(d)
That one's fine (but remember the f.close() above it). However, it's more idiomatic to write return d.
One more comment: You should be consistent with your indentation: always indent 4 spaces, not a random mix of 1, 4, and 7 spaces. It makes your code a lot easier to read—especially in Python, where indenting something wrong can change the meaning of the code, and that can be hard to spot when each indent level isn't consistent.

Categories