Reading line by line from file and processing lines at same time? - python

Is there a way to read lines from a file at the same time as those lines are being processed. So the reading and processing would be done separate from each other. Whenever data is read it would give for processing, so that reading is always being done regardless of how fast the processing is.

It depends on what you mean by "at the same time". Let's assume you don't necessarily want to go down the rabbit hole of multiple threads, green threads or event-based code and that you just want to cleanly separate reading of the lines, filtering/processing those lines and consuming those lines in your actual business logic.
That can easily be achieved with iterators and generators (the latter being a special kind of iterable). The file object returned from an open() call being usable as iterator in itself makes this a lot easier.
Consider this simple chaining of generator expressions (which, sure enough, are a kind of iterable) that pre-filter the read lines:
f = open('file-with-myriads-of-lines.txt', 'r')
# strip away trailing whitespace (including the newline)
lines_stripped = (line.rstrip() for line in f)
# remove trailing "#" comments (note: ignores potential quoting)
lines_without_comment = (line.partition('#')[0] for line in lines_stripped)
# remove remaining surrounding whitespace
lines_cleaned = (line.strip() for line in lines_without_comment)
# filter out (now) empty lines
lines_with_content = (line for line in lines_cleaned if line)
for line in lines_with_content:
# your business logic goes here
print("Line: {}".format(line))
While you could have combined some of that filtering / mangling into one generator expression or put it inside the for loop, this way the tasks are cleanly separated and you could easily mix and match by reordering, removing or adding more generators to the chain.
This also only reads and processes each line on demand, whenever one is consumed in the business logic for loop (which could also be tucked away in a separate function somewhere else). It does not read all the lines upfront and it also does not create intermediate lists with all the intermediate results. This is in contrast to list comprehensions, which are written with square brackets instead of parentheses.
Of course you can also give each unit of processing a name in the form of a function, to increase readability, encapsulation and maintainability:
def strip_trailing_whitespace(iterable):
return (line.rstrip() for line in iterable)
def remove_trailing_comments(iterable):
return (line.partition('#')[0] for line in iterable)
# ...
def preprocess_lines(iterable):
iterable = strip_trailing_whitespace(iterable)
iterable = remove_trailing_comments(iterable)
# ...
return iterable
def business_logic(iterable):
for line in iterable:
# your business logic here
print("Line: {}".format(line))
def main():
with open('file-with-myriads-of-lines.txt', 'r') as f:
iterable = preprocess_lines(f)
business_logic(iterable)
if __name__ == '__main__':
main()
And if your pre-processing of each line gets more complex than what is usable inside a generator expression, you can simply expand this to a custom generator function using the yield statement or expression:
def remove_trailing_comments(iterable):
"""Remove #-comments that are outside of double-quoted parts."""
for line in iterable:
pos = -1
while True:
pos = line.find('#', pos + 1)
if pos < 0:
break # use whole line
if line[:pos].count('"') % 2 == 0:
# strip starting from first "#" that's not inside quotes
line = line[:pos]
break
yield line
Everything else remains the same.

Related

Regular Expression to find valid words in file

I need to write a function get_specified_words(filename) to get a list of lowercase words from a text file. All of the following conditions must be applied:
Include all lower-case character sequences including those that
contain a - or ' character and those that end with a '
character.
Exclude words that end with a -.
The function must only process lines between the start and end marker lines
Use this regular expression to extract the words from each relevant line of a file: valid_line_words = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", line)
Ensure that the line string is lower case before using the regular expression.
Use the optional encoding parameter when opening files for reading. That is your open file call should look like open(filename, encoding='utf-8'). This will be especially helpful if your operating system doesn't set Python's default encoding to UTF-8.
The sample text file testing.txt contains this:
That are after the start and should be dumped.
So should that
and that
and yes, that
*** START OF SYNTHETIC TEST CASE ***
Toby's code was rather "interesting", it had the following issues: short,
meaningless identifiers such as n1 and n; deep, complicated nesting;
a doc-string drought; very long, rambling and unfocused functions; not
enough spacing between functions; inconsistent spacing before and
after operators, just like this here. Boy was he going to get a low
style mark.... Let's hope he asks his friend Bob to help him bring his code
up to an acceptable level.
*** END OF SYNTHETIC TEST CASE ***
This is after the end and should be ignored too.
Have a nice day.
Here's my code:
import re
def stripped_lines(lines):
for line in lines:
stripped_line = line.rstrip('\n')
yield stripped_line
def lines_from_file(fname):
with open(fname, 'rt') as flines:
for line in stripped_lines(flines):
yield line
def is_marker_line(line, start='***', end='***'):
min_len = len(start) + len(end)
if len(line) < min_len:
return False
return line.startswith(start) and line.endswith(end)
def advance_past_next_marker(lines):
for line in lines:
if is_marker_line(line):
break
def lines_before_next_marker(lines):
valid_lines = []
for line in lines:
if is_marker_line(line):
break
valid_lines.append(re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", line))
for content_line in valid_lines:
yield content_line
def lines_between_markers(lines):
it = iter(lines)
advance_past_next_marker(it)
for line in lines_before_next_marker(it):
yield line
def words(lines):
text = '\n'.join(lines).lower().split()
return text
def get_valid_words(fname):
return words(lines_between_markers(lines_from_file(fname)))
# This must be executed
filename = "valid.txt"
all_words = get_valid_words(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(all_words)))
print("word list:")
print("\n".join(all_words))
Here's my output:
File "C:/Users/jj.py", line 45, in <module>
text = '\n'.join(lines).lower().split()
builtins.TypeError: sequence item 0: expected str instance, list found
Here's the expected output:
valid.txt loaded ok.
73 valid words found.
word list:
toby's
code
was
rather
interesting
it
had
the
following
issues
short
meaningless
identifiers
such
as
n
and
n
deep
complicated
nesting
a
doc-string
drought
very
long
rambling
and
unfocused
functions
not
enough
spacing
between
functions
inconsistent
spacing
before
and
after
operators
just
like
this
here
boy
was
he
going
to
get
a
low
style
mark
let's
hope
he
asks
his
friend
bob
to
help
him
bring
his
code
up
to
an
acceptable
level
I need help with getting my code to work. Any help is appreciated.
lines_between_markers(lines_from_file(fname))
gives you a list of list of valid words.
So you just need to flatten it :
def words(lines):
words_list = [w for line in lines for w in line]
return words_list
Does the trick.
But I think that you should review the design of your program :
lines_between_markers should only yield lines between markers, but it does more. Regexp should be use on the result of this function and not inside the function.
What you didn't do :
Ensure that the line string is lower case before using the regular expression.
Use the optional encoding parameter when opening files for reading.
That is your open file call should look like open(filename,
encoding='utf-8').

Python: losing nucleotides from fasta file to dictionary

I am trying to write a code to extract longest ORF in a fasta file. It is from Coursera Genomics data science course.
the file is a practice file: "dna.example.fasta"
Data is here:https://d396qusza40orc.cloudfront.net/genpython/data_sets/dna.example.fasta
Part of my code is below to extract reading frame 2 (start from the second position of a sequence. eg: seq: ATTGGG, to get reading frame 2: TTGGG):
#!/usr/bin/python
import sys
import getopt
o, a = getopt.getopt(sys.argv[1:], 'h')
opts = dict()
for k,v in o:
opts[k] = v
if '-h' in k:
print "--help\n"
if len(a) < 0:
print "missing fasta file\n"
f = open(a[0], "r")
seq = dict()
for line in f:
line = line.strip()
if line.startswith(">"):
name = line.split()[0]
seq[name] = ''
else:
seq[name] = seq[name] + line[1:]
k = seq[">gi|142022655|gb|EQ086233.1|323"]
print len(k)
The length of this particular sequence should be 4804 bp. Therefore by using this sequence alone I could get the correct answer.
However, with the code, here in the dictionary, this particular sequence becomes only 4736 bp.
I am new to python, so I can not wrap my head around as to where did those 100 bp go?
Thank you,
Xio
Take another look at your data file
An example of some of the lines:
>gi|142022655|gb|EQ086233.1|43 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence
TCGGGCGAAGGCGGCAGCAAGTCGTCCACGCGCAGCGCGGCACCGCGGGCCTCTGCCGTGCGCTGCTTGG
CCATGGCCTCCAGCGCACCGATCGGATCAAAGCCGCTGAAGCCTTCGCGCATCAGGCGGCCATAGTTGGC
Notice how the sequences start on the first value of each line.
Your addition line seq[name] = seq[name] + line[1:] is adding everything on that line after the first character, excluding the first (Python 2 indicies are zero based). It turns out your missing number of nucleotides is the number of lines it took to make that genome, because you're losing the first character every time.
The revised way is seq[name] = seq[name] + line which simply adds the line without losing that first character.
The quickest way to find these kind of debugging errors is to either use a formal debugger, or add a bunch of print statements on your code and test with a small portion of the file -- something that you can see the output of and check for yourself if it's coming out right. A short file with maybe 50 nucleotides instead of 5000 is much easier to evaluate by hand and make sure the code is doing what you want. That's what I did to come up with the answer to the problem in about 5 minutes.
Also for future reference, please mention the version of python you are using before hand. There are quite a few differences between python 2 (The one you're using) and python 3.
I did some additional testing with your code, and if you get any extra characters at the end, they might be whitespace. Make sure you use the .strip() method on each line before adding it to your string, which clears whitespace.
Addressing your comment,
To start from the 2nd position on the first line of the sequence only and then use the full lines until the following nucleotide, you can take advantage of the file's linear format and just add one more clause to your if statement, an elif. This will test if we're on the first line of the sequence, and if so, use the characters starting from the second, if we're on any other line, use the whole line.
if line.startswith(">"):
name = line.split()[0]
seq[name] = ''
#If it's the first line in the series, then the dict's value
# will be an empty string, so this elif means "If we're at the
# start of the series..."
elif seq[name] == '':
seq[name] = seq[name] + line[1:]
else:
seq[name] = seq[name]
This adaptation will start from the 2nd nucleotide in the genome without losing the first from every line in the rest of the nucleotide.

Error with .readlines()[n]

I'm a beginner with Python.
I tried to solve the problem: "If we have a file containing <1000 lines, how to print only the odd-numbered lines? ". That's my code:
with open(r'C:\Users\Savina\Desktop\rosalind_ini5.txt')as f:
n=1
num_lines=sum(1 for line in f)
while n<num_lines:
if n/2!=0:
a=f.readlines()[n]
print(a)
break
n=n+2
where n is a counter and num_lines calculates how many lines the file contains.
But when I try to execute the code, it says:
"a=f.readlines()[n]
IndexError: list index out of range"
Why it doesn't recognize n as a counter?
You have the call to readlines into a loop, but this is not its intended use,
because readlines ingests the whole of the file at once, returning you a LIST
of newline terminated strings.
You may want to save such a list and operate on it
list_of_lines = open(filename).readlines() # no need for closing, python will do it for you
odd = 1
for line in list_of_lines:
if odd : print(line, end='')
odd = 1-odd
Two remarks:
odd is alternating between 1 (hence true when argument of an if) or 0 (hence false when argument of an if),
the optional argument end='' to the print function is required because each line in list_of_lines is terminated by a new line character, if you omit the optional argument the print function will output a SECOND new line character at the end of each line.
Coming back to your code, you can fix its behavior using a
f.seek(0)
before the loop to rewind the file to its beginning position and using the
f.readline() (look, it's NOT readline**S**) method inside the loop,
but rest assured that proceding like this is. let's say, a bit unconventional...
Eventually, it is possible to do everything you want with a one-liner
print(''.join(open(filename).readlines()[::2]))
that uses the slice notation for lists and the string method .join()
Well, I'd personally do it like this:
def print_odd_lines(some_file):
with open(some_file) as my_file:
for index, each_line in enumerate(my_file): # keep track of the index of each line
if index % 2 == 1: # check if index is odd
print(each_line) # if it does, print it
if __name__ == '__main__':
print_odd_lines('C:\Users\Savina\Desktop\rosalind_ini5.txt')
Be aware that this will leave a blank line instead of the even number. I'm sure you figure how to get rid of it.
This code will do exactly as you asked:
with open(r'C:\Users\Savina\Desktop\rosalind_ini5.txt')as f:
for i, line in enumerate(f.readlines()): # Iterate over each line and add an index (i) to it.
if i % 2 == 0: # i starts at 0 in python, so if i is even, the line is odd
print(line)
To explain what happens in your code:
A file can only be read through once. After that is has to be closed and reopened again.
You first iterate over the entire file in num_lines=sum(1 for line in f). Now the object f is empty.
If n is odd however, you call f.readlines(). This will go through all the lines again, but none are left in f. So every time n is odd, you go through the entire file. It is faster to go through it once (as in the solutions offered to your question).
As a fix, you need to type
f.close()
f = open(r'C:\Users\Savina\Desktop\rosalind_ini5.txt')
everytime after you read through the file, in order to get back to the start.
As a side note, you should look up modolus % for finding odd numbers.

Line in file that has the most characters?

I have a file that has an unknown number of lines with unknown length. How would you write a program that tells which line has the most characters, or in other words, which line is the longest?
I was thinking to make a for line in myFile function that uses len(line) and appends the length to a new list, so the length of the first line would go index 0, length of second line would go to index 1 etc... Then when there are no more lines to check use the myList.max() function to tell me the index of the longest line.
My question is this, is there a better/more efficient way to generate such output? Maybe there's even a built in function that I don't know about that is capable of doing so. You help would be much appreciated.
def tuple_compare(tup):
"""
Input: 2-tuple of the form (anything, line)
Output: Length of line with trailing newline stripped.
"""
unused_anything, line = tup
return len(line.rstrip('\n'))
with open('filename') as fin:
biggest_line_number, biggest_line = max(enumerate(fin),
key=tuple_compare)
Lets unpack this a little. tuple_compare just takes the tuples that come out of the enumerate function and returns the length of the line that it contains (minus any newline which might be hiding on the end there). enumerate yields a bunch of 2-tuples (lineno, line) which is why we take the second element in tuple_compare to be the line. max does all the rest of the heavy lifting for us and returns the biggest tuple based on the key comparison function.
At the end of the day, we just unpack the tuple into its 2 parts -- the line number and the line text.
You could use key parameter for max() function and treat the file object as an iterator over lines:
longest_line = max(myFile, key=len)
It assumes that the last line has a newline. Otherwise:
longest_line = max((line.rstrip("\n") for line in myFile), key=len)
If you want also a line number; you could use enumerate():
number, longest_line = max(enumerate(myFile, 1), key=lambda (i, line): len(line))
with open('filename') as fin:
max_len, line_num = (max((len(s),i) for i, s in enumerate(fin))
you may want to use len(s.rstrip('\r')) as in mgilson's answer
If you need the text from the line:
with open('filename') as fin:
max_len, line_num, line = (max((len(s),i, s) for i, s in enumerate(fin))
Here's yet another stylistic variant on the basic answer given by several others. I often like this style because it:
Leverages the idea of a data pipeline: each step receives an input stream and generates an output stream. This idiom crops up all over the place: functional programming; Unix shells; map-reduce; etc.
Often leads to readable code: we can apply a meaningful name to each step in the pipeline, and the resulting codes tends have a flat, almost declarative feeling.
Illustrates data-centric programming: if we transform and organize our data in the right way, the algorithmic aspect of our computation shrinks to trivial proportions, even to the point of practically disappearing -- in this case, we just call max() on the last stage of the pipeline.
For many other (and much more interesting) examples in this vein, search for David Beazley's online writings on iterators, generators, and coroutines.
with open('path/to/file') as fh:
# Each pipeline step is a generator.
stripped = (ln.rstrip('\n') for ln in fh)
lengths = ((len(ln), i, ln) for i, ln in enumerate(stripped))
# The data directly answers our question.
# We get max length, line number, and the line.
print max(lengths)

Reading lines from Python queues

I'm dipping my toes into Python threading. I've created a supplier thread that returns me character/line data from a *nix (serial) /dev via a Queue.
As an exercise, I would like to consume the data from the queue one line at a time (using '\n' as the line terminator).
My current (simplistic) solution is to put() only 1 character at a time into the queue, so the consumer will only get() one character at a time. (Is this a safe assumption?) This approach currently allows me to do the following:
...
return_buffer = []
while True:
rcv_data = queue.get(block=True)
return_buffer.append(rcv_data)
if rcv_data == "\n":
return return_buffer
This seems to be working, but I can definitely cause it to fail when I put() 2 characters at a time.
I would like to make the receive logic more generic and able to handle multi-character put()s.
My next approach would be to rcv_data.partition("\n"), putting the "remainder" in yet another buffer/list, but that will require juggling the temporary buffer alongside the queue.
(I guess another approach would be to only put() one line at a time, but where's the fun in that?)
Is there a more elegant way to read from a queue one line at a time?
This may be a good use for a generator. It will pick up exactly where it left off after yield, so reduces the amount of storage and buffer swapping you need (I cannot speak to its performance).
def getLineGenerator(queue, splitOn):
return_buffer = []
while True:
rcv_data = queue.get(block=True) # We can pull any number of characters here.
for c in rcv_data:
return_buffer.append(c)
if c == splitOn:
yield return_buffer
return_buffer = []
gen = getLineGenerator(myQueue, "\n")
for line in gen:
print line.strip()
Edit:
Once J.F. Sebastian pointed out that the line separator could be multi-character I had to solve that case as well. I also used StringIO from jdi's answer. Again I cannot speak to the efficiency, but I believe it is correct in all cases (at least the ones I could think of). This is untested, so would probably need some tweaks to actually run. Thanks to J.F. Sebastian and jdi for their answers which ultimately lead to this one.
def getlines(chunks, splitOn="\n"):
r_buffer = StringIO()
for chunk in chunks
r_buffer.write(chunk)
pos = r_buffer.getvalue().find(splitOn) # can't use rfind see the next comment
while pos != -1: # A single chunk may have more than one separator
line = r_buffer.getvalue()[:pos + len(splitOn)]
yield line
rest = r_buffer.getvalue().split(splitOn, 1)[1]
r_buffer.seek(0)
r_buffer.truncate()
r_buffer.write(rest)
pos = rest.find(splitOn) # rest and r_buffer are equivalent at this point. Use rest to avoid an extra call to getvalue
line = r_buffer.getvalue();
r_buffer.close() # just for completeness
yield line # whatever is left over.
for line in getlines(iter(queue.get, None)): # break on queue.put(None)
process(line)
If your specific use-case producer needs to put to the queue character by character, then I suppose I can't see anything wrong with getting them in a loop in the consumer. But you can probably get better performance by using a StringIO object as the buffer.
from cStringIO import StringIO
# python3: from io import StringIO
buf = StringIO()
The object if file-like, so you can write to it, seek it, and call getvalue() at any time to get the complete string value in the buffer. This will most likely give you much better performance than having to constantly grow a list, join it to a string, and clear it.
return_buffer = StringIO()
while True:
rcv_data = queue.get(block=True)
return_buffer.write(rcv_data)
if rcv_data == "\n":
ret = return_buffer.getvalue()
return_buffer.seek(0)
# truncate, unless you are counting bytes and
# reading the data directly each time
return_buffer.truncate()
return ret
The queue returns exactly what you put in it. If you put fragments you get fragments. If you put lines you get lines.
To consume line by line if partial lines in the input are allowed and could be completed later you need a buffer either explicit or implicit to store partial lines:
def getlines(fragments, linesep='\n'):
buff = []
for fragment in fragments:
pos = fragment.rfind(linesep)
if pos != -1: # linesep in fragment
lines = fragment[:pos].split(linesep)
if buff: # start of line from previous fragment
line[0] = ''.join(buff) + line[0] # prepend
del buff[:] # clear buffer
rest = fragment[pos+len(linesep):]
if rest:
buff.append(rest)
yield from lines
elif fragment: # linesep not in fragment, fragment is not empty
buff.append(fragment)
if buff:
yield ''.join(buff) # flush the rest
It allows fragments, linesep of arbitrary length. linesep should not span several fragments.
Usage:
for line in getlines(iter(queue.get, None)): # break on queue.put(None)
process(line)
It's important to note that there could be multiple lines in the queue. This function will return (and optionally print) all the lines from a given queue:
def getQueueContents(queue, printContents=True):
contents = ''
# get the full queue contents, not just a single line
while not queue.empty():
line = queue.get_nowait()
contents += line
if printContents:
# remove the newline at the end
print line[:-1]
return contents

Categories