I'm dipping my toes into Python threading. I've created a supplier thread that returns me character/line data from a *nix (serial) /dev via a Queue.
As an exercise, I would like to consume the data from the queue one line at a time (using '\n' as the line terminator).
My current (simplistic) solution is to put() only 1 character at a time into the queue, so the consumer will only get() one character at a time. (Is this a safe assumption?) This approach currently allows me to do the following:
...
return_buffer = []
while True:
rcv_data = queue.get(block=True)
return_buffer.append(rcv_data)
if rcv_data == "\n":
return return_buffer
This seems to be working, but I can definitely cause it to fail when I put() 2 characters at a time.
I would like to make the receive logic more generic and able to handle multi-character put()s.
My next approach would be to rcv_data.partition("\n"), putting the "remainder" in yet another buffer/list, but that will require juggling the temporary buffer alongside the queue.
(I guess another approach would be to only put() one line at a time, but where's the fun in that?)
Is there a more elegant way to read from a queue one line at a time?
This may be a good use for a generator. It will pick up exactly where it left off after yield, so reduces the amount of storage and buffer swapping you need (I cannot speak to its performance).
def getLineGenerator(queue, splitOn):
return_buffer = []
while True:
rcv_data = queue.get(block=True) # We can pull any number of characters here.
for c in rcv_data:
return_buffer.append(c)
if c == splitOn:
yield return_buffer
return_buffer = []
gen = getLineGenerator(myQueue, "\n")
for line in gen:
print line.strip()
Edit:
Once J.F. Sebastian pointed out that the line separator could be multi-character I had to solve that case as well. I also used StringIO from jdi's answer. Again I cannot speak to the efficiency, but I believe it is correct in all cases (at least the ones I could think of). This is untested, so would probably need some tweaks to actually run. Thanks to J.F. Sebastian and jdi for their answers which ultimately lead to this one.
def getlines(chunks, splitOn="\n"):
r_buffer = StringIO()
for chunk in chunks
r_buffer.write(chunk)
pos = r_buffer.getvalue().find(splitOn) # can't use rfind see the next comment
while pos != -1: # A single chunk may have more than one separator
line = r_buffer.getvalue()[:pos + len(splitOn)]
yield line
rest = r_buffer.getvalue().split(splitOn, 1)[1]
r_buffer.seek(0)
r_buffer.truncate()
r_buffer.write(rest)
pos = rest.find(splitOn) # rest and r_buffer are equivalent at this point. Use rest to avoid an extra call to getvalue
line = r_buffer.getvalue();
r_buffer.close() # just for completeness
yield line # whatever is left over.
for line in getlines(iter(queue.get, None)): # break on queue.put(None)
process(line)
If your specific use-case producer needs to put to the queue character by character, then I suppose I can't see anything wrong with getting them in a loop in the consumer. But you can probably get better performance by using a StringIO object as the buffer.
from cStringIO import StringIO
# python3: from io import StringIO
buf = StringIO()
The object if file-like, so you can write to it, seek it, and call getvalue() at any time to get the complete string value in the buffer. This will most likely give you much better performance than having to constantly grow a list, join it to a string, and clear it.
return_buffer = StringIO()
while True:
rcv_data = queue.get(block=True)
return_buffer.write(rcv_data)
if rcv_data == "\n":
ret = return_buffer.getvalue()
return_buffer.seek(0)
# truncate, unless you are counting bytes and
# reading the data directly each time
return_buffer.truncate()
return ret
The queue returns exactly what you put in it. If you put fragments you get fragments. If you put lines you get lines.
To consume line by line if partial lines in the input are allowed and could be completed later you need a buffer either explicit or implicit to store partial lines:
def getlines(fragments, linesep='\n'):
buff = []
for fragment in fragments:
pos = fragment.rfind(linesep)
if pos != -1: # linesep in fragment
lines = fragment[:pos].split(linesep)
if buff: # start of line from previous fragment
line[0] = ''.join(buff) + line[0] # prepend
del buff[:] # clear buffer
rest = fragment[pos+len(linesep):]
if rest:
buff.append(rest)
yield from lines
elif fragment: # linesep not in fragment, fragment is not empty
buff.append(fragment)
if buff:
yield ''.join(buff) # flush the rest
It allows fragments, linesep of arbitrary length. linesep should not span several fragments.
Usage:
for line in getlines(iter(queue.get, None)): # break on queue.put(None)
process(line)
It's important to note that there could be multiple lines in the queue. This function will return (and optionally print) all the lines from a given queue:
def getQueueContents(queue, printContents=True):
contents = ''
# get the full queue contents, not just a single line
while not queue.empty():
line = queue.get_nowait()
contents += line
if printContents:
# remove the newline at the end
print line[:-1]
return contents
Related
I have a medium-size file (25MB, 1000000 rows), and I want to read every row except every third row.
FIRST QUESTION: Is it faster to load the whole file into memory and then read the rows (method .read()), or load and read one row at the time (method .readline())?
Since I'm not an experienced coder I tried the second option with islice method from itertools module.
import intertools
with open(input_file) as inp:
inp_atomtype = itertools.islice(inp, 0, 40, 3)
inp_atomdata = itertools.islice(inp, 1, 40, 3)
for atomtype, atomdata in itertools.zip_longest(inp_atomtype, inp_atomdata):
print(atomtype + atomdata)
Although looping through single generator (inp_atomtype or inp_atomdata) prints correct data, looping through both of them simultaneously (as in this code) prints wrong data.
SECOND QUESTION: How can I reach desired rows using generators?
You don't need to slice the iterator, a simple line counter should be enough:
with open(input_file) as f:
current_line = 0
for line in f:
current_line += 1
if current_line % 3: # ignore every third line
print(line) # NOTE: print() will add an additional new line by default
As for turning it into a generator, just yield the line instead of printing.
When it comes to speed, given that you'll be reading your lines anyway the I/O part will probably take the same but you might benefit a bit (in total processing time) from fast list slicing instead of counting lines if you have enough working memory to keep the file contents and if loading the whole file upfront instead of streaming is acceptable.
yield is perfect for this.
This functions yields pairs from an iterable and skip every third item:
def two_thirds(seq):
_iter = iter(seq)
while True:
yield (next(_iter), next(_iter))
next(_iter)
You will lose half pairs, which means that two_thirds(range(2)) will stop iterating immediately.
https://repl.it/repls/DullNecessaryCron
You can also use the grouper recipe from itertools doc and ignore the third item in each tuple generated:
for atomtype, atomdata, _ in grouper(lines, 3):
pass
FIRST QUESTION: I am pretty sure that .readline() is faster than .read(). Plus, the fastest way based my test is to do lopping like:
with open(file, 'r') as f:
for line in f:
...
SECOND QUESTION: I am not quite sure abut this. you may consider to use yield.
There is a code snippet you may refer:
def myreadlines(f, newline):
buf = ""
while True:
while newline in buf:
pos = buf.index(newline)
yield buf[:pos]
buf = buf[pos + len(newline):]
chunk = f.read(4096)
if not chunk:
# the end of file
yield buf
break
buf += chunk
with open("input.txt") as f:
for line in myreadlines(f, "{|}"):
print (line)
q2: here's my generator:
def yield_from_file(input_file):
with open(input_file) as file:
yield from file
def read_two_skip_one(gen):
while True:
try:
val1 = next(gen)
val2 = next(gen)
yield val1, val2
_ = next(gen)
except StopIteration:
break
if __name__ == '__main__':
for atomtype, atomdata in read_two_skip_one(yield_from_file('sample.txt')):
print(atomtype + atomdata)
sample.txt was generated with a bash shell (it's just lines counting to 100)
for i in {001..100}; do echo $i; done > sample.txt
regarding q1: if you're reading the file multiple times, you'd be better off to have it in memory. otherwise you're fine reading it line by line.
Regarding the problem you're having with the wrong results:
both itertools.islice(inp, 0, 40, 3) statements will use inp as generator. Both will call next(inp), to provide you with a value.
Each time you call next() on an iterator, it will change its state, so that's where your problems come from.
You can use a generator expression:
with open(input_file, 'r') as f:
generator = (line for e, line in enumerate(f, start=1) if e % 3)
enumerate adds line numbers to each line, and the if clause ignores line numbers divisible by 3 (default numbering starts at 0, so you have to specify start=1 to get the desired pattern).
Keep in mind that you can only use the generator while the file is still open.
Is there a built-in method to do it? If not how can I do this without costing too much overhead?
Not built-in, but algorithm R(3.4.2) (Waterman's "Reservoir Algorithm") from Knuth's "The Art of Computer Programming" is good (in a very simplified version):
import random
def random_line(afile):
line = next(afile)
for num, aline in enumerate(afile, 2):
if random.randrange(num):
continue
line = aline
return line
The num, ... in enumerate(..., 2) iterator produces the sequence 2, 3, 4... The randrange will therefore be 0 with a probability of 1.0/num -- and that's the probability with which we must replace the currently selected line (the special-case of sample size 1 of the referenced algorithm -- see Knuth's book for proof of correctness == and of course we're also in the case of a small-enough "reservoir" to fit in memory ;-))... and exactly the probability with which we do so.
import random
lines = open('file.txt').read().splitlines()
myline =random.choice(lines)
print(myline)
For very long file:
seek to random place in file based on it's length and find two newline characters after position (or newline and end of file). Do again 100 characters before or from beginning of file if original seek position was <100 if we ended up inside the last line.
However this is over complicated, as file is iterator.So make it list and take random.choice (if you need many, use random.sample):
import random
print(random.choice(list(open('file.txt'))))
It depends what do you mean by "too much" overhead. If storing whole file in memory is possible, then something like
import random
random_lines = random.choice(open("file").readlines())
would do the trick.
Although I am four years late, I think I have the fastest solution. Recently I wrote a python package called linereader, which allows you to manipulate the pointers of file handles.
Here is the simple solution to getting a random line with this package:
from random import randint
from linereader import dopen
length = #lines in file
filename = #directory of file
file = dopen(filename)
random_line = file.getline(randint(1, length))
The first time this is done is the worst, as linereader has to compile the output file in a special format. After this is done, linereader can then access any line from the file quickly, whatever size the file is.
If your file is very small (small enough to fit into an MB), then you can replace dopen with copen, and it makes a cached entry of the file within memory. Not only is this faster, but you get the number of lines within the file as it is loaded into memory; it is done for you. All you need to do is to generate the random line number. Here is some example code for this.
from random import randint
from linereader import copen
file = copen(filename)
lines = file.count('\n')
random_line = file.getline(randint(1, lines))
I just got really happy because I saw someone who could benefit from my package! Sorry for the dead answer, but the package could definitely be applied to many other problems.
If you don't want to load the whole file into RAM with f.read() or f.readlines(), you can get random line this way:
import os
import random
def get_random_line(filepath: str) -> str:
file_size = os.path.getsize(filepath)
with open(filepath, 'rb') as f:
while True:
pos = random.randint(0, file_size)
if not pos: # the first line is chosen
return f.readline().decode() # return str
f.seek(pos) # seek to random position
f.readline() # skip possibly incomplete line
line = f.readline() # read next (full) line
if line:
return line.decode()
# else: line is empty -> EOF -> try another position in next iteration
P.S.: yes, that was proposed by Ignacio Vazquez-Abrams in his answer above, but a) there's no code in his answer and b) I've come up with this implementation myself; it can return first or last line. Hope it may be useful for someone.
However, if you care about distribution, this code is not an option for you.
If you don't want to read over the entire file, you can seek into the middle of the file, then seek backwards for the newline, and call readline.
Here is a Python3 script which does just this,
One disadvantage with this method is short lines have lower likelyhood of showing up.
def read_random_line(f, chunk_size=16):
import os
import random
with open(f, 'rb') as f_handle:
f_handle.seek(0, os.SEEK_END)
size = f_handle.tell()
i = random.randint(0, size)
while True:
i -= chunk_size
if i < 0:
chunk_size += i
i = 0
f_handle.seek(i, os.SEEK_SET)
chunk = f_handle.read(chunk_size)
i_newline = chunk.rfind(b'\n')
if i_newline != -1:
i += i_newline + 1
break
if i == 0:
break
f_handle.seek(i, os.SEEK_SET)
return f_handle.readline()
A slightly improved version of the Alex Martelli's answer, which handles empty files (by returning a default value):
from random import randrange
def random_line(afile, default=None):
line = default
for i, aline in enumerate(afile, start=1):
if randrange(i) == 0: # random int [0..i)
line = aline
return line
This approach can be used to get a random item from any iterator using O(n) time and O(1) space.
Seek to a random position, read a line and discard it, then read another line. The distribution of lines won't be normal, but that doesn't always matter.
This may be bulky, but it works I guess? (at least for txt files)
import random
choicefile=open("yourfile.txt","r")
linelist=[]
for line in choicefile:
linelist.append(line)
choice=random.choice(linelist)
print(choice)
It reads each line of a file, and appends it to a list. It then chooses a random line from the list.
If you want to remove the line once it's chosen, just do
linelist.remove(choice)
Hope this may help, but at least no extra modules and imports (apart from random) and relatively lightweight.
import random
with open("file.txt", "r") as f:
lines = f.readlines()
print (random.choice(lines))
You can add the lines into a set() which will change their order randomly.
filename=open("lines.txt",'r')
f=set(filename.readlines())
filename.close()
To find the 1st line:
print(next(iter(f)))
To find the 3rd line:
print(list(f)[2])
To list all the lines in the set:
for line in f:
print(line)
Is there a way to read lines from a file at the same time as those lines are being processed. So the reading and processing would be done separate from each other. Whenever data is read it would give for processing, so that reading is always being done regardless of how fast the processing is.
It depends on what you mean by "at the same time". Let's assume you don't necessarily want to go down the rabbit hole of multiple threads, green threads or event-based code and that you just want to cleanly separate reading of the lines, filtering/processing those lines and consuming those lines in your actual business logic.
That can easily be achieved with iterators and generators (the latter being a special kind of iterable). The file object returned from an open() call being usable as iterator in itself makes this a lot easier.
Consider this simple chaining of generator expressions (which, sure enough, are a kind of iterable) that pre-filter the read lines:
f = open('file-with-myriads-of-lines.txt', 'r')
# strip away trailing whitespace (including the newline)
lines_stripped = (line.rstrip() for line in f)
# remove trailing "#" comments (note: ignores potential quoting)
lines_without_comment = (line.partition('#')[0] for line in lines_stripped)
# remove remaining surrounding whitespace
lines_cleaned = (line.strip() for line in lines_without_comment)
# filter out (now) empty lines
lines_with_content = (line for line in lines_cleaned if line)
for line in lines_with_content:
# your business logic goes here
print("Line: {}".format(line))
While you could have combined some of that filtering / mangling into one generator expression or put it inside the for loop, this way the tasks are cleanly separated and you could easily mix and match by reordering, removing or adding more generators to the chain.
This also only reads and processes each line on demand, whenever one is consumed in the business logic for loop (which could also be tucked away in a separate function somewhere else). It does not read all the lines upfront and it also does not create intermediate lists with all the intermediate results. This is in contrast to list comprehensions, which are written with square brackets instead of parentheses.
Of course you can also give each unit of processing a name in the form of a function, to increase readability, encapsulation and maintainability:
def strip_trailing_whitespace(iterable):
return (line.rstrip() for line in iterable)
def remove_trailing_comments(iterable):
return (line.partition('#')[0] for line in iterable)
# ...
def preprocess_lines(iterable):
iterable = strip_trailing_whitespace(iterable)
iterable = remove_trailing_comments(iterable)
# ...
return iterable
def business_logic(iterable):
for line in iterable:
# your business logic here
print("Line: {}".format(line))
def main():
with open('file-with-myriads-of-lines.txt', 'r') as f:
iterable = preprocess_lines(f)
business_logic(iterable)
if __name__ == '__main__':
main()
And if your pre-processing of each line gets more complex than what is usable inside a generator expression, you can simply expand this to a custom generator function using the yield statement or expression:
def remove_trailing_comments(iterable):
"""Remove #-comments that are outside of double-quoted parts."""
for line in iterable:
pos = -1
while True:
pos = line.find('#', pos + 1)
if pos < 0:
break # use whole line
if line[:pos].count('"') % 2 == 0:
# strip starting from first "#" that's not inside quotes
line = line[:pos]
break
yield line
Everything else remains the same.
I am developing a string filter for huge process log file in distributed system.
These log files are >1GB and contains millions of lines.These logs contains special type of message blocks which are starting from "SMsg{" and end from "}". My program is reading the whole file line by line and put the line numbers which the line contains "SMsg{" to an list.Here is my python method to do that.
def FindNMsgStart(self,logfile):
self.logfile = logfile
lf = LogFilter()
infile = lf.OpenFile(logfile, 'Input')
NMsgBlockStart = list()
for num, line in enumerate(infile.readlines()):
if re.search('SMsg{', line):
NMsgBlockStart.append(num)
return NMsgBlockStart
This is my lookup function to search any kind of word in the text file.
def Lookup(self,infile,regex,start,end):
self.infile = infile
self.regex = regex
self.start = start
self.end = end
result = 0
for num, line in enumerate(itertools.islice(infile,start,end)):
if re.search(regex, line):
result = num + start
break
return result
Then I will get that list and find the end for each starting block through the whole file. Following is my code for find the end.
def FindNmlMsgEnd(self,logfile,NMsgBlockStart):
self.logfile = logfile
self.NMsgBlockStart = NMsgBlockStart
NMsgBlockEnd = list()
lf = LogFilter()
length = len(NMsgBlockStart)
if length > 0:
for i in range (0,length):
start=NMsgBlockStart[i]
infile = lf.OpenFile(logfile, 'Input')
lines = lf.LineCount(logfile, 'Input')
end = lf.Lookup(infile, '}', start, lines+1)
NMsgBlockEnd.append(end)
return NMsgBlockEnd
else:
print("There is no Normal Message blocks.")
But those method are never efficient enough to handle huge files. The program is running long time without a result.
Is there efficient way to do this?
If yes, How could I do this?
I am doing another filters too , But first I need to find a solution for this basic problem.I am really new to python. Please help me.
I see a couple of issues that are slowing your code down.
The first seems to be a pretty basic error. You're calling readlines on your file in the FindNMsgStart method, which is going to read the whole file into memory and return a list of its lines.
You should just iterate over the lines directly by using enumerate(infile). You do this properly in the other functions that read the file, so I suspect this is a typo or just a simple oversight.
The second issue is a bit more complicated. It involves the general architecture of your search.
You're first scanning the file for message start lines, then searching for the end line after each start. Each end-line search requires re-reading much of the file, since you need to skip all the lines that occur before the start line. It would be a lot more efficient if you could combine both searches into a single pass over the data file.
Here's a really crude generator function that does that:
def find_message_bounds(filename):
with open(filename) as f:
iterator = enumerate(f)
for start_line_no, start_line in iterator:
if 'SMsg{' in start_line:
for end_line_no, end_line in iterator:
if '}' in end_line:
yield start_line_no, end_line_no
break
This function yields start, end line number tuples, and only makes a single pass over the file.
I think you can actually implement a one-pass search using your Lookup method, if you're careful with the boundary variables you pass in to it.
def FindNmlMsgEnd(self,logfile,NMsgBlockStart):
self.logfile = logfile
self.NMsgBlockStart = NMsgBlockStart
NMsgBlockEnd = list()
lf = LogFilter()
infile = lf.OpenFile(logfile, 'Input')
total_lines = lf.LineCount(logfile, 'Input')
start = NMsgBlockStart[0]
prev_end = -1
for next_start in NMsgBlockStart[:1]:
end = lf.Lookup(infile, '}', start-prev_end-1, next_start-prev_end-1) + prev_end + 1
NMsgBlockEnd.append(end)
start = next_start
prev_end = end
last_end = lf.Lookup(infile, '}', start-prev_end-1, total_lines-prev_end-1) + prev_end + 1
NMsgBlockEnd.append(last_end)
return NMsgBlockEnd
It's possible I have an off-by-one error in there somewhere, the design of the Lookup function makes it difficult to call repeatedly.
I have XML files that contain invalid characters sequences which cause parsing to fail. They look like . To solve the problem, I am escaping them by replacing the whole thing with an escape sequence: --> !#~10^. Then after I am done parsing I can restore them to what they were.
buffersize = 2**16 # 64 KB buffer
def escape(filename):
out = file(filename + '_esc', 'w')
with open(filename, 'r') as f:
buffer = 'x' # is there a prettier way to handle the first one?
while buffer != '':
buffer = f.read(buffersize)
out.write(re.sub(r'&#x([a-fA-F0-9]+);', r'!#~\1^', buffer))
out.close()
The files are very large, so I have to use buffering (mmap gave me a MemoryError) . Because the buffer has a fixed size, I am running into problems when the buffer happens to be small enough to split a sequence. Imagine the buffer size is 8, and the file is like:
123456789
hello!&x10;
The buffer will only read hello!&x, allowing &x10; to slip through the cracks. How do I solve this? I thought of getting more characters if the last few look like they could belong to a character sequence, but the logic I thought of is very ugly.
First, don't bother to read and write the file, you can create a file-like object that wraps your open file, and processes the data before it's handled by the parser. Second, your buffering can just take care of the ends of read bytes. Here's some working code:
class Wrapped(object):
def __init__(self, f):
self.f = f
self.buffer = ""
def read(self, size=0):
buf = self.buffer + self.f.read(size)
buf = buf.replace("!", "!!")
buf = re.sub(r"&(#x[0-9a-fA-F]+;)", r"!\1", buf)
# If there's an ampersand near the end, hold onto that piece until we
# have more, to be sure we don't miss one.
last_amp = buf.rfind("&", -10, -1)
if last_amp > 0:
self.buffer = buf[last_amp:]
buf = buf[:last_amp]
else:
self.buffer = ""
return buf
Then in your code, replace this:
it = ET.iterparse(file(xml, "rb"))
with this:
it = ET.iterparse(Wrapped(file(xml, "rb")))
Third, I used a substitution replacing "&" with "!", and "!" with "!!", so you can fix them after parsing, and you aren't counting on obscure sequences. This is Stack Overflow data after all, so lots of strange random punctuation could occur naturally.
If you sequence is 6 characters long, you can use buffers with 5 overlapping characters. That way, you are sure no sequence will even slip between the buffers.
Here is an example to help you visualize :
--
--
--
#x10;--
As for the implementation, just prepend the 5 last characters of the last buffer to the new buffer :
buffer = buffer[-5:] + f.read(buffersize)
The only problem is that the concatenation may require a copy of the whole buffer. Another solution, if you have random access to the file, is to rewind a little bit with :
f.seek(-5, os.SEEK_CUR)
In both case, you'll have to modify the script slightly to handle the first iteration.