I have XML files that contain invalid characters sequences which cause parsing to fail. They look like . To solve the problem, I am escaping them by replacing the whole thing with an escape sequence: --> !#~10^. Then after I am done parsing I can restore them to what they were.
buffersize = 2**16 # 64 KB buffer
def escape(filename):
out = file(filename + '_esc', 'w')
with open(filename, 'r') as f:
buffer = 'x' # is there a prettier way to handle the first one?
while buffer != '':
buffer = f.read(buffersize)
out.write(re.sub(r'&#x([a-fA-F0-9]+);', r'!#~\1^', buffer))
out.close()
The files are very large, so I have to use buffering (mmap gave me a MemoryError) . Because the buffer has a fixed size, I am running into problems when the buffer happens to be small enough to split a sequence. Imagine the buffer size is 8, and the file is like:
123456789
hello!&x10;
The buffer will only read hello!&x, allowing &x10; to slip through the cracks. How do I solve this? I thought of getting more characters if the last few look like they could belong to a character sequence, but the logic I thought of is very ugly.
First, don't bother to read and write the file, you can create a file-like object that wraps your open file, and processes the data before it's handled by the parser. Second, your buffering can just take care of the ends of read bytes. Here's some working code:
class Wrapped(object):
def __init__(self, f):
self.f = f
self.buffer = ""
def read(self, size=0):
buf = self.buffer + self.f.read(size)
buf = buf.replace("!", "!!")
buf = re.sub(r"&(#x[0-9a-fA-F]+;)", r"!\1", buf)
# If there's an ampersand near the end, hold onto that piece until we
# have more, to be sure we don't miss one.
last_amp = buf.rfind("&", -10, -1)
if last_amp > 0:
self.buffer = buf[last_amp:]
buf = buf[:last_amp]
else:
self.buffer = ""
return buf
Then in your code, replace this:
it = ET.iterparse(file(xml, "rb"))
with this:
it = ET.iterparse(Wrapped(file(xml, "rb")))
Third, I used a substitution replacing "&" with "!", and "!" with "!!", so you can fix them after parsing, and you aren't counting on obscure sequences. This is Stack Overflow data after all, so lots of strange random punctuation could occur naturally.
If you sequence is 6 characters long, you can use buffers with 5 overlapping characters. That way, you are sure no sequence will even slip between the buffers.
Here is an example to help you visualize :
--
--
--
#x10;--
As for the implementation, just prepend the 5 last characters of the last buffer to the new buffer :
buffer = buffer[-5:] + f.read(buffersize)
The only problem is that the concatenation may require a copy of the whole buffer. Another solution, if you have random access to the file, is to rewind a little bit with :
f.seek(-5, os.SEEK_CUR)
In both case, you'll have to modify the script slightly to handle the first iteration.
Related
I need to read chunks of 64KB in loop, and process them, but stop at the end of file minus 16 bytes: the last 16 bytes are a tag metadata.
The file might be super large, so I can't read it all in RAM.
All the solutions I find are a bit clumsy and/or unpythonic.
with open('myfile', 'rb') as f:
while True:
block = f.read(65536)
if not block:
break
process_block(block)
If 16 <= len(block) < 65536, it's easy: it's the last block ever. So useful_data = block[:-16] and tag = block[-16:]
If len(block) == 65536, it could mean three things: that the full block is useful data. Or that this 64KB block is in fact the last block, so useful_data = block[:-16] and tag = block[-16:]. Or that this 64KB block is followed by another block of only a few bytes (let's say 3 bytes), so in this case: useful_data = block[:-13] and tag = block[-13:] + last_block[:3].
How to deal with this problem in a nicer way than distinguishing all these cases?
Note:
the solution should work for a file opened with open(...), but also for a io.BytesIO() object, or for a distant SFTP opened file (with pysftp).
I was thinking about getting the file object size, with
f.seek(0,2)
length = f.tell()
f.seek(0)
Then after each
block = f.read(65536)
we can know if we are far from the end with length - f.tell(), but again the full solution does not look very elegant.
you can just read in every iteration min(65536, L-f.tell()-16)
Something like this:
from pathlib import Path
L = Path('myfile').stat().st_size
with open('myfile', 'rb') as f:
while True:
to_read_length = min(65536, L-f.tell()-16)
block = f.read(to_read_length)
process_block(block)
if f.tell() == L-16
break
Did not ran this, but hope you get the gist of it.
The following method relies only on the fact that the f.read() method returns an empty bytes object upon end of stream (EOS). It thus could be adopted for sockets simply by replacing f.read() with s.recv().
def read_all_but_last16(f):
rand = random.Random() # just for testing
buf = b''
while True:
bytes_read = f.read(rand.randint(1, 40)) # just for testing
# bytes_read = f.read(65536)
buf += bytes_read
if not bytes_read:
break
process_block(buf[:-16])
buf = buf[-16:]
verify(buf[-16:])
It works by always leaving 16 bytes at the end of buf until EOS, then finally processing the last 16. Note that if there aren't at least 17 bytes in buf then buf[:-16] returns the empty bytes object.
I'm trying to figure out how to get the first N strings from a txt file, and store them into an array. Right now, I have code that gets every string from a txt file, separated by a space delimiter, and stores it into an array. However, I want to be able to only grab the first N number of strings from it, not every single string. Here is my code (and I'm doing it from a command prompt):
import sys
f = open(sys.argv[1], "r")
contents = f.read().split(' ')
f.close()
I'm sure that the only line I need to fix is:
contents = f.read().split(' ')
I'm just not sure how to limit it here to N number of strings.
If the file is really big, but not too big--that is, big enough that you don't want to read the whole file (especially in text mode or as a list of lines), but not so big that you can't page it into memory (which means under 2GB on a 32-bit OS, but a lot more on 64-bit), you can do this:
import itertools
import mmap
import re
import sys
n = 5
# Notice that we're opening in binary mode. We're going to do a
# bytes-based regex search. This is only valid if (a) the encoding
# is ASCII-compatible, and (b) the spaces are ASCII whitespace, not
# other Unicode whitespace.
with open(sys.argv[1], 'rb') as f:
# map the whole file into memory--this won't actually read
# more than a page or so beyond the last space
m = mmap.mmap(f.fileno(), access=mmap.ACCESS_READ)
# match and decode all space-separated words, but do it lazily...
matches = re.finditer(r'(.*?)\s', m)
bytestrings = (match.group(1) for match in matches)
strings = (b.decode() for b in bytestrings)
# ... so we can stop after 5 of them ...
nstrings = itertools.islice(strings, n)
# ... and turn that into a list of the first 5
contents = list(nstrings)
Obviously you can combine steps together, even cramming the whole thing into a giant one-liner if you want. (An idiomatic version would be somewhere between that extreme and this one.)
If you're fine with reading the whole file (assuming it's not memory prohibitive to do so) you can just do this:
strings_wanted = 5
strings = open('myfile').read().split()[:strings_wanted]
That works like this:
>>> s = 'this is a test string with more than five words.'
>>> s.split()[:5]
['this', 'is', 'a', 'test', 'string']
If you actually want to stop reading exactly as soon as you've reached the nth word, you pretty much have to read a byte at a time. But that's going to be slow, and complicated. Plus, it's still not really going to stop reading after the nth word, unless you're reading in binary mode and decoding manually, and you disable buffering.
As long as the text file has line breaks (as opposed to being one giant 80MB line), and it's acceptable to read a few bytes past the nth word, a very simple solution will still be pretty efficient: just read and split line by line:
import sys
f = open(sys.argv[1], "r")
contents = []
for line in f:
contents += line.split()
if len(contents) >= n:
del contents[n:]
break
f.close()
what about just:
output=input[:3]
output will contain the first three strings in input
I am developing a string filter for huge process log file in distributed system.
These log files are >1GB and contains millions of lines.These logs contains special type of message blocks which are starting from "SMsg{" and end from "}". My program is reading the whole file line by line and put the line numbers which the line contains "SMsg{" to an list.Here is my python method to do that.
def FindNMsgStart(self,logfile):
self.logfile = logfile
lf = LogFilter()
infile = lf.OpenFile(logfile, 'Input')
NMsgBlockStart = list()
for num, line in enumerate(infile.readlines()):
if re.search('SMsg{', line):
NMsgBlockStart.append(num)
return NMsgBlockStart
This is my lookup function to search any kind of word in the text file.
def Lookup(self,infile,regex,start,end):
self.infile = infile
self.regex = regex
self.start = start
self.end = end
result = 0
for num, line in enumerate(itertools.islice(infile,start,end)):
if re.search(regex, line):
result = num + start
break
return result
Then I will get that list and find the end for each starting block through the whole file. Following is my code for find the end.
def FindNmlMsgEnd(self,logfile,NMsgBlockStart):
self.logfile = logfile
self.NMsgBlockStart = NMsgBlockStart
NMsgBlockEnd = list()
lf = LogFilter()
length = len(NMsgBlockStart)
if length > 0:
for i in range (0,length):
start=NMsgBlockStart[i]
infile = lf.OpenFile(logfile, 'Input')
lines = lf.LineCount(logfile, 'Input')
end = lf.Lookup(infile, '}', start, lines+1)
NMsgBlockEnd.append(end)
return NMsgBlockEnd
else:
print("There is no Normal Message blocks.")
But those method are never efficient enough to handle huge files. The program is running long time without a result.
Is there efficient way to do this?
If yes, How could I do this?
I am doing another filters too , But first I need to find a solution for this basic problem.I am really new to python. Please help me.
I see a couple of issues that are slowing your code down.
The first seems to be a pretty basic error. You're calling readlines on your file in the FindNMsgStart method, which is going to read the whole file into memory and return a list of its lines.
You should just iterate over the lines directly by using enumerate(infile). You do this properly in the other functions that read the file, so I suspect this is a typo or just a simple oversight.
The second issue is a bit more complicated. It involves the general architecture of your search.
You're first scanning the file for message start lines, then searching for the end line after each start. Each end-line search requires re-reading much of the file, since you need to skip all the lines that occur before the start line. It would be a lot more efficient if you could combine both searches into a single pass over the data file.
Here's a really crude generator function that does that:
def find_message_bounds(filename):
with open(filename) as f:
iterator = enumerate(f)
for start_line_no, start_line in iterator:
if 'SMsg{' in start_line:
for end_line_no, end_line in iterator:
if '}' in end_line:
yield start_line_no, end_line_no
break
This function yields start, end line number tuples, and only makes a single pass over the file.
I think you can actually implement a one-pass search using your Lookup method, if you're careful with the boundary variables you pass in to it.
def FindNmlMsgEnd(self,logfile,NMsgBlockStart):
self.logfile = logfile
self.NMsgBlockStart = NMsgBlockStart
NMsgBlockEnd = list()
lf = LogFilter()
infile = lf.OpenFile(logfile, 'Input')
total_lines = lf.LineCount(logfile, 'Input')
start = NMsgBlockStart[0]
prev_end = -1
for next_start in NMsgBlockStart[:1]:
end = lf.Lookup(infile, '}', start-prev_end-1, next_start-prev_end-1) + prev_end + 1
NMsgBlockEnd.append(end)
start = next_start
prev_end = end
last_end = lf.Lookup(infile, '}', start-prev_end-1, total_lines-prev_end-1) + prev_end + 1
NMsgBlockEnd.append(last_end)
return NMsgBlockEnd
It's possible I have an off-by-one error in there somewhere, the design of the Lookup function makes it difficult to call repeatedly.
I am working on a script where it will breakdown another python script into blocks and using pycrypto to encrypt the blocks (all of this i have successfully done so far), now i am storing the encrypted blocks to a file so that the decrypter can read it and execute each block. The final result of the encryption is a list of binary outputs (something like blocks=[b'\xa1\r\xa594\x92z\xf8\x16\xaa',b'xfbI\xfdqx|\xcd\xdb\x1b\xb3',etc...]).
When writing the output to a file, they all end up into one giant line, so that when reading the file, all the bytes come back in one giant line, instead of each item from the original list. I also tried converting the bytes into a string, and adding a '\n' at the end of each one, but the problem there is that I still need the bytes, and I can't figure out how to undo the string to get the original byte.
To summarize this, i am looking to either: write each binary item to a separate line in a file so i can easily read the data and use it in the decryption, or i could translate the data to a string and in the decrpytion undo the string to get back the original binary data.
Here is the code for writing to the file:
new_file = open('C:/Python34/testfile.txt','wb')
for byte_item in byte_list:
# This or for the string i just replaced wb with w and
# byte_item with ascii(byte_item) + '\n'
new_file.write(byte_item)
new_file.close()
and for reading the file:
# Or 'r' instead of 'rb' if using string method
byte_list = open('C:/Python34/testfile.txt','rb').readlines()
A file is a stream of bytes without any implied structure. If you want to load a list of binary blobs then you should store some additional metadata to restore the structure e.g., you could use the netstring format:
#!/usr/bin/env python
blocks = [b'\xa1\r\xa594\x92z\xf8\x16\xaa', b'xfbI\xfdqx|\xcd\xdb\x1b\xb3']
# save blocks
with open('blocks.netstring', 'wb') as output_file:
for blob in blocks:
# [len]":"[string]","
output_file.write(str(len(blob)).encode())
output_file.write(b":")
output_file.write(blob)
output_file.write(b",")
Read them back:
#!/usr/bin/env python3
import re
from mmap import ACCESS_READ, mmap
blocks = []
match_size = re.compile(br'(\d+):').match
with open('blocks.netstring', 'rb') as file, \
mmap(file.fileno(), 0, access=ACCESS_READ) as mm:
position = 0
for m in iter(lambda: match_size(mm, position), None):
i, size = m.end(), int(m.group(1))
blocks.append(mm[i:i + size])
position = i + size + 1 # shift to the next netstring
print(blocks)
As an alternative, you could consider BSON format for your data or ascii armor format.
I think what you're looking for is byte_list=open('C:/Python34/testfile.txt','rb').read()
If you know how many bytes each item is, you can use read(number_of_bytes) to process one item at a time.
read() will read the entire file, but then it is up to you to decode that entire list of bytes into their respective items.
In general, since you're using Python 3, you will be working with bytes objects (which are immutable) and/or bytearray objects (which are mutable).
Example:
b1 = bytearray('hello', 'utf-8')
print b1
b1 += bytearray(' goodbye', 'utf-8')
print b1
open('temp.bin', 'wb').write(b1)
#------
b2 = open('temp.bin', 'rb').read()
print b2
Output:
bytearray(b'hello')
bytearray(b'hello goodbye')
b'hello goodbye'
I'm dipping my toes into Python threading. I've created a supplier thread that returns me character/line data from a *nix (serial) /dev via a Queue.
As an exercise, I would like to consume the data from the queue one line at a time (using '\n' as the line terminator).
My current (simplistic) solution is to put() only 1 character at a time into the queue, so the consumer will only get() one character at a time. (Is this a safe assumption?) This approach currently allows me to do the following:
...
return_buffer = []
while True:
rcv_data = queue.get(block=True)
return_buffer.append(rcv_data)
if rcv_data == "\n":
return return_buffer
This seems to be working, but I can definitely cause it to fail when I put() 2 characters at a time.
I would like to make the receive logic more generic and able to handle multi-character put()s.
My next approach would be to rcv_data.partition("\n"), putting the "remainder" in yet another buffer/list, but that will require juggling the temporary buffer alongside the queue.
(I guess another approach would be to only put() one line at a time, but where's the fun in that?)
Is there a more elegant way to read from a queue one line at a time?
This may be a good use for a generator. It will pick up exactly where it left off after yield, so reduces the amount of storage and buffer swapping you need (I cannot speak to its performance).
def getLineGenerator(queue, splitOn):
return_buffer = []
while True:
rcv_data = queue.get(block=True) # We can pull any number of characters here.
for c in rcv_data:
return_buffer.append(c)
if c == splitOn:
yield return_buffer
return_buffer = []
gen = getLineGenerator(myQueue, "\n")
for line in gen:
print line.strip()
Edit:
Once J.F. Sebastian pointed out that the line separator could be multi-character I had to solve that case as well. I also used StringIO from jdi's answer. Again I cannot speak to the efficiency, but I believe it is correct in all cases (at least the ones I could think of). This is untested, so would probably need some tweaks to actually run. Thanks to J.F. Sebastian and jdi for their answers which ultimately lead to this one.
def getlines(chunks, splitOn="\n"):
r_buffer = StringIO()
for chunk in chunks
r_buffer.write(chunk)
pos = r_buffer.getvalue().find(splitOn) # can't use rfind see the next comment
while pos != -1: # A single chunk may have more than one separator
line = r_buffer.getvalue()[:pos + len(splitOn)]
yield line
rest = r_buffer.getvalue().split(splitOn, 1)[1]
r_buffer.seek(0)
r_buffer.truncate()
r_buffer.write(rest)
pos = rest.find(splitOn) # rest and r_buffer are equivalent at this point. Use rest to avoid an extra call to getvalue
line = r_buffer.getvalue();
r_buffer.close() # just for completeness
yield line # whatever is left over.
for line in getlines(iter(queue.get, None)): # break on queue.put(None)
process(line)
If your specific use-case producer needs to put to the queue character by character, then I suppose I can't see anything wrong with getting them in a loop in the consumer. But you can probably get better performance by using a StringIO object as the buffer.
from cStringIO import StringIO
# python3: from io import StringIO
buf = StringIO()
The object if file-like, so you can write to it, seek it, and call getvalue() at any time to get the complete string value in the buffer. This will most likely give you much better performance than having to constantly grow a list, join it to a string, and clear it.
return_buffer = StringIO()
while True:
rcv_data = queue.get(block=True)
return_buffer.write(rcv_data)
if rcv_data == "\n":
ret = return_buffer.getvalue()
return_buffer.seek(0)
# truncate, unless you are counting bytes and
# reading the data directly each time
return_buffer.truncate()
return ret
The queue returns exactly what you put in it. If you put fragments you get fragments. If you put lines you get lines.
To consume line by line if partial lines in the input are allowed and could be completed later you need a buffer either explicit or implicit to store partial lines:
def getlines(fragments, linesep='\n'):
buff = []
for fragment in fragments:
pos = fragment.rfind(linesep)
if pos != -1: # linesep in fragment
lines = fragment[:pos].split(linesep)
if buff: # start of line from previous fragment
line[0] = ''.join(buff) + line[0] # prepend
del buff[:] # clear buffer
rest = fragment[pos+len(linesep):]
if rest:
buff.append(rest)
yield from lines
elif fragment: # linesep not in fragment, fragment is not empty
buff.append(fragment)
if buff:
yield ''.join(buff) # flush the rest
It allows fragments, linesep of arbitrary length. linesep should not span several fragments.
Usage:
for line in getlines(iter(queue.get, None)): # break on queue.put(None)
process(line)
It's important to note that there could be multiple lines in the queue. This function will return (and optionally print) all the lines from a given queue:
def getQueueContents(queue, printContents=True):
contents = ''
# get the full queue contents, not just a single line
while not queue.empty():
line = queue.get_nowait()
contents += line
if printContents:
# remove the newline at the end
print line[:-1]
return contents