s3fs read_block with delimiter lookahead? - python

I am using s3fs read_block to distribute a csv equally across multiple processes. Each process needs to be given a byte range to operate on and work independently of others. Every line in the csv needs to be processed without overlap.
The problem is that the beginning and ends of byte ranges are unlikely to be the beginning and ends of lines. So some lines may get chopped off.
For example-
My csv looks like this-
beer\npizza\nwings
And I want to process this in chunks of 9 bytes. For byte range 0-9 I will get "beer". And for byte range 10-16 i will get "wings". I will never get "pizza" because the split exists in the middle of a line
beer\npizza\nwings
__________^_______
What I need is some kind of lookahead. Where I want to get bytes between 0-9, and any additional bytes required to form the next line. Then my results would be beer\npizza, wings.
Is lookahead the right way of looking at this or is there another solution? If lookahead is the right way to do this, can this be done with s3fs or do I need a custom implementation to do this lookahead first to find the correct byte range?
Edit:
Custom implementation example:
if self._lookahead:
self._logger.debug('Performing lookahead')
"""Use lookahead to find next newline in csv"""
self._logger.debug(f'{end - 1}, {self._lookahead + 1}')
r = s3.read_block(self._s3_path, end - 1, self._lookahead + 1)
if '\n' not in (r[0], r[1]):
"""Range ends in the middle of a line. Look ahead for the next newline"""
read_length = read_length + r.index(b'\n')
self._logger.debug(f'New end found {read_length}')

From s3fs documentation. You can simply pass the delimiter to the read_block() function.
Hope it helps:
s3.read_block(path, offset=1000, length=10, delimiter=b'\n')

Related

How to save data to a file on separate items instead of one long string?

I am having trouble simply saving items into a file for later reading. When I save the file, instead of listing the items as single items, it appends the data together as one long string. According to my Google searches, this should not be appending the items.
What am I doing wrong?
Code:
with open('Ped.dta','w+') as p:
p.write(str(recnum)) # Add record number to top of file
for x in range(recnum):
p.write(dte[x]) # Write date
p.write(str(stp[x])) # Write Steps number
Since you do not show your data or your output I cannot be sure. But it seems you are trying to use the write method like the print function, but there are important differences.
Most important, write does not follow its written characters with any separator (like space by default for print) or end (like \n by default for print).
Therefore there is no space between your data and steps number or between the lines because you did not write them and Python did not add them.
So add those. Try the lines
p.write(dte[x]) # Write date
p.write(' ') # space separator
p.write(str(stp[x])) # Write Steps number
p.write('\n') # line terminator
Note that I do not know the format of your "date" that is written, so you may need to convert that to text before writing it.
Now that I have the time, I'll implement #abarnert's suggestion (in a comment) and show you how to get the advantages of the print function and still write to a file. Just use the file= parameter in Python 3, or in Python 2 after executing the statement
from __future__ import print_function
Using print you can do my four lines above in one line, since print automatically adds the space separator and newline end:
print(dte[x], str(stp[x]), file=p)
This does assume that your date datum dte[x] is to be printed as text.
Try adding a newline ('\n') character at the end of your lines as you see in docs. This should solve the problem of 'listing the items as single items', but the file you create may not be greatly structured nonetheless.
For further of your google searches you may want to check serialization, as well as json and csv formats, covered in python standard library.
You question would have befited if you gave very small example of recnum variable + original f.close() is not necessary as you have a with statement, see here at SO.

Iterate over bytes with findall

I'm trying to work with a settings file that is a binary file to find out how it's stuctured so that I might get some information about file location etc. from it.
As far as I can tell, the interesting data is either exactly after or near escape chars b'\x03\SETTING' - here's an example with a setting I'm interested in 'LQ'..
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0
\x03HTAPp\x00\x00\x00\x02\x02\x00\x00\x01\x02L\x02\x00\x00\x00\x01
\x03LQ\x00\x00\x00\\\\Media\\Render_Drive\\mediafiles\\mxf\\k70255.2\\a08.56d829a7_56d82956d829a0.mxf
\x03HTAPp\\x00\x00\x00\x02\x02\x00\x00\x01\x02L\x02\x00\x00\x00\x01
\x03LQ\x00\x00\x00\\\\Media\\Render_Drive\\mediafiles\\mxf\\k70255.2\\a07.56d829a6_56d82956d829a0.mxf
so it looks like each 'sentence' starts with \x03 - & the path I'm looking for here is on the 8th byte after the LQ setting '\x03LQ'
The file also has other settings that I want to capture - and each time it looks like the setting is directly after an escape char and padded by a short desciption of the setting and a number of bytes.
ATM I am reading the binary and can find a specific path (only, if I know how long it is right now)
with open(file, "rb") as abin:
abin.seek(0)
data = abin.read()
foo = re.search(b'\x03LQ', data)
abin.seek(foo.start() + 8) # cursor lands on 8th byte
eg = abin.read(32)
# so I get the path of some length as eg.....
This is not what I want, as I want to read the entire bytestring until the next escape char, and then find the next setting that occurs and read the path.
I'm experimenting with findall(), but it just returns a list of bytes objects that are the same (it seems), and I don't understand how to search for each unique path & the instance of each byte string and read from some cursor position in the data. Eg.
bar = re.findall(b'\x03LQ', data)
for bs in bar:
foo = re.search(bs, data)
abin.seek(foo.start() + 8)
eg = abin.read(64)
print('This is just the same path each time', eg)
Pointers anyone?
The key is to look at the result of your findall(), which is just going to be:
[b'\x03LQ', b'\x03LQ', b'\x03LQ', ...]
You're only telling it to find a static string, so that's all it's going to return. To make the results useful, you can tell it to instead capture what comes after the given string. Here's an example that will grab everything after the given string until the next \x03 byte:
findall(rb'\x03LQ([^\x03]*)', data)
The parens tell findall() what part of the match you want, and [^\x03]* means "match any number of bytes that are not \x03". The result from your example should be:
[b'\x00\x00\x00\\\\Media\\Render_Drive\\mediafiles\\mxf\\k70255.2\\a08.56d829a7_56d82956d829a0.mxf\n',
b'\x00\x00\x00\\\\Media\\Render_Drive\\mediafiles\\mxf\\k70255.2\\a07.56d829a6_56d82956d829a0.mxf']

Search string with regex in large binary file (2 GB or more)

What's the best way to search (multiple) strings in a large binary file (2 GB or more) using a regular expression.
The binary data is just 'raw' data (like a memory dump) and the string is not bounded.
I am able to do this in a large text file by reading the file line by line.
I suppose I need to read the file in chunks, but then there is a boundary risk (match is located on chunk boundary)
An how can I search the binary data.
A short example is very much appreciated.
Edit:
I do not see the similarity. It's by no means clear to me
read() takes a value which a numeric indication of how many characters (bytes? multi-byte characters always confuses me), so you could read it in chunks, saving as much as is reasonable, checking with your regex. As space becomes an issue, perhaps remove only the beginning of what you've read before you read in the next chunk. This resides on having at least some guess as the length of the regex, or rather, an upper bound on it. If the regex you want to match encompasses more than the amount you can have in memory at a time, then I'm out of ideas.
s = ""
SOME_CHUNK_SIZE = 4096 ## 4kb, totally arbitrary
with open("large_file", "rb") as fh:
if len(s) > SOME_BIG_NUMBER:
s = s[SOME_CHUNK_SIZE:]
s += fh.read(SOME_CHUNK_SIZE)
## do regex test now
That should maybe get you some of the way. You'll also need to know when you're at the end of the file, since it doesn't seem to throw an error, it just returns 0 bytes. You can either read into a temporary string and check the length, or you could try something like checking the file stats and doing arithmetic with SOME_CHUNK_SIZE.

How to exclude \n and \r from tell() count in Python 2.7

I want to keep track of the file pointer on a simple text file (just a few lines), after having used readline() on it. I observed that the tell() function also counts the line endings.
My questions:
How to instruct the code to skip counting the line endings ?
How to do the first question regardless the line ending type (to work the same in case the text file uses just \n, or just \r, or both) ?
You are navigating into trouble.
DOn't do that: either use the number "tell" tells you about, or count what you have in memory, regardless of the file contents.
You won't be able to correlate a position in text, read in memory, to a physicall place in a text file: text files are not meant for that. They are meant to be read one line at a time, or in whole: your pogram consumes the text, and let the OS worry about the file position.
You can open your file in binary mode, read its contents as they are into memory, and have some method of retrieving readable text from those contents as needed - doing this with a proper class can make it not that messy.
Consider the problem you already have with the line-endings which could be either "\n" or "\r\n" and still count as a single character, and now, imagine that situation one hundred fold more complex if the file has a single utf-8 encoded character that takes more than one byte to encode.
And even in binary files, knowing the absolute file pointer position can only be useful in a handful situations where, usually, one would be better using a database engine to start with.
tell is tell. It counts the number of bytes from the start of the file to the cursor. \n and \r are bytes, so they get counted. If you want to count the number of bytes, but not count certain characters, you will have to do it manually:
data_read = … # data you have already read
len([b for b in data_read if b not in '\r\n'])
The bad news is that it's far more annoying to do this than just looking at tell. The good news is that it answers both your questions.
or, I suppose you could do
yourfile.tell() - data_read.count('\r') - data_read.count('\n')
result = re.sub("[\r\n]", "", subject)
http://regex101.com/r/kM6dA1
Match a single character present in the list below «[\r\n]»
A carriage return character «\r»
A line feed character «\n»

Formatted Input in Python

I have a peculiar problem. I need to read (from a txt file) using python only those substrings that are present at predefined range of offsets. Let's say 5-8 and 12-16.
For example, if a line in the file is something like:
abcdefghi akdhflskdhfhglskdjfhghsldk
then I would like to read the two words - "efgh" and "kdhfl". Because, in the word "efgh", the offset of character "e" is 5 and that of "h" is 8. Similarly, the other word "kdhfl".
Please note that the whitespaces also add to the offset. Infact, the white spaces in my file are not "consistenty occurring" in every line and cannot be depended upon to extract the words of interest. Which is why, I have to bank on the offsets.
I hope I've been able to make the question clear.
Awaiting answers!
Edit -
yes, the whitespace amount in each line can change and accounts for the offsets also. For example, consider these two lines -
abcz d
a bc d
In both cases, I view the offset of the final character "d" as the same. As I said, the white spaces in the file are not consistent and I cannot rely on them. I need to pick up the characters based on their offsets. Does your answer still hold?
assuming its a file,
for line in open("file"):
print line[4:8] , line[11:16]
To extract pieces from offsets simply read each line into a string and then access a substring with a slice ([from:to]).
It's unclear what you're saying about the inconsistent whitespace. If whitespace adds to the offset, it must be consistent to be meaningful. If the whitespace amount can change but actually accounts for the offsets, you can't reliably extract your data.
In your added example, as long as d's offset stays the same, you can extract it with slicing.
>>> s = 'a bc d'
>>> s[5:6]
'd'
>>> s = 'abc d'
>>> s[5:6]
'd'
What's to stop you from using a regular expression? Besides the whitespace do the offsets vary?
/.{4}(.{4}).{4}(.{4})/

Categories