Pythonic way to write a large number of lines to a file - python

I need to auto-generate a somewhat large Makefile using a Python script. The number of lines is expected to be relatively large. The routine for writing to the file is composed of nested loops, whole bunch of conditions etc.
My options:
Start with an empty string and keep appending the lines to it and finally write the huge string to the file using file.write (pro: only single write operation, con: huge string will take up memory)
Start with an empty list and keep appending the lines to it and finally use file.writelines (pro: single write operation (?), con: huge list takes up memory)
Write each line to the file as it is constructed (pro: no large memory consumed, con: huge number of write operations)
What is the idiomatic/recommended way of writing large number of lines to a file?

Option 3: Write the lines as you generate them.
Writes are already buffered; you don't have to do it manually as in options 1 and 2.

Option #3 is usually the best; normal file objects are buffered, so you won't be performing excessive system calls by writing as you receive data to write.
Alternatively, you can mix option #2 and #3; don't build intermediate lists and call .writelines on them, make the code that would produce said lists a generator function (having it yield values as it goes) or generator expression, and pass that to .writelines. It's functionally equivalent to #3 in most cases, but it pushes the work of iterating the generator to the C layer, removing a lot of Python byte code processing overhead. It's usually meaningless in the context of file I/O costs though.

Related

Efficient extraction of anadromes

An anadrome is a proper sentence that when written in reverse constitutes a (possibly different) proper sentence up to a possible change of spacing. I have a file with 100 Million proper sentences and I would like to find all sub-sentences (divided by word boundaries) which are anadromes, by testing if their inverse is also in the file when ignoring internal spaces. My initial approach was to extract all sub-sentences and save them to a temporary file, create an in-memory set of their space-stripped inverses, and finally iterate over the temporary file and test if each line after space-stripping belongs to the set. This worked fine for smaller files but does not scale, as the set gets too large for memory. Other than replacing the in-memory set with an on-disk database, what could be done?
Edit: I ended up using an sqlite database with index. on a smaller set of 5 Million sentences using a db instead of an in-memory set takes 2x the time. With the full set, this is the only method that I found could complete the computation.
For each proper sentence you could try inverting it, and find all proper possible subtences.
Then for each inverted subsentence, you strip all the spaces.
You then do a regex-search in the original file, searching for using the space-stripped inverted-subsentence allowing for \s? in between characters.
For example d\s?l\s?r\s?o\s?w\s?o\s?l\s?l\s?e\s?h (inverted 'hello world') would match 'wlro woll eh' (inverted 'hello world' with crazy spacing, which would be in the original file if it were a proper sentence)

Best way to chunk a large string by line

I have a large file (400+ MB) that I'm reading in from S3 using get_contents_as_string(), which means that I end up with the entire file in memory as a string. I'm running several other memory-intensive operations in parallel, so I need a memory-efficient way of splitting the resulting string into chunks by line number. Is split() efficient enough? Or is something like re.finditer() a better way to go?
I see three options here, from the most memory-consuming to the least:
split will create a copy of your file as a list of strings, meaning additional 400 MB used. Easy to implement, takes RAM.
Use re or simply iterate over a string and memorize \n positions: for i, c in enumerate(s): if c == '\n': newlines.append(i+1).
The same as point 2, but with the string stored as a file on HDD. Slow but really memory efficient, also addressing the disadvantage of Python strings - they're immutable, and if one wants to do some changes, interpreter will create a copy. Files don't suffer from this, allowing in-place operations without loading the whole file at all.
I would also suggest to encapsulate solutions 2 or 3 into a separate class in order to keep newline indexes and the string contents consistent. Proxy pattern and the idea of lazy evaluation would fit here, I think.
you could try to read the file line by line
f= open(filename)
partialstring = f.readline()
see
https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files

Search string with regex in large binary file (2 GB or more)

What's the best way to search (multiple) strings in a large binary file (2 GB or more) using a regular expression.
The binary data is just 'raw' data (like a memory dump) and the string is not bounded.
I am able to do this in a large text file by reading the file line by line.
I suppose I need to read the file in chunks, but then there is a boundary risk (match is located on chunk boundary)
An how can I search the binary data.
A short example is very much appreciated.
Edit:
I do not see the similarity. It's by no means clear to me
read() takes a value which a numeric indication of how many characters (bytes? multi-byte characters always confuses me), so you could read it in chunks, saving as much as is reasonable, checking with your regex. As space becomes an issue, perhaps remove only the beginning of what you've read before you read in the next chunk. This resides on having at least some guess as the length of the regex, or rather, an upper bound on it. If the regex you want to match encompasses more than the amount you can have in memory at a time, then I'm out of ideas.
s = ""
SOME_CHUNK_SIZE = 4096 ## 4kb, totally arbitrary
with open("large_file", "rb") as fh:
if len(s) > SOME_BIG_NUMBER:
s = s[SOME_CHUNK_SIZE:]
s += fh.read(SOME_CHUNK_SIZE)
## do regex test now
That should maybe get you some of the way. You'll also need to know when you're at the end of the file, since it doesn't seem to throw an error, it just returns 0 bytes. You can either read into a temporary string and check the length, or you could try something like checking the file stats and doing arithmetic with SOME_CHUNK_SIZE.

Compile and split a string from file using python

How can I compile a string from selected rows of a file, run some operations on the string and then split that string back to the original rows into that same file?
I only need certain rows of the file. I cannot do the operations to the other parts of the file. I have made a class that separates these rows from the file and runs the operations on these rows, but I'm thinking this would be even faster to run these operations on a single string containing parts of the file that can be used in these operations...
Or, if I can run these operations on a whole dictionary, that would help too. The operations are string replacements and RegEx replacements.
I am using python 3.3
Edit:
I'm going to explain this in greater detail here since my original post was so vague (thank you Paolo for pointing that out).
For instance, if I would like to fix a SubRipper (.srt-file), which is a common subtitle file, I would take something like this as an input (this is from an actual srt-file):
Here you can find correct example, submitting the file contents here messes newlines:
http://pastebin.com/ZdWUpNZ2
...And then I would only fix those rows which have the actual subtitle lines, not those ordering number rows or those hide/show rows of the subtitle file. So my compiled string might be:
"They're up on that ridge.|They got us pinned down."
Then I would run operations on that string. Then I would have to save those rows back to the file. How can I get those subtitle rows back into my original file after they are fixed? I could split my compiled and fixed string using "|" as a row delimiter and put them back to the original file, but how can I be certain what row goes where?
You can use pysrt to edit SubRip files:
from pysrt import SubRipFile
subs = SubRipFile.open('some/file.srt')
for sub in subs:
# do something with sub.text
pass
# save changes to a new file
subs.save('other/path.srt', encoding='utf-8')

Fastest way of processing regexp

I have a script in python to process a log file - it parses the values and joins them simply with a tab.
p = re.compile(
"([0-9/]+) ([0-9]+):([0-9]+):([0-9]+) I.*"+
"worker\\(([0-9]+)\\)(?:#([^]]*))?.*\\[([0-9]+)\\] "+
"=RES= PS:([0-9]+) DW:([0-9]+) RT:([0-9]+) PRT:([0-9]+) IP:([^ ]*) "+
"JOB:([^!]+)!([0-9]+) CS:([\\.0-9]+) CONV:([^ ]*) URL:[^ ]+ KEY:([^/]+)([^ ]*)"
)
for line in sys.stdin:
line = line.strip()
if len(line) == 0: continue
result = p.match(line)
if result != None:
print "\t".join([x if x is not None else "." for x in result.groups()])
However, the scripts behaves quite slowly and it takes a long time to process the data.
How can I achieve the same behaviour in faster way? Perl/SED/PHP/Bash/...?
Thanks
It is hard to know without seeing your input, but it looks like your log file is made up of fields that are separated by spaces and do not contain any spaces internally. If so, you could split on whitespace first to put the individual log fields into an array. i.e.
line.split() #Split based on whitespace
or
line.split(' ') #Split based on a single space character
After that, use a few small regexes or even simple string operations to extract the data from the fields that you want.
It would likely be much more efficient, because the bulk of the line processing is done with a simple rule. You wouldn't have the pitfalls of potential backtracking, and you would have more readable code that is less likely to contain mistakes.
I don't know Python, so I can't write out a full code example, but that is the approach I would take in Perl.
Im writing Perl, not Python, but recently i used this technique to parse very big logs:
Divide input file to chunks (for example, FileLen/NumProcessors bytes
each).
Adjust start and end of every chunk to \n so you take full lines to
each worker.
fork() to create NumProcessors workers, each of which reading own
bytes range from file and writes his own output file.
Merge output files if needed.
Sure, you should work to optimize the regexp too, for example less use .* cus it will create many backtraces, this is slow. But anyway, 99% you will have bottleneck on CPU by this regexp, so working on 8 CPUs should help.
In Perl it is possible to use precompiled regexps which are much faster if you are using them many times.
http://perldoc.perl.org/perlretut.html#Compiling-and-saving-regular-expressions
"The qr// operator showed up in perl 5.005. It compiles a regular expression, but doesn't apply it."
If the data is large then it is worth to processing it paralel by split data into pieces. There are several modules in CPAN which makes this easier.

Categories