Best way to chunk a large string by line

Best way to chunk a large string by line - python

I have a large file (400+ MB) that I'm reading in from S3 using get_contents_as_string(), which means that I end up with the entire file in memory as a string. I'm running several other memory-intensive operations in parallel, so I need a memory-efficient way of splitting the resulting string into chunks by line number. Is split() efficient enough? Or is something like re.finditer() a better way to go?

I see three options here, from the most memory-consuming to the least:
split will create a copy of your file as a list of strings, meaning additional 400 MB used. Easy to implement, takes RAM.
Use re or simply iterate over a string and memorize \n positions: for i, c in enumerate(s): if c == '\n': newlines.append(i+1).
The same as point 2, but with the string stored as a file on HDD. Slow but really memory efficient, also addressing the disadvantage of Python strings - they're immutable, and if one wants to do some changes, interpreter will create a copy. Files don't suffer from this, allowing in-place operations without loading the whole file at all.
I would also suggest to encapsulate solutions 2 or 3 into a separate class in order to keep newline indexes and the string contents consistent. Proxy pattern and the idea of lazy evaluation would fit here, I think.

you could try to read the file line by line
f= open(filename)
partialstring = f.readline()
see
https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files

Related

fastest way to find one of several substrings in string

I'm doing a lot of file processing where I look for one of several substrings in each line. So I have code equivalent to this:
with open(file) as infile:
for line in infile:
for key in MY_SUBSTRINGS:
if key in line:
print(key, line)
MY_SUBSTRINGS is a list of 6-20 substrings. Substrings vary in length 10-30 chars and may contain spaces.
I'd really like to find a much faster way of doing this. Files have many 100k lines in them. Lines are typically 150 chars. User has to wait for 30s to a minute while file processes. The above is not the only thing taking time but it's taking quite a lot. I'm doing various other processes on a line-by-line basis so not appropraite to search the whole file as once.
I've tried the regex and ahocorasick answers from here but they both come out slower in my tests:
Fastest way to check whether a string is a substring in a list of strings
Any suggestions for faster methods?
I'm not quite sure of the best way to share example datasets. A logcat off an Android phone would be an example. One that's at least 200k lines long.
Then search for 10 strings like:
(NL80211_CMD_TRIGGER_SCAN) received for
Trying to associate with
Request to deauthenticate
interface state UNINITIALIZED->ENABLED
I tried regexes like this:
match_str = "|".join(MY_SUBSTRINGS)
regex = re.compile(match_str)
with open(file) as infile:
for line in infile:
match = regex.search(line)
if match:
print(match.group(0))

I would build a regular expression to search through the file.
Make sure that you're not running each of the search terms in loops when you use regex.
If each of your expressions are in one regexp it would look something like this:
import re
line = 'fsjdk abc def abc jkl'
re.findall(r'(abc|def)', line)
https://docs.python.org/3/library/re.html
If you need to to run still faster consider running a process concurrently with threads. This is a much broader topic but one method that might work is to first take a look at your problem and consider what the bottleneck might be.
If the issue is that your look is starved for disk throughput on the read what you can do is first run through the file and split it up into chunks and then map those chunks to worker threads that can process the data like a queue.
Definitely would need some more on your problem to understand exactly what kind of issue you're looking to solve. And there's people here that definitely would love to dig into a challenge.

Which data structure and/or algorithm is suitable for this problem?

I have a 30MB .txt file containing random strings like:
416
abcd23
cd542
banana
bambam
There is 1 word per line, words are separated by a new line
I need to search the file for my chosen substring and return every matched string in the file. To make it clearer:
Input: cd
Output: abcd23, cd542
Are Generalized suffix trees, suffix trees or suffix arrays suitable for this kind of problem or is there something faster? (time complexity is important)
p.s. my programming skills are a bit sketchy so any kind of example would be kindly appreciated

Assuming you are finding the strings in the file which contain one string, then the fastest method is simply to iterate through the file and check string function 'in' or 'find' on each line as follows.
def find_matches(filename, txt):
with open(filename, 'r') as f:
return [line for line in f if txt in line] # using 'in'
Example Usage:
matches = find_matches('myfile.txt', 'cd')
Simply reading the file avoids the overhead of structuring the fields of other methods such as Pandas -- Pandas one of the slower methods of reading in a file. Also: What is the fastest way to search a CSV file.
String methods using in, or find basically rely on an optimized fastsearch that's implemented in C whose efficiency per string search is:
It looks like the implementation is in worst case O(N*M) (The same as
a naive approach), but can do O(N/M) in some cases (where N and M are
the lengths of the string and substring respectively), and O(N) in
frequent cases

Pythonic way to write a large number of lines to a file

I need to auto-generate a somewhat large Makefile using a Python script. The number of lines is expected to be relatively large. The routine for writing to the file is composed of nested loops, whole bunch of conditions etc.
My options:
Start with an empty string and keep appending the lines to it and finally write the huge string to the file using file.write (pro: only single write operation, con: huge string will take up memory)
Start with an empty list and keep appending the lines to it and finally use file.writelines (pro: single write operation (?), con: huge list takes up memory)
Write each line to the file as it is constructed (pro: no large memory consumed, con: huge number of write operations)
What is the idiomatic/recommended way of writing large number of lines to a file?

Option 3: Write the lines as you generate them.
Writes are already buffered; you don't have to do it manually as in options 1 and 2.

Option #3 is usually the best; normal file objects are buffered, so you won't be performing excessive system calls by writing as you receive data to write.
Alternatively, you can mix option #2 and #3; don't build intermediate lists and call .writelines on them, make the code that would produce said lists a generator function (having it yield values as it goes) or generator expression, and pass that to .writelines. It's functionally equivalent to #3 in most cases, but it pushes the work of iterating the generator to the C layer, removing a lot of Python byte code processing overhead. It's usually meaningless in the context of file I/O costs though.

Fastest way of processing regexp

I have a script in python to process a log file - it parses the values and joins them simply with a tab.
p = re.compile(
"([0-9/]+) ([0-9]+):([0-9]+):([0-9]+) I.*"+
"worker\\(([0-9]+)\\)(?:#([^]]*))?.*\\[([0-9]+)\\] "+
"=RES= PS:([0-9]+) DW:([0-9]+) RT:([0-9]+) PRT:([0-9]+) IP:([^ ]*) "+
"JOB:([^!]+)!([0-9]+) CS:([\\.0-9]+) CONV:([^ ]*) URL:[^ ]+ KEY:([^/]+)([^ ]*)"
)
for line in sys.stdin:
line = line.strip()
if len(line) == 0: continue
result = p.match(line)
if result != None:
print "\t".join([x if x is not None else "." for x in result.groups()])
However, the scripts behaves quite slowly and it takes a long time to process the data.
How can I achieve the same behaviour in faster way? Perl/SED/PHP/Bash/...?
Thanks

It is hard to know without seeing your input, but it looks like your log file is made up of fields that are separated by spaces and do not contain any spaces internally. If so, you could split on whitespace first to put the individual log fields into an array. i.e.
line.split() #Split based on whitespace
or
line.split(' ') #Split based on a single space character
After that, use a few small regexes or even simple string operations to extract the data from the fields that you want.
It would likely be much more efficient, because the bulk of the line processing is done with a simple rule. You wouldn't have the pitfalls of potential backtracking, and you would have more readable code that is less likely to contain mistakes.
I don't know Python, so I can't write out a full code example, but that is the approach I would take in Perl.

Im writing Perl, not Python, but recently i used this technique to parse very big logs:
Divide input file to chunks (for example, FileLen/NumProcessors bytes
each).
Adjust start and end of every chunk to \n so you take full lines to
each worker.
fork() to create NumProcessors workers, each of which reading own
bytes range from file and writes his own output file.
Merge output files if needed.
Sure, you should work to optimize the regexp too, for example less use .* cus it will create many backtraces, this is slow. But anyway, 99% you will have bottleneck on CPU by this regexp, so working on 8 CPUs should help.

In Perl it is possible to use precompiled regexps which are much faster if you are using them many times.
http://perldoc.perl.org/perlretut.html#Compiling-and-saving-regular-expressions
"The qr// operator showed up in perl 5.005. It compiles a regular expression, but doesn't apply it."
If the data is large then it is worth to processing it paralel by split data into pieces. There are several modules in CPAN which makes this easier.

Efficient way to do a large number of search/replaces in Python?

I'm fairly new to Python, and am writing a series of script to convert between some proprietary markup formats. I'm iterating line by line over files and then basically doing a large number (100-200) of substitutions that basically fall into 4 categories:
line = line.replace("-","<EMDASH>") # Replace single character with tag
line = line.replace("<\\#>","#") # tag with single character
line = line.replace("<\\n>","") # remove tag
line = line.replace("\xe1","•") # replace non-ascii character with entity
the str.replace() function seems to be pretty efficient (fairly low in the numbers when I examine profiling output), but is there a better way to do this? I've seen the re.sub() method with a function as an argument, but am unsure if this would be better? I guess it depends on what kind of optimizations Python does internally. Thought I would ask for some advice before creating a large dict that might not be very helpful!
Additionally I do some parsing of tags (that look somewhat like HTML, but are not HTML). I identify tags like this:
m = re.findall('(<[^>]+>)',line)
And then do ~100 search/replaces (mostly removing matches) within the matched tags as well, e.g.:
m = re.findall('(<[^>]+>)',line)
for tag in m:
tag_new = re.sub("\*t\([^\)]*\)","",tag)
tag_new = re.sub("\*p\([^\)]*\)","",tag_new)
# do many more searches...
if tag != tag_new:
line = line.replace(tag,tag_new,1) # potentially problematic
Any thoughts of efficiency here?
Thanks!

str.replace() is more efficient if you're going to do basic search and replaces, and re.sub is (obviously) more efficient if you need complex pattern matching (because otherwise you'd have to use str.replace several times).
I'd recommend you use a combination of both. If you have several patterns that all get replaced by one thing, use re.sub. If you just have some cases where you just need to replace one specific tag with another, use str.replace.
You can also improve efficiency by using larger strings (call re.sub once instead of once for each line). Increases memory use, but shouldn't be a problem unless the file is HUGE, but also improves execution time.

If you don't actually need the regex and are just doing literal replacing, string.replace() will almost certainly be faster. But even so, your bottleneck here will be file input/output, not string manipulation.
The best solution though would probably be to use cStringIO

Depending on the ratio of relevant-to-not-relevant portions of the text you're operating on (and whether or not the parts each substitution operates on overlap), it might be more efficient to try to break down the input into tokens and work on each token individually.
Since each replace() in your current implementation has to examine the entire input string, that can be slow. If you instead broke down that stream into something like...
[<normal text>, <tag>, <tag>, <normal text>, <tag>, <normal text>]
# from an original "<normal text><tag><tag><normal text><tag><normal text>"
...then you could simply look to see if a given token is a tag, and replace it in the list (and then ''.join() at the end).

You can pass a function object to re.sub instead of a substitution string, it takes the match object and returns the substitution, so for example
>>> r = re.compile(r'<(\w+)>|(-)')
>>> r.sub(lambda m: '(%s)' % (m.group(1) if m.group(1) else 'emdash'), '<atag>-<anothertag>')
'(atag)(emdash)(anothertag)'
Of course you can use a more complex function object, this lambda is just an example.
Using a single regex that does all the substitution should be slightly faster than iterating the string many times, but if a lot of substitutions are perfomed the overhead of calling the function object that computes the substitution may be significant.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Best way to chunk a large string by line - python

you could try to read the file line by line f= open(filename) partialstring = f.readline() see https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files

Related

fastest way to find one of several substrings in string

Which data structure and/or algorithm is suitable for this problem?

Pythonic way to write a large number of lines to a file

Fastest way of processing regexp

Efficient way to do a large number of search/replaces in Python?

Categories

Resources