Read and copy specific chunks of text in python - python

I have seen several similar questions on SO (copying trigger lines or chunks of definite sizes), but they don't quite fit to what I'm trying to do. I have a very large text file (output from Valgrind) that I'd like to cut down to only the parts I need.
The structure of the file is as follows: they are blocks of lines that start with a title line containing the string 'in loss record'. I want to trigger only on those title lines that also contain the string 'definitely lost', then copy all the lines below until another title line is reached (at which point the decision process is repeated).
How can I implement such a select-and-copy script in Python?
Here's what I've tried so far. It works, but I don't think is the most efficient (or pythonic) way of doing it, and so I'd like to see faster approaches, as the files I'm working with are usually quite large. (This method takes 1.8s for a 290M file)
with open("in_file.txt","r") as fin:
with open("out_file.txt","w") as fout:
lines = fin.read().split("\n")
i=0
while i<len(lines):
if "blocks are definitely lost in loss record" in lines[i]:
fout.write(lines[i].rstrip()+"\n")
i+=1
while i<len(lines) and "loss record" not in lines[i]:
fout.write(lines[i].rstrip()+"\n")
i+=1
i+=1

You might try with a regex and using mmap
Something similar to:
import re, mmap
# create a regex that will define each block of text you want here:
pat=re.compile(r'^([^\n]*?blocks are definitely lost in loss record.*?loss record)', re.S | re.M)
with open(fn, 'r+b') as f:
mm = mmap.mmap(f.fileno(), 0)
for i, m in enumerate(pat.finditer(mm)):
# m is a block that you want.
print m.group(1)
Given you have no input example, that regex certainly does not work -- but you get the idea.
With mmap the entire file is treated as a string but not necessarily all in memory so large files can be searched and blocks of it selected in this way.
If your file comfortably fits in memory, you can just read the file and use a regex directly (pseudo Python):
with open(fn) as fo:
pat=re.compile(r'^([^\n]*?blocks are definitely lost in loss record.*?loss record)', re.S | re.M)
for i, block in pat.finditer(of.read()):
# deal with each block
If you want a line by line non regex approach, read the file line by line (assuming it is a \n delimited text file):
with open(fn) as fo:
for line in fo:
# deal with each line here
# DON'T do something like string=fo.read() and
# then iterate over the lines of the string please...
# unless you need random access to the lines out of order

Another way to do this is to use groupby to identify header lines and set functions that will either write or ignore following lines. Then you can iterate the file line by line and reduce your memory footprint.
import itertools
def megs(val):
return val * (2**20)
def ignorelines(lines):
for line in lines:
pass
# assuming ascii or utf-8 you save a small amount of processing by avoiding decode/encode
# and a few fewer trips to the disk with larger buffers
with open('test.log', 'rb', buffering=megs(4)) as infile,\
open('out.log', 'wb', buffering=megs(4)) as outfile:
dump_fctn = ignorelines # ignore lines til we see a good header
# group by header or contained lines
for is_hdr, block in itertools.groupby(infile, lambda x: b'in loss record' in x):
if is_hdr:
for hdr in block:
if b'definitely lost' in hdr:
outfile.write(hdr)
dump_fctn = outfile.writelines
else:
dump_fctn = ignorelines
else:
# either writelines or ignorelines, depending on last header seen
dump_fctn(block)
print(open('out.log').read())

Related

How to prevent loss/miscount of data using File I/O with buffer/size hint Python

I am new to Python and writing several methods to process large log files (bigger than 5GB). THrough the research i did, I saw a lot of people using "with open" and specificying a size hint/buffer like so:
with open(filename, 'rb', buffering=102400) as f:
time_data_count = 0
logbinset = set()
#def f(n):print('{:0b}'.format(n)) #check what non iteratable function means
search_pattern = regex.compile(b'\d+\((.)+\)\s+\d+\((.)+\)')
for line in f:
if search_pattern.search(line):
x = search_pattern.search(line)
#print(x.group(1)+" "+ x.group(2))
print((x.group(1)).decode())
print((x.group(2)).decode())
Another method (this one always returns none for some reason. Could use some help finding out why:
with open(filename, 'rb') as f:
#text = []
while True:
memcap = f.read(102400)
if not memcap:
break
text = re.search(b'\d+\(.+\)\s+\d+\(.+\)',memcap)
if text is None:
print("none")
print(text.group())
In these method, I am trying to extract regex patterns from a 6GB log file. My question is, I am worried using buffers to chop the file into chunks could result in situations where the line containing the pattern is chopped in half which would result in some data being missing.
How do I make sure line intergrity is kept? How do I make sure it only breaks up my file at the end of a line? How do I make sure I don't lose data in between chunks? Or does the "with open" and read(102400) method ensure lines are not split in half when breaking the file into chunks.
First of all dont use 'rb' use just 'r' wich is used for text. 'rb' is for binary data.
the read method reads as many characters as you specify, so you woul end up with chopped lines. Use readline.
The first variant is the correct one: set a buffer size when opening the file to get fewer read operations without loosing matches that span read blocks.
If you are concerned with runtime it would be a good idea to just search once in each line and not one search to determine if there is a match and then doing the exact same search again to get at the values:
regex = re.compile(rb"\d+\((.)+\)\s+\d+\((.)+\)")
with open(filename, "rb", buffering=102400) as lines:
for line in lines:
match = regex.search(line)
if match:
print((match.group(1)).decode())
print((match.group(2)).decode())
The for loop and the filtering can be moved into functions that are implemented in C (in CPython):
regex = re.compile(rb"\d+\((.)+\)\s+\d+\((.)+\)")
with open(filename, "rb", buffering=102400) as lines:
for match in filter(bool, map(regex.search, lines)):
print((match.group(1)).decode())
print((match.group(2)).decode())
On a 64 bit Python you could also try the mmap module to map the file into memory and apply the regular expression to the whole content.

For each line in a file, replace multiple-whitespace substring of variable length with line break

Using Python 2.7.1, I read in a file:
input = open(file, "rU")
tmp = input.readlines()
which looks like this:
>name -----meoidoad
>longname -lksowkdkfg
>nm --kdmknskoeoe---
>nmee dowdbnufignwwwwcds--
That is, each line has a short substring of whitespaces, but the length of this substring varies by line.
I would like to write script that edits my tmp object such that when I write tmp to file, the result is
>name
-----meoidoad
>longname
-lksowkdkfg
>nm
--kdmknskoeoe---
>nmee
dowdbnufignwwwwcds--
I.e. I would like to break each line into two lines, at that substring of whitespaces (and get rid of the spaces in the process).
The starting position of the string after the whitespaces is always the same within a file, but may vary among a large batch of files I am working with. So, I need a solution that does not rely on positions.
I've seen many similar questions on here, with many well-liked answers that use short regex scripts to do so, so it is possible I am duplicating a previous question. However, none of what I've seen so far has worked for me.
import re
with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
for line in infile:
outfile.write(re.sub('\s\s+', '\n', line))
If the file isn't huge (i.e. hundreds of MB), you can do this concisely with split() and join():
with open(file, 'rU') as f, open(outfilename, 'w') as o:
o.write('\n'.join(f.read().split()))
I would also recommend against naming anything input, as that will mask the built-in.

How do I write a simple, Python parsing script?

Most of what I do involves writing simple parsing scripts that reads search terms from one file and searches, line by line, another file. Once the search term is found, the line and sometimes the following line are written to another output file. The code I use is rudimentary and likely crude.
#!/usr/bin/env python
data = open("data.txt", "r")
search_terms = ids.read().splitlines()
data.close()
db = open("db.txt", "r")
output = open("output.txt", "w")
for term in search_terms:
for line in db:
if line.find(term) > -1:
next_line = db.next()
output.write(">" + head + "\n" + next_line)
print("Found %s" % term)
There are a few problems here. First, I don't think it's the most efficient and fastest to search line by line, but I'm not exactly sure about that. Second, I often run into issues with cursor placement and the cursor doesn't reset to the beginning of the file when the search term is found. Third, while I am usually confident that all of the terms can be found in the db, there are rare times when I can't be sure, so I would like to write to another file whenever it iterates through the entire db and can't find the term. I've tried adding a snippet that counts the number of lines of the db so if the find() function gets to the last line and the term isn't found, then it outputs to another "not found" file, but I haven't been able to get my elif and else loops right.
Overall, I'd just like any hints or corrections that could make this sort of script more efficient and robust.
Thanks.
Unless it's a really big file, why not iterate line by line? If the input file's size is some significant portion of your machine's available resources (memory), then you might want to look into buffered input and other, more low-level abstractions of what the computer is doing. But if you're talking about a few hundred MB or less on a relatively modern machine, let the computer do the computing ;)
Off the bat you might want to get into the habit of using the built-in context manager with. For instance, in your snippet, you don't have a call to output.close().
with open('data.txt', 'r') as f_in:
search_terms = f_in.read().splitlines()
Now search_terms is a handle to a list that has each line from data.txt as a string (but with the newline characters removed). And data.txt is closed thanks to with.
In fact, I would do that with the db.txt file, also.
with open('db.txt', 'r') as f_in:
lines = f_in.read().splitlines()
Context managers are cool.
As a side note, you could open your destination file now, and do your parsing and results-tracking with it open the whole time, but I like leaving as many files closed as possible for as long as possible.
I would suggest setting the biggest object on the outside of your loop, which I'm guessing is db.txt contents. The outermost loop only usually only gets iterated once, so might as well put the biggest thing there.
results = []
for i, line in enumerate(lines):
for term in search_terms:
if term in line:
# Use something not likely to appear in your line as a separator
# for these "second lines". I used three pipe characters, but
# you could just as easily use something even more random
results.append('{}|||{}'.format(line, lines[i+1]))
if results:
with open('output.txt', 'w') as f_out:
for result in results:
# Don't forget to replace your custom field separator
f_out.write('> {}\n'.format(result.replace('|||', '\n')))
else:
with open('no_results.txt', 'w') as f_out:
# This will write an empty file to disk
pass
The nice thing about this approach is each line in db.txt is checked once for each search_term in search_terms. However, the downside is that any line will be recorded for each search term it contains, ie., if it has three search terms in it, that line will appear in your output.txt three times.
And all the files are magically closed.
Context managers are cool.
Good luck!
search_terms keeps whole data.txt in memory. That it's not good in general but in this case it's not quite bad.
Looking line-by-line is not sufficient but if the case is simple and files are not too big it's not a big deal. If you want more efficiency you should sort data.txt file and put this to some tree-like structure. It depends on data which is inside.
You have to use seek to move pointer back after using next.
Propably the easiest way here is to generate two lists of lines and search using in like:
`db = open('db.txt').readlines()
db_words = [x.split() for x in db]
data = open('data.txt').readlines()
print('Lines in db {}'.format(len(db)))
for item in db:
for words in db_words:
if item in words:
print("Found {}".format(item))`
Your key issue is that you may be looping in the wrong order -- in your code as posted, you'll always exhaust the db looking for the first term, so after the first pass of the outer for loop db will be at end, no more lines to read, no other term will ever be found.
Other improvements include using the with statement to guarantee file closure, and a set to track which search terms were not found. (There are also typos in your posted code, such as opening a file as data but then reading it as ids).
So, for example, something like:
with open("data.txt", "r") as data:
search_terms = data.read().splitlines()
missing_terms = set(search_terms)
with open("db.txt", "r") as db, open("output.txt", "w") as output:
for line in db:
for term in search_terms:
if term in line:
missing_terms.discard(term)
next_line = db.next()
output.write(">" + head + "\n" + next_line)
print("Found {}".format(term))
break
if missing_terms:
diagnose_not_found(missing_terms)
where the diagnose_not_found function does whatever you need to do to warn the user about missing terms.
There are assumptions embedded here, such as the fact that you don't care if some other search term is present in a line where you've found a previous one, or the very next one; they might take substantial work to fix if not applicable and it will require that you edit your Q with a very complete and unambiguous list of specifications.
If your db is actually small enough to comfortably fit in memory, slurping it all in as a list of lines once and for all would allow easier accommodation for more demanding specs (as in that case you can easily go back and forth, while iterating on a file means you can only go forward one line at a time), so if your specs are indeed more demanding please also clarify if this crucial condition hold, or rather you need this script to process potentially humungous db files (say gigabyte-plus sizes, so as to not "comfortably fit in memory", depending on your platform of course).

Update strings in a text file at a specific location

I would like to find a better solution to achieve the following three steps:
read strings at a given row
update strings
write the updated strings back
Below are my code which works but I am wondering is there any better (simple) solutions?
new='99999'
f=open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP','r+')
lines=f.readlines()
#the row number we want to update is given, so just load the content
x = lines[95]
print(x)
f.close()
#replace
f1=open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP')
con = f1.read()
print con
con1 = con.replace(x[2:8],new) #only certain columns in this row needs to be updated
print con1
f1.close()
#write
f2 = open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP', 'w')
f2.write(con1)
f2.close()
Thanks!
UPDATE: get an idea from jtmoulia this time it becomes easier
def replace_line(file_name, line_num, col_s, col_e, text):
lines = open(file_name, 'r').readlines()
temp=lines[line_num]
temp = temp.replace(temp[col_s:col_e],text)
lines[line_num]=temp
out = open(file_name, 'w')
out.writelines(lines)
out.close()
The problem with textual data, even when tabulated, is that the byte offsets are not predictable. For example, when representing numbers with strings you have one byte per digit, whereas when using binary (e.g. two's complement) you always need four or eight bytes either for small and large integers.
Nevertheless, if your text format is strict enough you can get along by replacing bytes without changing the size of the file, you can try using the standard mmap module. With it, you'll be able to treat a file as a mutable byte string and modify parts of it inplace and letting the kernel do the file saving for you.
Otherwise, whatever of the other answers are much better suited for the problem.
Well, to begin with you don't need to keep reopening and reading from the file every time. The r+ mode allows you to read and write to the given file.
Perhaps something like
with open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP', 'r+') as f:
lines = f.readlines()
#... Perform whatever replacement you'd like on lines
f.seek(0)
f.writelines(lines)
Also, Editing specific line in text file in python
When I had to do something similar (for a Webmin customization), I did it entirely in PERL because that's what the Webmin framework used, and I found it quite easy. I assume (but don't know for sure) there are equivalent things in Python. First read the entire file into memory all at once (the PERL way to do this is probably called "slurp"). (This idea of holding the entire file in memory rather than just one line used to make little sense {or even be impossible}. But these days RAM is so large it's the only way to go.) Then use the split operator to divide the file into lines and put each line in a different element of a giant array. You can then use the desired line number as an index into the array (remember array indices usually start with 0). Finally, use "regular expression" processing to change the text of the line. Then change another line, and another, and another (or make another change to the same line). When you're all done, use join to put all the lines in the array back together into one giant string. Then write the whole modified file out.
While I don't have the complete code handy, here's an approximate fragment of some of the PERL code so you can see what I mean:
our #filelines = ();
our $lineno = 43;
our $oldstring = 'foobar';
our $newstring = 'fee fie fo fum';
$filelines[$lineno-1] =~ s/$oldstring/$newstring/ig;
# "ig" modifiers for case-insensitivity and possible multiple occurences in the line
# use different modifiers at the end of the s/// construct as needed
FILENAME = 'C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP'
lines = list(open(FILENAME))
lines[95][2:8] = '99999'
open(FILENAME, 'w').write(''.join(lines))

How do I modify the last line of a file?

The last line of my file is:
29-dez,40,
How can I modify that line so that it reads:
29-Dez,40,90,100,50
Note: I don't want to write a new line. I want to take the same line and put new values after 29-Dez,40,
I'm new at python. I'm having a lot of trouble manipulating files and for me every example I look at seems difficult.
Unless the file is huge, you'll probably find it easier to read the entire file into a data structure (which might just be a list of lines), and then modify the data structure in memory, and finally write it back to the file.
On the other hand maybe your file is really huge - multiple GBs at least. In which case: the last line is probably terminated with a new line character, if you seek to that position you can overwrite it with the new text at the end of the last line.
So perhaps:
f = open("foo.file", "wb")
f.seek(-len(os.linesep), os.SEEK_END)
f.write("new text at end of last line" + os.linesep)
f.close()
(Modulo line endings on different platforms)
To expand on what Doug said, in order to read the file contents into a data structure you can use the readlines() method of the file object.
The below code sample reads the file into a list of "lines", edits the last line, then writes it back out to the file:
#!/usr/bin/python
MYFILE="file.txt"
# read the file into a list of lines
lines = open(MYFILE, 'r').readlines()
# now edit the last line of the list of lines
new_last_line = (lines[-1].rstrip() + ",90,100,50")
lines[-1] = new_last_line
# now write the modified list back out to the file
open(MYFILE, 'w').writelines(lines)
If the file is very large then this approach will not work well, because this reads all the file lines into memory each time and writes them back out to the file, which is very inefficient. For a small file however this will work fine.
Don't work with files directly, make a data structure that fits your needs in form of a class and make read from/write to file methods.
I recently wrote a script to do something very similar to this. It would traverse a project, find all module dependencies and add any missing import statements. I won't clutter this post up with the entire script, but I'll show how I went about modifying my files.
import os
from mmap import mmap
def insert_import(filename, text):
if len(text) < 1:
return
f = open(filename, 'r+')
m = mmap(f.fileno(), os.path.getsize(filename))
origSize = m.size()
m.resize(origSize + len(text))
pos = 0
while True:
l = m.readline()
if l.startswith(('import', 'from')):
continue
else:
pos = m.tell() - len(l)
break
m[pos+len(text):] = m[pos:origSize]
m[pos:pos+len(text)] = text
m.close()
f.close()
Summary: This snippet takes a filename and a blob of text to insert. It finds the last import statement already present, and sticks the text in at that location.
The part I suggest paying most attention to is the use of mmap. It lets you work with files in the same manner you may work with a string. Very handy.

Categories