Optimizing find and replace over large files in Python - python

I am a complete beginner to Python or any serious programming language for that matter. I finally got a prototype code to work but I think it will be too slow.
My goal is to find and replace some Chinese characters across all files (they are csv) in a directory with integers as per a csv file I have. The files are nicely numbered by year-month, for example 2000-01.csv, and will be the only files in that directory.
I will be looping across about 25 files that are in the neighborhood of 500mb each (and about a million lines). The dictionary I will be using will have about 300 elements and I will be changing unicode (Chinese character) to integers. I tried with a test run and, assuming everything scales up linearly (?), it looks like it would take about a week for this to run.
Thanks in advance. Here is my code (don't laugh!):
# -*- coding: utf-8 -*-
import os, codecs
dir = "C:/Users/Roy/Desktop/test/"
Dict = {'hello' : 'good', 'world' : 'bad'}
for dirs, subdirs, files in os.walk(dir):
for file in files:
inFile = codecs.open(dir + file, "r", "utf-8")
inFileStr = inFile.read()
inFile.close()
inFile = codecs.open(dir + file, "w", "utf-8")
for key in Dict:
inFileStr = inFileStr.replace(key, Dict[key])
inFile.write(inFileStr)
inFile.close()

In your current code, you're reading the whole file into memory at once. Since they're 500Mb files, that means 500Mb strings. And then you do repeated replacements of them, which means Python has to create a new 500Mb string with the first replacement, then destroy the first string, then create a second 500Mb string for the second replacement, then destroy the second string, et cetera, for each replacement. That turns out to be quite a lot of copying of data back and forth, not to mention using a lot of memory.
If you know the replacements will always be contained in a line, you can read the file line by line by iterating over it. Python will buffer the read, which means it will be fairly optimized. You should open a new file, under a new name, for writing the new file simultaneously. Perform the replacement on each line in turn, and write it out immediately. Doing this will greatly reduce the amount of memory used and the amount of memory copied back and forth as you do the replacements:
for file in files:
fname = os.path.join(dir, file)
inFile = codecs.open(fname, "r", "utf-8")
outFile = codecs.open(fname + ".new", "w", "utf-8")
for line in inFile:
newline = do_replacements_on(line)
outFile.write(newline)
inFile.close()
outFile.close()
os.rename(fname + ".new", fname)
If you can't be certain if they'll always be on one line, things get a little harder; you'd have to read in blocks manually, using inFile.read(blocksize), and keep careful track of whether there might be a partial match at the end of the block. Not as easy to do, but usually still worth it to avoid the 500Mb strings.
Another big improvement would be if you could do the replacements in one go, rather than trying a whole bunch of replacements in order. There are several ways of doing that, but which fits best depends entirely on what you're replacing and with what. For translating single characters into something else, the translate method of unicode objects may be convenient. You pass it a dict mapping unicode codepoints (as integers) to unicode strings:
>>> u"\xff and \ubd23".translate({0xff: u"255", 0xbd23: u"something else"})
u'255 and something else'
For replacing substrings (and not just single characters), you could use the re module. The re.sub function (and the sub method of compiled regexps) can take a callable (a function) as the first argument, which will then be called for each match:
>>> import re
>>> d = {u'spam': u'spam, ham, spam and eggs', u'eggs': u'saussages'}
>>> p = re.compile("|".join(re.escape(k) for k in d))
>>> def repl(m):
... return d[m.group(0)]
...
>>> p.sub(repl, u"spam, vikings, eggs and vikings")
u'spam, ham, spam and eggs, vikings, saussages and vikings'

I think you can lower memory use greatly (and thus limit swap use and make things faster) by reading a line at a time and writing it (after the regexp replacements already suggested) to a temporary file - then moving the file to replace the original.

A few things (unrelated to the optimization problem):
dir + file should be os.path.join(dir, file)
You might want to not reuse infile, but instead open (and write to) a separate outfile. This also won't increase performance, but is good practice.
I don't know if you're I/O bound or cpu bound, but if your cpu utilization is very high, you may want to use threading, with each thread operating on a different file (so with a quad core processor, you'd be reading/writing 4 different files simultaneously).

Open the files read/write ('r+') and avoid the double open/close (and likely associated buffer flush). Also, if possible, don't write back the entire file, seek and write back only the changed areas after doing the replace on the file's contents. Read, replace, write changed areas (if any).
That still won't help performance too much though: I'd profile and determine where the performance hit actually is and then move onto optimising it. It could just be the reading of the data from disk that's very slow, and there's not much you can do about that in Python.

Related

How do I write a simple, Python parsing script?

Most of what I do involves writing simple parsing scripts that reads search terms from one file and searches, line by line, another file. Once the search term is found, the line and sometimes the following line are written to another output file. The code I use is rudimentary and likely crude.
#!/usr/bin/env python
data = open("data.txt", "r")
search_terms = ids.read().splitlines()
data.close()
db = open("db.txt", "r")
output = open("output.txt", "w")
for term in search_terms:
for line in db:
if line.find(term) > -1:
next_line = db.next()
output.write(">" + head + "\n" + next_line)
print("Found %s" % term)
There are a few problems here. First, I don't think it's the most efficient and fastest to search line by line, but I'm not exactly sure about that. Second, I often run into issues with cursor placement and the cursor doesn't reset to the beginning of the file when the search term is found. Third, while I am usually confident that all of the terms can be found in the db, there are rare times when I can't be sure, so I would like to write to another file whenever it iterates through the entire db and can't find the term. I've tried adding a snippet that counts the number of lines of the db so if the find() function gets to the last line and the term isn't found, then it outputs to another "not found" file, but I haven't been able to get my elif and else loops right.
Overall, I'd just like any hints or corrections that could make this sort of script more efficient and robust.
Thanks.
Unless it's a really big file, why not iterate line by line? If the input file's size is some significant portion of your machine's available resources (memory), then you might want to look into buffered input and other, more low-level abstractions of what the computer is doing. But if you're talking about a few hundred MB or less on a relatively modern machine, let the computer do the computing ;)
Off the bat you might want to get into the habit of using the built-in context manager with. For instance, in your snippet, you don't have a call to output.close().
with open('data.txt', 'r') as f_in:
search_terms = f_in.read().splitlines()
Now search_terms is a handle to a list that has each line from data.txt as a string (but with the newline characters removed). And data.txt is closed thanks to with.
In fact, I would do that with the db.txt file, also.
with open('db.txt', 'r') as f_in:
lines = f_in.read().splitlines()
Context managers are cool.
As a side note, you could open your destination file now, and do your parsing and results-tracking with it open the whole time, but I like leaving as many files closed as possible for as long as possible.
I would suggest setting the biggest object on the outside of your loop, which I'm guessing is db.txt contents. The outermost loop only usually only gets iterated once, so might as well put the biggest thing there.
results = []
for i, line in enumerate(lines):
for term in search_terms:
if term in line:
# Use something not likely to appear in your line as a separator
# for these "second lines". I used three pipe characters, but
# you could just as easily use something even more random
results.append('{}|||{}'.format(line, lines[i+1]))
if results:
with open('output.txt', 'w') as f_out:
for result in results:
# Don't forget to replace your custom field separator
f_out.write('> {}\n'.format(result.replace('|||', '\n')))
else:
with open('no_results.txt', 'w') as f_out:
# This will write an empty file to disk
pass
The nice thing about this approach is each line in db.txt is checked once for each search_term in search_terms. However, the downside is that any line will be recorded for each search term it contains, ie., if it has three search terms in it, that line will appear in your output.txt three times.
And all the files are magically closed.
Context managers are cool.
Good luck!
search_terms keeps whole data.txt in memory. That it's not good in general but in this case it's not quite bad.
Looking line-by-line is not sufficient but if the case is simple and files are not too big it's not a big deal. If you want more efficiency you should sort data.txt file and put this to some tree-like structure. It depends on data which is inside.
You have to use seek to move pointer back after using next.
Propably the easiest way here is to generate two lists of lines and search using in like:
`db = open('db.txt').readlines()
db_words = [x.split() for x in db]
data = open('data.txt').readlines()
print('Lines in db {}'.format(len(db)))
for item in db:
for words in db_words:
if item in words:
print("Found {}".format(item))`
Your key issue is that you may be looping in the wrong order -- in your code as posted, you'll always exhaust the db looking for the first term, so after the first pass of the outer for loop db will be at end, no more lines to read, no other term will ever be found.
Other improvements include using the with statement to guarantee file closure, and a set to track which search terms were not found. (There are also typos in your posted code, such as opening a file as data but then reading it as ids).
So, for example, something like:
with open("data.txt", "r") as data:
search_terms = data.read().splitlines()
missing_terms = set(search_terms)
with open("db.txt", "r") as db, open("output.txt", "w") as output:
for line in db:
for term in search_terms:
if term in line:
missing_terms.discard(term)
next_line = db.next()
output.write(">" + head + "\n" + next_line)
print("Found {}".format(term))
break
if missing_terms:
diagnose_not_found(missing_terms)
where the diagnose_not_found function does whatever you need to do to warn the user about missing terms.
There are assumptions embedded here, such as the fact that you don't care if some other search term is present in a line where you've found a previous one, or the very next one; they might take substantial work to fix if not applicable and it will require that you edit your Q with a very complete and unambiguous list of specifications.
If your db is actually small enough to comfortably fit in memory, slurping it all in as a list of lines once and for all would allow easier accommodation for more demanding specs (as in that case you can easily go back and forth, while iterating on a file means you can only go forward one line at a time), so if your specs are indeed more demanding please also clarify if this crucial condition hold, or rather you need this script to process potentially humungous db files (say gigabyte-plus sizes, so as to not "comfortably fit in memory", depending on your platform of course).

How do i replace a specific value in a file in python

Im trying to replace the zero's with a value. So far this is my code, but what do i do next?
g = open("January.txt", "r+")
for i in range(3):
dat_month = g.readline()
Month: January
Item: Lawn
Total square metres purchased:
0
monthly value = 0
You could do that -
but that is not the usual approach, and certainly is not the correct approach for text files.
The correct way to do it is to write another file, with the information you want updated in place, and then rename the new file to the old one. That is the only sane way of doing this with text files, since the information size in bytes for the fields is variable.
As for the impresion that you are "writing 200 bytes to the disk" instead of a single byte, changing your value, don't let that fool you: at the Operating system level, all file access has to be done in blocks, which are usually a couple of kilobytes long (in special cases, and tunned filesystems it could be a couple hundred bytes). Anyway, you will never, in a user-space program, much less in a high level language like Python, trigger a diskwrite of less than a few hundred bytes.
Now, for the code:
import os
my_number = <number you want to place in the line you want to rewrite>
with open("January.txt", "r") as in_file, open("newfile.txt", "w") as out_file:
for line in in_file:
if line.strip() == "0":
out_file.write(str(my_number) + "\n")
else:
out_file.write(line)
os.unlink("January.txt")
os.rename("newfile.txt", "January.txt")
So - that is the general idea -
of course you should not write code with all values hardcoded in that way (i.e. the values to be checked and written fixed in the program code, as are the filenames).
As for the with statement - it is a special construct of the language wich is very appropriate to oppening files and manipulating then in a block, like in this case - but it is not needed.
Programing apart, the concept you have to keep in mind is this:
when you use an application that lets you edit a text file, a spreadsheet, an image, you, as user, may have the impression that after you are done and have saved your work, the updates are comitted to the same file. In the vast, vast majority of use cases, that is not what happens: the application uses internally a pattern like the one I presented above - a completly new file is written to disk and the old one is deleted, or renamed. The few exceptions could be simple database applications, which could replace fixed width fields inside the file itself on updates. Modern day databases certainly do not do that, resorting to appending the most recent, updated information, to the end of the file. PDF files are another kind that were not designed to be replaced entirely on each update, when being created: but also in that case, the updated information is written at the end of the file, even if the update is to take place in a page in the beginning of the rendered document.
dat_month = dat_month.replace("0", "45678")
To write to a file you do:
with open("Outfile.txt", "wt") as outfile:
And then
outfile.write(dat_month)
Try this:
import fileinput
import itertools
import sys
with fileinput.input('January.txt', inplace=True) as file:
beginning = tuple(itertools.islice(file, 3))
sys.stdout.writelines(beginning)
sys.stdout.write(next(file).replace('0', 'a value'))
sys.stdout.write(next(file).replace('0', 'a value'))
sys.stdout.writelines(file)

How to enhance the performance of finding the most common strings in Python?

I have about 30 files, the size of each is around 300MB. There are some information I'm interested in in each file, such as usernames. Now I want to find the usernames using regex, then find the most common usernames. Here's my code:
rList=[]
for files in os.listdir("."):
with open(files,'r') as f:
for line in f:
m=re.search('PATTERN TO FIND USERNAME',line)
if m:
rList.append(m.group())
c=Counter(rList)
print c.most_common(10)
Now as you can see, I add every username I find to a list and then call Counter(). This way it takes about several minutes to finish. I've tried removing the c=Counter(rList) and calling c.update() every time I finish reading a file, but it won't make any differnce, will it?
SO, is this the best practice? Are there any ways to improve the performance? Thanks!
Profiling will show you that there is significant overhead involved with looping over each line of the file one by one. If the files are always around the size you specified and you can spend the memory, get them into memory with a single call to .read() and then use a more complex, pre-compiled regexp (that takes line-breaks into account) to extract all usernames at once. Then .update() your counter-object with the groups from the matched regexp. This will be about as efficient as it can get.
If you have the memory then:
Use mmap
Use implicit loops as much as possible
The following fragment should be fast but needs memory:
# imports elided
patternString = rb'\b[a-zA-Z]+\b' # byte string creating a byte pattern
pattern = re.compile(patternString)
c = Counter()
for fname in os.listdir("."):
with open(fname, "r+b") as f:
mm = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
c.update(pattern.findall(mm))
print(c.most_common(10))
patternString should be your pattern.

Update strings in a text file at a specific location

I would like to find a better solution to achieve the following three steps:
read strings at a given row
update strings
write the updated strings back
Below are my code which works but I am wondering is there any better (simple) solutions?
new='99999'
f=open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP','r+')
lines=f.readlines()
#the row number we want to update is given, so just load the content
x = lines[95]
print(x)
f.close()
#replace
f1=open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP')
con = f1.read()
print con
con1 = con.replace(x[2:8],new) #only certain columns in this row needs to be updated
print con1
f1.close()
#write
f2 = open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP', 'w')
f2.write(con1)
f2.close()
Thanks!
UPDATE: get an idea from jtmoulia this time it becomes easier
def replace_line(file_name, line_num, col_s, col_e, text):
lines = open(file_name, 'r').readlines()
temp=lines[line_num]
temp = temp.replace(temp[col_s:col_e],text)
lines[line_num]=temp
out = open(file_name, 'w')
out.writelines(lines)
out.close()
The problem with textual data, even when tabulated, is that the byte offsets are not predictable. For example, when representing numbers with strings you have one byte per digit, whereas when using binary (e.g. two's complement) you always need four or eight bytes either for small and large integers.
Nevertheless, if your text format is strict enough you can get along by replacing bytes without changing the size of the file, you can try using the standard mmap module. With it, you'll be able to treat a file as a mutable byte string and modify parts of it inplace and letting the kernel do the file saving for you.
Otherwise, whatever of the other answers are much better suited for the problem.
Well, to begin with you don't need to keep reopening and reading from the file every time. The r+ mode allows you to read and write to the given file.
Perhaps something like
with open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP', 'r+') as f:
lines = f.readlines()
#... Perform whatever replacement you'd like on lines
f.seek(0)
f.writelines(lines)
Also, Editing specific line in text file in python
When I had to do something similar (for a Webmin customization), I did it entirely in PERL because that's what the Webmin framework used, and I found it quite easy. I assume (but don't know for sure) there are equivalent things in Python. First read the entire file into memory all at once (the PERL way to do this is probably called "slurp"). (This idea of holding the entire file in memory rather than just one line used to make little sense {or even be impossible}. But these days RAM is so large it's the only way to go.) Then use the split operator to divide the file into lines and put each line in a different element of a giant array. You can then use the desired line number as an index into the array (remember array indices usually start with 0). Finally, use "regular expression" processing to change the text of the line. Then change another line, and another, and another (or make another change to the same line). When you're all done, use join to put all the lines in the array back together into one giant string. Then write the whole modified file out.
While I don't have the complete code handy, here's an approximate fragment of some of the PERL code so you can see what I mean:
our #filelines = ();
our $lineno = 43;
our $oldstring = 'foobar';
our $newstring = 'fee fie fo fum';
$filelines[$lineno-1] =~ s/$oldstring/$newstring/ig;
# "ig" modifiers for case-insensitivity and possible multiple occurences in the line
# use different modifiers at the end of the s/// construct as needed
FILENAME = 'C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP'
lines = list(open(FILENAME))
lines[95][2:8] = '99999'
open(FILENAME, 'w').write(''.join(lines))

How can I change a huge file into csv in python

I'm a beginner in python. I have a huge text file (hundreds of GB) and I want to convert the file into csv file. In my text file, I know the row delimiter is a string "<><><><><><><>". If a line contains that string, I want to replace it with ". Is there a way to do it without having to read the old file and rewriting a new file.
Normally I thought I need to do something like this:
fin = open("input", "r")
fout = open("outpout", "w")
line = f.readline
while line != "":
if line.contains("<><><><><><><>"):
fout.writeline("\"")
else:
fout.writeline(line)
line = f.readline
but copying hundreds of GB is wasteful. Also I don't know if open will eat lots of memory (does it treat file handler as a stream?)
Any help is greatly appreciated.
Note: an example of the file would be
file.txt
<><><><><><><>
abcdefeghsduai
asdjliwa
1231214 ""
<><><><><><><>
would be one row and one column in csv.
#richard-levasseur
I agree, sed seems like the right way to go. Here's a rough cut at what the OP describes:
sed -i -e's/<><><><><><><>/"/g' foo.txt
This will do the replacement in-place in the existing foo.txt. For that reason, I recommend having the original file under some sort of version control; any of the DVCS should fit the bill.
Yes, open() treats the file as a stream, as does readline(). It'll only read the next line. If you call read(), however, it'll read everything into memory.
Your example code looks ok at first glance. Almost every solution will require you to copy the file elsewhere. Its not exactly easy to modify the contents of a file inplace without a 1:1 replacement.
It may be faster to use some standard unix utilities (awk and sed most likely), but I lack the unix and bash-fu necessary to provide a full solution.
It's only wasteful if you don't have disk to spare. That is, fix it when it's a problem. Your solution looks ok as a first attempt.
It's not wasteful of memory because a file handler is a stream.
Reading lines is simply done using a file iterator:
for line in fin:
if line.contains("<><><><><><><>"):
fout.writeline("\"")
Also consider the CSV writer object to write CSV files, e.g:
import csv
writer = csv.writer(open("some.csv", "wb"))
writer.writerows(someiterable)
With python you will have to create a new file for safety sake, it will cause alot less headaches than trying to write in place.
The below listed reads your input 1 line at a time and buffers the columns (from what I understood of your test input file was 1 row) and then once the end of row delimiter is hit it will write that buffer to disk, flushing manually every 1000 lines of the original file. This will save some IO as well instead of writing every segment, 1000 writes of 32 bytes each will be faster than 4000 writes of 8 bytes.
fin = open(input_fn, "rb")
fout = open(output_fn, "wb")
row_delim = "<><><><><><><>"
write_buffer = []
for i, line in enumerate(fin):
if not i % 1000:
fout.flush()
if row_delim in line and i:
fout.write('"%s"\r\n'%'","'.join(write_buffer))
write_buffer = []
else:
write_buffer.append(line.strip())
Hope that helps.
EDIT: Forgot to mention, while using .readline() is not a bad thing don't use .readlines() which will go and read the entire content of the file into a list containing each line which is incredibly inefficient. Using the built in iterator that comes with a file object is the best memory usage and speed.
#Constatin suggests that if you would be satisfied with replacing '<><><><><><><>\n' by '" \n'
then the replacement string is the same length, and in that case you can craft a solution to in-place editing with mmap. You will need python 2.6. It's vital that the file is opened in the right mode!
import mmap, os
CHUNK = 2**20
oldStr = ''
newStr = '" '
strLen = len(oldStr)
assert strLen==len(newStr)
f = open("myfilename", "r+")
size = os.fstat(f.fileno()).st_size
for offset in range(0,size,CHUNK):
map = mmap.mmap(f.fileno(),
length=min(CHUNK+strLen,size-offset), # not beyond EOF
offset=offset)
index = 0 # start at beginning
while 1:
index = map.find(oldStr,index) # find next match
if index == -1: # no more matches in this map
break
map[index:index+strLen] = newStr
f.close()
This code is not debugged! It works for me on a 3 MB test case, but it may not work on a large ( > 2GB) file - the mmap module still seems a bit immature, so I wouldn't rely on it too much.
Looking at the bigger picture, from what you've posted it isn't clear that your file will end up as valid CSV. Also be aware that the tool you're planning to use to actually process the CSV may be flexible enough to deal with the file as it stands.
If you're delimiting fields with double quotes, it looks like you need to escape the double quotes you have occurring in your elements (for example 1231214 "" will need to be \n1231214 \"\").
Something like
fin = open("input", "r")
fout = open("output", "w")
for line in fin:
if line.contains("<><><><><><><>"):
fout.writeline("\"")
else:
fout.writeline(line.replace('"',r'\"')
fin.close()
fout.close()
[For the problem exactly as stated] There's no way that this can be done without copying the data, in python or any other language. If your processing always replaced substrings with new substrings of equal length, maybe you could do it in-place. But whenever you replace <><><><><><><> with " you are changing the position of all subsequent characters in the file. Copying from one place to another is the only way to handle this.
EDIT:
Note that the use of sed won't actually save any copying...sed doesn't really edit in-place either. From the GNU sed manual:
-i[SUFFIX]
--in-place[=SUFFIX]
This option specifies that files are to be edited in-place. GNU sed does this by creating a temporary file and sending output to this file rather than to the standard output.
(emphasis mine.)

Categories