I am working with a text file of about 12*10^6 rows which is stored on my hard disk.
The structure of the file is:
data|data|data|...|data\n
data|data|data|...|data\n
data|data|data|...|data\n
...
data|data|data|...|data\n
There's no header, and there's no id to uniquely identify the rows.
Since I want to use it for machine learning purposes, I need to make sure that there's no order in the text file which may affect the stochastic learning.
Usually I upload such kind of files into memory, and I shuffle them before rewriting them to disk. Unfortunately this time it is not possible, due to the size of the file, so I have to manage the shuffling directly on disk(assume I don't have problem with the disk space). Any idea about how to effectively (with lowest possible complexity, i.e. writing to the disk) manage such task with Python?
All but one of these ideas use O(N) memory—but if you use an array.array or numpy.ndarray we're talking around N*4 bytes, which is significantly smaller than the whole file. (I'll use a plain list for simplicity; if you need help converting to a more compact type, I can show that too.)
Using a temporary database and an index list:
with contextlib.closing(dbm.open('temp.db', 'n')) as db:
with open(path) as f:
for i, line in enumerate(f):
db[str(i)] = line
linecount = i
shuffled = random.shuffle(range(linecount))
with open(path + '.shuffled', 'w') as f:
for i in shuffled:
f.write(db[str(i)])
os.remove('temp.db')
This is 2N single-line disk operations, and 2N single-dbm-key disk operations, which should be 2NlogN single-disk-disk-operation-equivalent operations, so the total complexity is O(NlogN).
If you use a relational database like sqlite3 instead of a dbm, you don't even need the index list, because you can just do this:
SELECT * FROM Lines ORDER BY RANDOM()
This has the same time complexity as the above, and the space complexity is O(1) instead of O(N)—in theory. In practice, you need an RDBMS that can feed you a row at a time from a 100M row set without storing that 100M on either side.
A different option, without using a temporary database—in theory O(N**2), but in practice maybe faster if you happen to have enough memory for the line cache to be helpful:
with open(path) as f:
linecount = sum(1 for _ in f)
shuffled = random.shuffle(range(linecount))
with open(path + '.shuffled', 'w') as f:
for i in shuffled:
f.write(linecache.getline(path, i))
Finally, by doubling the size of the index list, we can eliminate the temporary disk storage. But in practice, this might be a lot slower, because you're doing a lot more random-access reads, which drives aren't nearly as good at.
with open(path) as f:
linestarts = [f.tell() for line in f]
lineranges = zip(linestarts, linestarts[1:] + [f.tell()])
shuffled = random.shuffle(lineranges)
with open(path + '.shuffled', 'w') as f1:
for start, stop in shuffled:
f.seek(start)
f1.write(f.read(stop-start))
This is a suggestion based on my comment above. It relies on having the compressed lines still being able to fit into memory. If that is not the case, the other solutions will be required.
import zlib
from random import shuffle
def heavy_shuffle(filename_in, filename_out):
with open(filename_in, 'r') as f:
zlines = [zlib.compress(line, 9) for line in f]
shuffle(zlines)
with open(filename_out, 'w') as f:
for zline in zlines:
f.write(zlib.decompress(zline) + '\n')
My experience has been that zlib is fast, while bz2 offers better compression, so you may want to compare.
Also, if you can get away with chunking, say, n lines together, doing so is likely to lift your compression ratio.
I was wondering about the likelihood of useful compression, so here's an IPython experiment. I don't know what your data looks like, so I just went with floats (as strings) rounded to 3 places and strung together with pipes:
Best-case scenario (e.g. many rows have all same digits):
In [38]: data = '0.000|'*200
In [39]: len(data)
Out[39]: 1200
In [40]: zdata = zlib.compress(data, 9)
In [41]: print 'zlib compression ratio: ',1.-1.*len(zdata)/len(data)
zlib compression ratio: 0.98
In [42]: bz2data = bz2.compress(data, 9)
In [43]: print 'bz2 compression ratio: ',1.-1.*len(bz2data)/len(data)
bz2 compression ratio: 0.959166666667
As expected, best-case is really good, >95% compression ratio.
Worst-case scenario (randomized data):
In [44]: randdata = '|'.join(['{:.3f}'.format(x) for x in np.random.randn(200)])
In [45]: zdata = zlib.compress(randdata, 9)
In [46]: print 'zlib compression ratio: ',1.-1.*len(zdata)/len(data)
zlib compression ratio: 0.5525
In [47]: bz2data = bz2.compress(randdata, 9)
In [48]: print 'bz2 compression ratio: ',1.-1.*len(bz2data)/len(data)
bz2 compression ratio: 0.5975
Surprisingly, worst-case is not too bad ~60% compression ratio, but likely to be problematic if you only have 8 GB of memory (60% of 15 GB is 9 GB).
Assuming that disk space is not not a problem with you, I am creating multiple files to hold the data.
import random
import os
PMSize = 100 #Lesser value means using more primary memory
shuffler = lambda x: open(x, 'w')
shufflers = [shuffler('file'+str(x)) for x in range(PMSize)]
with open('filename') as file:
for line in file:
i = random.randint(0, len(shufflers)-1)
shufflers[i].write(line)
with open('filename', 'w') as file:
for file in shufflers:
newfile.write(file.read())
for file in shufflers:
os.remove(file)
Your Memory complexity will be controlled by PMSize. Time complexity will be around O(N + PMSize).
This problem can be thought as a problem of efficient memory pages management to reduce swap file I/O. Let your buffer buf be a list of contigous chunks of file you would like to be stored into the output file. Let a contigous chunk of file be a list of a fixed amount of whole lines.
Now, generate a random sequence and remap the returned values to appriopriate chunk numbers and line offsets inside that chunk.
This operation leaves you with a sequence of numbers [1..num of chunks] which can be described as a sequence of accesses to memory fragments contained in pages of numbers between [1..num of chunks]. For online variantion (like in real OS), there is no optimal strategy to this problem, but since you know the actual sequence of pages' referencing there is an optimal solution which can be found here.
What's the gain from this approach? Pages which are going to be used most often are least reread from the HDD meaning less I/O operations to read the data. Also, considering your chunk size is big enough to minimize page swapping compared to memory footprint, a lot of the times following lines of the output file would be taken from the same chunk stored in memory (or any other, but not yet swapped to the drive) rather than reread from the drive.
Maybe it's not the easiest solution (though the optimal page swapping algorithm is easy to write), it could be a fun exercise to do, wouldn't it?
Related
I'm reading line by line from a text file and manipulating the string to then be written to a csv file.
I can think of two best ways to do this (and I welcome other ideas or modifications):
Read, process single line into a list, and go straight to writing the line.
linelist = []
with open('dirty.txt', 'r') as dirty_text:
with open('clean.csv', 'w') as clean_csv:
cleancsv_writer = csv.writer(clean_csv, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for line in dirty_text:
#Parse fields into list, replacing the previous list item with a new string that is a comma-separated row.
#Write list item into clean.csv.
Read and process the lines into a list (until reaching the size limit of a list), then writing the list to the csv in one big batch. Repeat until end of file (but I'm leaving out the loop for this example).
linelist = []
seekpos = 0
with open('dirty.txt', 'r') as dirty_text:
for line in dirty_text:
#Parse fields into list until the end of the file or the end of the list's memory space, such that each list item is a string that is a comma-separated row.
#update seek position to come back to after this batch, if looping through multiple batches
with open('clean.csv', 'a') as clean_csv:
cleancsv_writer = csv.writer(clean_csv, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
#write list into clean.csv, each list item becoming a comma-separated row.
#This would likely be a loop for bigger files, but for my project and for simplicity, it's not necessary.
Which process is the most efficient use of resources?
In this case, I'm assuming nobody (human or otherwise) needs access to either file during this process (though I would gladly hear discussion about efficiency in that case).
I'm also assuming a list demands less resources than a dictionary.
Memory use is my primary concern. My hunch is that the first process uses the least memory because the list never gets longer than one item, so the maximum memory it uses at any given moment is less than that of the second process which maxes out the list memory. But, I'm not sure how dynamic memory allocation works in Python, and you have two file objects open at the same time in the first process.
As for power usage and total time it takes, I'm not sure which process is more efficient. My hunch is that with multiple batches, the second option would use more power and take more time because it opens and closes the files at each batch.
As for code complexity and length, the first option seems like it will turn out simpler and shorter.
Other considerations?
Which process is best?
Is there a better way? Ten better ways?
Thanks in advance!
Reading all the data into memory is inefficient because it uses more memory than necessary.
You can trade some CPU for memory; the program to read everything into memory will have a single, very simple main loop; but the main bottleneck will be the I/O channel, so it really won't be faster. Regardless of how fast the code runs, any reasonable implementation will spend most of its running time waiting for the disk.
If you have enough memory, reading the entire file into memory will work fine. Once the data is bigger than your available memory, performance will degrade ungracefully (i.e. the OS will start swapping regions of memory out to disk and then swap them back in when they are needed again; in the worst case, this will basically grind the system to a halt, a situation called thrashing). The main reason to prefer reading and writing a line at a time is that the program will perform without degradation even when you scale up to larger amounts of data.
I/O is already buffered; just write what looks natural, and let the file-like objects and the operating system take care of the actual disk reads and writes.
with open('dirty.txt', 'r') as dirty_text:
with open('clean.csv', 'w') as clean_csv:
cleancsv_writer = csv.writer(clean_csv, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for line in dirty_text:
row = some_function(line)
cleancsv_writer.writerow(row)
If all the work of cleaning up a line is abstracted away by some_function, you don't even need the for loop.
with open('dirty.txt', 'r') as dirty_text,\
with open('clean.csv', 'w') as clean_csv:
cleancsv_writer = csv.writer(clean_csv, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
cleancsv_writer.writerows(some_function(line) for line in dirty_text))
I have a problem that I have not been able to solve. I have 4 .txt files each between 30-70GB. Each file contains n-gram entries as follows:
blabla1/blabla2/blabla3
word1/word2/word3
...
What I'm trying to do is count how many times each item appear, and save this data to a new file, e.g:
blabla1/blabla2/blabla3 : 1
word1/word2/word3 : 3
...
My attempts so far has been simply to save all entries in a dictionary and count them, i.e.
entry_count_dict = defaultdict(int)
with open(file) as f:
for line in f:
entry_count_dict[line] += 1
However, using this method I run into memory errors (I have 8GB RAM available). The data follows a zipfian distribution, e.g. the majority of the items occur only once or twice.
The total number of entries is unclear, but a (very) rough estimate is that there is somewhere around 15,000,000 entries in total.
In addition to this, I've tried h5py where all the entries are saved as a h5py dataset containing the array [1], which is then updated, e.g:
import h5py
import numpy as np
entry_count_dict = h5py.File(filename)
with open(file) as f:
for line in f:
if line in entry_count_dict:
entry_count_file[line][0] += 1
else:
entry_count_file.create_dataset(line,
data=np.array([1]),
compression="lzf")
However, this method is way to slow. The writing speed gets slower and slower. As such, unless the writing speed can be increased this approach is implausible. Also, processing the data in chunks and opening/closing the h5py file for each chunk did not show any significant difference in processing speed.
I've been thinking about saving entries which start with certain letters in separate files, i.e. all the entries which start with a are saved in a.txt, and so on (this should be doable using defaultdic(int)).
However, to do this the file have to iterated once for every letter, which is implausible given the file sizes (max = 69GB).
Perhaps when iterating over the file, one could open a pickle and save the entry in a dict, and then close the pickle. But doing this for each item slows down the process quite a lot due to the time it takes to open, load and close the pickle file.
One way of solving this would be to sort all the entries during one pass, then iterate over the sorted file and count the entries alphabetically. However, even sorting the file is painstakingly slow using the linux command:
sort file.txt > sorted_file.txt
And, I don't really know how to solve this using python given that loading the whole file into memory for sorting would cause memory errors. I have some superficial knowledge of different sorting algorithms, however they all seem to require that the whole object to be sorted needs get loaded into memory.
Any tips on how to approach this would be much appreciated.
There are a number of algorithms for performing this type of operation. They all fall under the general heading of External Sorting.
What you did there with "saving entries which start with certain letters in separate files" is actually called bucket sort, which should, in theory, be faster. Try it with sliced data sets.
or,
try Dask, a DARPA + Anaconda backed distributive computing library, with interfaces familiar to numpy, pandas, and works like Apache-Spark. (works on single machine too)
btw it scales
I suggest trying dask.array,
which cuts the large array into many small ones, and implements numpy ndarray interface with blocked algorithms to utilize all of your cores when computing these larger-than-memory datas.
I've been thinking about saving entries which start with certain letters in separate files, i.e. all the entries which start with a are saved in a.txt, and so on (this should be doable using defaultdic(int)). However, to do this the file have to iterated once for every letter, which is implausible given the file sizes (max = 69GB).
You are almost there with this line of thinking. What you want to do is to split the file based on a prefix - you don't have to iterate once for every letter. This is trivial in awk. Assuming your input files are in a directory called input:
mkdir output
awk '/./ {print $0 > ( "output/" substr($0,0,1))}` input/*
This will append each line to a file named with the first character of that line (note this will be weird if your lines can start with a space; since these are ngrams I assume that's not relevant). You could also do this in Python but managing the opening and closing of files is somewhat more tedious.
Because the files have been split up they should be much smaller now. You could sort them but there's really no need - you can read the files individually and get the counts with code like this:
from collections import Counter
ngrams = Counter()
for line in open(filename):
ngrams[line.strip()] += 1
for key, val in ngrams.items():
print(key, val, sep='\t')
If the files are still too large you can increase the length of the prefix used to bucket the lines until the files are small enough.
So recently I took on as a personal project to make my very own DB in Python, mainly because I hate messing arround with most DBs and I needed something easy to setup, portable and simple to study large data sets.
I now find myself stuck on a problem, an efficient way to delete a line from the DB file (which is really just a text file). The way I found to do it is to write all of the content thats after the line before it, and then truncate the file (I take suggestions on better ways to do it). The problem arrives when I need to write the content after the line before it, because doing it all at once could possibly load millions of lines onto the RAM at once. The code follows:
ln = 11 # Line to be deleted
with open("test.txt", "r+") as f:
readlinef = f.readline
for i in xrange(ln):
line = readlinef()
length, start = (len(line), f.tell()-len(line))
f.seek(0, 2)
chunk = f.tell() - start+length
f.seek(start+length, 0)
# How to make this buffered?
data = f.read(chunk)
f.seek(start, 0)
f.write(data)
f.truncate()
Right now thats reading all of that data at once, how would I make that last code block work in a buffered fashion? The start position would switch every time a new chunk of data is written before it, I was wondering what would be the most efficient and fast (execution time wise) way to do this.
Thanks in advance.
edit
I've decided to follow the advices submitted here, but just for curiosity's sake I found a way to read and write in chunks. It follows:
with open("test.txt", "r+") as f:
readlinef = f.readline
for i in xrange(ln):
line = readlinef()
start, length = (f.tell()-len(line), len(line))
readf = f.read
BUFFER_SIZE = 1024 * 1024
x = 0
chunk = readf(BUFFER_SIZE)
while chunk:
f.seek(start, 0)
f.write(chunk)
start += BUFFER_SIZE
f.seek(start+length+(x*BUFFER_SIZE), 0)
chunk = readf(BUFFER_SIZE)
f.truncate()
Answering your question "How would I do that?" concerning indices and vacuum.
Disclaimer: This is a very simple example and does in no way compare to existing DBMS and I strongly advise against it.
Basic idea:
For each table in your DB, keep various files, some for your object ids (row ids, record ids) and some (page files) with the actual data. Let's suppose that each record is of variable length.
Each record has a table-unique OID. These are stored in the oid-files. Let's name the table "test" and the oid files "test.oidX". Each record in the oid file is of fixed length and each oid file is of fixed length.
Now if "test.oid1" reads:
0001:0001:0001:0015 #oid:pagefile:position:length
0002:0001:0016:0100
0004:0002:0001:0001
It means that record 1 is in page file 1, at position 1 and has length 15. Record 2 is in page file 1 at position 16 of length 100, etc.
Now when you want to delete a record, just touch the oid file. E.g. for deleting record 2, edit it to:
0001:0001:0001:0015
0000:0001:0016:0100 #0000 indicating empty cell
0004:0002:0001:0001
And don't even bother touching your page files.
This will create holes in your page files. Now you need to implement some "maintenance" routine which moves blocks in your page files around, etc, which could either run when requested by the user, or automatically when your DBMS has nothing else to do. Depending on which locking strategy you use, you might need to lock the concerned records or the whole table.
Also when you insert a new record, and you find a hole big enough, you can insert it there.
If your oid-files should also function as an index (slow inserts, fast queries), you will need to rebuild it (surely on insertion, maybe on deletion).
Operations on oid-files should be fast, as they are fixed-length and of fixed-length records.
This is just the very basic idea, not touching topics like search trees, hashing, etc, etc.
You can do this the same way that (effectively) memmove works: seek back and forth between the source range and the destination range:
count = (size+chunksize-1) // chunk size
for chunk in range(count):
f.seek(start + chunk * chunksize + deleted_line_size, 0)
buf = f.read(chunksize)
f.seek(start + chunk * chunksize, 0)
f.write(buf)
Using a temporary file and shutil makes it a lot simpler—and, despite what you're expect, it may actually be faster. (There's twice as much writing, but a whole lot less seeking, and mostly block-aligned writing.) For example:
with tempfile.TemporaryFile('w') as ftemp:
shutil.copyfileobj(ftemp, f)
ftemp.seek(0, 0)
f.seek(start, 0)
shutil.copyfileobj(f, ftemp)
f.truncate()
However, if your files are big enough to fit in your virtual memory space (which they probably are in 64-bit land, but may not be in 32-bit land), it may be simpler to just mmap the file and let the OS/libc take care of the work:
m = mmap.mmap(f.fileno(), access=mmap.ACCESS_WRITE)
m[start:end-deleted_line_size] = m[start+deleted_line_size:end]
m.close()
f.seek(end-deleted_line_size)
f.truncate()
I have an ASCII table in a file from which I want to read a particular set of lines (e.g. lines 4003 to 4005). The issue is that this file could be very very long (e.g. 100's of thousands to millions of lines), and I'd like to do this as quickly as possible.
a) Bad Solution: Read in the entire file, and go to those lines,
f = open('filename')
lines = f.readlines()[4003:4005]
b) Better Solution: enumerate over each line so that it's not all in memory (a la https://stackoverflow.com/a/2081880/230468)
f = open('filename')
lines = []
for i, line in enumerate(f):
if i >= 4003 and i <= 4005: lines.append(line)
if i > 4005: break # #Wooble
c) Best Solution?
But b) still requires going through each line.
Is there a better (in terms of speed/efficiency) method of accessing a particular line from a huge file?
Should I use a linecache even though I will only access the file once (typically)?
Using a binary file instead, in which case it might be easier to skip-ahead, is an option --- but I'd much rather avoid it.
I would probably just use itertools.islice. Using islice over an iterable like a file handle means the whole file is never read into memory, and the first 4002 lines are discarded as quickly as possible. You could even cast the two lines you need into a list pretty cheaply (assuming the lines themselves aren't very long). Then you can exit the with block, closing the filehandle.
from itertools import islice
with open('afile') as f:
lines = list(islice(f, 4003, 4005))
do_something_with(lines)
Update
But holy cow is linecache faster for multiple accesses. I created a million-line file to compare islice and linecache and linecache blew it away.
>>> timeit("x=islice(open('afile'), 4003, 4005); print next(x) + next(x)", 'from itertools import islice', number=1)
4003
4004
0.00028586387634277344
>>> timeit("print getline('afile', 4003) + getline('afile', 4004)", 'from linecache import getline', number=1)
4002
4003
2.193450927734375e-05
>>> timeit("getline('afile', 4003) + getline('afile', 4004)", 'from linecache import getline', number=10**5)
0.14125394821166992
>>> timeit("''.join(islice(open('afile'), 4003, 4005))", 'from itertools import islice', number=10**5)
14.732316970825195
Constantly re-importing and re-reading the file:
This is not a practical test, but even re-importing linecache at each step it's only a second slower than islice.
>>> timeit("from linecache import getline; getline('afile', 4003) + getline('afile', 4004)", number=10**5)
15.613967180252075
Conclusion
Yes, linecache is faster than islice for all but constantly re-creating the linecache, but who does that? For the likely scenarios (reading only a few lines, once, and reading many lines, once) linecache is faster and presents a terse syntax, but the islice syntax is quite clean and fast as well and doesn't ever read the whole file into memory. On a RAM-tight environment, the islice solution may be the right choice. For very high speed requirements, linecache may be the better choice. Practically, though, in most environments both times are small enough it almost doesn't matter.
The main problem here is, that linebreaks are in no way different than any other character. So the OS has no way of skipping to that line.
That said there are a few options but for every one you have to make sacrifices in one way or another.
You did already state the first one: Use a binary file. If you have fixed line-length, then you can seek ahead line * bytes_per_line bytes and jump directly to that line.
The next option would be using an index: create a second file and in every line of this index file write the byte-index of the line in your datafile. Accessing the datafile now involves two seek operation (skip to line of index, then skip to index_value in datafile) but it will still be pretty fast. Plus: Will save diskspace because the lines can have different length. Minus: You can't touch the datafile with an editor.
One more option: (I think I would go with this) is to use only one file but begin every line with the line-number and some kind of seperator. (e.g. 4005: My data line). Now you can use a modified version of binary search https://en.wikipedia.org/wiki/Binary_search_algorithm to seek for your line. This will take around log(n) seek operations with n being the total number of lines. Plus: You can edit the file and it saves space compared to fixed length lines. And it's still very fast. Even for one million lines this are only about 20 seek operations which happen in no time. Minus: The most complex of these posibilities. (But fun to do ;)
EDIT: One more solution: Split your file in many smaler ones. If you have very long 'lines' this could be as small as one line per file. But then I would put them in groups in folders like e.g. 4/0/05. But even with shorter lines divide your file in - let's say roughly - 1mb chunks, name them 1000.txt, 2000.txt and read the one (or two) matching your line completely should be pretty fast end very easy to implement.
I ran into a similar problem as the post above, however, the solutions posted above have problems in my particular scenario; the file was too big for linecache and islice was nowhere near fast enough. I would like to offer a third (or fourth) alternative solution.
My solution is based upon the fact that we can use mmap to access a particular point in the file. We need only know where in a file that lines begin and end, then the mmap can give those to us comparably as fast as linecache. To optimize this code (see the updates):
We use the deque class from collections to create a dynamically lengthed collection of endpoints.
We then convert that to a list which optimizes random access to that collection.
The following is a simple wrapper for the process:
from collections import deque
import mmap
class fast_file():
def __init__(self, file):
self.file = file
self.linepoints = deque()
self.linepoints.append(0)
pos = 0
with open(file,'r') as fp:
while True:
c = fp.read(1)
if not c:
break
if c == '\n':
self.linepoints.append(pos)
pos += 1
pos += 1
self.fp = open(self.file,'r+b')
self.mm = mmap.mmap(self.fp.fileno(),0 )
self.linepoints.append(pos)
self.linepoints = list(self.linepoints)
def getline(self, i):
return self.mm[self.linepoints[i]:self.linepoints[i+1]]
def close(self):
self.fp.close()
self.mm.close()
The caveat is that the file, mmap needs closing and the enumerating of endpoints can take some time. But it is a one-off cost. The result is something that is both fast in instantiation and in random file access, however, the output is an element of type bytes.
I tested speed by looking at accessing a sample of my large file for the first 1 million lines (out of 48mil). I ran the following to get an idea of the time took to do 10 million accesses:
linecache.getline("sample.txt",0)
F = fast_file("sample.txt")
sleep(1)
start = time()
for i in range(10000000):
linecache.getline("sample.txt",1000)
print(time()-start)
>>> 6.914520740509033
sleep(1)
start = time()
for i in range(10000000):
F.getline(1000)
print(time()-start)
>>> 4.488042593002319
sleep(1)
start = time()
for i in range(10000000):
F.getline(1000).decode()
print(time()-start)
>>> 6.825756549835205
It's not that much faster and it takes some time to initiate (longer in fact), however, consider the fact that my original file was too large for linecache. This simple wrapper allowed me to do random accesses for lines that linecache was unable to perform on my computer (32Gb of RAM).
I think this now might be an optimal faster alternative to linecache (speeds may depend on i/o and RAM speeds), but if you have a way to improve this, please add a comment and I will update the solution accordingly.
Update
I recently replaced a list with a collections.deque which is faster.
Second Update
The collections.deque is faster in the append operation, however, a list is faster for random access, hence, the conversion here from a deque to a list optimizes both random access times and instantiation. I've added sleeps in this test and the decode function in the comparison because the mmap will return bytes to make the comparison fair.
I have a number of very large text files which I need to process, the largest being about 60GB.
Each line has 54 characters in seven fields and I want to remove the last three characters from each of the first three fields - which should reduce the file size by about 20%.
I am brand new to Python and have a code which will do what I want to do at about 3.4 GB per hour, but to be a worthwhile exercise I really need to be getting at least 10 GB/hr - is there any way to speed this up? This code doesn't come close to challenging my processor, so I am making an uneducated guess that it is limited by the read and write speed to the internal hard drive?
def ProcessLargeTextFile():
r = open("filepath", "r")
w = open("filepath", "w")
l = r.readline()
while l:
x = l.split(' ')[0]
y = l.split(' ')[1]
z = l.split(' ')[2]
w.write(l.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
l = r.readline()
r.close()
w.close()
Any help would be really appreciated. I am using the IDLE Python GUI on Windows 7 and have 16GB of memory - perhaps a different OS would be more efficient?.
Edit: Here is an extract of the file to be processed.
70700.642014 31207.277115 -0.054123 -1585 255 255 255
70512.301468 31227.990799 -0.255600 -1655 155 158 158
70515.727097 31223.828659 -0.066727 -1734 191 187 180
70566.756699 31217.065598 -0.205673 -1727 254 255 255
70566.695938 31218.030807 -0.047928 -1689 249 251 249
70536.117874 31227.837662 -0.033096 -1548 251 252 252
70536.773270 31212.970322 -0.115891 -1434 155 158 163
70533.530777 31215.270828 -0.154770 -1550 148 152 156
70533.555923 31215.341599 -0.138809 -1480 150 154 158
It's more idiomatic to write your code like this
def ProcessLargeTextFile():
with open("filepath", "r") as r, open("outfilepath", "w") as w:
for line in r:
x, y, z = line.split(' ')[:3]
w.write(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
The main saving here is to just do the split once, but if the CPU is not being taxed, this is likely to make very little difference
It may help to save up a few thousand lines at a time and write them in one hit to reduce thrashing of your harddrive. A million lines is only 54MB of RAM!
def ProcessLargeTextFile():
bunchsize = 1000000 # Experiment with different sizes
bunch = []
with open("filepath", "r") as r, open("outfilepath", "w") as w:
for line in r:
x, y, z = line.split(' ')[:3]
bunch.append(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
if len(bunch) == bunchsize:
w.writelines(bunch)
bunch = []
w.writelines(bunch)
suggested by #Janne, an alternative way to generate the lines
def ProcessLargeTextFile():
bunchsize = 1000000 # Experiment with different sizes
bunch = []
with open("filepath", "r") as r, open("outfilepath", "w") as w:
for line in r:
x, y, z, rest = line.split(' ', 3)
bunch.append(' '.join((x[:-3], y[:-3], z[:-3], rest)))
if len(bunch) == bunchsize:
w.writelines(bunch)
bunch = []
w.writelines(bunch)
Measure! You got quite some useful hints how to improve your python code and I agree with them. But you should first figure out, what your real problem is. My first steps to find your bottleneck would be:
Remove any processing from your code. Just read and write the data and measure the speed. If just reading and writing the files is too slow, it's not a problem of your code.
If just reading and writing is already slow, try to use multiple disks. You are reading and writing at the same time. On the same disc? If yes, try to use different discs and try again.
Some async io library (Twisted?) might help too.
If you figured out the exact problem, ask again for optimizations of that problem.
As you don't seem to be limited by CPU, but rather by I/O, have you tried with some variations on the third parameter of open?
Indeed, this third parameter can be used to give the buffer size to be used for file operations!
Simply writing open( "filepath", "r", 16777216 ) will use 16 MB buffers when reading from the file. It must help.
Use the same for the output file, and measure/compare with identical file for the rest.
Note: This is the same kind of optimization suggested by other, but you can gain it here for free, without changing your code, without having to buffer yourself.
I'll add this answer to explain why buffering makes sense and also offer one more solution
You are getting breathtakingly bad performance. This article Is it possible to speed-up python IO? shows that a 10 gb read should take in the neighborhood of 3 minutes. Sequential write is the same speed. So you're missing a factor of 30 and your performance target is still 10 times slower than what ought to be possible.
Almost certainly this kind of disparity lies in the number of head seeks the disk is doing. A head seek takes milliseconds. A single seek corresponds to several megabytes of sequential read-write. Enormously expensive. Copy operations on the same disk require seeking between input and output. As has been stated, one way to reduce seeks is to buffer in such a way that many megabytes are read before writing to disk and vice versa. If you can convince the python io system to do this, great. Otherwise you can read and process lines into a string array and then write after perhaps 50 mb of output are ready. This size means a seek will induce a <10% performance hit with respect to the data transfer itself.
The other very simple way to eliminate seeks between input and output files altogether is to use a machine with two physical disks and fully separate io channels for each. Input from one. Output to other. If you're doing lots of big file transformations, it's good to have a machine with this feature.
Heres the code for loading text files of any size without causing memory issues. It support gigabytes sized files. It will run smoothly on any kind of machine, you just need to configure CHUNK_SIZE based on your system RAM. More the CHUNK_SIZE, more will be the data read at a time
https://gist.github.com/iyvinjose/e6c1cb2821abd5f01fd1b9065cbc759d
download the file data_loading_utils.py and import it into your code
usage
import data_loading_utils.py.py
file_name = 'file_name.ext'
CHUNK_SIZE = 1000000
def process_lines(line, eof, file_name):
# check if end of file reached
if not eof:
# process data, data is one single line of the file
else:
# end of file reached
data_loading_utils.read_lines_from_file_as_data_chunks(file_name, chunk_size=CHUNK_SIZE, callback=process_lines)
process_lines method is the callback function. It will be called for all the lines, with parameter line representing one single line of the file at a time.
You can configure the variable CHUNK_SIZE depending on your machine hardware configurations.
ProcessLargeTextFile():
r = open("filepath", "r")
w = open("filepath", "w")
l = r.readline()
while l:
As has been suggested already, you may want to use a for loop to make this more optimal.
x = l.split(' ')[0]
y = l.split(' ')[1]
z = l.split(' ')[2]
You are performing a split operation 3 times here, depending on the size of each line this will have a detremental impact on performance. You should split once and assign x,y,z to the entries in the array that comes back.
w.write(l.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
Each line you are reading, you are writing immediately to the file, which is very I/O intensive. You should consider buffering your output to memory and pushing to the disk periodically. Something like this:
BUFFER_SIZE_LINES = 1024 # Maximum number of lines to buffer in memory
def ProcessLargeTextFile():
r = open("filepath", "r")
w = open("filepath", "w")
buf = ""
bufLines = 0
for lineIn in r:
x, y, z = lineIn.split(' ')[:3]
lineOut = lineIn.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3])
bufLines+=1
if bufLines >= BUFFER_SIZE:
# Flush buffer to disk
w.write(buf)
buf = ""
bufLines=1
buf += lineOut + "\n"
# Flush remaining buffer to disk
w.write(buf)
buf.close()
r.close()
w.close()
You can tweak BUFFER_SIZE to determine an optimal balance between memory usage and speed.
Your code is rather un-idiomatic and makes far more function calls than needed. A simpler version is:
ProcessLargeTextFile():
with open("filepath") as r, open("output") as w:
for line in r:
fields = line.split(' ')
fields[0:2] = [fields[0][:-3],
fields[1][:-3],
fields[2][:-3]]
w.write(' '.join(fields))
and I don't know of a modern filesystem that is slower than Windows. Since it appears you are using these huge data files as databases, have you considered using a real database?
Finally, if you are just interested in reducing file size, have you considered compressing / zipping the files?
Read the file using for l in r: to benefit from buffering.
Those seem like very large files... Why are they so large? What processing are you doing per line? Why not use a database with some map reduce calls (if appropriate) or simple operations of the data? The point of a database is to abstract the handling and management large amounts of data that can't all fit in memory.
You can start to play with the idea with sqlite3 which just uses flat files as databases. If you find the idea useful then upgrade to something a little more robust and versatile like postgresql.
Create a database
conn = sqlite3.connect('pts.db')
c = conn.cursor()
Creates a table
c.execute('''CREATE TABLE ptsdata (filename, line, x, y, z''')
Then use one of the algorithms above to insert all the lines and points in the database by calling
c.execute("INSERT INTO ptsdata VALUES (filename, lineNumber, x, y, z)")
Now how you use it depends on what you want to do. For example to work with all the points in a file by doing a query
c.execute("SELECT lineNumber, x, y, z FROM ptsdata WHERE filename=file.txt ORDER BY lineNumber ASC")
And get n lines at a time from this query with
c.fetchmany(size=n)
I'm sure there is a better wrapper for the sql statements somewhere, but you get the idea.
You can try to save your split result first you do it and not do it every time you need a field. May be this will speed up.
you can also try not to run it in gui. Run it in cmd.
Since you only mention saving space as a benefit, is there some reason you can't just store the files gzipped? That should save 70% and up on this data. Or consider getting NTFS to compress the files if random access is still important. You'll get much more dramatic savings on I/O time after either of those.
More importantly, where is your data that you're getting only 3.4GB/hr? That's down around USBv1 speeds.