Process very large (>20GB) text file line by line - python

I have a number of very large text files which I need to process, the largest being about 60GB.
Each line has 54 characters in seven fields and I want to remove the last three characters from each of the first three fields - which should reduce the file size by about 20%.
I am brand new to Python and have a code which will do what I want to do at about 3.4 GB per hour, but to be a worthwhile exercise I really need to be getting at least 10 GB/hr - is there any way to speed this up? This code doesn't come close to challenging my processor, so I am making an uneducated guess that it is limited by the read and write speed to the internal hard drive?
def ProcessLargeTextFile():
r = open("filepath", "r")
w = open("filepath", "w")
l = r.readline()
while l:
x = l.split(' ')[0]
y = l.split(' ')[1]
z = l.split(' ')[2]
w.write(l.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
l = r.readline()
r.close()
w.close()
Any help would be really appreciated. I am using the IDLE Python GUI on Windows 7 and have 16GB of memory - perhaps a different OS would be more efficient?.
Edit: Here is an extract of the file to be processed.
70700.642014 31207.277115 -0.054123 -1585 255 255 255
70512.301468 31227.990799 -0.255600 -1655 155 158 158
70515.727097 31223.828659 -0.066727 -1734 191 187 180
70566.756699 31217.065598 -0.205673 -1727 254 255 255
70566.695938 31218.030807 -0.047928 -1689 249 251 249
70536.117874 31227.837662 -0.033096 -1548 251 252 252
70536.773270 31212.970322 -0.115891 -1434 155 158 163
70533.530777 31215.270828 -0.154770 -1550 148 152 156
70533.555923 31215.341599 -0.138809 -1480 150 154 158

It's more idiomatic to write your code like this
def ProcessLargeTextFile():
with open("filepath", "r") as r, open("outfilepath", "w") as w:
for line in r:
x, y, z = line.split(' ')[:3]
w.write(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
The main saving here is to just do the split once, but if the CPU is not being taxed, this is likely to make very little difference
It may help to save up a few thousand lines at a time and write them in one hit to reduce thrashing of your harddrive. A million lines is only 54MB of RAM!
def ProcessLargeTextFile():
bunchsize = 1000000 # Experiment with different sizes
bunch = []
with open("filepath", "r") as r, open("outfilepath", "w") as w:
for line in r:
x, y, z = line.split(' ')[:3]
bunch.append(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
if len(bunch) == bunchsize:
w.writelines(bunch)
bunch = []
w.writelines(bunch)
suggested by #Janne, an alternative way to generate the lines
def ProcessLargeTextFile():
bunchsize = 1000000 # Experiment with different sizes
bunch = []
with open("filepath", "r") as r, open("outfilepath", "w") as w:
for line in r:
x, y, z, rest = line.split(' ', 3)
bunch.append(' '.join((x[:-3], y[:-3], z[:-3], rest)))
if len(bunch) == bunchsize:
w.writelines(bunch)
bunch = []
w.writelines(bunch)

Measure! You got quite some useful hints how to improve your python code and I agree with them. But you should first figure out, what your real problem is. My first steps to find your bottleneck would be:
Remove any processing from your code. Just read and write the data and measure the speed. If just reading and writing the files is too slow, it's not a problem of your code.
If just reading and writing is already slow, try to use multiple disks. You are reading and writing at the same time. On the same disc? If yes, try to use different discs and try again.
Some async io library (Twisted?) might help too.
If you figured out the exact problem, ask again for optimizations of that problem.

As you don't seem to be limited by CPU, but rather by I/O, have you tried with some variations on the third parameter of open?
Indeed, this third parameter can be used to give the buffer size to be used for file operations!
Simply writing open( "filepath", "r", 16777216 ) will use 16 MB buffers when reading from the file. It must help.
Use the same for the output file, and measure/compare with identical file for the rest.
Note: This is the same kind of optimization suggested by other, but you can gain it here for free, without changing your code, without having to buffer yourself.

I'll add this answer to explain why buffering makes sense and also offer one more solution
You are getting breathtakingly bad performance. This article Is it possible to speed-up python IO? shows that a 10 gb read should take in the neighborhood of 3 minutes. Sequential write is the same speed. So you're missing a factor of 30 and your performance target is still 10 times slower than what ought to be possible.
Almost certainly this kind of disparity lies in the number of head seeks the disk is doing. A head seek takes milliseconds. A single seek corresponds to several megabytes of sequential read-write. Enormously expensive. Copy operations on the same disk require seeking between input and output. As has been stated, one way to reduce seeks is to buffer in such a way that many megabytes are read before writing to disk and vice versa. If you can convince the python io system to do this, great. Otherwise you can read and process lines into a string array and then write after perhaps 50 mb of output are ready. This size means a seek will induce a <10% performance hit with respect to the data transfer itself.
The other very simple way to eliminate seeks between input and output files altogether is to use a machine with two physical disks and fully separate io channels for each. Input from one. Output to other. If you're doing lots of big file transformations, it's good to have a machine with this feature.

Heres the code for loading text files of any size without causing memory issues. It support gigabytes sized files. It will run smoothly on any kind of machine, you just need to configure CHUNK_SIZE based on your system RAM. More the CHUNK_SIZE, more will be the data read at a time
https://gist.github.com/iyvinjose/e6c1cb2821abd5f01fd1b9065cbc759d
download the file data_loading_utils.py and import it into your code
usage
import data_loading_utils.py.py
file_name = 'file_name.ext'
CHUNK_SIZE = 1000000
def process_lines(line, eof, file_name):
# check if end of file reached
if not eof:
# process data, data is one single line of the file
else:
# end of file reached
data_loading_utils.read_lines_from_file_as_data_chunks(file_name, chunk_size=CHUNK_SIZE, callback=process_lines)
process_lines method is the callback function. It will be called for all the lines, with parameter line representing one single line of the file at a time.
You can configure the variable CHUNK_SIZE depending on your machine hardware configurations.

ProcessLargeTextFile():
r = open("filepath", "r")
w = open("filepath", "w")
l = r.readline()
while l:
As has been suggested already, you may want to use a for loop to make this more optimal.
x = l.split(' ')[0]
y = l.split(' ')[1]
z = l.split(' ')[2]
You are performing a split operation 3 times here, depending on the size of each line this will have a detremental impact on performance. You should split once and assign x,y,z to the entries in the array that comes back.
w.write(l.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
Each line you are reading, you are writing immediately to the file, which is very I/O intensive. You should consider buffering your output to memory and pushing to the disk periodically. Something like this:
BUFFER_SIZE_LINES = 1024 # Maximum number of lines to buffer in memory
def ProcessLargeTextFile():
r = open("filepath", "r")
w = open("filepath", "w")
buf = ""
bufLines = 0
for lineIn in r:
x, y, z = lineIn.split(' ')[:3]
lineOut = lineIn.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3])
bufLines+=1
if bufLines >= BUFFER_SIZE:
# Flush buffer to disk
w.write(buf)
buf = ""
bufLines=1
buf += lineOut + "\n"
# Flush remaining buffer to disk
w.write(buf)
buf.close()
r.close()
w.close()
You can tweak BUFFER_SIZE to determine an optimal balance between memory usage and speed.

Your code is rather un-idiomatic and makes far more function calls than needed. A simpler version is:
ProcessLargeTextFile():
with open("filepath") as r, open("output") as w:
for line in r:
fields = line.split(' ')
fields[0:2] = [fields[0][:-3],
fields[1][:-3],
fields[2][:-3]]
w.write(' '.join(fields))
and I don't know of a modern filesystem that is slower than Windows. Since it appears you are using these huge data files as databases, have you considered using a real database?
Finally, if you are just interested in reducing file size, have you considered compressing / zipping the files?

Read the file using for l in r: to benefit from buffering.

Those seem like very large files... Why are they so large? What processing are you doing per line? Why not use a database with some map reduce calls (if appropriate) or simple operations of the data? The point of a database is to abstract the handling and management large amounts of data that can't all fit in memory.
You can start to play with the idea with sqlite3 which just uses flat files as databases. If you find the idea useful then upgrade to something a little more robust and versatile like postgresql.
Create a database
conn = sqlite3.connect('pts.db')
c = conn.cursor()
Creates a table
c.execute('''CREATE TABLE ptsdata (filename, line, x, y, z''')
Then use one of the algorithms above to insert all the lines and points in the database by calling
c.execute("INSERT INTO ptsdata VALUES (filename, lineNumber, x, y, z)")
Now how you use it depends on what you want to do. For example to work with all the points in a file by doing a query
c.execute("SELECT lineNumber, x, y, z FROM ptsdata WHERE filename=file.txt ORDER BY lineNumber ASC")
And get n lines at a time from this query with
c.fetchmany(size=n)
I'm sure there is a better wrapper for the sql statements somewhere, but you get the idea.

You can try to save your split result first you do it and not do it every time you need a field. May be this will speed up.
you can also try not to run it in gui. Run it in cmd.

Since you only mention saving space as a benefit, is there some reason you can't just store the files gzipped? That should save 70% and up on this data. Or consider getting NTFS to compress the files if random access is still important. You'll get much more dramatic savings on I/O time after either of those.
More importantly, where is your data that you're getting only 3.4GB/hr? That's down around USBv1 speeds.

Related

PYTHON: distinguishing between end of file and blank line in file.readline() [duplicate]

I want to iterate over each line of an entire file. One way to do this is by reading the entire file, saving it to a list, then going over the line of interest. This method uses a lot of memory, so I am looking for an alternative.
My code so far:
for each_line in fileinput.input(input_file):
do_something(each_line)
for each_line_again in fileinput.input(input_file):
do_something(each_line_again)
Executing this code gives an error message: device active.
Any suggestions?
The purpose is to calculate pair-wise string similarity, meaning for each line in file, I want to calculate the Levenshtein distance with every other line.
Nov. 2022 Edit: A related question that was asked 8 months after this question has many useful answers and comments. To get a deeper understanding of python logic, do also read this related question How should I read a file line-by-line in Python?
The correct, fully Pythonic way to read a file is the following:
with open(...) as f:
for line in f:
# Do something with 'line'
The with statement handles opening and closing the file, including if an exception is raised in the inner block. The for line in f treats the file object f as an iterable, which automatically uses buffered I/O and memory management so you don't have to worry about large files.
There should be one -- and preferably only one -- obvious way to do it.
Two memory efficient ways in ranked order (first is best) -
use of with - supported from python 2.5 and above
use of yield if you really want to have control over how much to read
1. use of with
with is the nice and efficient pythonic way to read large files. advantages - 1) file object is automatically closed after exiting from with execution block. 2) exception handling inside the with block. 3) memory for loop iterates through the f file object line by line. internally it does buffered IO (to optimized on costly IO operations) and memory management.
with open("x.txt") as f:
for line in f:
do something with data
2. use of yield
Sometimes one might want more fine-grained control over how much to read in each iteration. In that case use iter & yield. Note with this method one explicitly needs close the file at the end.
def readInChunks(fileObj, chunkSize=2048):
"""
Lazy function to read a file piece by piece.
Default chunk size: 2kB.
"""
while True:
data = fileObj.read(chunkSize)
if not data:
break
yield data
f = open('bigFile')
for chunk in readInChunks(f):
do_something(chunk)
f.close()
Pitfalls and for the sake of completeness - below methods are not as good or not as elegant for reading large files but please read to get rounded understanding.
In Python, the most common way to read lines from a file is to do the following:
for line in open('myfile','r').readlines():
do_something(line)
When this is done, however, the readlines() function (same applies for read() function) loads the entire file into memory, then iterates over it. A slightly better approach (the first mentioned two methods above are the best) for large files is to use the fileinput module, as follows:
import fileinput
for line in fileinput.input(['myfile']):
do_something(line)
the fileinput.input() call reads lines sequentially, but doesn't keep them in memory after they've been read or even simply so this, since file in python is iterable.
References
Python with statement
To strip newlines:
with open(file_path, 'rU') as f:
for line_terminated in f:
line = line_terminated.rstrip('\n')
...
With universal newline support all text file lines will seem to be terminated with '\n', whatever the terminators in the file, '\r', '\n', or '\r\n'.
EDIT - To specify universal newline support:
Python 2 on Unix - open(file_path, mode='rU') - required [thanks #Dave]
Python 2 on Windows - open(file_path, mode='rU') - optional
Python 3 - open(file_path, newline=None) - optional
The newline parameter is only supported in Python 3 and defaults to None. The mode parameter defaults to 'r' in all cases. The U is deprecated in Python 3. In Python 2 on Windows some other mechanism appears to translate \r\n to \n.
Docs:
open() for Python 2
open() for Python 3
To preserve native line terminators:
with open(file_path, 'rb') as f:
with line_native_terminated in f:
...
Binary mode can still parse the file into lines with in. Each line will have whatever terminators it has in the file.
Thanks to #katrielalex's answer, Python's open() doc, and iPython experiments.
this is a possible way of reading a file in python:
f = open(input_file)
for line in f:
do_stuff(line)
f.close()
it does not allocate a full list. It iterates over the lines.
Some context up front as to where I am coming from. Code snippets are at the end.
When I can, I prefer to use an open source tool like H2O to do super high performance parallel CSV file reads, but this tool is limited in feature set. I end up writing a lot of code to create data science pipelines before feeding to H2O cluster for the supervised learning proper.
I have been reading files like 8GB HIGGS dataset from UCI repo and even 40GB CSV files for data science purposes significantly faster by adding lots of parallelism with the multiprocessing library's pool object and map function. For example clustering with nearest neighbor searches and also DBSCAN and Markov clustering algorithms requires some parallel programming finesse to bypass some seriously challenging memory and wall clock time problems.
I usually like to break the file row-wise into parts using gnu tools first and then glob-filemask them all to find and read them in parallel in the python program. I use something like 1000+ partial files commonly. Doing these tricks helps immensely with processing speed and memory limits.
The pandas dataframe.read_csv is single threaded so you can do these tricks to make pandas quite faster by running a map() for parallel execution. You can use htop to see that with plain old sequential pandas dataframe.read_csv, 100% cpu on just one core is the actual bottleneck in pd.read_csv, not the disk at all.
I should add I'm using an SSD on fast video card bus, not a spinning HD on SATA6 bus, plus 16 CPU cores.
Also, another technique that I discovered works great in some applications is parallel CSV file reads all within one giant file, starting each worker at different offset into the file, rather than pre-splitting one big file into many part files. Use python's file seek() and tell() in each parallel worker to read the big text file in strips, at different byte offset start-byte and end-byte locations in the big file, all at the same time concurrently. You can do a regex findall on the bytes, and return the count of linefeeds. This is a partial sum. Finally sum up the partial sums to get the global sum when the map function returns after the workers finished.
Following is some example benchmarks using the parallel byte offset trick:
I use 2 files: HIGGS.csv is 8 GB. It is from the UCI machine learning repository. all_bin .csv is 40.4 GB and is from my current project.
I use 2 programs: GNU wc program which comes with Linux, and the pure python fastread.py program which I developed.
HP-Z820:/mnt/fastssd/fast_file_reader$ ls -l /mnt/fastssd/nzv/HIGGS.csv
-rw-rw-r-- 1 8035497980 Jan 24 16:00 /mnt/fastssd/nzv/HIGGS.csv
HP-Z820:/mnt/fastssd$ ls -l all_bin.csv
-rw-rw-r-- 1 40412077758 Feb 2 09:00 all_bin.csv
ga#ga-HP-Z820:/mnt/fastssd$ time python fastread.py --fileName="all_bin.csv" --numProcesses=32 --balanceFactor=2
2367496
real 0m8.920s
user 1m30.056s
sys 2m38.744s
In [1]: 40412077758. / 8.92
Out[1]: 4530501990.807175
That’s some 4.5 GB/s, or 45 Gb/s, file slurping speed. That ain’t no spinning hard disk, my friend. That’s actually a Samsung Pro 950 SSD.
Below is the speed benchmark for the same file being line-counted by gnu wc, a pure C compiled program.
What is cool is you can see my pure python program essentially matched the speed of the gnu wc compiled C program in this case. Python is interpreted but C is compiled, so this is a pretty interesting feat of speed, I think you would agree. Of course, wc really needs to be changed to a parallel program, and then it would really beat the socks off my python program. But as it stands today, gnu wc is just a sequential program. You do what you can, and python can do parallel today. Cython compiling might be able to help me (for some other time). Also memory mapped files was not explored yet.
HP-Z820:/mnt/fastssd$ time wc -l all_bin.csv
2367496 all_bin.csv
real 0m8.807s
user 0m1.168s
sys 0m7.636s
HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=16 --balanceFactor=2
11000000
real 0m2.257s
user 0m12.088s
sys 0m20.512s
HP-Z820:/mnt/fastssd/fast_file_reader$ time wc -l HIGGS.csv
11000000 HIGGS.csv
real 0m1.820s
user 0m0.364s
sys 0m1.456s
Conclusion: The speed is good for a pure python program compared to a C program. However, it’s not good enough to use the pure python program over the C program, at least for linecounting purpose. Generally the technique can be used for other file processing, so this python code is still good.
Question: Does compiling the regex just one time and passing it to all workers will improve speed? Answer: Regex pre-compiling does NOT help in this application. I suppose the reason is that the overhead of process serialization and creation for all the workers is dominating.
One more thing.
Does parallel CSV file reading even help? Is the disk the bottleneck, or is it the CPU? Many so-called top-rated answers on stackoverflow contain the common dev wisdom that you only need one thread to read a file, best you can do, they say. Are they sure, though?
Let’s find out:
HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=16 --balanceFactor=2
11000000
real 0m2.256s
user 0m10.696s
sys 0m19.952s
HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=1 --balanceFactor=1
11000000
real 0m17.380s
user 0m11.124s
sys 0m6.272s
Oh yes, yes it does. Parallel file reading works quite well. Well there you go!
Ps. In case some of you wanted to know, what if the balanceFactor was 2 when using a single worker process? Well, it’s horrible:
HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=1 --balanceFactor=2
11000000
real 1m37.077s
user 0m12.432s
sys 1m24.700s
Key parts of the fastread.py python program:
fileBytes = stat(fileName).st_size # Read quickly from OS how many bytes are in a text file
startByte, endByte = PartitionDataToWorkers(workers=numProcesses, items=fileBytes, balanceFactor=balanceFactor)
p = Pool(numProcesses)
partialSum = p.starmap(ReadFileSegment, zip(startByte, endByte, repeat(fileName))) # startByte is already a list. fileName is made into a same-length list of duplicates values.
globalSum = sum(partialSum)
print(globalSum)
def ReadFileSegment(startByte, endByte, fileName, searchChar='\n'): # counts number of searchChar appearing in the byte range
with open(fileName, 'r') as f:
f.seek(startByte-1) # seek is initially at byte 0 and then moves forward the specified amount, so seek(5) points at the 6th byte.
bytes = f.read(endByte - startByte + 1)
cnt = len(re.findall(searchChar, bytes)) # findall with implicit compiling runs just as fast here as re.compile once + re.finditer many times.
return cnt
The def for PartitionDataToWorkers is just ordinary sequential code. I left it out in case someone else wants to get some practice on what parallel programming is like. I gave away for free the harder parts: the tested and working parallel code, for your learning benefit.
Thanks to: The open-source H2O project, by Arno and Cliff and the H2O staff for their great software and instructional videos, which have provided me the inspiration for this pure python high performance parallel byte offset reader as shown above. H2O does parallel file reading using java, is callable by python and R programs, and is crazy fast, faster than anything on the planet at reading big CSV files.
Katrielalex provided the way to open & read one file.
However the way your algorithm goes it reads the whole file for each line of the file. That means the overall amount of reading a file - and computing the Levenshtein distance - will be done N*N if N is the amount of lines in the file. Since you're concerned about file size and don't want to keep it in memory, I am concerned about the resulting quadratic runtime. Your algorithm is in the O(n^2) class of algorithms which often can be improved with specialization.
I suspect that you already know the tradeoff of memory versus runtime here, but maybe you would want to investigate if there's an efficient way to compute multiple Levenshtein distances in parallel. If so it would be interesting to share your solution here.
How many lines do your files have, and on what kind of machine (mem & cpu power) does your algorithm have to run, and what's the tolerated runtime?
Code would look like:
with f_outer as open(input_file, 'r'):
for line_outer in f_outer:
with f_inner as open(input_file, 'r'):
for line_inner in f_inner:
compute_distance(line_outer, line_inner)
But the questions are how do you store the distances (matrix?) and can you gain an advantage of preparing e.g. the outer_line for processing, or caching some intermediate results for reuse.
Need to frequently read a large file from last position reading ?
I have created a script used to cut an Apache access.log file several times a day.
So I needed to set a position cursor on last line parsed during last execution.
To this end, I used file.seek() and file.seek() methods which allows the storage of the cursor in file.
My code :
ENCODING = "utf8"
CURRENT_FILE_DIR = os.path.dirname(os.path.abspath(__file__))
# This file is used to store the last cursor position
cursor_position = os.path.join(CURRENT_FILE_DIR, "access_cursor_position.log")
# Log file with new lines
log_file_to_cut = os.path.join(CURRENT_FILE_DIR, "access.log")
cut_file = os.path.join(CURRENT_FILE_DIR, "cut_access", "cut.log")
# Set in from_line
from_position = 0
try:
with open(cursor_position, "r", encoding=ENCODING) as f:
from_position = int(f.read())
except Exception as e:
pass
# We read log_file_to_cut to put new lines in cut_file
with open(log_file_to_cut, "r", encoding=ENCODING) as f:
with open(cut_file, "w", encoding=ENCODING) as fw:
# We set cursor to the last position used (during last run of script)
f.seek(from_position)
for line in f:
fw.write("%s" % (line))
# We save the last position of cursor for next usage
with open(cursor_position, "w", encoding=ENCODING) as fw:
fw.write(str(f.tell()))
From the python documentation for fileinput.input():
This iterates over the lines of all files listed in sys.argv[1:], defaulting to sys.stdin if the list is empty
further, the definition of the function is:
fileinput.FileInput([files[, inplace[, backup[, mode[, openhook]]]]])
reading between the lines, this tells me that files can be a list so you could have something like:
for each_line in fileinput.input([input_file, input_file]):
do_something(each_line)
See here for more information
#Using a text file for the example
with open("yourFile.txt","r") as f:
text = f.readlines()
for line in text:
print line
Open your file for reading (r)
Read the whole file and save each line into a list (text)
Loop through the list printing each line.
If you want, for example, to check a specific line for a length greater than 10, work with what you already have available.
for line in text:
if len(line) > 10:
print line
I would strongly recommend not using the default file loading as it is horrendously slow. You should look into the numpy functions and the IOpro functions (e.g. numpy.loadtxt()).
http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html
https://store.continuum.io/cshop/iopro/
Then you can break your pairwise operation into chunks:
import numpy as np
import math
lines_total = n
similarity = np.zeros(n,n)
lines_per_chunk = m
n_chunks = math.ceil(float(n)/m)
for i in xrange(n_chunks):
for j in xrange(n_chunks):
chunk_i = (function of your choice to read lines i*lines_per_chunk to (i+1)*lines_per_chunk)
chunk_j = (function of your choice to read lines j*lines_per_chunk to (j+1)*lines_per_chunk)
similarity[i*lines_per_chunk:(i+1)*lines_per_chunk,
j*lines_per_chunk:(j+1)*lines_per_chunk] = fast_operation(chunk_i, chunk_j)
It's almost always much faster to load data in chunks and then do matrix operations on it than to do it element by element!!
Best way to read large file, line by line is to use python enumerate function
with open(file_name, "rU") as read_file:
for i, row in enumerate(read_file, 1):
#do something
#i in line of that line
#row containts all data of that line

Read specific lines of csv file

Hllo guys,
so i have a huge CSV file (500K of lines), i want to process the file simultaneously with 4 processes (so each one will read aprox. 100K of lines)
what is the best way to do it using multi proccessing?
what i have up til now:
def csv_handler(path, procceses = 5):
test_arr = []
with open(path) as fd:
reader = DictReader(fd)
for row in reader:
test_arr.append(row)
current_line = 0
equal_length = len(test_arr) / 5
for i in range(5):
process1 = multiprocessing.Process(target=get_data, args=(test_arr[current_line: current_line + equal_length],))
current_line = current_line + equal_length
i know it's a bad udea to do that with one reading line, but i don't find another option..
i would be happy to get some ideas to how to do it in a better way!
CSV is a pretty tricky format to split the reads up with, and other file formats may be more ideal.
The basic problem is that as lines may be different lengths, you can't know where to start reading a particular lines easily to "fseek" to it. You would have to scan through the file counting newlines, which is basically, reading it.
But you can get pretty close which sounds like it is enough for your needs. Say for two parts, take the file size, divide that by 2.
The first part you start at zero, and stop after completing the record at file_size / 2.
The second part, you seek to file_size / 2, look for the next new line, and start there.
This way while the Python processes won't all get exactly the same amount it will be pretty close, and avoids too much inter-process message passing or multi-threading and with CPython probably the global interpreter lock.
Of course all the normal things for optimising either file IO, or Python code still apply (depending on where your bottleneck lies. You need to measure this.).

How to make a buffered writer?

So recently I took on as a personal project to make my very own DB in Python, mainly because I hate messing arround with most DBs and I needed something easy to setup, portable and simple to study large data sets.
I now find myself stuck on a problem, an efficient way to delete a line from the DB file (which is really just a text file). The way I found to do it is to write all of the content thats after the line before it, and then truncate the file (I take suggestions on better ways to do it). The problem arrives when I need to write the content after the line before it, because doing it all at once could possibly load millions of lines onto the RAM at once. The code follows:
ln = 11 # Line to be deleted
with open("test.txt", "r+") as f:
readlinef = f.readline
for i in xrange(ln):
line = readlinef()
length, start = (len(line), f.tell()-len(line))
f.seek(0, 2)
chunk = f.tell() - start+length
f.seek(start+length, 0)
# How to make this buffered?
data = f.read(chunk)
f.seek(start, 0)
f.write(data)
f.truncate()
Right now thats reading all of that data at once, how would I make that last code block work in a buffered fashion? The start position would switch every time a new chunk of data is written before it, I was wondering what would be the most efficient and fast (execution time wise) way to do this.
Thanks in advance.
edit
I've decided to follow the advices submitted here, but just for curiosity's sake I found a way to read and write in chunks. It follows:
with open("test.txt", "r+") as f:
readlinef = f.readline
for i in xrange(ln):
line = readlinef()
start, length = (f.tell()-len(line), len(line))
readf = f.read
BUFFER_SIZE = 1024 * 1024
x = 0
chunk = readf(BUFFER_SIZE)
while chunk:
f.seek(start, 0)
f.write(chunk)
start += BUFFER_SIZE
f.seek(start+length+(x*BUFFER_SIZE), 0)
chunk = readf(BUFFER_SIZE)
f.truncate()
Answering your question "How would I do that?" concerning indices and vacuum.
Disclaimer: This is a very simple example and does in no way compare to existing DBMS and I strongly advise against it.
Basic idea:
For each table in your DB, keep various files, some for your object ids (row ids, record ids) and some (page files) with the actual data. Let's suppose that each record is of variable length.
Each record has a table-unique OID. These are stored in the oid-files. Let's name the table "test" and the oid files "test.oidX". Each record in the oid file is of fixed length and each oid file is of fixed length.
Now if "test.oid1" reads:
0001:0001:0001:0015 #oid:pagefile:position:length
0002:0001:0016:0100
0004:0002:0001:0001
It means that record 1 is in page file 1, at position 1 and has length 15. Record 2 is in page file 1 at position 16 of length 100, etc.
Now when you want to delete a record, just touch the oid file. E.g. for deleting record 2, edit it to:
0001:0001:0001:0015
0000:0001:0016:0100 #0000 indicating empty cell
0004:0002:0001:0001
And don't even bother touching your page files.
This will create holes in your page files. Now you need to implement some "maintenance" routine which moves blocks in your page files around, etc, which could either run when requested by the user, or automatically when your DBMS has nothing else to do. Depending on which locking strategy you use, you might need to lock the concerned records or the whole table.
Also when you insert a new record, and you find a hole big enough, you can insert it there.
If your oid-files should also function as an index (slow inserts, fast queries), you will need to rebuild it (surely on insertion, maybe on deletion).
Operations on oid-files should be fast, as they are fixed-length and of fixed-length records.
This is just the very basic idea, not touching topics like search trees, hashing, etc, etc.
You can do this the same way that (effectively) memmove works: seek back and forth between the source range and the destination range:
count = (size+chunksize-1) // chunk size
for chunk in range(count):
f.seek(start + chunk * chunksize + deleted_line_size, 0)
buf = f.read(chunksize)
f.seek(start + chunk * chunksize, 0)
f.write(buf)
Using a temporary file and shutil makes it a lot simpler—and, despite what you're expect, it may actually be faster. (There's twice as much writing, but a whole lot less seeking, and mostly block-aligned writing.) For example:
with tempfile.TemporaryFile('w') as ftemp:
shutil.copyfileobj(ftemp, f)
ftemp.seek(0, 0)
f.seek(start, 0)
shutil.copyfileobj(f, ftemp)
f.truncate()
However, if your files are big enough to fit in your virtual memory space (which they probably are in 64-bit land, but may not be in 32-bit land), it may be simpler to just mmap the file and let the OS/libc take care of the work:
m = mmap.mmap(f.fileno(), access=mmap.ACCESS_WRITE)
m[start:end-deleted_line_size] = m[start+deleted_line_size:end]
m.close()
f.seek(end-deleted_line_size)
f.truncate()

How to shuffle a text file on disk in Python

I am working with a text file of about 12*10^6 rows which is stored on my hard disk.
The structure of the file is:
data|data|data|...|data\n
data|data|data|...|data\n
data|data|data|...|data\n
...
data|data|data|...|data\n
There's no header, and there's no id to uniquely identify the rows.
Since I want to use it for machine learning purposes, I need to make sure that there's no order in the text file which may affect the stochastic learning.
Usually I upload such kind of files into memory, and I shuffle them before rewriting them to disk. Unfortunately this time it is not possible, due to the size of the file, so I have to manage the shuffling directly on disk(assume I don't have problem with the disk space). Any idea about how to effectively (with lowest possible complexity, i.e. writing to the disk) manage such task with Python?
All but one of these ideas use O(N) memory—but if you use an array.array or numpy.ndarray we're talking around N*4 bytes, which is significantly smaller than the whole file. (I'll use a plain list for simplicity; if you need help converting to a more compact type, I can show that too.)
Using a temporary database and an index list:
with contextlib.closing(dbm.open('temp.db', 'n')) as db:
with open(path) as f:
for i, line in enumerate(f):
db[str(i)] = line
linecount = i
shuffled = random.shuffle(range(linecount))
with open(path + '.shuffled', 'w') as f:
for i in shuffled:
f.write(db[str(i)])
os.remove('temp.db')
This is 2N single-line disk operations, and 2N single-dbm-key disk operations, which should be 2NlogN single-disk-disk-operation-equivalent operations, so the total complexity is O(NlogN).
If you use a relational database like sqlite3 instead of a dbm, you don't even need the index list, because you can just do this:
SELECT * FROM Lines ORDER BY RANDOM()
This has the same time complexity as the above, and the space complexity is O(1) instead of O(N)—in theory. In practice, you need an RDBMS that can feed you a row at a time from a 100M row set without storing that 100M on either side.
A different option, without using a temporary database—in theory O(N**2), but in practice maybe faster if you happen to have enough memory for the line cache to be helpful:
with open(path) as f:
linecount = sum(1 for _ in f)
shuffled = random.shuffle(range(linecount))
with open(path + '.shuffled', 'w') as f:
for i in shuffled:
f.write(linecache.getline(path, i))
Finally, by doubling the size of the index list, we can eliminate the temporary disk storage. But in practice, this might be a lot slower, because you're doing a lot more random-access reads, which drives aren't nearly as good at.
with open(path) as f:
linestarts = [f.tell() for line in f]
lineranges = zip(linestarts, linestarts[1:] + [f.tell()])
shuffled = random.shuffle(lineranges)
with open(path + '.shuffled', 'w') as f1:
for start, stop in shuffled:
f.seek(start)
f1.write(f.read(stop-start))
This is a suggestion based on my comment above. It relies on having the compressed lines still being able to fit into memory. If that is not the case, the other solutions will be required.
import zlib
from random import shuffle
def heavy_shuffle(filename_in, filename_out):
with open(filename_in, 'r') as f:
zlines = [zlib.compress(line, 9) for line in f]
shuffle(zlines)
with open(filename_out, 'w') as f:
for zline in zlines:
f.write(zlib.decompress(zline) + '\n')
My experience has been that zlib is fast, while bz2 offers better compression, so you may want to compare.
Also, if you can get away with chunking, say, n lines together, doing so is likely to lift your compression ratio.
I was wondering about the likelihood of useful compression, so here's an IPython experiment. I don't know what your data looks like, so I just went with floats (as strings) rounded to 3 places and strung together with pipes:
Best-case scenario (e.g. many rows have all same digits):
In [38]: data = '0.000|'*200
In [39]: len(data)
Out[39]: 1200
In [40]: zdata = zlib.compress(data, 9)
In [41]: print 'zlib compression ratio: ',1.-1.*len(zdata)/len(data)
zlib compression ratio: 0.98
In [42]: bz2data = bz2.compress(data, 9)
In [43]: print 'bz2 compression ratio: ',1.-1.*len(bz2data)/len(data)
bz2 compression ratio: 0.959166666667
As expected, best-case is really good, >95% compression ratio.
Worst-case scenario (randomized data):
In [44]: randdata = '|'.join(['{:.3f}'.format(x) for x in np.random.randn(200)])
In [45]: zdata = zlib.compress(randdata, 9)
In [46]: print 'zlib compression ratio: ',1.-1.*len(zdata)/len(data)
zlib compression ratio: 0.5525
In [47]: bz2data = bz2.compress(randdata, 9)
In [48]: print 'bz2 compression ratio: ',1.-1.*len(bz2data)/len(data)
bz2 compression ratio: 0.5975
Surprisingly, worst-case is not too bad ~60% compression ratio, but likely to be problematic if you only have 8 GB of memory (60% of 15 GB is 9 GB).
Assuming that disk space is not not a problem with you, I am creating multiple files to hold the data.
import random
import os
PMSize = 100 #Lesser value means using more primary memory
shuffler = lambda x: open(x, 'w')
shufflers = [shuffler('file'+str(x)) for x in range(PMSize)]
with open('filename') as file:
for line in file:
i = random.randint(0, len(shufflers)-1)
shufflers[i].write(line)
with open('filename', 'w') as file:
for file in shufflers:
newfile.write(file.read())
for file in shufflers:
os.remove(file)
Your Memory complexity will be controlled by PMSize. Time complexity will be around O(N + PMSize).
This problem can be thought as a problem of efficient memory pages management to reduce swap file I/O. Let your buffer buf be a list of contigous chunks of file you would like to be stored into the output file. Let a contigous chunk of file be a list of a fixed amount of whole lines.
Now, generate a random sequence and remap the returned values to appriopriate chunk numbers and line offsets inside that chunk.
This operation leaves you with a sequence of numbers [1..num of chunks] which can be described as a sequence of accesses to memory fragments contained in pages of numbers between [1..num of chunks]. For online variantion (like in real OS), there is no optimal strategy to this problem, but since you know the actual sequence of pages' referencing there is an optimal solution which can be found here.
What's the gain from this approach? Pages which are going to be used most often are least reread from the HDD meaning less I/O operations to read the data. Also, considering your chunk size is big enough to minimize page swapping compared to memory footprint, a lot of the times following lines of the output file would be taken from the same chunk stored in memory (or any other, but not yet swapped to the drive) rather than reread from the drive.
Maybe it's not the easiest solution (though the optimal page swapping algorithm is easy to write), it could be a fun exercise to do, wouldn't it?

How can I parallelize a pipeline of generators/iterators in Python?

Suppose I have some Python code like the following:
input = open("input.txt")
x = (process_line(line) for line in input)
y = (process_item(item) for item in x)
z = (generate_output_line(item) + "\n" for item in y)
output = open("output.txt", "w")
output.writelines(z)
This code reads each line from the input file, runs it through several functions, and writes the output to the output file. Now I know that the functions process_line, process_item, and generate_output_line will never interfere with each other, and let's assume that the input and output files are on separate disks, so that reading and writing will not interfere with each other.
But Python probably doesn't know any of this. My understanding is that Python will read one line, apply each function in turn, and write the result to the output, and then it will read the second line only after sending the first line to the output, so that the second line does not enter the pipeline until the first one has exited. Do I understand correctly how this program will flow? If this is how it works, is there any easy way to make it so that multiple lines can be in the pipeline at once, so that the program is reading, writing, and processing each step in parallel?
You can't really parallelize reading from or writing to files; these will be your bottleneck, ultimately. Are you sure your bottleneck here is CPU, and not I/O?
Since your processing contains no dependencies (according to you), it's trivially simple to use Python's multiprocessing.Pool class.
There are a couple ways to write this, but the easier w.r.t. debugging is to find independent critical paths (slowest part of the code), which we will make run parallel. Let's presume it's process_item.
…And that's it, actually. Code:
import multiprocessing.Pool
p = multiprocessing.Pool() # use all available CPUs
input = open("input.txt")
x = (process_line(line) for line in input)
y = p.imap(process_item, x)
z = (generate_output_line(item) + "\n" for item in y)
output = open("output.txt", "w")
output.writelines(z)
I haven't tested it, but this is the basic idea. Pool's imap method makes sure results are returned in the right order.
is there any easy way to make it so that multiple lines can be in the pipeline at once
I wrote a library to do just this: https://github.com/michalc/threaded-buffered-pipeline, that iterates over each iterable in a separate thread.
So what was
input = open("input.txt")
x = (process_line(line) for line in input)
y = (process_item(item) for item in x)
z = (generate_output_line(item) + "\n" for item in y)
output = open("output.txt", "w")
output.writelines(z)
becomes
from threaded_buffered_pipeline import buffered_pipeline
input = open("input.txt")
buffer_iterable = buffered_pipeline()
x = buffer_iterable((process_line(line) for line in input))
y = buffer_iterable((process_item(item) for item in x))
z = buffer_iterable((generate_output_line(item) + "\n" for item in y))
output = open("output.txt", "w")
output.writelines(z)
How much actual parallelism this adds depends on what's actually happening in each iterable, and how many CPU cores you have/how busy they are.
The classic example is the Python GIL: if each step is fairly CPU heavy, and just uses Python, then not much parallelism would be added, and this might not be faster than the serial version. On the other hand, if each is network IO heavy, then I think it's likely to be faster.

Categories