In the normal code, everything is fine, I can read/and write correctly.
for i in range(10):
sleep(1)
m = mmap.mmap(-1, 1024, access=mmap.ACCESS_WRITE, tagname='share_mmap')
m.seek(0)
cnt = m.read_byte()
if cnt == 0:
print("Load data to memory")
m.seek(0)
m.write(b"FFFFFFFFFFFFFFFFFF")
else:
m.seek(0)
info_str=m.read().translate(None, b'\x00').decode()
print("The data is in memory: ", info_str)
Result:
> Load data to memory
> The data is in memory: FFFFFFFFFFFFFFFFFF
> The data is in memory: FFFFFFFFFFFFFFFFFF
But if I wrap it with contextlib, like all the tutorial's code, then I can't read anymore.
for i in range(10):
sleep(1)
# !!!! LOOK AT HERE !!!! I only changed this line.
with contextlib.closing(mmap.mmap(-1, 1024, access=mmap.ACCESS_WRITE,
tagname='share_mmap')) as m:
m.seek(0)
cnt = m.read_byte()
if cnt == 0:
print("Load data to memory")
m.seek(0)
m.write(b"FFFFFFFFFFFFFFFFFF")
else:
m.seek(0)
info_str=m.read().translate(None, b'\x00').decode()
print("The data is in memory: ", info_str)
Result:
> Load data to memory
> Load data to memory
> Load data to memory
Why?
Also, why does everyone like do it in this way — haven't they encountered this bug?
You're mapping anonymous memory as opposed to a file, so the only thing that's making you get the same mapping each time is the tag name. If you already have a mapping with a given tag name, then mmap gives you the same mapping again, but if not, you get a fresh mapping. Your first snippet works because the old mappings are still there (since you never closed them), but in your second snippet, the old mapping is gone before the new one gets created, and since it's anonymous memory, that means the data is gone too.
To fix the problem, either do the mapping outside the loop or map file-backed memory instead.
Apparently it doesn't work with anonymous memory (fileno -1) since it does with a real file as illustrated below. Not sure if this is a bug, or just an undocumented limitation.
import contextlib
import mmap
from time import sleep
import tempfile
with tempfile.TemporaryFile() as mmapfile:
for i in range(5):
sleep(1)
# Works if you use a real file.
with contextlib.closing(mmap.mmap(mmapfile.fileno(), 1024, access=mmap.ACCESS_WRITE,
tagname='share_mmap')) as m:
m.seek(0)
cnt = m.read_byte()
if cnt == 0:
print("Load data to memory")
m.seek(0)
m.write(b"FFFFFFFFFFFFFFFFFF")
else:
m.seek(0)
info_str=m.read().translate(None, b'\x00').decode()
print("The data is in memory: ", info_str)
Output:
Load data to memory
The data is in memory: FFFFFFFFFFFFFFFFFF
The data is in memory: FFFFFFFFFFFFFFFFFF
The data is in memory: FFFFFFFFFFFFFFFFFF
The data is in memory: FFFFFFFFFFFFFFFFFF
Related
I am trying to read a large text file > 20Gb with python.
File contains positions of atoms for 400 frames and each frame is independent in terms of my computations in this code. In theory I can split the job to 400 tasks without any need of communication. Each frame has 1000000 lines so the file has 1000 000* 400 lines of text.
My initial approach is using multiprocessing with pool of workers:
def main():
""" main function
"""
filename=sys.argv[1]
nump = int(sys.argv[2])
f = open(filename)
s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
cursor = 0
framelocs=[]
start = time.time()
print (mp.cpu_count())
chunks = []
while True:
initial = s.find(b'ITEM: TIMESTEP', cursor)
if initial == -1:
break
cursor = initial + 14
final = s.find(b'ITEM: TIMESTEP', cursor)
framelocs.append([initial,final])
#readchunk(s[initial:final])
chunks.append(s[initial:final])
if final == -1:
break
Here basically I am seeking file to find frame begins and ends with opening file with python mmap module to avoid reading everything into memory.
def readchunk(chunk):
start = time.time()
part = chunk.split(b'\n')
timestep= int(part[1])
print(timestep)
Now I would like to send chunks of file to pool of workers to process.
Read part should be more complex but those lines will be implemented later.
print('Seeking file took %8.6f'%(time.time()-start))
pool = mp.Pool(nump)
start = time.time()
results= pool.map(readchunk,chunks[0:16])
print('Reading file took %8.6f'%(time.time()-start))
If I run this with sending 8 chunks to 8 cores it would take 0.8 sc to read.
However
If I run this with sending 16 chunks to 16 cores it would take 1.7 sc. Seems like parallelization does not speed up. I am running this on Oak Ridge's Summit supercomputer if it is relevant, I am using this command:
jsrun -n1 -c16 -a1 python -u ~/Developer/DipoleAnalyzer/AtomMan/readlargefile.py DW_SET6_NVT.lammpstrj 16
This supposed to create 1 MPI task and assign 16 cores to 16 threads.
Am I missing here something?
Is there a better approach?
As others have said, there is some overhead when making processes so you could see a slowdown if testing with small samples.
Something like this might be neater. Make sure you understand what the generator function is doing.
import multiprocessing as mp
import sys
import mmap
def do_something_with_frame(frame):
print("processing a frame:")
return 100
def frame_supplier(filename):
"""A generator for frames"""
f = open(filename)
s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
cursor = 0
while True:
initial = s.find(b'ITEM: TIMESTEP', cursor)
if initial == -1:
break
cursor = initial + 14
final = s.find(b'ITEM: TIMESTEP', cursor)
yield s[initial:final]
if final == -1:
break
def main():
"""Process a file of atom frames
Args:
filename: the file to process
processes: the size of the pool
"""
filename = sys.argv[1]
nump = int(sys.argv[2])
frames = frame_supplier(filename)
pool = mp.Pool(nump)
# play around with the chunksize
for result in pool.imap(do_something_with_frame, frames, chunksize=10):
print(result)
Disclaimer: this is a suggestion. There may be some syntax errors. I haven't tested it.
EDIT:
It sounds like your script is becoming I/O limited (i.e. limited by the rate at which you can read from disk). You should be able to verify this by setting the body of do_something_with_frame to pass. If the program is I/O bound, it will still take nearly as long.
I don't think MPI is going to make any difference here. I think that file-read speed is probably a limiting factor and I don't see how MPI will help.
It's worth doing some profiling at this point to find out which function calls are taking the longest.
It is also worth trying without mmap():
frame = []
with open(filename) as file:
for line in file:
if line.beginswith('ITEM: TIMESTEP'):
yield frame
else:
frame.append(line)
I am currently trying to store some data into .h5 files, I quickly realised that might have to store my data into parts, as it is not possible to process it an have in my ram. I started out using numpy.array to compress the memory usage, but that resulted in days spend on formatting data.
So i went back to use list, but made the program monitor the memory usage,
when it was above a specified value, will a part be stored, as a numpy format - such that a another process can load it and make use of it. Problem with doing this, is that what I thought would release my memory isn't releasing the memory. For some reason is the memory the same even though I reset the variable and del the variable. Why isn't the memory being released here?
import numpy as np
import os
import resource
import sys
import gc
import math
import h5py
import SecureString
import objgraph
from numpy.lib.stride_tricks import as_strided as ast
total_frames = 15
total_frames_with_deltas = total_frames*3
dim = 40
window_height = 5
def store_file(file_name,data):
with h5py.File(file_name,'w') as f:
f["train_input"] = np.concatenate(data,axis=1)
def load_data_overlap(saved):
#os.chdir(numpy_train)
print "Inside function!..."
if saved == False:
train_files = np.random.randint(255,size=(1,40,690,4))
train_input_data_interweawed_normalized = []
print "Storing train pic to numpy"
part = 0
for i in xrange(100000):
print resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
if resource.getrusage(resource.RUSAGE_SELF).ru_maxrss > 2298842112/10:
print "Max ram storing part: " + str(part) + " At entry: " + str(i)
print "Storing Train input"
file_name = 'train_input_'+'part_'+str(part)+'_'+str(dim)+'_'+str(total_frames_with_deltas)+'_window_height_'+str(window_height)+'.h5'
store_file(file_name,train_input_data_interweawed_normalized)
part = part + 1
del train_input_data_interweawed_normalized
gc.collect()
train_input_data_interweawed_normalized = []
raw_input("something")
for plot in train_files:
overlaps_reshaped = np.random.randint(10,size=(45,200,5,3))
for ind_plot in overlaps_reshaped.reshape(overlaps_reshaped.shape[1],overlaps_reshaped.shape[0],overlaps_reshaped.shape[2],overlaps_reshaped.shape[3]):
ind_plot_reshaped = ind_plot.reshape(ind_plot.shape[0],1,ind_plot.shape[1],ind_plot.shape[2])
train_input_data_interweawed_normalized.append(ind_plot_reshaped)
print len(train_input_data_interweawed_normalized)
return train_input_data_interweawed_normalized_print
#------------------------------------------------------------------------------------------------------------------------------------------------------------
saved = False
train_input = load_data_overlap(saved)
output:
.....
223662080
224772096
225882112
226996224
228106240
229216256
230326272
Max ram storing part: 0 At entry: 135
Storing Train input
something
377118720
Max ram storing part: 1 At entry: 136
Storing Train input
something
377118720
Max ram storing part: 2 At entry: 137
Storing Train input
something
You need to explicitly force garbage collection, see here:
According to Python Official Documentation, you can force the Garbage Collector to release unreferenced memory with gc.collect()
I have python code where I have a continuous loop with a pickle load inside. I have 200 pickle files in the loop each about 80 MB each on an SSD drive.
When I ran the code I experienced that the performance of the pickle load fluctuates continuously: mostly at about 0,2s but at times it "pauses" for 4-6s debasing the overall benchmark of the process.
What could be the problem?
def unpickle(filename):
fo = open(filename, 'r')
contents = cPickle.load(fo)
fo.close()
return contents
for xd in self.X:
tt = time()
xdf = unpickle(xd)
tt = time() - tt
print tt
OUT:
1.87527704239
4.30886101723
0.259668111801
0.234542131424
0.228765964508
0.214528799057
0.213661909103
0.215914011002
0.217473983765
0.225739002228
The way I created pickle files:
I have a pandas DataFrame with the column: 'name','source','level','image','path','is_train'.
The main data regarding the size is the 'image'.
I pickle it with:
def pickle(filename, data):
with open(filename, 'w') as fo:
cPickle.dump(data, fo, protocol=cPickle.HIGHEST_PROTOCOL)
Your question is terribly unclear (in particular, you should be giving us enough information to reproduce your test case ourselves), but it feels like GC pauses or memory defragmentation.
Pickle is a terribly inefficient format, and processing 16 gigabytes' worth of it is bound to cause some serious memory thrashing.
I have multiple 3 GB tab delimited files. There are 20 million rows in each file. All the rows have to be independently processed, no relation between any two rows. My question is, what will be faster?
Reading line-by-line?
with open() as infile:
for line in infile:
Reading the file into memory in chunks and processing it, say 250 MB at a time?
The processing is not very complicated, I am just grabbing value in column1 to List1, column2 to List2 etc. Might need to add some column values together.
I am using python 2.7 on a linux box that has 30GB of memory. ASCII Text.
Any way to speed things up in parallel? Right now I am using the former method and the process is very slow. Is using any CSVReader module going to help?
I don't have to do it in python, any other language or database use ideas are welcome.
It sounds like your code is I/O bound. This means that multiprocessing isn't going to help—if you spend 90% of your time reading from disk, having an extra 7 processes waiting on the next read isn't going to help anything.
And, while using a CSV reading module (whether the stdlib's csv or something like NumPy or Pandas) may be a good idea for simplicity, it's unlikely to make much difference in performance.
Still, it's worth checking that you really are I/O bound, instead of just guessing. Run your program and see whether your CPU usage is close to 0% or close to 100% or a core. Do what Amadan suggested in a comment, and run your program with just pass for the processing and see whether that cuts off 5% of the time or 70%. You may even want to try comparing with a loop over os.open and os.read(1024*1024) or something and see if that's any faster.
Since your using Python 2.x, Python is relying on the C stdio library to guess how much to buffer at a time, so it might be worth forcing it to buffer more. The simplest way to do that is to use readlines(bufsize) for some large bufsize. (You can try different numbers and measure them to see where the peak is. In my experience, usually anything from 64K-8MB is about the same, but depending on your system that may be different—especially if you're, e.g., reading off a network filesystem with great throughput but horrible latency that swamps the throughput-vs.-latency of the actual physical drive and the caching the OS does.)
So, for example:
bufsize = 65536
with open(path) as infile:
while True:
lines = infile.readlines(bufsize)
if not lines:
break
for line in lines:
process(line)
Meanwhile, assuming you're on a 64-bit system, you may want to try using mmap instead of reading the file in the first place. This certainly isn't guaranteed to be better, but it may be better, depending on your system. For example:
with open(path) as infile:
m = mmap.mmap(infile, 0, access=mmap.ACCESS_READ)
A Python mmap is sort of a weird object—it acts like a str and like a file at the same time, so you can, e.g., manually iterate scanning for newlines, or you can call readline on it as if it were a file. Both of those will take more processing from Python than iterating the file as lines or doing batch readlines (because a loop that would be in C is now in pure Python… although maybe you can get around that with re, or with a simple Cython extension?)… but the I/O advantage of the OS knowing what you're doing with the mapping may swamp the CPU disadvantage.
Unfortunately, Python doesn't expose the madvise call that you'd use to tweak things in an attempt to optimize this in C (e.g., explicitly setting MADV_SEQUENTIAL instead of making the kernel guess, or forcing transparent huge pages)—but you can actually ctypes the function out of libc.
I know this question is old; but I wanted to do a similar thing, I created a simple framework which helps you read and process a large file in parallel. Leaving what I tried as an answer.
This is the code, I give an example in the end
def chunkify_file(fname, size=1024*1024*1000, skiplines=-1):
"""
function to divide a large text file into chunks each having size ~= size so that the chunks are line aligned
Params :
fname : path to the file to be chunked
size : size of each chink is ~> this
skiplines : number of lines in the begining to skip, -1 means don't skip any lines
Returns :
start and end position of chunks in Bytes
"""
chunks = []
fileEnd = os.path.getsize(fname)
with open(fname, "rb") as f:
if(skiplines > 0):
for i in range(skiplines):
f.readline()
chunkEnd = f.tell()
count = 0
while True:
chunkStart = chunkEnd
f.seek(f.tell() + size, os.SEEK_SET)
f.readline() # make this chunk line aligned
chunkEnd = f.tell()
chunks.append((chunkStart, chunkEnd - chunkStart, fname))
count+=1
if chunkEnd > fileEnd:
break
return chunks
def parallel_apply_line_by_line_chunk(chunk_data):
"""
function to apply a function to each line in a chunk
Params :
chunk_data : the data for this chunk
Returns :
list of the non-None results for this chunk
"""
chunk_start, chunk_size, file_path, func_apply = chunk_data[:4]
func_args = chunk_data[4:]
t1 = time.time()
chunk_res = []
with open(file_path, "rb") as f:
f.seek(chunk_start)
cont = f.read(chunk_size).decode(encoding='utf-8')
lines = cont.splitlines()
for i,line in enumerate(lines):
ret = func_apply(line, *func_args)
if(ret != None):
chunk_res.append(ret)
return chunk_res
def parallel_apply_line_by_line(input_file_path, chunk_size_factor, num_procs, skiplines, func_apply, func_args, fout=None):
"""
function to apply a supplied function line by line in parallel
Params :
input_file_path : path to input file
chunk_size_factor : size of 1 chunk in MB
num_procs : number of parallel processes to spawn, max used is num of available cores - 1
skiplines : number of top lines to skip while processing
func_apply : a function which expects a line and outputs None for lines we don't want processed
func_args : arguments to function func_apply
fout : do we want to output the processed lines to a file
Returns :
list of the non-None results obtained be processing each line
"""
num_parallel = min(num_procs, psutil.cpu_count()) - 1
jobs = chunkify_file(input_file_path, 1024 * 1024 * chunk_size_factor, skiplines)
jobs = [list(x) + [func_apply] + func_args for x in jobs]
print("Starting the parallel pool for {} jobs ".format(len(jobs)))
lines_counter = 0
pool = mp.Pool(num_parallel, maxtasksperchild=1000) # maxtaskperchild - if not supplied some weird happend and memory blows as the processes keep on lingering
outputs = []
for i in range(0, len(jobs), num_parallel):
print("Chunk start = ", i)
t1 = time.time()
chunk_outputs = pool.map(parallel_apply_line_by_line_chunk, jobs[i : i + num_parallel])
for i, subl in enumerate(chunk_outputs):
for x in subl:
if(fout != None):
print(x, file=fout)
else:
outputs.append(x)
lines_counter += 1
del(chunk_outputs)
gc.collect()
print("All Done in time ", time.time() - t1)
print("Total lines we have = {}".format(lines_counter))
pool.close()
pool.terminate()
return outputs
Say for example, I have a file in which I want to count the number of words in each line, then the processing of each line would look like
def count_words_line(line):
return len(line.strip().split())
and then call the function like:
parallel_apply_line_by_line(input_file_path, 100, 8, 0, count_words_line, [], fout=None)
Using this, I get a speed up of ~8 times as compared to vanilla line by line reading on a sample file of size ~20GB in which I do some moderately complicated processing on each line.
Samples records in the data file (SAM file):
M01383 0 chr4 66439384 255 31M * 0 0 AAGAGGA GFAFHGD MD:Z:31 NM:i:0
M01382 0 chr1 241995435 255 31M * 0 0 ATCCAAG AFHTTAG MD:Z:31 NM:i:0
......
The data files are line-by-line based
The size of the data files are varies from 1G - 5G.
I need to go through the record in the data file line by line, get a particular value (e.g. 4th value, 66439384) from each line, and pass this value to another function for processing. Then some results counter will be updated.
the basic workflow is like this:
# global variable, counters will be updated in search function according to the value passed.
counter_a = 0
counter_b = 0
counter_c = 0
open textfile:
for line in textfile:
value = line.split()[3]
search_function(value) # this function takes abit long time to process
def search_function (value):
some conditions checking:
update the counter_a or counter_b or counter_c
With single process code and about 1.5G data file, it took about 20 hours to run through all the records in one data file. I need much faster code because there are more than 30 of this kind data file.
I was thinking to process the data file in N chunks in parallel, and each chunk will perform above workflow and update the global variable (counter_a, counter_b, counter_c) simultaneously. But I don't know how to achieve this in code, or wether this will work.
I have access to a server machine with: 24 processors and around 40G RAM.
Anyone could help with this? Thanks very much.
The simplest approach would probably be to do all 30 files at once with your existing code -- would still take all day, but you'd have all the files done at once. (ie, "9 babies in 9 months" is easy, "1 baby in 1 month" is hard)
If you really want to get a single file done faster, it will depend on how your counters actually update. If almost all the work is just in analysing value you can offload that using the multiprocessing module:
import time
import multiprocessing
def slowfunc(value):
time.sleep(0.01)
return value**2 + 0.3*value + 1
counter_a = counter_b = counter_c = 0
def add_to_counter(res):
global counter_a, counter_b, counter_c
counter_a += res
counter_b -= (res - 10)**2
counter_c += (int(res) % 2)
pool = multiprocessing.Pool(50)
results = []
for value in range(100000):
r = pool.apply_async(slowfunc, [value])
results.append(r)
# don't let the queue grow too long
if len(results) == 1000:
results[0].wait()
while results and results[0].ready():
r = results.pop(0)
add_to_counter(r.get())
for r in results:
r.wait()
add_to_counter(r.get())
print counter_a, counter_b, counter_c
That will allow 50 slowfuncs to run in parallel, so instead of taking 1000s (=100k*0.01s), it takes 20s (100k/50)*0.01s to complete. If you can restructure your function into "slowfunc" and "add_to_counter" like the above, you should be able to get a factor of 24 speedup.
Read one file at a time, use all CPUs to run search_function():
#!/usr/bin/env python
from multiprocessing import Array, Pool
def init(counters_): # called for each child process
global counters
counters = counters_
def search_function (value): # assume it is CPU-intensive task
some conditions checking:
update the counter_a or counter_b or counter_c
counter[0] += 1 # counter 'a'
counter[1] += 1 # counter 'b'
return value, result, error
if __name__ == '__main__':
counters = Array('i', [0]*3)
pool = Pool(initializer=init, initargs=[counters])
values = (line.split()[3] for line in textfile)
for value, result, error in pool.imap_unordered(search_function, values,
chunksize=1000):
if error is not None:
print('value: {value}, error: {error}'.format(**vars()))
pool.close()
pool.join()
print(list(counters))
Make sure (for example, by writing wrappers) that exceptions do not escape next(values), search_function().
This solution works on a set of files.
For each file, it divides it into a specified number of line-aligned chunks, solves each chunk in parallel, then combines the results.
It streams each chunk from disk; this is somewhat slower, but does not consume nearly so much memory. We depend on disk cache and buffered reads to prevent head thrashing.
Usage is like
python script.py -n 16 sam1.txt sam2.txt sam3.txt
and script.py is
import argparse
from io import SEEK_END
import multiprocessing as mp
#
# Worker process
#
def summarize(fname, start, stop):
"""
Process file[start:stop]
start and stop both point to first char of a line (or EOF)
"""
a = 0
b = 0
c = 0
with open(fname, newline='') as inf:
# jump to start position
pos = start
inf.seek(pos)
for line in inf:
value = int(line.split(4)[3])
# *** START EDIT HERE ***
#
# update a, b, c based on value
#
# *** END EDIT HERE ***
pos += len(line)
if pos >= stop:
break
return a, b, c
def main(num_workers, sam_files):
print("{} workers".format(num_workers))
pool = mp.Pool(processes=num_workers)
# for each input file
for fname in sam_files:
print("Dividing {}".format(fname))
# decide how to divide up the file
with open(fname) as inf:
# get file length
inf.seek(0, SEEK_END)
f_len = inf.tell()
# find break-points
starts = [0]
for n in range(1, num_workers):
# jump to approximate break-point
inf.seek(n * f_len // num_workers)
# find start of next full line
inf.readline()
# store offset
starts.append(inf.tell())
# do it!
stops = starts[1:] + [f_len]
start_stops = zip(starts, stops)
print("Solving {}".format(fname))
results = [pool.apply(summarize, args=(fname, start, stop)) for start,stop in start_stops]
# collect results
results = [sum(col) for col in zip(*results)]
print(results)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Parallel text processor')
parser.add_argument('--num_workers', '-n', default=8, type=int)
parser.add_argument('sam_files', nargs='+')
args = parser.parse_args()
main(args.num_workers, args.sam_files)
main(args.num_workers, args.sam_files)
What you don't want to do is hand files to invidual CPUs. If that's the case, the file open/reads will likely cause the heads to bounce randomly all over the disk, because the files are likely to be all over the disk.
Instead, break each file into chunks and process the chunks.
Open the file with one CPU. Read in the whole thing into an array Text. You want to do this is one massive read to prevent the heads from thrashing around the disk, under the assumption that your file(s) are placed on the disk in relatively large sequential chunks.
Divide its size in bytes by N, giving a (global) value K, the approximate number of bytes each CPU should process. Fork N threads, and hand each thread i its index i, and a copied handle for each file.
Each thread i starts a thread-local scan pointer p into Text as offset i*K. It scans the text, incrementing p and ignores the text until a newline is found. At this point, it starts processing lines (increment p as it scans the lines). Tt stops after processing a line, when its index into the Text file is greater than (i+1)*K.
If the amount of work per line is about equal, your N cores will all finish about the same time.
(If you have more than one file, you can then start the next one).
If you know that the file sizes are smaller than memory, you might arrange the file reads to be pipelined, e.g., while the current file is being processed, a file-read thread is reading the next file.