How to calculate the average of several .dat files using python? - python

So I have 50 - 60 .dat files all containing m rows and n columns of numbers. I need to take the average of all of the files and create a new file in the same format. I have to do this in python. Can anyone help me with this?
I've written some code..
I realize I have some incompatible types here, but I can't think of an alternative so I haven't change anything yet.
#! /usr/bin/python
import os
CC = 1.96
average = []
total = []
count = 0
os.chdir("./")
for files in os.listdir("."):
if files.endswith(".dat"):
infile = open(files)
cur = []
cur = infile.readlines()
for i in xrange(0, len(cur)):
cur[i] = cur[i].split()
total += cur
count += 1
average = [x/count for x in total]
#calculate uncertainty
uncert = []
for files in os.listdir("."):
if files.endswith(".dat"):
infile = open(files)
cur = []
cur = infile.readlines
for i in xrange(0, len(cur)):
cur[i] = cur[i].split()
uncert += (cur - average)**2
uncert = uncert**.5
uncert = uncert*CC

Here's a fairly time- and resource-efficient approach which reads in the values and calculates their averages for all the files in parallel, yet only reads in one line per file at a time -- however it does temporarily read the entire first .dat file into memory in order to determine how many rows and columns of numbers are going to be in each file.
You didn't say if your "numbers" were integer or float or what, so this reads them in as floating point (which will work even if they're not). Regardless, the averages are calculated and output as floating point numbers.
Update
I've modified my original answer to also calculate a population standard deviation (sigma) of the values in each row and column, as per your comment. It does this right after it computes their mean value so a second pass to re-read the all the data isn't necessary. In addition, in response to a suggestion made in the comments, a context manager has been added to ensure that all the input files are get closed.
Note that the standard deviations are only printed and are not written to the output file, but doing that to the same or separate file should to be easy enough to add.
from contextlib import contextmanager
from itertools import izip
from glob import iglob
from math import sqrt
from sys import exit
#contextmanager
def multi_file_manager(files, mode='rt'):
files = [open(file, mode) for file in files]
yield files
for file in files:
file.close()
# generator function to read, convert, and yield each value from a text file
def read_values(file, datatype=float):
for line in file:
for value in (datatype(word) for word in line.split()):
yield value
# enumerate multiple egual length iterables simultaneously as (i, n0, n1, ...)
def multi_enumerate(*iterables, **kwds):
start = kwds.get('start', 0)
return ((n,)+t for n, t in enumerate(izip(*iterables), start))
DATA_FILE_PATTERN = 'data*.dat'
MIN_DATA_FILES = 2
with multi_file_manager(iglob(DATA_FILE_PATTERN)) as datfiles:
num_files = len(datfiles)
if num_files < MIN_DATA_FILES:
print('Less than {} .dat files were found to process, '
'terminating.'.format(MIN_DATA_FILES))
exit(1)
# determine number of rows and cols from first file
temp = [line.split() for line in datfiles[0]]
num_rows = len(temp)
num_cols = len(temp[0])
datfiles[0].seek(0) # rewind first file
del temp # no longer needed
print '{} .dat files found, each must have {} rows x {} cols\n'.format(
num_files, num_rows, num_cols)
means = []
std_devs = []
divisor = float(num_files-1) # Bessel's correction for sample standard dev
generators = [read_values(file) for file in datfiles]
for _ in xrange(num_rows): # main processing loop
for _ in xrange(num_cols):
# create a sequence of next cell values from each file
values = tuple(next(g) for g in generators)
mean = float(sum(values)) / num_files
means.append(mean)
means_diff_sq = ((value-mean)**2 for value in values)
std_dev = sqrt(sum(means_diff_sq) / divisor)
std_devs.append(std_dev)
print 'Average and (standard deviation) of values:'
with open('means.txt', 'wt') as averages:
for i, mean, std_dev in multi_enumerate(means, std_devs):
print '{:.2f} ({:.2f})'.format(mean, std_dev),
averages.write('{:.2f}'.format(mean)) # note std dev not written
if i % num_cols != num_cols-1: # not last column?
averages.write(' ') # delimiter between values on line
else:
print # newline
averages.write('\n')

I am not sure which aspect of the process is giving you the problem, but I will just answer specifically about getting the averages of all the dat files.
Assuming a data structure like this:
72 12 94 79 76 5 30 98 97 48
79 95 63 74 70 18 92 20 32 50
77 88 60 98 19 17 14 66 80 24
...
Getting averages of the files:
import glob
import itertools
avgs = []
for datpath in glob.iglob("*.dat"):
with open(datpath, 'r') as f:
str_nums = itertools.chain.from_iterable(i.strip().split() for i in f)
nums = map(int, str_nums)
avg = sum(nums) / len(nums)
avgs.append(avg)
print avgs
It loops over each .dat file, reads and joins the lines. Converts them to int (could be float if you want) and appends the avg.
If these files are enormous and you are concerned about the amount of memory when reading them in, you could more explicitly loop over each line and only keep counter, the way your original example was doing:
for datpath in glob.iglob("*.dat"):
with open(datpath, 'r') as f:
count = 0
total = 0
for line in f:
nums = [int(i) for i in line.strip().split()]
count += len(nums)
total += sum(nums)
avgs.append(total / count)
Note: I am not handling exceptional cases, such as the file being empty and producing a Divide By Zero situation.

Related

How to create multiple file based on the size of the file in a nested for loop?

I am writing a Python script for creating multiple files based on the file size.
For example:
Create a another file when the size becomes 10MB.
In regards to the sample script that contains what I have tried so far, it is creating multiple files but not based on the size:
global fname
x=1
def IP():
limit = 1
i= 255
for j in range(1,3):
fname = "new_file"+str(x)+".txt"
global x
x += 1
with open(fname, "a") as new:
for k in range(1,200):
for l in range(1,200):
new.write("IP is: %d.%d.%d.%d\n"%(i,j,k,l))
IP()
IP()
Output:
new_file1.txt
new_file2.txt
You can simply test your file. You might need to use .flush() to force writing the files content physically before checking its size, else the OS decides when to flush it - wich might be at 4k or 8k data in the buffer.
import os
import random
def rnd():
r = random.randint
return [r(1,255),r(1,255),r(1,255),r(1,255)]
x = 1
while True:
with open("myfile_{}.txt".format(x),"a") as f:
x += 1
# reads flushed sizes only
while os.fstat(f.fileno()).st_size < 300: # size criterium
f.write("{}.{}.{}.{}\n".format(*rnd()))
f.flush() # persist to disk so size checking works - this degrades performance
if x > 10:
break
for f in os.listdir("./"):
print(f, os.stat(f).st_size)
Output:
myfile_10.txt 304
myfile_9.txt 315
myfile_8.txt 313
myfile_4.txt 302
myfile_6.txt 305
myfile_2.txt 306
myfile_5.txt 300
main.py 447
myfile_7.txt 308
myfile_3.txt 303
myfile_1.txt 304
If you just need a guesstimate, you could also use the file-handles .tell() method to get the position inside the stream you are currently writing. This is not the file size though.

How to get approximate line number of large files

I have CSV files with up to 10M+ rows. I am attempting to get the total line numbers of a file so I can split the processing of each file into a multiprocessing approach. To do this, I will set a start and end line for each sub-process to handle. This cuts down my processing time from 180s to 110s for a file size of 2GB. However, in order to do this, It requires to know the line number count. If I attempt to get the exact line number count, it will take ~30seconds. I feel like this time is wasted as an approximate with the final thread possibly having to read an extra hundred thousand lines or so, would only add a couple seconds as apposed to the 30 seconds it takes to get the exact line count.
How would I go about getting an approximate line count for files? I would like this estimate to be within 1 million lines (Preferably within a couple hundred thousand lines). Would something like this be possible?
This will be horribly inaccurate but it will get the size of a row and divide it against the size of the file.
import sys
import csv
import os
with open("example.csv", newline="") as f:
reader = csv.reader(f)
row1 = next(reader)
_Size = sys.getsizeof(len("".join(row1)))
print("Size of Line 1 > ",_Size)
print("Size of File >",str(os.path.getsize("example.csv")))
print("Approx Lines >",(os.path.getsize("example.csv") / _Size))
(Edit) If you change the last line to
math.floor(os.path.getsize("example.csv") / _Size) It's actually
quite accurate
I'd suggest you split the file into chunks of similar size, before even parsing.
The example code below will split data.csv into 4 chunks of approximately equal size, by seeking and searching for the next line break. It'll then call launch_worker() for each chunk, indicating the start offset and length of the data that worker should handle.
Ideally you'd use a subprocess for each worker.
import os
n_workers = 4
# open the log file, and find out how long it is
f = open('data.csv', 'rb')
length_total = f.seek(0, os.SEEK_END)
# split the file evenly among n workers
length_worker = int(length_total / n_workers)
prev_worker_end = 0
for i in range(n_workers):
# seek to the next worker's approximate start
file_pos = f.seek(prev_worker_end + length_worker, os.SEEK_SET)
# see if we tried to seek past the end of the file... the last worker probably will
if file_pos >= length_total: # <-- (3)
# ... if so, this worker's chunk extends to the end of the file
this_worker_end = length_total
else:
# ... otherwise, look for the next line break
buf = f.read(256) # <-- (1)
next_line_end = buf.index(b'\n') # <-- (2)
this_worker_end = file_pos + next_line_end
# calculate how long this worker's chunk is
this_worker_length = this_worker_end - prev_worker_end
if this_worker_length > 0:
# if there is any data in the chunk, then try to launch a worker
launch_worker(prev_worker_end, this_worker_length)
# remember where the last worker got to in the file
prev_worker_end = this_worker_end + 1
Some expansion on markers in the code:
You'll need to make sure that the read() will consume at least an entire line. Alternatively you could loop to perform multiple read()s if you don't know how long a line could be upfront.
This assumes \n line endings... you may need to modify for your data.
The last worker will get slightly less data to handle that the others... this is because we always search-forwards for the next line break. The more workers you have, the less data the final worker gets. It's not very significant (~200-500 bytes in my testing).
Make sure you always use binary-mode, as text-mode can give you wonky seek()s / read()s.
An example launch_worker() would look like this:
def launch_worker(offset, length):
print('Starting a worker... using chunk %d - %d (%d bytes)...'
% ( offset, offset + length, length ))
with open('log.txt', 'rb') as f:
f.seek(offset, os.SEEK_SET)
worker_buf = f.read(length)
lines = worker_buf.split(b'\n')
print('First Line:')
print('\t' + str(lines[0]))
print('Last Line:')
print('\t' + str(lines[-1]))

Python, process a large text file in parallel

Samples records in the data file (SAM file):
M01383 0 chr4 66439384 255 31M * 0 0 AAGAGGA GFAFHGD MD:Z:31 NM:i:0
M01382 0 chr1 241995435 255 31M * 0 0 ATCCAAG AFHTTAG MD:Z:31 NM:i:0
......
The data files are line-by-line based
The size of the data files are varies from 1G - 5G.
I need to go through the record in the data file line by line, get a particular value (e.g. 4th value, 66439384) from each line, and pass this value to another function for processing. Then some results counter will be updated.
the basic workflow is like this:
# global variable, counters will be updated in search function according to the value passed.
counter_a = 0
counter_b = 0
counter_c = 0
open textfile:
for line in textfile:
value = line.split()[3]
search_function(value) # this function takes abit long time to process
def search_function (value):
some conditions checking:
update the counter_a or counter_b or counter_c
With single process code and about 1.5G data file, it took about 20 hours to run through all the records in one data file. I need much faster code because there are more than 30 of this kind data file.
I was thinking to process the data file in N chunks in parallel, and each chunk will perform above workflow and update the global variable (counter_a, counter_b, counter_c) simultaneously. But I don't know how to achieve this in code, or wether this will work.
I have access to a server machine with: 24 processors and around 40G RAM.
Anyone could help with this? Thanks very much.
The simplest approach would probably be to do all 30 files at once with your existing code -- would still take all day, but you'd have all the files done at once. (ie, "9 babies in 9 months" is easy, "1 baby in 1 month" is hard)
If you really want to get a single file done faster, it will depend on how your counters actually update. If almost all the work is just in analysing value you can offload that using the multiprocessing module:
import time
import multiprocessing
def slowfunc(value):
time.sleep(0.01)
return value**2 + 0.3*value + 1
counter_a = counter_b = counter_c = 0
def add_to_counter(res):
global counter_a, counter_b, counter_c
counter_a += res
counter_b -= (res - 10)**2
counter_c += (int(res) % 2)
pool = multiprocessing.Pool(50)
results = []
for value in range(100000):
r = pool.apply_async(slowfunc, [value])
results.append(r)
# don't let the queue grow too long
if len(results) == 1000:
results[0].wait()
while results and results[0].ready():
r = results.pop(0)
add_to_counter(r.get())
for r in results:
r.wait()
add_to_counter(r.get())
print counter_a, counter_b, counter_c
That will allow 50 slowfuncs to run in parallel, so instead of taking 1000s (=100k*0.01s), it takes 20s (100k/50)*0.01s to complete. If you can restructure your function into "slowfunc" and "add_to_counter" like the above, you should be able to get a factor of 24 speedup.
Read one file at a time, use all CPUs to run search_function():
#!/usr/bin/env python
from multiprocessing import Array, Pool
def init(counters_): # called for each child process
global counters
counters = counters_
def search_function (value): # assume it is CPU-intensive task
some conditions checking:
update the counter_a or counter_b or counter_c
counter[0] += 1 # counter 'a'
counter[1] += 1 # counter 'b'
return value, result, error
if __name__ == '__main__':
counters = Array('i', [0]*3)
pool = Pool(initializer=init, initargs=[counters])
values = (line.split()[3] for line in textfile)
for value, result, error in pool.imap_unordered(search_function, values,
chunksize=1000):
if error is not None:
print('value: {value}, error: {error}'.format(**vars()))
pool.close()
pool.join()
print(list(counters))
Make sure (for example, by writing wrappers) that exceptions do not escape next(values), search_function().
This solution works on a set of files.
For each file, it divides it into a specified number of line-aligned chunks, solves each chunk in parallel, then combines the results.
It streams each chunk from disk; this is somewhat slower, but does not consume nearly so much memory. We depend on disk cache and buffered reads to prevent head thrashing.
Usage is like
python script.py -n 16 sam1.txt sam2.txt sam3.txt
and script.py is
import argparse
from io import SEEK_END
import multiprocessing as mp
#
# Worker process
#
def summarize(fname, start, stop):
"""
Process file[start:stop]
start and stop both point to first char of a line (or EOF)
"""
a = 0
b = 0
c = 0
with open(fname, newline='') as inf:
# jump to start position
pos = start
inf.seek(pos)
for line in inf:
value = int(line.split(4)[3])
# *** START EDIT HERE ***
#
# update a, b, c based on value
#
# *** END EDIT HERE ***
pos += len(line)
if pos >= stop:
break
return a, b, c
def main(num_workers, sam_files):
print("{} workers".format(num_workers))
pool = mp.Pool(processes=num_workers)
# for each input file
for fname in sam_files:
print("Dividing {}".format(fname))
# decide how to divide up the file
with open(fname) as inf:
# get file length
inf.seek(0, SEEK_END)
f_len = inf.tell()
# find break-points
starts = [0]
for n in range(1, num_workers):
# jump to approximate break-point
inf.seek(n * f_len // num_workers)
# find start of next full line
inf.readline()
# store offset
starts.append(inf.tell())
# do it!
stops = starts[1:] + [f_len]
start_stops = zip(starts, stops)
print("Solving {}".format(fname))
results = [pool.apply(summarize, args=(fname, start, stop)) for start,stop in start_stops]
# collect results
results = [sum(col) for col in zip(*results)]
print(results)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Parallel text processor')
parser.add_argument('--num_workers', '-n', default=8, type=int)
parser.add_argument('sam_files', nargs='+')
args = parser.parse_args()
main(args.num_workers, args.sam_files)
main(args.num_workers, args.sam_files)
What you don't want to do is hand files to invidual CPUs. If that's the case, the file open/reads will likely cause the heads to bounce randomly all over the disk, because the files are likely to be all over the disk.
Instead, break each file into chunks and process the chunks.
Open the file with one CPU. Read in the whole thing into an array Text. You want to do this is one massive read to prevent the heads from thrashing around the disk, under the assumption that your file(s) are placed on the disk in relatively large sequential chunks.
Divide its size in bytes by N, giving a (global) value K, the approximate number of bytes each CPU should process. Fork N threads, and hand each thread i its index i, and a copied handle for each file.
Each thread i starts a thread-local scan pointer p into Text as offset i*K. It scans the text, incrementing p and ignores the text until a newline is found. At this point, it starts processing lines (increment p as it scans the lines). Tt stops after processing a line, when its index into the Text file is greater than (i+1)*K.
If the amount of work per line is about equal, your N cores will all finish about the same time.
(If you have more than one file, you can then start the next one).
If you know that the file sizes are smaller than memory, you might arrange the file reads to be pipelined, e.g., while the current file is being processed, a file-read thread is reading the next file.

Efficient file buffering & scanning methods for large files in python

The description of the problem I am having is a bit complicated, and I will err on the side of providing more complete information. For the impatient, here is the briefest way I can summarize it:
What is the fastest (least execution
time) way to split a text file in to
ALL (overlapping) substrings of size N (bound N, eg 36)
while throwing out newline characters.
I am writing a module which parses files in the FASTA ascii-based genome format. These files comprise what is known as the 'hg18' human reference genome, which you can download from the UCSC genome browser (go slugs!) if you like.
As you will notice, the genome files are composed of chr[1..22].fa and chr[XY].fa, as well as a set of other small files which are not used in this module.
Several modules already exist for parsing FASTA files, such as BioPython's SeqIO. (Sorry, I'd post a link, but I don't have the points to do so yet.) Unfortunately, every module I've been able to find doesn't do the specific operation I am trying to do.
My module needs to split the genome data ('CAGTACGTCAGACTATACGGAGCTA' could be a line, for instance) in to every single overlapping N-length substring. Let me give an example using a very small file (the actual chromosome files are between 355 and 20 million characters long) and N=8
>>>import cStringIO
>>>example_file = cStringIO.StringIO("""\
>header
CAGTcag
TFgcACF
""")
>>>for read in parse(example_file):
... print read
...
CAGTCAGTF
AGTCAGTFG
GTCAGTFGC
TCAGTFGCA
CAGTFGCAC
AGTFGCACF
The function that I found had the absolute best performance from the methods I could think of is this:
def parse(file):
size = 8 # of course in my code this is a function argument
file.readline() # skip past the header
buffer = ''
for line in file:
buffer += line.rstrip().upper()
while len(buffer) >= size:
yield buffer[:size]
buffer = buffer[1:]
This works, but unfortunately it still takes about 1.5 hours (see note below) to parse the human genome this way. Perhaps this is the very best I am going to see with this method (a complete code refactor might be in order, but I'd like to avoid it as this approach has some very specific advantages in other areas of the code), but I thought I would turn this over to the community.
Thanks!
Note, this time includes a lot of extra calculation, such as computing the opposing strand read and doing hashtable lookups on a hash of approximately 5G in size.
Post-answer conclusion: It turns out that using fileobj.read() and then manipulating the resulting string (string.replace(), etc.) took relatively little time and memory compared to the remainder of the program, and so I used that approach. Thanks everyone!
Could you mmap the file and start pecking through it with a sliding window? I wrote a stupid little program that runs pretty small:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
sarnold 20919 0.0 0.0 33036 4960 pts/2 R+ 22:23 0:00 /usr/bin/python ./sliding_window.py
Working through a 636229 byte fasta file (found via http://biostar.stackexchange.com/questions/1759) took .383 seconds.
#!/usr/bin/python
import mmap
import os
def parse(string, size):
stride = 8
start = string.find("\n")
while start < size - stride:
print string[start:start+stride]
start += 1
fasta = open("small.fasta", 'r')
fasta_size = os.stat("small.fasta").st_size
fasta_map = mmap.mmap(fasta.fileno(), 0, mmap.MAP_PRIVATE, mmap.PROT_READ)
parse(fasta_map, fasta_size)
Some classic IO bound changes.
Use a lower level read operation like os.read and read in to a large fixed buffer.
Use threading/multiprocessing where one reads and buffers and the other processes.
If you have multiple processors/machines use multiprocessing/mq to divy up processing across CPUs ala map-reduce.
Using a lower level read operation wouldn't be that much of a rewrite. The others would be pretty large rewrites.
I suspect the problem is that you have so much data stored in string format, which is really wasteful for your use case, that you're running out of real memory and thrashing swap. 128 GB should be enough to avoid this... :)
Since you've indicated in comments that you need to store additional information anyway, a separate class which references a parent string would be my choice. I ran a short test using chr21.fa from chromFa.zip from hg18; the file is about 48MB and just under 1M lines. I only have 1GB of memory here, so I simply discard the objects afterwards. This test thus won't show problems with fragmentation, cache, or related, but I think it should be a good starting point for measuring parsing throughput:
import mmap
import os
import time
import sys
class Subseq(object):
__slots__ = ("parent", "offset", "length")
def __init__(self, parent, offset, length):
self.parent = parent
self.offset = offset
self.length = length
# these are discussed in comments:
def __str__(self):
return self.parent[self.offset:self.offset + self.length]
def __hash__(self):
return hash(str(self))
def __getitem__(self, index):
# doesn't currently handle slicing
assert 0 <= index < self.length
return self.parent[self.offset + index]
# other methods
def parse(file, size=8):
file.readline() # skip header
whole = "".join(line.rstrip().upper() for line in file)
for offset in xrange(0, len(whole) - size + 1):
yield Subseq(whole, offset, size)
class Seq(object):
__slots__ = ("value", "offset")
def __init__(self, value, offset):
self.value = value
self.offset = offset
def parse_sep_str(file, size=8):
file.readline() # skip header
whole = "".join(line.rstrip().upper() for line in file)
for offset in xrange(0, len(whole) - size + 1):
yield Seq(whole[offset:offset + size], offset)
def parse_plain_str(file, size=8):
file.readline() # skip header
whole = "".join(line.rstrip().upper() for line in file)
for offset in xrange(0, len(whole) - size + 1):
yield whole[offset:offset+size]
def parse_tuple(file, size=8):
file.readline() # skip header
whole = "".join(line.rstrip().upper() for line in file)
for offset in xrange(0, len(whole) - size + 1):
yield (whole, offset, size)
def parse_orig(file, size=8):
file.readline() # skip header
buffer = ''
for line in file:
buffer += line.rstrip().upper()
while len(buffer) >= size:
yield buffer[:size]
buffer = buffer[1:]
def parse_os_read(file, size=8):
file.readline() # skip header
file_size = os.fstat(file.fileno()).st_size
whole = os.read(file.fileno(), file_size).replace("\n", "").upper()
for offset in xrange(0, len(whole) - size + 1):
yield whole[offset:offset+size]
def parse_mmap(file, size=8):
file.readline() # skip past the header
buffer = ""
for line in file:
buffer += line
if len(buffer) >= size:
for start in xrange(0, len(buffer) - size + 1):
yield buffer[start:start + size].upper()
buffer = buffer[-(len(buffer) - size + 1):]
for start in xrange(0, len(buffer) - size + 1):
yield buffer[start:start + size]
def length(x):
return sum(1 for _ in x)
def duration(secs):
return "%dm %ds" % divmod(secs, 60)
def main(argv):
tests = [parse, parse_sep_str, parse_tuple, parse_plain_str, parse_orig, parse_os_read]
n = 0
for fn in tests:
n += 1
with open(argv[1]) as f:
start = time.time()
length(fn(f))
end = time.time()
print "%d %-20s %s" % (n, fn.__name__, duration(end - start))
fn = parse_mmap
n += 1
with open(argv[1]) as f:
f = mmap.mmap(f.fileno(), 0, mmap.MAP_PRIVATE, mmap.PROT_READ)
start = time.time()
length(fn(f))
end = time.time()
print "%d %-20s %s" % (n, fn.__name__, duration(end - start))
if __name__ == "__main__":
sys.exit(main(sys.argv))
1 parse 1m 42s
2 parse_sep_str 1m 42s
3 parse_tuple 0m 29s
4 parse_plain_str 0m 36s
5 parse_orig 0m 45s
6 parse_os_read 0m 34s
7 parse_mmap 0m 37s
The first four are my code, while orig is yours and the last two are from other answers here.
User-defined objects are much more costly to create and collect than tuples or plain strings! This shouldn't be that surprising, but I had not realized it would make this much of a difference (compare #1 and #3, which really only differ in a user-defined class vs tuple). You said you want to store additional information, like offset, with the string anyway (as in the parse and parse_sep_str cases), so you might consider implementing that type in a C extension module. Look at Cython and related if you don't want to write C directly.
Case #1 and #2 being identical is expected: by pointing to a parent string, I was trying to save memory rather than processing time, but this test doesn't measure that.
I have a function for process a text file and use buffer in read and write and parallel computing with async pool of workets of process. I have a AMD of 2 cores, 8GB RAM, with gnu/linux and can process 300000 lines in less of 1 second, 1000000 lines in aproximately 4 seconds and aproximately 4500000 lines (more of 220MB) in aproximately 20 seconds:
# -*- coding: utf-8 -*-
import sys
from multiprocessing import Pool
def process_file(f, fo="result.txt", fi=sys.argv[1]):
fi = open(fi, "r", 4096)
fo = open(fo, "w", 4096)
b = []
x = 0
result = None
pool = None
for line in fi:
b.append(line)
x += 1
if (x % 200000) == 0:
if pool == None:
pool = Pool(processes=20)
if result == None:
result = pool.map_async(f, b)
else:
presult = result.get()
result = pool.map_async(f, b)
for l in presult:
fo.write(l)
b = []
if not result == None:
for l in result.get():
fo.write(l)
if not b == []:
for l in b:
fo.write(f(l))
fo.close()
fi.close()
First argument is function that rceive one line, process and return result for will write in file, next is file of output and last is file of input (you can not use last argument if you receive as first parameter in your script file of input).

Reading text files into list, then storing in dictionay fills system memory ? (A what am I doing wrong?) Python

I have 43 text files that consume "232.2 MB on disk (232,129,355 bytes) for 43 items". what to read them in to memory (see code below). The problem I am having is that each file which is about 5.3mb on disk is causing python to use an additional 100mb of system memory. If check the size of the dict() getsizeof() (see sample of output). When python is up to 3GB of system memory getsizeof(dict()) is only using 6424 bytes of memory. I don't understand what is using the memory.
What is using up all the memory?
The related link is different in that the reported memory use by python was "correct"
related question
I am not very interested in other solutions DB .... I am more interested in understanding what is happening so I know how to avoid it in the future. That said using other python built ins array rather than lists are are great suggestion if it helps.
I have heard suggestions of using guppy to find what is using the memory.
sample output:
Loading into memory: ME49_800.txt
ME49_800.txt has 228484 rows of data
ME49_800.txt has 0 rows of masked data
ME49_800.txt has 198 rows of outliers
ME49_800.txt has 0 modified rows of data
280bytes of memory used for ME49_800.txt
43 files of 43 using 12568 bytes of memory
120
Sample data:
CellHeader=X Y MEAN STDV NPIXELS
0 0 120.0 28.3 25
1 0 6924.0 1061.7 25
2 0 105.0 17.4 25
Code:
import csv, os, glob
import sys
def read_data_file(filename):
reader = csv.reader(open(filename, "U"),delimiter='\t')
fname = os.path.split(filename)[1]
data = []
mask = []
outliers = []
modified = []
maskcount = 0
outliercount = 0
modifiedcount = 0
for row in reader:
if '[MASKS]' in row:
maskcount = 1
if '[OUTLIERS]' in row:
outliercount = 1
if '[MODIFIED]' in row:
modifiedcount = 1
if row:
if not any((maskcount, outliercount, modifiedcount)):
data.append(row)
elif not any((not maskcount, outliercount, modifiedcount)):
mask.append(row)
elif not any((not maskcount, not outliercount, modifiedcount)):
outliers.append(row)
elif not any((not maskcount, not outliercount, not modifiedcount)):
modified.append(row)
else: print '***something went wrong***'
data = data[1:]
mask = mask[3:]
outliers = outliers[3:]
modified = modified[3:]
filedata = dict(zip((fname + '_data', fname + '_mask', fname + '_outliers', fname+'_modified'), (data, mask, outliers, modified)))
return filedata
def ImportDataFrom(folder):
alldata = dict{}
infolder = glob.glob( os.path.join(folder, '*.txt') )
numfiles = len(infolder)
print 'Importing files from: ', folder
print 'Importing ' + str(numfiles) + ' files from: ', folder
for infile in infolder:
fname = os.path.split(infile)[1]
print "Loading into memory: " + fname
filedata = read_data_file(infile)
alldata.update(filedata)
print fname + ' has ' + str(len(filedata[fname + '_data'])) + ' rows of data'
print fname + ' has ' + str(len(filedata[fname + '_mask'])) + ' rows of masked data'
print fname + ' has ' + str(len(filedata[fname + '_outliers'])) + ' rows of outliers'
print fname + ' has ' + str(len(filedata[fname +'_modified'])) + ' modified rows of data'
print str(sys.getsizeof(filedata)) +'bytes'' of memory used for '+ fname
print str(len(alldata)/4) + ' files of ' + str(numfiles) + ' using ' + str(sys.getsizeof(alldata)) + ' bytes of memory'
#print alldata.keys()
print str(sys.getsizeof(ImportDataFrom))
print ' '
return alldata
ImportDataFrom("/Users/vmd/Dropbox/dna/data/rawdata")
The dictionary itself is very small - the bulk of the data is the whole content of the files stored in lists, containing one tuple per line. The 20x size increase is bigger than I expected but seems to be real. Splitting a 27-bytes line from your example input into a tuple, gives me 309 bytes (counting recursively, on a 64-bit machine). Add to this some unknown overhead of memory allocation, and 20x is not impossible.
Alternatives: for more compact representation, you want to convert the strings to integers/floats and to tightly pack them (without all that pointers and separate objects). I'm talking not just one row (although that's a start), but a whole list of rows together - so each file will be represented by just four 2D arrays of numbers. The array module is a start, but really what you need here are numpy arrays:
# Using explicit field types for compactness and access by name
# (e.g. data[i]['mean'] == data[i][2]).
fields = [('x', int), ('y', int), ('mean', float),
('stdv', float), ('npixels', int)]
# The simplest way is to build lists as you do now, and convert them
# to numpy array when done.
data = numpy.array(data, dtype=fields)
mask = numpy.array(mask, dtype=fields)
...
This gives me 40 bytes spent per row (measured on the .data attribute; sys.getsizeof reports that the array has a constant overhead of 80 bytes, but doesn't see the actual data used). This is still a ~1.5 more than the original files, but should easily fit into RAM.
I see 2 of your fields are labeled "x" and "y" - if your data is dense, you could arrange it by them - data[x,y]==... - instead of just storing (x,y,...) records. Besides being slightly more compact, it would be the most sensible structure, allowing easier processing.
If you need to handle even more data than your RAM will fit, pytables is a good library for efficient access to compact (even compressed) tabular data in files. (It's much better at this than general SQL DBs.)
This line specifically gets the size of the function object:
print str(sys.getsizeof(ImportDataFrom))
that's unlikely to be what you're interested in.
The size of a container does not include the size of the data it contains. Consider, for example:
>>> import sys
>>> d={}
>>> sys.getsizeof(d)
140
>>> d['foo'] = 'x'*99
>>> sys.getsizeof(d)
140
>>> d['foo'] = 'x'*9999
>>> sys.getsizeof(d)
140
If you want the size of the container plus the size of all contained things you have to write your own (presumably recursive) function that reaches inside containers and digs for every byte. Or, you can use third-party libraries such as Pympler or guppy.

Categories