Python MemoryError: cannot allocate array memory - python

I've got a 250 MB CSV file I need to read with ~7000 rows and ~9000 columns. Each row represents an image, and each column is a pixel (greyscale value 0-255)
I started with a simple np.loadtxt("data/training_nohead.csv",delimiter=",") but this gave me a memory error. I thought this was strange since I'm running 64-bit Python with 8 gigs of memory installed and it died after using only around 512 MB.
I've since tried SEVERAL other tactics, including:
import fileinput and read one line at a time, appending them to an array
np.fromstring after reading in the entire file
np.genfromtext
Manual parsing of the file (since all data is integers, this was fairly easy to code)
Every method gave me the same result. MemoryError around 512 MB. Wondering if there was something special about 512MB, I created a simple test program which filled up memory until python crashed:
str = " " * 511000000 # Start at 511 MB
while 1:
str = str + " " * 1000 # Add 1 KB at a time
Doing this didn't crash until around 1 gig. I also, just for fun, tried: str = " " * 2048000000 (fill 2 gigs) - this ran without a hitch. Filled the RAM and never complained. So the issue isn't the total amount of RAM I can allocate, but seems to be how many TIMES I can allocate memory...
I google'd around fruitlessly until I found this post: Python out of memory on large CSV file (numpy)
I copied the code from the answer exactly:
def iter_loadtxt(filename, delimiter=',', skiprows=0, dtype=float):
def iter_func():
with open(filename, 'r') as infile:
for _ in range(skiprows):
next(infile)
for line in infile:
line = line.rstrip().split(delimiter)
for item in line:
yield dtype(item)
iter_loadtxt.rowlength = len(line)
data = np.fromiter(iter_func(), dtype=dtype)
data = data.reshape((-1, iter_loadtxt.rowlength))
return data
Calling iter_loadtxt("data/training_nohead.csv") gave a slightly different error this time:
MemoryError: cannot allocate array memory
Googling this error I only found one, not so helpful, post: Memory error (MemoryError) when creating a boolean NumPy array (Python)
As I'm running Python 2.7, this was not my issue. Any help would be appreciated.

With some help from #J.F. Sebastian I developed the following answer:
train = np.empty([7049,9246])
row = 0
for line in open("data/training_nohead.csv")
train[row] = np.fromstring(line, sep=",")
row += 1
Of course this answer assumed prior knowledge of the number of rows and columns. Should you not have this information before-hand, the number of rows will always take a while to calculate as you have to read the entire file and count the \n characters. Something like this will suffice:
num_rows = 0
for line in open("data/training_nohead.csv")
num_rows += 1
For number of columns if every row has the same number of columns then you can just count the first row, otherwise you need to keep track of the maximum.
num_rows = 0
max_cols = 0
for line in open("data/training_nohead.csv")
num_rows += 1
tmp = line.split(",")
if len(tmp) > max_cols:
max_cols = len(tmp)
This solution works best for numerical data, as a string containing a comma could really complicate things.

This is an old discussion, but might help people in present.
I think I know why str = str + " " * 1000 fails fester than str = " " * 2048000000
When running the first one, I believe OS needs to allocate in memory the new object which is str + " " * 1000, and only after that it reference the name str to it. Before referencing the name 'str' to the new object, it cannot get rid of the first one.
This means the OS needs to allocate about the 'str' object twice in the same time, making it able to do it just for 1 gig, instead of 2 gigs.
I believe using the next code will get the same maximum memory out of your OS as in single allocation:
str = " " * 511000000
while(1):
l = len(str)
str = " "
str = " " * (len + 1000)
Feel free to roccet me if I am wrong

Related

Python mmap - slow access to end of files [with test code]

I posted a similar question a few days ago but without any code, now I created a test code in hopes of getting some help.
Code is at the bottom.
I got some dataset where I have a bunch of large files (~100) and I want to extract specific lines from those files very efficiently (both in memory and in speed).
My code gets a list of relevant files, the code opens each file with [line 1], then maps the file to memory with [line 2], also, for each file I receives a list of indices and going over the indices I retrieve the relevant information (10 bytes for this example) like so: [line 3-4], finally I close the handles with [line 5-6].
binaryFile = open(path, "r+b")
binaryFile_mm = mmap.mmap(binaryFile.fileno(), 0)
for INDEX in INDEXES:
information = binaryFile_mm[(INDEX):(INDEX)+10].decode("utf-8")
binaryFile_mm.close()
binaryFile.close()
This codes runs in parallel, with thousands of indices for each file, and continuously do that several times a second for hours.
Now to the problem - The code runs well when I limit the indices to be small (meaning - when I ask the code to get information from the beginning of the file). But! when I increase the range of the indices, everything slows down to (almost) a halt AND the buff/cache memory gets full (I'm not sure if the memory issue is related to the slowdown).
So my question is why does it matter if I retrieve information from the beginning or the end of the file and how do I overcome this in order to get instant access to information from the end of the file without slowing down and increasing buff/cache memory use.
PS - some numbers and sizes: so I got ~100 files each about 1GB in size, when I limit the indices to be from the 0%-10% of the file it runs fine, but when I allow the index to be anywhere in the file it stops working.
Code - tested on linux and windows with python 3.5, requires 10 GB of storage (creates 3 files with random strings inside 3GB each)
import os, errno, sys
import random, time
import mmap
def create_binary_test_file():
print("Creating files with 3,000,000,000 characters, takes a few seconds...")
test_binary_file1 = open("test_binary_file1.testbin", "wb")
test_binary_file2 = open("test_binary_file2.testbin", "wb")
test_binary_file3 = open("test_binary_file3.testbin", "wb")
for i in range(1000):
if i % 100 == 0 :
print("progress - ", i/10, " % ")
# efficiently create random strings and write to files
tbl = bytes.maketrans(bytearray(range(256)),
bytearray([ord(b'a') + b % 26 for b in range(256)]))
random_string = (os.urandom(3000000).translate(tbl))
test_binary_file1.write(str(random_string).encode('utf-8'))
test_binary_file2.write(str(random_string).encode('utf-8'))
test_binary_file3.write(str(random_string).encode('utf-8'))
test_binary_file1.close()
test_binary_file2.close()
test_binary_file3.close()
print("Created binary file for testing.The file contains 3,000,000,000 characters")
# Opening binary test file
try:
binary_file = open("test_binary_file1.testbin", "r+b")
except OSError as e: # this would be "except OSError, e:" before Python 2.6
if e.errno == errno.ENOENT: # errno.ENOENT = no such file or directory
create_binary_test_file()
binary_file = open("test_binary_file1.testbin", "r+b")
## example of use - perform 100 times, in each itteration: open one of the binary files and retrieve 5,000 sample strings
## (if code runs fast and without a slowdown - increase the k or other numbers and it should reproduce the problem)
## Example 1 - getting information from start of file
print("Getting information from start of file")
etime = []
for i in range(100):
start = time.time()
binary_file_mm = mmap.mmap(binary_file.fileno(), 0)
sample_index_list = random.sample(range(1,100000-1000), k=50000)
sampled_data = [[binary_file_mm[v:v+1000].decode("utf-8")] for v in sample_index_list]
binary_file_mm.close()
binary_file.close()
file_number = random.randint(1, 3)
binary_file = open("test_binary_file" + str(file_number) + ".testbin", "r+b")
etime.append((time.time() - start))
if i % 10 == 9 :
print("Iter ", i, " \tAverage time - ", '%.5f' % (sum(etime[-9:]) / len(etime[-9:])))
binary_file.close()
## Example 2 - getting information from all of the file
print("Getting information from all of the file")
binary_file = open("test_binary_file1.testbin", "r+b")
etime = []
for i in range(100):
start = time.time()
binary_file_mm = mmap.mmap(binary_file.fileno(), 0)
sample_index_list = random.sample(range(1,3000000000-1000), k=50000)
sampled_data = [[binary_file_mm[v:v+1000].decode("utf-8")] for v in sample_index_list]
binary_file_mm.close()
binary_file.close()
file_number = random.randint(1, 3)
binary_file = open("test_binary_file" + str(file_number) + ".testbin", "r+b")
etime.append((time.time() - start))
if i % 10 == 9 :
print("Iter ", i, " \tAverage time - ", '%.5f' % (sum(etime[-9:]) / len(etime[-9:])))
binary_file.close()
My results: (The average time of getting information from all across the file is almost 4 times slower than getting information from the beginning, with ~100 files and parallel computing this difference gets much bigger)
Getting information from start of file
Iter 9 Average time - 0.14790
Iter 19 Average time - 0.14590
Iter 29 Average time - 0.14456
Iter 39 Average time - 0.14279
Iter 49 Average time - 0.14256
Iter 59 Average time - 0.14312
Iter 69 Average time - 0.14145
Iter 79 Average time - 0.13867
Iter 89 Average time - 0.14079
Iter 99 Average time - 0.13979
Getting information from all of the file
Iter 9 Average time - 0.46114
Iter 19 Average time - 0.47547
Iter 29 Average time - 0.47936
Iter 39 Average time - 0.47469
Iter 49 Average time - 0.47158
Iter 59 Average time - 0.47114
Iter 69 Average time - 0.47247
Iter 79 Average time - 0.47881
Iter 89 Average time - 0.47792
Iter 99 Average time - 0.47681
The basic reason why you have this time difference is that you have to seek to where you need in the file. The further from position 0 you are, the longer it's going to take.
What might help is since you know the starting index you need, seek on the file descriptor to that point and then do the mmap. Or really, why bother with mmap in the first place - just read the number of bytes that you need from the seeked-to position, and put that into your result variable.
To determine if you're getting adequate performance, check the memory available for the buffer/page cache (free in Linux), I/O stats - the number of reads, their size and duration (iostat; compare with the specs of your hardware), and the CPU utilization of your process.
[edit] Assuming that you read from a locally attached SSD (without having the data you need in the cache):
When reading in a single thread, you should expect your batch of 50,000 reads to take more than 7 seconds (50000*0.000150). Probably longer because the 50k accesses of a mmap-ed file will trigger more or larger reads, as your accesses are not page-aligned - as I suggested in another Q&A I'd use simple seek/read instead (and open the file with buffering=0 to avoid unnecessary reads for Python buffered I/O).
With more threads/processes reading simultaneously, you can saturate your SSD throughput (how much 4KB reads/s it can do - it can be anywhere from 5,000 to 1,000,000), then the individual reads will become even slower.
[/edit]
The first example only accesses 3*100KB of the files' data, so as you have much more than that available for the cache, all of the 300KB quickly end up in the cache, so you'll see no I/O, and your python process will be CPU-bound.
I'm 99.99% sure that if you test reading from the last 100KB of each file, it will perform as well as the first example - it's not about the location of the data, but about the size of the data accessed.
The second example accesses random portions from 9GB, so you can hope to see similar performance only if you have enough free RAM to cache all of the 9GB, and only after you preload the files into the cache, so that the testcase runs with zero I/O.
In realistic scenarios, the files will not be fully in the cache - so you'll see many I/O requests and much lower CPU utilization for python. As I/O is much slower than cached access, you should expect this example to run slower.

Python - Efficient way to flip bytes in a file?

I've got a folder full of very large files that need to be byte flipped by a power of 4. So essentially, I need to read the files as a binary, adjust the sequence of bits, and then write a new binary file with the bits adjusted.
In essence, what I'm trying to do is read a hex string hexString that looks like this:
"00112233AABBCCDD"
And write a file that looks like this:
"33221100DDCCBBAA"
(i.e. every two characters is a byte, and I need to flip the bytes by a power of 4)
I am very new to python and coding in general, and the way I am currently accomplishing this task is extremely inefficient. My code currently looks like this:
import binascii
with open(myFile, 'rb') as f:
content = f.read()
hexString = str(binascii.hexlify(content))
flippedBytes = ""
inc = 0
while inc < len(hexString):
flippedBytes += file[inc + 6:inc + 8]
flippedBytes += file[inc + 4:inc + 6]
flippedBytes += file[inc + 2:inc + 4]
flippedBytes += file[inc:inc + 2]
inc += 8
..... write the flippedBytes to file, etc
The code I pasted above accurately accomplishes what I need (note, my actual code has a few extra lines of: "hexString.replace()" to remove unnecessary hex characters - but I've left those out to make the above easier to read). My ultimate problem is that it takes EXTREMELY long to run my code with larger files. Some of my files I need to flip are almost 2gb in size, and the code was going to take almost half a day to complete one single file. I've got dozens of files I need to run this on, so that timeframe simply isn't practical.
Is there a more efficient way to flip the HEX values in a file by a power of 4?
.... for what it's worth, there is a tool called WinHEX that can do this manually, and only takes a minute max to flip the whole file.... I was just hoping to automate this with python so we didn't have to manually use WinHEX each time
You want to convert your 4-byte integers from little-endian to big-endian, or vice-versa. You can use the struct module for that:
import struct
with open(myfile, 'rb') as infile, open(myoutput, 'wb') as of:
while True:
d = infile.read(4)
if not d:
break
le = struct.unpack('<I', d)
be = struct.pack('>I', *le)
of.write(be)
Here is a little struct awesomeness to get you started:
>>> import struct
>>> s = b'\x00\x11\x22\x33\xAA\xBB\xCC\xDD'
>>> a, b = struct.unpack('<II', s)
>>> s = struct.pack('>II', a, b)
>>> ''.join([format(x, '02x') for x in s])
'33221100ddccbbaa'
To do this at full speed for a large input, use struct.iter_unpack

Fastest way to process a large file?

I have multiple 3 GB tab delimited files. There are 20 million rows in each file. All the rows have to be independently processed, no relation between any two rows. My question is, what will be faster?
Reading line-by-line?
with open() as infile:
for line in infile:
Reading the file into memory in chunks and processing it, say 250 MB at a time?
The processing is not very complicated, I am just grabbing value in column1 to List1, column2 to List2 etc. Might need to add some column values together.
I am using python 2.7 on a linux box that has 30GB of memory. ASCII Text.
Any way to speed things up in parallel? Right now I am using the former method and the process is very slow. Is using any CSVReader module going to help?
I don't have to do it in python, any other language or database use ideas are welcome.
It sounds like your code is I/O bound. This means that multiprocessing isn't going to help—if you spend 90% of your time reading from disk, having an extra 7 processes waiting on the next read isn't going to help anything.
And, while using a CSV reading module (whether the stdlib's csv or something like NumPy or Pandas) may be a good idea for simplicity, it's unlikely to make much difference in performance.
Still, it's worth checking that you really are I/O bound, instead of just guessing. Run your program and see whether your CPU usage is close to 0% or close to 100% or a core. Do what Amadan suggested in a comment, and run your program with just pass for the processing and see whether that cuts off 5% of the time or 70%. You may even want to try comparing with a loop over os.open and os.read(1024*1024) or something and see if that's any faster.
Since your using Python 2.x, Python is relying on the C stdio library to guess how much to buffer at a time, so it might be worth forcing it to buffer more. The simplest way to do that is to use readlines(bufsize) for some large bufsize. (You can try different numbers and measure them to see where the peak is. In my experience, usually anything from 64K-8MB is about the same, but depending on your system that may be different—especially if you're, e.g., reading off a network filesystem with great throughput but horrible latency that swamps the throughput-vs.-latency of the actual physical drive and the caching the OS does.)
So, for example:
bufsize = 65536
with open(path) as infile:
while True:
lines = infile.readlines(bufsize)
if not lines:
break
for line in lines:
process(line)
Meanwhile, assuming you're on a 64-bit system, you may want to try using mmap instead of reading the file in the first place. This certainly isn't guaranteed to be better, but it may be better, depending on your system. For example:
with open(path) as infile:
m = mmap.mmap(infile, 0, access=mmap.ACCESS_READ)
A Python mmap is sort of a weird object—it acts like a str and like a file at the same time, so you can, e.g., manually iterate scanning for newlines, or you can call readline on it as if it were a file. Both of those will take more processing from Python than iterating the file as lines or doing batch readlines (because a loop that would be in C is now in pure Python… although maybe you can get around that with re, or with a simple Cython extension?)… but the I/O advantage of the OS knowing what you're doing with the mapping may swamp the CPU disadvantage.
Unfortunately, Python doesn't expose the madvise call that you'd use to tweak things in an attempt to optimize this in C (e.g., explicitly setting MADV_SEQUENTIAL instead of making the kernel guess, or forcing transparent huge pages)—but you can actually ctypes the function out of libc.
I know this question is old; but I wanted to do a similar thing, I created a simple framework which helps you read and process a large file in parallel. Leaving what I tried as an answer.
This is the code, I give an example in the end
def chunkify_file(fname, size=1024*1024*1000, skiplines=-1):
"""
function to divide a large text file into chunks each having size ~= size so that the chunks are line aligned
Params :
fname : path to the file to be chunked
size : size of each chink is ~> this
skiplines : number of lines in the begining to skip, -1 means don't skip any lines
Returns :
start and end position of chunks in Bytes
"""
chunks = []
fileEnd = os.path.getsize(fname)
with open(fname, "rb") as f:
if(skiplines > 0):
for i in range(skiplines):
f.readline()
chunkEnd = f.tell()
count = 0
while True:
chunkStart = chunkEnd
f.seek(f.tell() + size, os.SEEK_SET)
f.readline() # make this chunk line aligned
chunkEnd = f.tell()
chunks.append((chunkStart, chunkEnd - chunkStart, fname))
count+=1
if chunkEnd > fileEnd:
break
return chunks
def parallel_apply_line_by_line_chunk(chunk_data):
"""
function to apply a function to each line in a chunk
Params :
chunk_data : the data for this chunk
Returns :
list of the non-None results for this chunk
"""
chunk_start, chunk_size, file_path, func_apply = chunk_data[:4]
func_args = chunk_data[4:]
t1 = time.time()
chunk_res = []
with open(file_path, "rb") as f:
f.seek(chunk_start)
cont = f.read(chunk_size).decode(encoding='utf-8')
lines = cont.splitlines()
for i,line in enumerate(lines):
ret = func_apply(line, *func_args)
if(ret != None):
chunk_res.append(ret)
return chunk_res
def parallel_apply_line_by_line(input_file_path, chunk_size_factor, num_procs, skiplines, func_apply, func_args, fout=None):
"""
function to apply a supplied function line by line in parallel
Params :
input_file_path : path to input file
chunk_size_factor : size of 1 chunk in MB
num_procs : number of parallel processes to spawn, max used is num of available cores - 1
skiplines : number of top lines to skip while processing
func_apply : a function which expects a line and outputs None for lines we don't want processed
func_args : arguments to function func_apply
fout : do we want to output the processed lines to a file
Returns :
list of the non-None results obtained be processing each line
"""
num_parallel = min(num_procs, psutil.cpu_count()) - 1
jobs = chunkify_file(input_file_path, 1024 * 1024 * chunk_size_factor, skiplines)
jobs = [list(x) + [func_apply] + func_args for x in jobs]
print("Starting the parallel pool for {} jobs ".format(len(jobs)))
lines_counter = 0
pool = mp.Pool(num_parallel, maxtasksperchild=1000) # maxtaskperchild - if not supplied some weird happend and memory blows as the processes keep on lingering
outputs = []
for i in range(0, len(jobs), num_parallel):
print("Chunk start = ", i)
t1 = time.time()
chunk_outputs = pool.map(parallel_apply_line_by_line_chunk, jobs[i : i + num_parallel])
for i, subl in enumerate(chunk_outputs):
for x in subl:
if(fout != None):
print(x, file=fout)
else:
outputs.append(x)
lines_counter += 1
del(chunk_outputs)
gc.collect()
print("All Done in time ", time.time() - t1)
print("Total lines we have = {}".format(lines_counter))
pool.close()
pool.terminate()
return outputs
Say for example, I have a file in which I want to count the number of words in each line, then the processing of each line would look like
def count_words_line(line):
return len(line.strip().split())
and then call the function like:
parallel_apply_line_by_line(input_file_path, 100, 8, 0, count_words_line, [], fout=None)
Using this, I get a speed up of ~8 times as compared to vanilla line by line reading on a sample file of size ~20GB in which I do some moderately complicated processing on each line.

Python MemoryError or ValueError in np.loadtxt and iter_loadtxt

My starting point was a problem with NumPy's function loadtxt:
X = np.loadtxt(filename, delimiter=",")
that gave a MemoryError in np.loadtxt(..). I googled it and came to this question on StackOverflow. That gave the following solution:
def iter_loadtxt(filename, delimiter=',', skiprows=0, dtype=float):
def iter_func():
with open(filename, 'r') as infile:
for _ in range(skiprows):
next(infile)
for line in infile:
line = line.rstrip().split(delimiter)
for item in line:
yield dtype(item)
iter_loadtxt.rowlength = len(line)
data = np.fromiter(iter_func(), dtype=dtype)
data = data.reshape((-1, iter_loadtxt.rowlength))
return data
data = iter_loadtxt('your_file.ext')
So I tried that, but then encountered the following error message:
> data = data.reshape((-1, iter_loadtext.rowlength))
> ValueError: total size of new array must be unchanged
Then I tried to add the number of rows and maximum number of cols to the code with the code fragments down here, which I partly got from another question and partly wrote myself:
num_rows = 0
max_cols = 0
with open(filename, 'r') as infile:
for line in infile:
num_rows += 1
tmp = line.split(",")
if len(tmp) > max_cols:
max_cols = len(tmp)
def iter_func():
#didn't change
data = np.fromiter(iter_func(), dtype=dtype, count=num_rows)
data = data.reshape((num_rows, max_cols))
But this still gave the same error message though I thought it should have been solved. On the other hand I'm not sure if I'm calling data.reshape(..) in the correct manner.
I commented the rule where date.reshape(..) is called to see what happened. That gave this error message:
> ValueError: need more than 1 value to unpack
Which happened at the first point where something is done with X, the variable where this problem is all about.
I know this code can work on the input files I got, because I saw it in use with them. But I can't find why I can't solve this problem. My reasoning goes as far as that because I'm using a 32-bit Python version (on a 64-bit Windows machine), something goes wrong with memory that doesn't happen on other computers. But I'm not sure. For info: I'm having 8GB of RAM for a 1.2GB file but my RAM is not full according to Task Manager.
What I want to solve is that I'm using open source code that needs to read and parse the given file just like np.loadtxt(filename, delimiter=","), but then within my memory. I know the code originally worked in MacOsx and Linux, and to be more precise: "MacOsx 10.9.2 and Linux (version 2.6.18-194.26.1.el5 (brewbuilder#norob.fnal.gov) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)) 1 SMP Tue Nov 9 12:46:16 EST 2010)."
I don't care that much about time. My file contains +-200.000 lines on which there are 100 or 1000 (depending on the input files: one is always 100, one is always 1000) items per line, where one item is a floating point with 3 decimals either negated or not and they are separated by , and a space. F.e.: [..] 0.194, -0.007, 0.004, 0.243, [..], and 100 or 100 of those items of which you see 4, for +-200.000 lines.
I'm using Python 2.7 because the open source code needs that.
Does any of you have the solution for this? Thanks in advance.
On Windows a 32 bit process is only given a maximum of 2GB (or GiB?) memory and numpy.loadtxt is notorious for being heavy on memory, so that explains why the first approach doesn't work.
The second problem you appear to be facing is that the particular file you are testing with has missing data, i.e. not all lines have the same number of values. This is easy to check, for example:
import numpy as np
numbers_per_line = []
with open(filename) as infile:
for line in infile:
numbers_per_line.append(line.count(delimiter) + 1)
# Check where there might be problems
numbers_per_line = np.array(numbers_per_line)
expected_number = 100
print np.where(numbers_per_line != expected_number)

Reading text files into list, then storing in dictionay fills system memory ? (A what am I doing wrong?) Python

I have 43 text files that consume "232.2 MB on disk (232,129,355 bytes) for 43 items". what to read them in to memory (see code below). The problem I am having is that each file which is about 5.3mb on disk is causing python to use an additional 100mb of system memory. If check the size of the dict() getsizeof() (see sample of output). When python is up to 3GB of system memory getsizeof(dict()) is only using 6424 bytes of memory. I don't understand what is using the memory.
What is using up all the memory?
The related link is different in that the reported memory use by python was "correct"
related question
I am not very interested in other solutions DB .... I am more interested in understanding what is happening so I know how to avoid it in the future. That said using other python built ins array rather than lists are are great suggestion if it helps.
I have heard suggestions of using guppy to find what is using the memory.
sample output:
Loading into memory: ME49_800.txt
ME49_800.txt has 228484 rows of data
ME49_800.txt has 0 rows of masked data
ME49_800.txt has 198 rows of outliers
ME49_800.txt has 0 modified rows of data
280bytes of memory used for ME49_800.txt
43 files of 43 using 12568 bytes of memory
120
Sample data:
CellHeader=X Y MEAN STDV NPIXELS
0 0 120.0 28.3 25
1 0 6924.0 1061.7 25
2 0 105.0 17.4 25
Code:
import csv, os, glob
import sys
def read_data_file(filename):
reader = csv.reader(open(filename, "U"),delimiter='\t')
fname = os.path.split(filename)[1]
data = []
mask = []
outliers = []
modified = []
maskcount = 0
outliercount = 0
modifiedcount = 0
for row in reader:
if '[MASKS]' in row:
maskcount = 1
if '[OUTLIERS]' in row:
outliercount = 1
if '[MODIFIED]' in row:
modifiedcount = 1
if row:
if not any((maskcount, outliercount, modifiedcount)):
data.append(row)
elif not any((not maskcount, outliercount, modifiedcount)):
mask.append(row)
elif not any((not maskcount, not outliercount, modifiedcount)):
outliers.append(row)
elif not any((not maskcount, not outliercount, not modifiedcount)):
modified.append(row)
else: print '***something went wrong***'
data = data[1:]
mask = mask[3:]
outliers = outliers[3:]
modified = modified[3:]
filedata = dict(zip((fname + '_data', fname + '_mask', fname + '_outliers', fname+'_modified'), (data, mask, outliers, modified)))
return filedata
def ImportDataFrom(folder):
alldata = dict{}
infolder = glob.glob( os.path.join(folder, '*.txt') )
numfiles = len(infolder)
print 'Importing files from: ', folder
print 'Importing ' + str(numfiles) + ' files from: ', folder
for infile in infolder:
fname = os.path.split(infile)[1]
print "Loading into memory: " + fname
filedata = read_data_file(infile)
alldata.update(filedata)
print fname + ' has ' + str(len(filedata[fname + '_data'])) + ' rows of data'
print fname + ' has ' + str(len(filedata[fname + '_mask'])) + ' rows of masked data'
print fname + ' has ' + str(len(filedata[fname + '_outliers'])) + ' rows of outliers'
print fname + ' has ' + str(len(filedata[fname +'_modified'])) + ' modified rows of data'
print str(sys.getsizeof(filedata)) +'bytes'' of memory used for '+ fname
print str(len(alldata)/4) + ' files of ' + str(numfiles) + ' using ' + str(sys.getsizeof(alldata)) + ' bytes of memory'
#print alldata.keys()
print str(sys.getsizeof(ImportDataFrom))
print ' '
return alldata
ImportDataFrom("/Users/vmd/Dropbox/dna/data/rawdata")
The dictionary itself is very small - the bulk of the data is the whole content of the files stored in lists, containing one tuple per line. The 20x size increase is bigger than I expected but seems to be real. Splitting a 27-bytes line from your example input into a tuple, gives me 309 bytes (counting recursively, on a 64-bit machine). Add to this some unknown overhead of memory allocation, and 20x is not impossible.
Alternatives: for more compact representation, you want to convert the strings to integers/floats and to tightly pack them (without all that pointers and separate objects). I'm talking not just one row (although that's a start), but a whole list of rows together - so each file will be represented by just four 2D arrays of numbers. The array module is a start, but really what you need here are numpy arrays:
# Using explicit field types for compactness and access by name
# (e.g. data[i]['mean'] == data[i][2]).
fields = [('x', int), ('y', int), ('mean', float),
('stdv', float), ('npixels', int)]
# The simplest way is to build lists as you do now, and convert them
# to numpy array when done.
data = numpy.array(data, dtype=fields)
mask = numpy.array(mask, dtype=fields)
...
This gives me 40 bytes spent per row (measured on the .data attribute; sys.getsizeof reports that the array has a constant overhead of 80 bytes, but doesn't see the actual data used). This is still a ~1.5 more than the original files, but should easily fit into RAM.
I see 2 of your fields are labeled "x" and "y" - if your data is dense, you could arrange it by them - data[x,y]==... - instead of just storing (x,y,...) records. Besides being slightly more compact, it would be the most sensible structure, allowing easier processing.
If you need to handle even more data than your RAM will fit, pytables is a good library for efficient access to compact (even compressed) tabular data in files. (It's much better at this than general SQL DBs.)
This line specifically gets the size of the function object:
print str(sys.getsizeof(ImportDataFrom))
that's unlikely to be what you're interested in.
The size of a container does not include the size of the data it contains. Consider, for example:
>>> import sys
>>> d={}
>>> sys.getsizeof(d)
140
>>> d['foo'] = 'x'*99
>>> sys.getsizeof(d)
140
>>> d['foo'] = 'x'*9999
>>> sys.getsizeof(d)
140
If you want the size of the container plus the size of all contained things you have to write your own (presumably recursive) function that reaches inside containers and digs for every byte. Or, you can use third-party libraries such as Pympler or guppy.

Categories