A RAM error of big array

A RAM error of big array - python

I need to get the numbers of one line randomly, and put each line in other array, then get the numbers of one col.
I have a big file, more than 400M. In that file, there are 13496*13496 number, means 13496 rows and 13496 cols. I want to read them to a array.
This is my code:
_L1 = [[0 for col in range(13496)] for row in range(13496)]
_L1file = open('distanceCMD.function.txt')
while (i<13496):
print "i="+str(i)
_strlf = _L1file.readline()
_strlf = _strlf.split('\t')
_strlf = _strlf[:-1]
_L1[i] = _strlf
i += 1
_L1file.close()
And this is my error message:
MemoryError:
File "D:\research\space-function\ART3.py", line 30, in <module>
_strlf = _strlf.split('\t')

you might want to approach your problem in another way. Process the file line by line. I don't see a need to store the whole big file into array. Otherwise, you might want to tell us what you are actually trying to do.
for line in open("400MB_file"):
# do something with line.
Or
f=open("file")
for linenum,line in enumerate(f):
if linenum+1 in [2,3,10]:
print "there are ", len(line.split())," columns" #assuming you want to split on spaces
print "100th column value is: ", line.split()[99]
if linenum+1>10:
break # break if you want to stop after the 10th line
f.close()

This is a simple case of your program demanding more memory than is available to the computer. An array of 13496x13496 elements requires 182,142,016 'cells', where a cell is a minimum of one byte (if storing chars) and potentially several bytes (if storing floating-point numerics, for example). I'm not even taking your particular runtimes' array metadata into account, though this would typically be a tiny overhead on a simple array.
Assuming each array element is just a single byte, your computer needs around 180MB of RAM to hold it in memory in its' entirety. Trying to process it could be impractical.
You need to think about the problem a different way; as has already been mentioned, a line-by-line approach might be a better option. Or perhaps processing the grid in smaller units, perhaps 10x10 or 100x100, and aggregating the results. Or maybe the problem itself can be expressed in a different form, which avoids the need to process the entire dataset altogether...?
If you give us a little more detail on the nature of the data and the objective, perhaps someone will have an idea to make the task more manageable.

Short answer: the Python object overhead is killing you. In Python 2.x on a 64-bit machine, a list of strings consumes 48 bytes per list entry even before accounting for the content of the strings. That's over 8.7 Gb of overhead for the size of array you describe.
On a 32-bit machine it'll be a bit better: only 28 bytes per list entry.
Longer explanation: you should be aware that Python objects themselves can be quite large: even simple objects like ints, floats and strings. In your code you're ending up with a list of lists of strings. On my (64-bit) machine, even an empty string object takes up 40 bytes, and to that you need to add 8 bytes for the list pointer that's pointing to this string object in memory. So that's already 48 bytes per entry, or around 8.7 Gb. Given that Python allocates memory in multiples of 8 bytes at a time, and that your strings are almost certainly non-empty, you're actually looking at 56 or 64 bytes (I don't know how long your strings are) per entry.
Possible solutions:
(1) You might do (a little) better by converting your entries from strings to ints or floats as appropriate.
(2) You'd do much better by either using Python's array type (not the same as list!) or by using numpy: then your ints or floats would only take 4 or 8 bytes each.
Since Python 2.6, you can get basic information about object sizes with the sys.getsizeof function. Note that if you apply it to a list (or other container) then the returned size doesn't include the size of the contained list objects; only of the structure used to hold those objects. Here are some values on my machine.
>>> import sys
>>> sys.getsizeof("")
40
>>> sys.getsizeof(5.0)
24
>>> sys.getsizeof(5)
24
>>> sys.getsizeof([])
72
>>> sys.getsizeof(range(10)) # 72 + 8 bytes for each pointer
152

MemoryError exception:
Raised when an operation runs out of
memory but the situation may still be
rescued (by deleting some objects).
The associated value is a string
indicating what kind of (internal)
operation ran out of memory. Note that
because of the underlying memory
management architecture (C’s malloc()
function), the interpreter may not
always be able to completely recover
from this situation; it nevertheless
raises an exception so that a stack
traceback can be printed, in case a
run-away program was the cause.
It seems that, at least in your case, reading the entire file into memory is not a doable option.

Replace this:
_strlf = _strlf[:-1]
with this:
_strlf = [float(val) for val in _strlf[:-1]]
You are making a big array of strings. I can guarantee that the string "123.00123214213" takes a lot less memory when you convert it to floating point.
You might want to include some handling for null values.
You can also go to numpy's array type, but your problem may be too small to bother.

Related

Python np.fromfile() adding arbitrary random comma when reading from binary file

I encounter weird problem and could not solve it for days. I have created byte array that contains values from 1 to 250 and write it to binary file from C# using WriteAllBytes.
Later i read it from Python using np.fromfile(filename, dtype=np.ubyte). However, i realize this functions was adding arbitrary comma (see the image). Interestingly it is not visible in array property. And if i call numpy.array2string, comma turns '\n'. One solution is to replace comma with none, however i have very long sequences it will take forever on 100gb of data to use replace function. I also recheck the files by reading using .net Core, i'm quite sure comma is not there.
What could i be missing?
Edit:
I was trying to read all byte values to array and cast each member to or entire array to string. I found out that most reliable way to do this is:
list(map(str, (ubyte_array))
Above code returns string list that its elements without any arbitrary comma or blank space.

Passing a sequence of bits to a file python

As a part of a bigger project, I want to save a sequence of bits in a file so that the file is as small as possible. I'm not talking about compression, I want to save the sequence as it is but using the least amount of characters. The initial idea was to turn mini-sequences of 8 bits into chars using ASCII encoding and saving those chars, but due to some unknown problem with strange characters, the characters retrieved when reading the file are not the same that were originally written. I've tried opening the file with utf-8 encoding, latin-1 but none seems to work. I'm wondering if there's any other way, maybe by turning the sequence into a hexadecimal number?

technically you can not write less than a byte because the os organizes memory in bytes (write individual bits to a file in python), so this is binary file io, see https://docs.python.org/2/library/io.html there are modules like struct
open the file with the 'b' switch, indicates binary read/write operation, then use i.e. the to_bytes() function (Writing bits to a binary file) or struct.pack() (How to write individual bits to a text file in python?)
with open('somefile.bin', 'wb') as f:
import struct
>>> struct.pack("h", 824)
'8\x03'
>>> bits = "10111111111111111011110"
>>> int(bits[::-1], 2).to_bytes(4, 'little')
b'\xfd\xff=\x00'
if you want to get around the 8 bit (byte) structure of the memory you can use bit manipulation and techniques like bitmasks and BitArrays
see https://wiki.python.org/moin/BitManipulation and https://wiki.python.org/moin/BitArrays
however the problem is, as you said, to read back the data if you use BitArrays of differing length i.e. to store a decimal 7 you need 3 bit 0x111 to store a decimal 2 you need 2 bit 0x10. now the problem is to read this back.
how can your program know if it has to read the value back as a 3 bit value or as a 2 bit value ? in unorganized memory the sequence decimal 72 looks like 11110 that translates to 111|10 so how can your program know where the | is ?
in normal byte ordered memory decimal 72 is 0000011100000010 -> 00000111|00000010 this has the advantage that it is clear where the | is
this is why memory on its lowest level is organized in fixed clusters of 8 bit = 1 byte. if you want to access single bits inside a bytes/ 8 bit clusters you can use bitmasks in combination with logic operators (http://www.learncpp.com/cpp-tutorial/3-8a-bit-flags-and-bit-masks/). in python the easiest way for single bit manipulation is the module ctypes
if you know that your values are all 6 bit maybe it is worth the effort, however this is also tough...
(How do you set, clear, and toggle a single bit?)
(Why can't you do bitwise operations on pointer in C, and is there a way around this?)

Python: size of strings in memory

Consider the following code:
arr = []
for (str, id, flag) in some_data:
arr.append((str, id, flag))
Imagine the input strings being 2 chars long in average and 5 chars max and some_data having 1 million elements.
What will the memory requirement of such a structure be?
May it be that a lot of memory is wasted for the strings? If so, how can I avoid that?

In this case, because the strings are quite short, and there are so many of them, you stand to save a fair bit of memory by using intern on the strings. Assuming there are only lowercase letters in the strings, that's 26 * 26 = 676 possible strings, so there must be a lot of repetitions in this list; intern will ensure that those repetitions don't result in unique objects, but all refer to the same base object.
It's possible that Python already interns short strings; but looking at a number of different sources, it seems this is highly implementation-dependent. So calling intern in this case is probably the way to go; YMMV.
As an elaboration on why this is very likely to save memory, consider the following:
>>> sys.getsizeof('')
40
>>> sys.getsizeof('a')
41
>>> sys.getsizeof('ab')
42
>>> sys.getsizeof('abc')
43
Adding single characters to a string adds only a byte to the size of the string itself, but every string takes up 40 bytes on its own.

In recent Python 3 (64-bit) versions, string instances take up 49+ bytes. But also keep in mind that if you use non-ASCII characters, the memory usage jumps up even more:
>>> sys.getsizeof('t')
50
>>> sys.getsizeof('я')
76
Notice how even if one character in a string is non-ASCII, all other characters will take up more space (2 or 4 bytes each):
>>> sys.getsizeof('t12345')
55 # +5 bytes, compared to 't'
>>> sys.getsizeof('я12345')
86 # +10 bytes, compared to 'я'
This has to do with the internal representation of strings since Python 3.3. See PEP 393 -- Flexible String Representation for more details.
Python, in general, is not very memory efficient, when it comes to having lots of small objects, not just for strings. See these examples:
>>> sys.getsizeof(1)
28
>>> sys.getsizeof(True)
28
>>> sys.getsizeof([])
56
>>> sys.getsizeof(dict())
232
>>> sys.getsizeof((1,1))
56
>>> sys.getsizeof([1,1])
72
Internalizing strings could help, but make sure you don't have too many unique values, as that could do more harm than good.
It's hard to tell how to optimize your specific case, as there is no single universal solution. You could save up a lot of memory if you somehow serialize data from multiple items into a single byte buffer, for example, but then that could complicate your code or affect performance too much. In many cases it won't be worth it, but if I were in a situation where I really needed to optimize memory usage, I would also consider writing that part in a language like Rust (it's not too hard to create a native Python module via PyO3 for example).

If your strings are so short, it is likely there will be a significant number of duplicates. Python interning will optimise it so that these strings are stored only once and the reference used multiple tiems, rather than storing the string multiple times...
These strings should be automatically interned as there are.

Python, how to put 32-bit integer into byte array

I usually perform things like this in C++, but I'm using python to write a quick script and I've run into a wall.
If I have a binary list (or whatever python stores the result of an "fread" in). I can access the individual bytes in it with: buffer[0], buffer[1], etc.
I need to change the bytes [8-11] to hold a new 32-bit file-size (read: there's already a filesize there, I need to update it). In C++ I would just get a pointer to the location and cast it to store the integer, but with python I suddenly realized I have no idea how to do something like this.
How can I update 4 bytes in my buffer at a specific location to hold the value of an integer in python?
EDIT
I'm going to add more because I can't seem to figure it out from the solutions (though I can see they're on the right track).
First of all, I'm on python 2.4 (and can't upgrade, big corporation servers) - so that apparently limits my options. Sorry for not mentioning that earlier, I wasn't aware it had so many less features.
Secondly, let's make this ultra-simple.
Lets say I have a binary file named 'myfile.binary' with the five-byte contents '4C53535353' in hex - this equates to the ascii representations for letters "L and 4xS" being alone in the file.
If I do:
f = open('myfile.binary', 'rb')
contents = f.read(5)
contents should (from Sven Marnach's answer) hold a five-byte immutable string.
Using Python 2.4 facilities only, how could I change the 4 S's held in 'contents' to an arbitrary integer value? I.e. give me a line of code that can make byte indices contents [1-4] contain the 32-bit integer 'myint' with value 12345678910.

What you need is this function:
struct.pack_into(fmt, buffer, offset, v1, v2, ...)
It's documented at http://docs.python.org/library/struct.html near the top.
Example code:
import struct
import ctypes
data=ctypes.create_string_buffer(10)
struct.pack_into(">i", data, 5, 0x12345678)
print list(data)
Similar posting: Python: How to pack different types of data into a string buffer using struct.pack_into
EDIT: Added a Python 2.4 compatible example:
import struct
f=open('myfile.binary', 'rb')
contents=f.read(5)
data=list(contents)
data[0:4]=struct.pack(">i", 0x12345678)
print data

Have a look at Struct module. You need pack function.
EDIT:
The code:
import struct
s = "LSSSS" # your string
s = s[0] + struct.pack('<I', 1234567891) # note "shorter" constant than in your example
print s
Output:
L╙☻ЦI
struct.pack should be available in Python2.4.
Your number "12345678910" cannot be packed into 4 bytes, I shortened it a bit.

The result of file.read() is a string in Python, and it is immutable. Depending on the context of the task you are trying to accomplish, there are different solutions to the problem.
One is using the array module: Read the file directly as an array of 32-bit integers. You can modify this array and write it back to the file.
with open("filename") as f:
f.seek(0, 2)
size = f.tell()
f.seek(0)
data = array.array("i")
assert data.itemsize == 4
data.fromfile(f, size // 4)
data[2] = new_value
# use data.tofile(g) to write the data back to a new file g

You could install the numpy module, which is often used for scientific computing.
read_data = numpy.fromfile(file=id, dtype=numpy.uint32)
Then access the data at the desired location and make your changes.

The following is just a demonstration for you to understand what really happens when the four bytes are converted into an integer.
Suppose you have a number: 15213
Decimal: 15213
Binary: 0011 1011 0110 1101
Hex: 3 B 6 D
On little-endian systems (i.e x86 machines), this number can be represented using a length-4 bytearray as: b"\x6d\x3b\x00\x00" or b"m;\x00\x00" when you print it on the screen, to convert the four bytes into an integer, we simply do a bit of base conversion, which in this case, is:
sum(n*(256**i) for i,n in enumerate(b"\x6d\x3b\x00\x00"))
This gives you the result: 15213

Short Integers in Python

Python allocates integers automatically based on the underlying system architecture. Unfortunately I have a huge dataset which needs to be fully loaded into memory.
So, is there a way to force Python to use only 2 bytes for some integers (equivalent of C++ 'short')?

Nope. But you can use short integers in arrays:
from array import array
a = array("h") # h = signed short, H = unsigned short
As long as the value stays in that array it will be a short integer.
documentation for the array module

Thanks to Armin for pointing out the 'array' module. I also found the 'struct' module that packs c-style structs in a string:
From the documentation (https://docs.python.org/library/struct.html):
>>> from struct import *
>>> pack('hhl', 1, 2, 3)
'\x00\x01\x00\x02\x00\x00\x00\x03'
>>> unpack('hhl', '\x00\x01\x00\x02\x00\x00\x00\x03')
(1, 2, 3)
>>> calcsize('hhl')
8

You can use NumyPy's int as np.int8 or np.int16.

Armin's suggestion of the array module is probably best. Two possible alternatives:
You can create an extension module yourself that provides the data structure that you're after. If it's really just something like a collection of shorts, then
that's pretty simple to do.
You can
cheat and manipulate bits, so that
you're storing one number in the
lower half of the Python int, and
another one in the upper half.
You'd write some utility functions
to convert to/from these within your
data structure. Ugly, but it can be made to work.
It's also worth realising that a Python integer object is not 4 bytes - there is additional overhead. So if you have a really large number of shorts, then you can save more than two bytes per number by using a C short in some way (e.g. the array module).
I had to keep a large set of integers in memory a while ago, and a dictionary with integer keys and values was too large (I had 1GB available for the data structure IIRC). I switched to using a IIBTree (from ZODB) and managed to fit it. (The ints in a IIBTree are real C ints, not Python integers, and I hacked up an automatic switch to a IOBTree when the number was larger than 32 bits).

You can also store multiple any size of integers in a single large integer.
For example as seen below, in python3 on 64bit x86 system, 1024 bits are taking 164 bytes of memory storage. That means on average one byte can store around 6.24 bits. And if you go with even larger integers you can get even higher bits storage density. For example around 7.50 bits per byte with 2**20 bits wide integer.
Obviously you will need some wrapper logic to access individual short numbers stored in the larger integer, which is easy to implement.
One issue with this approach is your data access will slow down due use of the large integer operations.
If you are accessing a big batch of consecutively stored integers at once to minimize the access to large integers, then the slower access to long integers won't be an issue.
I guess use of numpy will be easier approach.
>>> a = 2**1024
>>> sys.getsizeof(a)
164
>>> 1024/164
6.2439024390243905
>>> a = 2**(2**20)
>>> sys.getsizeof(a)
139836
>>> 2**20 / 139836
7.49861266054521

Using bytearray in python which is basically a C unsigned char array under the hood will be a better solution than using large integers. There is no overhead for manipulating a byte array and, it has much less storage overhead compared to large integers. It's possible to get storage density of 7.99+ bits per byte with bytearrays.
>>> import sys
>>> a = bytearray(2**32)
>>> sys.getsizeof(a)
4294967353
>>> 8 * 2**32 / 4294967353
7.999999893829228

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.