Python: Zero Copy while truncating a byte buffer - python

This is a noob question on Python.
Is there a way in Python to truncate off few bytes from the begining of bytearray and achieve this without copying the content to another memory location? Following is what I am doing:
inbuffer = bytearray()
inbuffer.extend(someincomingbytedata)
x = inbuffer[0:10]
del inbuffer[0:10]
I need to retain the truncated bytes (referenced by x) and perform some operation on it.
will x point to the same memory location as inbuffer[0] or will the 3rd line in the above code make a copy of data. Also, if the copy is not made, will deleting in the last line also delete the data referenced by x? Since x is still referencing that data, GC should not be reclaiming it. Is that right?
Edit:
If this is not the right way to truncate a byte buffer and return the truncated bytes without copying, is there any other type that supports such operation safely?

In your example, x will be a new object that holds a copy of the contents of inbuffer[0:10].
To get a representation without copying, you need to use a memoryview (available only in Python 3):
inbuffer_view = memoryview(inbuffer)
prefix = inbuffer_view[0:10]
suffix = inbuffer_view[10:]
Now prefix will point to the first 10 bytes of inbuffer, and suffix will point to the remaining contents of inbuffer. Both objects keep an internal reference to inbuffer, so you do not need to explicitly keep references to inbuffer or inbuffer_view.
Note that both prefix and suffix will be memoryviews, not bytearrays or bytes. You can create bytes and bytearrays from them, but at that point the contents will be copied.
memoryviews can be passed to any function that works with objects that implement the buffer protocol. So, for example, you can write them directly into a file using fh.write(suffix).

You can use the iterator protocol and itertools.islice to pull the first 10 values out of your someincomingbytedata iterable before putting the rest into inbuffer. This doesn't use the same memory for all the bytes, but it's about as good as you can get at avoiding unnecessary copying with a bytearray:
import itertools
it = iter(someincomingbytedata)
x = bytearray(itertools.islice(it, 10)) # consume the first 10 bytes
inbuffer = bytearray(it) # consume the rest
If you really do need to do your reading all up front and then efficiently view various slices of it without copying, you might consider using numpy. If you load your data into a numpy array, any slices you take later will be views into the same memory:
import numpy as np
inbuffer = np.array(someincomingdata, dtype=np.uint8) # load data into an array of bytes
x = inbuffer[:10] # grab a view of the first ten bytes, which does not require a copy
inbuffer = inbuffer[10:] # change inbuffer to reference a slice; no copying here either

It is very easy to check:
>>> inbuffer = bytearray([1, 2, 3, 4, 5])
>>> x = inbuffer[0:2]
>>> print id(x) == id(inbuffer)
False
So it is not the same object.
Also you are asking about x pointing at inbuffer[0]. You seem to misunderstand something. Arrays in Python don't work the same way as arrays in C. The address of inbuffer is not the address of inbuffer[0]:
>>> inbuffer = bytearray([1, 2, 3, 4, 5])
>>> print id(inbuffer) == id(inbuffer[0])
False
These are wrappers around C-level arrays.
Also in Python everything is an object. And Python caches all integers up to 256 (the range of bytearray). Therefore the only thing that is copied over is pointers:
>>> inbuffer = bytearray([1, 2, 3, 4, 5])
>>> print id(inbuffer[0]) == id(1)
True

Related

Why ndarray function as immutable object

I probably misunderstood the term immutable/mutable and in-place change.
Creating ndarray: x = np.arange(6) . Reshaping ndarray: x.reshape(3,2).
Now when I look on x he is unchanged, the ndarray is still 1-dim.
When I do the same with a built-in python list, the list gets changed.
Like #morhc mentiond, you're not changing your array x because x.reshape returns a new array instead of modifying the existing one. Some numpy methods allow an inplace parameter, but reshape is not one of them.
Mutability is a somewhat related, but more general concept.
A mutable object is one that can be changed after it has been created and stored in memory. Immutable objects, once created, cannot be changed; if you want to modify an immutable object, you have to create a new one. For example, lists in python are mutable: You can add or remove elements, and the list remains stored at the same place in memory. A string however is immutable: If you take the string "foobar" and call its replace method to replace some characters, then the modified string that you get as a result is a new object stored in a different place in memory.
You can use the id built-in function in python to check at what memory address an object is stored. So, to demonstrate:
test_list = []
id(test_list)
>>> 1696610990848 # memory address of the empty list object we just created
test_list.append(1)
id(test_list)
>>> 1696610990848 # the list with one item added to it is still the same object
test_str = "foobar"
id(test_str)
>>> 1696611256816 # memory address of the string object we just created
test_str = test_str.replace("b", "x")
id(test_str)
>>> 1696611251312 # the replace method returned a new object
In fact, numpy arrays are in principle mutable:
test_arr = np.zeros(4)
id(test_arr)
>>> 1696611361200
test_arr[0] = 1
id(test_arr)
>>> 1696611361200 # after changing an entry in the array, it's still the same object
I suppose that reshaping an object in-place would be too difficult, so instead a new one is created.
Note also that making an assignment like test_arr2 = test_arr does not make a copy; instead, test_arr2 points to the same object in memory. If you truly want to make a new copy, you should use test_arr.copy().
The issue is that you are not overwriting the array. You are currently writing:
x = np.arange(6)
x.reshape(3,2)
But you should rather be writing:
x = np.arange(6)
x = x.reshape(3,2)
Note that the array has to be of size six to rearrange to (3,2).

Load numpy array of strings python 3

I am converting code from python 2 into python 3. The array was originally saved in python 2. As part of some of my code I load an array of strings that I have saved. In python 2, I can simply load it as
arr = np.load("path_to_string.npy")
and it gives me
arr = ['str1','str2' etc...]
however, when i do the same in python 3, it doesn't work and I get instead.
arr = [b'str1',b'str2' etc...]
which I take it means that the strings are stored as a different data type. I have tried to convert them using:
arr = [str(i) for i in arr]
but this just compounds the problem. Can someone explain why this happens and how to fix it? I'm sure its trivial, but am just drawing a blank?
To be clear, if they were strs in Python 2, then bytes in Python 3 is the "correct" type, in the sense that both of them store byte data; if you wanted arbitrary text data, you would use unicode in Python 2.
For numpy, this is really the correct behavior; numpy doesn't want to silently convert from bytes-oriented data to text-oriented data (among other issues, doing so will bloat the memory usage by a factor of 4x, since fixed width representations of all Unicode characters use four bytes per character). If you really want to change from bytes to str, you can explicitly cast it, though it's a little bit hacky:
>>> arr # Original version
array([[b'abc', b'123'],
[b'foo', b'bar']], dtype='|S3')
>>> arr = arr.astype('U') # Cast from "[S]tring" to "[U]nicode" equivalent
>>> arr
array([['abc', '123'],
['foo', 'bar']], dtype='<U3')

Efficient way to make numpy object arrays intern strings

Consider numpy arrays of the object dtype. I can shove anything I want in there.
A common use case for me is to put strings in them. However, for very large arrays, this may use up a lot of memory, depending on how the array is constructed. For example, if you assign a long string (e.g. "1234567890123456789012345678901234567890") to a variable, and then assign that variable to each element in the array, everything is fine:
arr = np.zeros((100000,), dtype=object)
arr[:] = "1234567890123456789012345678901234567890"
The interpreter now has one large string in memory, and an array full of pointers to this one object.
However, we can also do it wrong:
arr2 = np.zeros((100000,), dtype=object)
for idx in range(100000):
arr2[idx] = str(1234567890123456789012345678901234567890)
Now, the interpreter has a hundred thousand copies of my long string in memory. Not so great.
(Naturally, in the above example, the generation of a new string each time is stunted - in real life, imagine reading a string from each line in a file.)
What I want to do is, instead of assigning each element to the string, first check if it's already in the array, and if it is, use the same object as the previous entry, rather than the new object.
Something like:
arr = np.zeros((100000,), dtype=object)
seen = []
for idx, string in enumerate(file): # Length of file is exactly 100000
if string in seen:
arr[idx] = seen[seen.index(string)]
else:
arr[idx] = string
seen.append(string)
(Apologies for not posting fully running code. Hopefully you get the idea.)
Unfortunately this requires a large number of superfluous operations on the seen list. I can't figure out how to make it work with sets either.
Suggestions?
Here's one way to do it, using a dictionary whose values are equal to its keys:
seen = {}
for idx, string in enumerate(file):
arr[idx] = seen.setdefault(string, string)

How to read stdin to a 2d python array of integers?

I would like to read a 2d array of integers from stdin (or from a file) in Python.
Non-working code:
from StringIO import StringIO
from array import array
# fake stdin
stdin = StringIO("""1 2
3 4
5 6""")
a = array('i')
a.fromstring(stdin.read())
This gives me an error: a.fromstring(stdin.read())
ValueError: string length not a multiple of item size
Several approaches to accomplish this are available. Below are a few of the possibilities.
Using an array
From a list
Replace the last line of code in the question with the following.
a.fromlist([int(val) for val in stdin.read().split()])
Now:
>>> a
array('i', [1, 2, 3, 4, 5, 6])
Con: does not preserve 2d structure (see comments).
From a generator
Note: this option is incorporated from comments by eryksun.
A more efficient way to do this is to use a generator instead of the list. Replace the last two lines of the code in the question with:
a = array('i', (int(val) for row in stdin for val in row.split()))
This produces the same result as the option above, but avoids creating the intermediate list.
Using a NumPy array
If you want the preserve the 2d structure, you could use a NumPy array. Here's the whole example:
from StringIO import StringIO
import numpy as np
# fake stdin
stdin = StringIO("""1 2
3 4
5 6""")
a = np.loadtxt(stdin, dtype=np.int)
Now:
>>> a
array([[1, 2],
[3, 4],
[5, 6]])
Using standard lists
It is not clear from the question if a Python list is acceptable. If it is, one way to accomplish the goal is replace the last two lines of the code in the question with the following.
a = [map(int, row.split()) for row in stdin]
After running this, we have:
>>> a
[[1, 2], [3, 4], [5, 6]]
I've never used array.array, so I had to do some digging around.
The answer is in the error message -
ValueError: string length not a multiple of item size
How do you determine the item size? Well it depends on the type you initialized it with. In your case you initialized it with i which is a signed int. Now, how big is an int? Ask your python interpreter..
>>> a.itemsize
4
The value above provides insight into the problem. Your string is only 11 bytes wide. 11 isn't a multiple of 4. But increasing the length of the string will not give you an array of {1,2,3,4,5,6}... I'm not sure what it would give you. Why the uncertainty? Well, read the docstring below... (It's late, so I highlighted the important part, in case you're getting sleepy, like me!)
array.fromfile(f, n)
Read n items (as machine values) from the file object f and append them to the end of the array. If less than n items are available, EOFError is raised, but the items that were available are still inserted into the array. f must be a real built-in file object; something else with a read() method won’t do.
array.fromstring reads data in the same manner as array.fromfile. Notice the bold above. "as machine values" means "reads as binary". So, to do what you want to do, you need to use the struct module. Check out the code below.
import struct
a = array.array('i')
binary_string = struct.pack('iiii', 1, 2, 3, 4)
a.fromstring(binary_string)
The code snippet above loads the array with tlhe values 1, 2, 3, 4; like we expect.
Hope it helps.
arr = []
arr = raw_input()
If you want to split the input by spaces:
arr = []
arr = raw_input().split()

How do I declare an array in Python?

How do I declare an array in Python?
variable = []
Now variable refers to an empty list*.
Of course this is an assignment, not a declaration. There's no way to say in Python "this variable should never refer to anything other than a list", since Python is dynamically typed.
*The default built-in Python type is called a list, not an array. It is an ordered container of arbitrary length that can hold a heterogenous collection of objects (their types do not matter and can be freely mixed). This should not be confused with the array module, which offers a type closer to the C array type; the contents must be homogenous (all of the same type), but the length is still dynamic.
This is surprisingly complex topic in Python.
Practical answer
Arrays are represented by class list (see reference and do not mix them with generators).
Check out usage examples:
# empty array
arr = []
# init with values (can contain mixed types)
arr = [1, "eels"]
# get item by index (can be negative to access end of array)
arr = [1, 2, 3, 4, 5, 6]
arr[0] # 1
arr[-1] # 6
# get length
length = len(arr)
# supports append and insert
arr.append(8)
arr.insert(6, 7)
Theoretical answer
Under the hood Python's list is a wrapper for a real array which contains references to items. Also, underlying array is created with some extra space.
Consequences of this are:
random access is really cheap (arr[6653] is same to arr[0])
append operation is 'for free' while some extra space
insert operation is expensive
Check this awesome table of operations complexity.
Also, please see this picture, where I've tried to show most important differences between array, array of references and linked list:
You don't actually declare things, but this is how you create an array in Python:
from array import array
intarray = array('i')
For more info see the array module: http://docs.python.org/library/array.html
Now possible you don't want an array, but a list, but others have answered that already. :)
I think you (meant)want an list with the first 30 cells already filled.
So
f = []
for i in range(30):
f.append(0)
An example to where this could be used is in Fibonacci sequence.
See problem 2 in Project Euler
This is how:
my_array = [1, 'rebecca', 'allard', 15]
For calculations, use numpy arrays like this:
import numpy as np
a = np.ones((3,2)) # a 2D array with 3 rows, 2 columns, filled with ones
b = np.array([1,2,3]) # a 1D array initialised using a list [1,2,3]
c = np.linspace(2,3,100) # an array with 100 points beteen (and including) 2 and 3
print(a*1.5) # all elements of a times 1.5
print(a.T+b) # b added to the transpose of a
these numpy arrays can be saved and loaded from disk (even compressed) and complex calculations with large amounts of elements are C-like fast.
Much used in scientific environments. See here for more.
JohnMachin's comment should be the real answer.
All the other answers are just workarounds in my opinion!
So:
array=[0]*element_count
A couple of contributions suggested that arrays in python are represented by lists. This is incorrect. Python has an independent implementation of array() in the standard library module array "array.array()" hence it is incorrect to confuse the two. Lists are lists in python so be careful with the nomenclature used.
list_01 = [4, 6.2, 7-2j, 'flo', 'cro']
list_01
Out[85]: [4, 6.2, (7-2j), 'flo', 'cro']
There is one very important difference between list and array.array(). While both of these objects are ordered sequences, array.array() is an ordered homogeneous sequences whereas a list is a non-homogeneous sequence.
You don't declare anything in Python. You just use it. I recommend you start out with something like http://diveintopython.net.
I would normally just do a = [1,2,3] which is actually a list but for arrays look at this formal definition
To add to Lennart's answer, an array may be created like this:
from array import array
float_array = array("f",values)
where values can take the form of a tuple, list, or np.array, but not array:
values = [1,2,3]
values = (1,2,3)
values = np.array([1,2,3],'f')
# 'i' will work here too, but if array is 'i' then values have to be int
wrong_values = array('f',[1,2,3])
# TypeError: 'array.array' object is not callable
and the output will still be the same:
print(float_array)
print(float_array[1])
print(isinstance(float_array[1],float))
# array('f', [1.0, 2.0, 3.0])
# 2.0
# True
Most methods for list work with array as well, common
ones being pop(), extend(), and append().
Judging from the answers and comments, it appears that the array
data structure isn't that popular. I like it though, the same
way as one might prefer a tuple over a list.
The array structure has stricter rules than a list or np.array, and this can
reduce errors and make debugging easier, especially when working with numerical
data.
Attempts to insert/append a float to an int array will throw a TypeError:
values = [1,2,3]
int_array = array("i",values)
int_array.append(float(1))
# or int_array.extend([float(1)])
# TypeError: integer argument expected, got float
Keeping values which are meant to be integers (e.g. list of indices) in the array
form may therefore prevent a "TypeError: list indices must be integers, not float", since arrays can be iterated over, similar to np.array and lists:
int_array = array('i',[1,2,3])
data = [11,22,33,44,55]
sample = []
for i in int_array:
sample.append(data[i])
Annoyingly, appending an int to a float array will cause the int to become a float, without throwing an exception.
np.array retain the same data type for its entries too, but instead of giving an error it will change its data type to fit new entries (usually to double or str):
import numpy as np
numpy_int_array = np.array([1,2,3],'i')
for i in numpy_int_array:
print(type(i))
# <class 'numpy.int32'>
numpy_int_array_2 = np.append(numpy_int_array,int(1))
# still <class 'numpy.int32'>
numpy_float_array = np.append(numpy_int_array,float(1))
# <class 'numpy.float64'> for all values
numpy_str_array = np.append(numpy_int_array,"1")
# <class 'numpy.str_'> for all values
data = [11,22,33,44,55]
sample = []
for i in numpy_int_array_2:
sample.append(data[i])
# no problem here, but TypeError for the other two
This is true during assignment as well. If the data type is specified, np.array will, wherever possible, transform the entries to that data type:
int_numpy_array = np.array([1,2,float(3)],'i')
# 3 becomes an int
int_numpy_array_2 = np.array([1,2,3.9],'i')
# 3.9 gets truncated to 3 (same as int(3.9))
invalid_array = np.array([1,2,"string"],'i')
# ValueError: invalid literal for int() with base 10: 'string'
# Same error as int('string')
str_numpy_array = np.array([1,2,3],'str')
print(str_numpy_array)
print([type(i) for i in str_numpy_array])
# ['1' '2' '3']
# <class 'numpy.str_'>
or, in essence:
data = [1.2,3.4,5.6]
list_1 = np.array(data,'i').tolist()
list_2 = [int(i) for i in data]
print(list_1 == list_2)
# True
while array will simply give:
invalid_array = array([1,2,3.9],'i')
# TypeError: integer argument expected, got float
Because of this, it is not a good idea to use np.array for type-specific commands. The array structure is useful here. list preserves the data type of the values.
And for something I find rather pesky: the data type is specified as the first argument in array(), but (usually) the second in np.array(). :|
The relation to C is referred to here:
Python List vs. Array - when to use?
Have fun exploring!
Note: The typed and rather strict nature of array leans more towards C rather than Python, and by design Python does not have many type-specific constraints in its functions. Its unpopularity also creates a positive feedback in collaborative work, and replacing it mostly involves an additional [int(x) for x in file]. It is therefore entirely viable and reasonable to ignore the existence of array. It shouldn't hinder most of us in any way. :D
How about this...
>>> a = range(12)
>>> a
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
>>> a[7]
6
Following on from Lennart, there's also numpy which implements homogeneous multi-dimensional arrays.
Python calls them lists. You can write a list literal with square brackets and commas:
>>> [6,28,496,8128]
[6, 28, 496, 8128]
I had an array of strings and needed an array of the same length of booleans initiated to True. This is what I did
strs = ["Hi","Bye"]
bools = [ True for s in strs ]
You can create lists and convert them into arrays or you can create array using numpy module. Below are few examples to illustrate the same. Numpy also makes it easier to work with multi-dimensional arrays.
import numpy as np
a = np.array([1, 2, 3, 4])
#For custom inputs
a = np.array([int(x) for x in input().split()])
You can also reshape this array into a 2X2 matrix using reshape function which takes in input as the dimensions of the matrix.
mat = a.reshape(2, 2)
# This creates a list of 5000 zeros
a = [0] * 5000
You can read and write to any element in this list with a[n] notation in the same as you would with an array.
It does seem to have the same random access performance as an array. I cannot say how it allocates memory because it also supports a mix of different types including strings and objects if you need it to.

Categories