Referential Arrays in Python - python

I have started learning Data Structures and Algorithms.
Please help me with my doubt.
Is it ok to say that lists, tuples, and dictionaries are a type of referential array?
I was going through the example in the book where it was written that we want a medical information system to keep track of the patients currently assigned to beds in a certain hospital. If we assume that the hospital has 200 beds, and conveniently that those beds are numbered from 0 to 199, we might consider using an array-based structure to maintain the names of the patients currently assigned to those beds. For example, in Python, we might use a list of names, such as:
[ Rene , Joseph , Janet , Jonas , Helen , Virginia , ... ]
To represent such a list with an array, Python must adhere to the requirement that each cell of the array use the same number of bytes. Yet the elements are strings, and strings naturally have different lengths. Python could attempt to reserve enough space for each cell to hold the maximum length string (not just of currently stored strings, but of any string we might ever want to store), but that would be wasteful.
Instead, Python represents a list or tuple instance using an internal storage mechanism of an array of object references. At the lowest level, what is stored is a consecutive sequence of memory addresses at which the elements of the sequence reside. A high-level diagram of such a list is shown in Figure
https://i.stack.imgur.com/7FffP.jpg
What I understood is that python stores the memory address of "Rene" or "Joseph". But memory addresses will also change with respect to number of characters in the name like each Unicode takes 2 bytes of space.
Now it was also written that "Although the relative size of the individual elements may vary, the number of bits used to store the memory address of each element is fixed (e.g., 64-bits per address i.e 8 bytes). What if the character is very long and it's not able to store the memory address in 64 bits?

Is it ok to say that lists, tuples, and dictionaries are a type of referential array?
Lists are (though it might matter that they have dynamic resizing built in), but tuples and dictionaries have a lot more going on behind the scenes.
What I understood is that python stores the memory address of "Rene" or "Joseph". But memory addresses will also change with respect to number of characters in the name like each Unicode takes 2 bytes of space.
The memory address is an integer describing where to find the data, and the data itself is a string of bytes like b'Rene'. I'm not sure what CPython uses to figure out where a data structure ends, but common approaches include a null terminator (when given an address, all bytes that you find are part of the data structure till you hit a null byte), ensuring every object is the same size so that you can skip ahead (not feasible for strings), or having the first byte(s) describe the length of the rest of the object. The address itself has a fixed size, e.g., 64 bits.
In the following illustration, the "address" used in the referential array would be 3, and you would know that the last byte of the string happened at address 6 by walking till you found the null byte (the other approaches I mentioned work similarly). The important thing is that 3 is the only number the referential array needs to store to be able to reference b'Rene'.
|3|4|5|6|7 |
|R|e|n|e|\0|
What if the character is very long and it's not able to store the memory address in 64 bits?
This was actually the reason 32-bit systems couldn't use more than 4GB of RAM -- they only had 32-bit identifiers to access the RAM (without additional techniques like assigning each process a RAM offset so that each process has a 32-bit memory space). Modern 64-bit systems allow 16 exabytes of RAM to be indexed without needing any other special techniques -- one of the primary motivations for the 32 -> 64 bit migration.
In any case though, memory limits are a real thing, and if we start needing to address more memory than that we'll need to use more complicated strategies than fixed-width integer references, or we'll need to jump up again and use bigger integers.

Related

Is there a way to have a list of 4 billion numbers in Python?

I made a binary search function and I'm curious what would happen if I used it on 4 billion numbers, but I get a MemoryError every time I use it. Is there a way to store the list without this issue?
Yes, but it's gonna be expensive.
Assuming the numbers are 1-4,000,000,000, each number would take up either 28 bytes or 32 bytes. Going with 32 bytes, storing 4,000,000,000 numbers would take ~128GB. The list itself also takes up memory, 8 bytes per object (source), so it would require around 192GB of memory altogether.
If you're wanting to emulate a list, however, things become a lot more reasonable.
If your numbers follow some formula, you can make a class that "pretends" to be a list by following the answers here. This will create a custom class that works like a list, (e.g. you can do foo[3] on it and it will return a number) but doesn't take up all that memory because it isn't actually storing 4,000,000,000 numbers. In fact, this is pretty similar to how range() works, and is the reason why doing range(1,4_000_000_000) doesn't take up 192GB of ram.

Different lists in Python in size and content still share the id, does memory matter?

I have read the answer from this question as well as the related questions about the issue of having different objects sharing the same id (which can be answered by this Python docs about id). However, in these questions, I notice that the contents of the objects are the same (thus the memory sizes are the same, too). I experiment with the list of different sizes and contents on both the IPython shell and .py file with CPython, and get the "same id" result, too:
print(id([1]), id([1,2,3]), id([1,2,3,4,1,1,1,1,1,1,1,1,1,1,1,1]))
# Result: 2067494928320 2067494928320 2067494928320
The result doesn't change despite how many elements or the size of the number (big or small) I add to the list
So I have a question here: when an id is given, does the list size have any effect on whether the id can be reused or not? I thought that it could because according to the docs above,
CPython implementation detail: This is the address of the object in memory.
and if the address does not have enough space for the list, then a new id should be given. But I'm quite surprised about the result above.
Make a list, and some items to it. the id remains the same:
In [21]: alist = []
In [22]: id(alist)
Out[22]: 139981602327808
In [23]: for i in range(29): alist.append(i)
In [24]: id(alist)
Out[24]: 139981602327808
But the memory use for this list occurs in several parts. There's some sort storage for the list instance itself (that's that the id references). Python is written in C, but all items are objects (as in C++).
The list also has a data buffer, think of it as a C array with fix size. It holds pointers to objects elsewhere in memory. That buffer has space for the current references plus some sort of growth space. As you add items to list, their references are inserted in the growth space. When that fills up, the list gets a new buffer, with more growth space. List append is relatively fast, with periodic slow downs as it copies references to the new buffer. All that occurs under the covers so that the Python programmer doesn't notice.
I suspect that in my example alist got a new buffer several times, but I don't there's any way to track or measure that.
Storage for the objects referenced by the list is another matter. cython creates small integer objects (up to 256) at the start, so my list (and yours) will have references to those unique integer objects. It also maintains some sort of cache of strings. But other things, such as larger numbers, other lists, dicts, custom class objects, are created as needed. Identically valued objects might well have different id.
So while the data buffer of the list is contiguous, the items referenced in the buffer are not.
By and large, that memory management is unimportant to programmers. It's only when data structures get very large that we need to worry. And that seems to occur more with object classes like numpy arrays, which have a very different memory use.

Large Numpy array handler, numpy data procession, memmap funciton mapping

Large numpy array (over 4GB) with nyp file and memmap function
I was using numpy package for array calculation where I read https://docs.scipy.org/doc/numpy/neps/npy-format.html
In "Format Specification: Version 2.0" it said that, for .npy file, "version 2.0 format extends the header size to 4 GiB".
My question was that:
What was header size? Did that mean I could only save numpy.array of sizeat most 4GB array into the npy file? How large could a single array go?
I also read https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.memmap.html
where it stated that "Memory-mapped files cannot be larger than 2GB on 32-bit systems"
did it mean numpy.memmap's limitation was based on the memory of the system? Was there anyway to avoid such limitation?
Further, I read that we could chose the dtype of the array, where the best resolution was "complex128". Was there any way to "use" and "save" elements with more accuracy on a 64 bit computer?(more accurate than complex128 or float64)
The previous header size field was 16 bits wide, allowing headers smaller than 64KiB. Because the header describes the structure of the data, and doesn't contain the data itself, this is not a huge concern for most people. Quoting the notes, "This can be exceeded by structured arrays with a large number of columns." So to answer the first question, header size was under 64KiB but the data came after, so this wasn't the array size limit. The format didn't specify a data size limit.
Memory map capacity is dependent on operating system as well as machine architecture. Nowadays we've largely moved to flat but typically virtual address maps, so the program itself, stack, heap, and mapped files all compete for the same space, in total 4GiB for 32 bit pointers. Operating systems frequently partition this in quite large chunks, so some systems might only allow 2GiB total for user space, others 3GiB; and often you can map more memory than you can allocate otherwise. The memmap limitation is more closely tied to the operating system in use than the physical memory.
Non-flat address spaces, such as using distinct segments on OS/2, could allow larger usage. The cost is that a pointer is no longer a single word. PAE, for instance, supplies a way for the operating system to use more memory but still leaves processes with their own 32 bit limits. Typically it's easier nowadays to use a 64 bit system, allowing memory spaces up to 16 exabytes. Because data sizes have grown a lot, we also handle it in larger pieces, such as 4MiB or 16MiB allocations rather than the classic 4KiB pages or 512B sectors. Physical memory typically has more practical limits.
Yes, there are elements with more precision than 64 bit floating point; in particular, 64 bit integers. This effectively uses a larger mantissa by sacrificing all of the exponent. Complex128 is two 64 bit floats, and doesn't have higher precision but a second dimension. There are types that can grow arbitrarily precise, such as Python's long integers (long in python 2, int in python 3) and fractions, but numpy generally doesn't delve into those because they also have matching storage and computation costs. A basic property of the arrays is that they can be addressed using index calculations since the element size is consistent.

Memory needed for an item in a list

I have learning python for a year. In the lecture slides, I see that the compiler usually allocate 4 bytes to store items in a list.
Why can't the compiler allocate memory dynamically? For example, if I have a small value, why don't the compiler just assign 1 or 2 bytes to store it, would it be more memory efficient?
What is mean by 32 bit-integer where one bit is a sign bit and 31 bits represent number? One byte is equal to 8 bits and an integer is 32 bits, so why would they represent it with 8 bytes where they waste 4 bytes btw in some computers you see 64-bit , they can use 8 bytes and it has only effect how greater number you can represent. The more bits ==> the more greater number.It simply goes about your processor and it's registers.
It is not true that elements in a list or array take 4 bytes.
In languages with strong static types (that is, in which you have to declare the type of a variable before using it), an array takes exactly the size of the array times the size of each variable of that type. In C, char types usually uses one byte, and int types use an amount of bytes that depends on the system's architecture.
In languages with dynamic types, like Python, all variables are implemented as objects containing pointers (to a location in memory where data is actually stored) and labels (indicating the type of the variable, which will be used in runtime to determine the variable's behavior when performing operations).
The 4-byte issue probably refers to the fact that many architectures use 4 bytes for integers, but this is coincidental and it is a stretch to consider it a rule.

Prefix search against half a billion strings

I have a list of 500 mil strings. The strings are alphanumeric, ASCII characters, of varying size (usually from 2-30 characters). Also, they're single words (or a combination of words without spaces like 'helloiamastring').
What I need is a fast way to check against a target, say 'hi'. The result should be all strings from the 500mil list which start with 'hi' (for eg. 'hithere', 'hihowareyou' etc.). This needs to be fast because there will be a new query each time the user types something, so if he types "hi", all strings starting with "hi" from the 500 mil list will be shown, if he types "hey", all strings starting with "hey" will show etc.
I've tried with the Tries algo, but the memory footprint to store 300 mil strings is just huge. It should require me 100GB+ ram for that. And I'm pretty sure the list will grow up to a billion.
What is a fast algorithm for this use case?
P.S. In case there's no fast option, the best alternative would be to limit people to enter at least, say, 4 characters, before results show up. Is there a fast way to retrieve the results then?
You want a Directed Acyclic Word Graph or DAWG. This generalizes #greybeard's suggestion to use stemming.
See, for example, the discussion in section 3.2 of this.
If the strings are sorted then a binary search is reasonable. As a speedup, you could maintain a dictionary of all possible bigrams ("aa", "ab", etc.) where the corresponding values are the first and last index starting with that bigram (if any do) and so in O(1) time zero in on a much smaller sublist that contains the strings that you are looking for. Once you find a match, do a linear search to the right and left to get all other matches.
If you want to force the user to digit at least 4 letters, for example, you can keep a key-value map, memory or disk, where the keys are all combinations of 4 letters (they are not too many if it is case insensitive, otherwise you can limit to three), and the values are list of positions of all strings that begin with the combination.
After the user has typed the three (or four) letters you have at once all the possible strings. From this point on you just loop on this subset.
On average this subset is small enough, i.e. 500M divided by 26^4...just as example. Actually bigger because probably not all sets of 4 letters can be prefix for your strings.
Forgot to say: when you add a new string to the big list, you also update the list of indexes corresponding to the key in the map.
If you doesn't want to use some database, you should create some data related routines pre-existing in all database engines:
Doesn't try to load all data in memory.
Use fixed length for all string. It increase storage memory consumption but significantly decrease seeking time (i-th string can be found at position L*i bytes in file, where L - fixed length). Create additional mechanism to work with extremely long strings: store it in different place and use special pointers.
Sort all of strings. You can use merge sort to do it without load all strings in memory in one time.
Create indexes (address of first line starts with 'a','b',... ) also indexes can be created for 2-grams, 3-grams, etc. Indexes can be placed in memory to increase search speed.
Use advanced strategies to avoid full indexes regeneration on data update: split a data to a number of files by first letters and update only affected indexes, create an empty spaces in data to decrease affect of read-modify-write procedures, create a cache for a new lines before they will be added to main storage and search in this cache.
Use query cache to fast processing a popular requests.
In this hypothetical, where the strings being indexed are not associated with any other information (e.g. other columns in the same row), there is relatively little difference between a complete index and keeping the strings sorted in the first place (as in, some difference, but not as much as you are hoping for). In light of the growing nature of the list and the cost of updating it, perhaps the opposite approach will better accomplish the performance tradeoffs that you are looking for.
For any given character at any given location in the string, your base case is that no string exists containing that letter. For example, once 'hello' has been typed, if the next letter typed is 't', then your base case is that there is no string beginning 'hellot'. There is a finite number of characters that could follow 'hello' at location 5 (say, 26). You need 26 fixed-length spaces in which to store information about characters that follow 'hello' at location 5. Each space either says zero if there is no string in which, e.g., 't' follows 'hello', or contains a number of data-storage addresses by which to advance to find the list of characters for which one or more strings involve that character following 'hellot' at location 6 (or use absolute data-storage addresses, although only relative addressess allow the algorithm I propose to support an infinite number of strings of infinite length without any modification to allow for larger pointers as the list grows).
The algorithm can then move forward through this data stored on disk, building a tree of string-beginnings in memory as it goes, and avoiding delays caused by random-access reads. For an in-memory index, simply store the part of the tree closest to the root in memory. After the user has typed 'hello' and the algorithm has tracked that information about one or more strings beginning 'hellot' exists at data-storage address X, the algorithm finds one of two types of lists at location X. Either it is another sequence of, e.g., 26 fixed-length spaces with information about characters following 'hellot' at location 6, or it is a pre-allocated block of space listing all post-fixes that follow 'hellot', depending on how many such post-fixes exist. Once there are enough post-fixes that using some traditional search and/or sort algorithm to both update and search the post-fix list fails to provide the performance benefits that you desire, it gets divided up and replaced with a sequence of, e.g., 26 fixed-length spaces.
This involves pre-allocating a relatively substantial amount of disk-storage upfront, with the tradeoff that your tree can be maintained in sorted form without needing to move anything around for most updates, and your searches can be peformed in full in a single sequential read. It also provides more flexibility and probably requires less storage space than a solution based on storing the strings themselves as fixed-length strings.
First of all I should say that the tag you should have added for your question is "Information Retrieval".
I think using Apache Lucene's PrefixQuery is the best way you can handle wildcard queries. Apache has a Python version if you are comfortable with python. But to use Apache lucent to solve your problem you should first know about indexing your data(which is the part that your data will be compressed and saved in a more efficient manner).
Also looking to indexing and wildcard query section of IR book will give you a better vision.

Categories