PEP8 has naming conventions for e.g. functions (lowercase), classes (CamelCase) and constants (uppercase).
It seems to me that distinguishing between numpy arrays and built-ins such as lists is probably more important as the same operators such as "+" actually mean something totally different.
Does anyone have any naming conventions to help with this?
You may use a prefix np_ for numpy arrays, thus distinguishing them from other variables.
numpy arrays and lists should occupy similar syntactic roles in your code and as such I wouldn't try to distinguish between them by naming conventions. Since everything in python is an object the usual naming conventions are there not to help distinguish type so much as usage. Data, whether represented in a list or a numpy.ndarray has the same usage.
I agree that it's awkward that eg. + means different things for lists and arrays. I implicitly deal with this by never putting anything like numerical data in a list but rather always in an array. That way I know if I want to concatenate blocks of data I should be using numpy.hstack. That said, there are definitely cases where I want to build up a list through concatenation and turn it into a numpy array when I'm done. In those cases the code block is usually short enough that it's clear what's going on. Some comments in the code never hurt.
Related
Actually, I am doing a sequence operation about numpy array, therefore, I want to know how to access a[i] quickly?
(Because I accessa[i-1] in the last loop, therefore, in c++, we may simply access a[i] by adding 1 to the address of a[i-1],but I don't know whether it is possible in numpy. Thanks.
I don't think this is possible/a[i] is the fastest way.
Python is a programming language that is easier to learn (and use) than c++, this of course comes at a cost, one of these costs is, that it's slower.
The references you're talking about can be "dangerous", hence python makes them not (easily) available to people, to protect them from things the do not understand.
While references are faster, you can't use them in python (as it is anyway slower, the difference in using references or not doesn't matter that much)
It's best not to think of a Python NumPy: ndarray as a C++ array. They're much different. Python also offers its own native list objects and includes an array module in its standard libraries.
A Python list behaves mostly like a generic array (as found in many programming languages). It's an ordered sequence; elements can be accessed by integer index from 0 up through (but not including) the length of the list (len(mylist)); ranges of elements can be access using "slice" notation (mylist[start_offset:end_offset]) returning another list object; negative indexes are treated as offsets from the end of the list (mylist[-1] is the last item of the list) and so on.
Additionally they support a number of methods such as .count(), .find() and .replace().
Unlike the arrays in most programming languages, Python lists are heterogenous. The elements can be any mixture of any object types in Python, including references to nested lists, dictionaries, code, classes, generator objects and other callable first class objects and, of course, instances of custom objects.
The Python array module allows one to instantiate homogenous list-like objects. That is you instantiate them using any of a dozen primitive data types (character or Unicode, signed or unsigned short or long integers, floating point or double precision floating point). Indexing and slicing are identical to Python native lists.
The primary advantage of Python array.array() instances is that they can store large numbers of their elements far more compactly than the more generalized list objects. Various operations on these arrays are likely to be somewhat faster than similar operations performed by iterating over or otherwise referencing elements in a native Python list because there's greater locality of reference in the more compact array layout (in memory) and because the type constraint obviates some dispatch overhead that's incurred when handling generalized objects.
NumPy, on the other hand, is far more sophisticated than the Python array module.
For one thing the ndarray can be multi-dimensional and can be dynamically reshaped. It's common to start with a linear ndarray and to reshape it into a matrix or other higher dimensional structure. Also the ndarray supports a much richer set of data types than the Python array module. NumPy also implements some rather advanced fancy indexingfeatures.
But the real performance advantages of NumPy relate to how it "vectorizes" most operations, broadcasts them across the data structures (possibly using any SIMD features supported by your CPU or even your GPU in the process. At the very list many common matrix operations, when properly written in Python for NumPy, are execute as native machine code speed. This performance edge does well beyond the minor effects of locality of references and obviating dispatch tables that one gains using the simple array module.
As far as I understand it, tuples and strings are immutable to allow optimizations such as re-using memory that won't change. However, one obvious optimisation, making slices of tuples refer to the same memory as the original tuple, is not included in python.
I know that this optimization isn't included because when I time the following function, time taken goes like O(n^2) instead of O(n), so full copying is taking place:
def test(n):
tup = tuple(range(n))
for i in xrange(n):
tup[0:i]
Is there some behavior of python that would change if this optimization was implemented? Is there some performance benefit to copying even when the original is immutable?
By view, are you thinking of something equivalent to what numpy does? I'm familiar with how and why numpy does that.
A numpy array is an object with shape and dtype information, plus a data buffer. You can see this information in the __array_interface__ property. A view is a new numpy object, with its own shape attribute, but with a new data buffer pointer that points to someplace in the source buffer. It also has a flag that says "I don't own the buffer". numpy also maintains its own reference count, so the data buffer is not destroyed if the original (owner) array is deleted (and garbage collected).
This use of views can be big time saver, especially with very large arrays (questions about memory errors are common on SO). Views also allow different dtype, so a data buffer can be viewed at 4 byte integers, or 1 bytes characters, etc.
How would this apply to tuples? My guess is that it would require a lot of extra baggage. A tuple consists of a fixed set of object pointers - probably a C array. A view would use the same array, but with its own start and end markers (pointers and/or lengths). What about sharing flags? Garbage collection?
And what's the typical size and use of tuples? A common use of tuples is to pass arguments to a function. My guess is that a majority of tuples in a typical Python run are small - 0, 1 or 2 elements. Slices are allowed, but are they very common? On small tuples or very large ones?
Would there be any unintended consequences to making tuple slices views (in the numpy sense)? The distinction between views and copies is one of the harder things for numpy users to grasp. Since a tuple is supposed to be immutable - that is the pointers in the tuple cannot be changed - it is possible that implementing views would be invisible to users. But still I wonder.
It may make most sense to try this idea on a branch of the PyPy version - unless you really like to get dig into Cpython code. Or as a custom class with Cython.
This is a follow up to this question
What are the benefits / drawbacks of a list of lists compared to a numpy array of OBJECTS with regards to MEMORY?
I'm interested in understanding the speed implications of using a numpy array vs a list of lists when the array is of type object.
If anyone is interested in the object I'm using:
import gmpy2 as gm
gm.mpfr('0') # <-- this is the object
The biggest usual benefits of numpy, as far as speed goes, come from being able to vectorize operations, which means you replace a Python loop around a Python function call with a C loop around some inlined C (or even custom SIMD assembly) code. There are probably no built-in vectorized operations for arrays of mpfr objects, so that main benefit vanishes.
However, there are some place you'll still benefit:
Some operations that would require a copy in pure Python are essentially free in numpy—transposing a 2D array, slicing a column or a row, even reshaping the dimensions are all done by wrapping a pointer to the same underlying data with different striding information. Since your initial question specifically asked about A.T, yes, this is essentially free.
Many operations can be performed in-place more easily in numpy than in Python, which can save you some more copies.
Even when a copy is needed, it's faster to bulk-copy a big array of memory and then refcount all of the objects than to iterate through nested lists deep-copying them all the way down.
It's a lot easier to write your own custom Cython code to vectorize an arbitrary operation with numpy than with Python.
You can still get some benefit from using np.vectorize around a normal Python function, pretty much on the same order as the benefit you get from a list comprehension over a for statement.
Within certain size ranges, if you're careful to use the appropriate striding, numpy can allow you to optimize cache locality (or VM swapping, at larger sizes) relatively easily, while there's really no way to do that at all with lists of lists. This is much less of a win when you're dealing with an array of pointers to objects that could be scattered all over memory than when dealing with values that can be embedded directly in the array, but it's still something.
As for disadvantages… well, one obvious one is that using numpy restricts you to CPython or sometimes PyPy (hopefully in the future that "sometimes" will become "almost always", but it's not quite there as of 2014); if your code would run faster in Jython or IronPython or non-NumPyPy PyPy, that could be a good reason to stick with lists.
How do I define a constant list of namedtuples? I wish to define something like this in a module and use it throughout my code. I will then use filters on it when I'm only interested in ADULTS for example.
However, I dont want other coders to be able to corrupt this data, which they can do if I use a list. I realise that I could use a tuple but my understanding is that there is a de facto standard that tuples are used for heterogeneous data while I actually have homogeneous data:
http://jtauber.com/blog/2006/04/15/python_tuples_are_not_just_constant_lists/
If I have understood this article correctly.
human_map = [
Activity(ADULT, EATING, OFTEN, 5),
Activity(ADULT, SLEEPING, PERIODIC, 24),
Activity(BABY, CRYING, OFTEN, 1)]
Lists and tuples may be used for heterogeneous data, but in Python there's nothing wrong with using them for homogeneous data.
Other languages provide basic collections (eg arrays in C/C++) that can only hold homogeneous data; more complicated collections must be used for heterogeneous data, and in those languages it is best practice to use the more efficient arrays for homogeneous data, rather than incurring the overheads of the fancier types of collection. But that logic does not apply to Python.
Lists and tuples are both for arbitrary Python objects. You can use one or another: the only useful difference is that tuples are immutable.
There are only 3 types of immutable sequences in Python: strings, unicode and tuples (in Python 2), or strings, tuples and bytes (in Python 3).
So, if you need an immutable list of objects, use tuples.
Reference: https://docs.python.org/2/reference/datamodel.html#the-standard-type-hierarchy
No matter which method you use, there's very little to stop a developer from making your name human_map refer to an entirely different immutable container. As an example:
human_map = (some, immutable, elements,)
And then the malicious developer does something like the following:
import humans
humans.human_map = (something, else, entirely,)
At some point in Python, you just need to stop worrying and trust the clients of your module a little. If people want to shoot themselves in the foot, let them.
I feel like this must have been asked before (probably more than once), so potential apologies in advance, but I can't find it anywhere (here or through Google).
Anyway, when explaining the difference between lists and tuples in Python, the second thing mentioned, after tuples being immutable, is that lists are best for homogeneous data and tuples are best for heterogeneous data. But nobody seems to think to explain why that's the case. So why is that the case?
First of all, that guideline is only sort of true. You're free to use tuples for homogenous data and lists for heterogenous data, and there may be cases where that's a fine thing to do. One important case is if you need the collection to the hashable so you can use it as a dictionary key; in this case you must use a tuple, even if all the elements are homogenous in nature.
Also note that the homogenous/heterogenous distinction is really about the semantics of the data, not just the types. A sequence of a name, occupation, and address would probably be considered heterogenous, even though all three might be represented as strings. So it's more important to think about what you're going to do with the data (i.e., will you actually treat the elements the same) than about what types they are.
That said, I think one reason lists are preferred for homogenous data is because they're mutable. If you have a list of several things of the same kind, it may make sense to add another one to the list, or take one away; when you do that, you're still left with a list of things of the same kind.
By contrast, if you have a collection of things of heterogenous kinds, it's usually because you have a fixed structure or "schema" to them (e.g., the first one is an ID number, the second is a name, the third is an address, or whatever). In this case, it doesn't make sense to add or remove an element from the collection, because the collection is an integrated whole with specified roles for each element. You can't add an element without changing your whole schema for what the elements represent.
In short, changes in size are more natural for homogenous collections than for heterogenous collections, so mutable types are more natural for homogenous collections.
The difference is philosophical more than anything.
A tuple is meant to be a shorthand for fixed and predetermined data meanings. For example:
person = ("John", "Doe")
So, this example is a person, who has a first name and last name. The fixed nature of this is the critical factor. Not the data type. Both "John" and "Doe" are strings, but that is not the point. The benefit of this is unchangeable nature:
You are never surprised to find a value missing. person always has two values. Always.
You are never surprised to find something added. Unlike a dictionary, another bit of code can't "add a new key" or attribute
This predictability is called immutability
It is just a fancy way of saying it has a fixed structure.
One of the direct benefits is that it can be used as a dictionary key. So:
some_dict = {person: "blah blah"}
works. But:
da_list = ["Larry", "Smith"]
some_dict = {da_list: "blah blah"}
does not work.
Don't let the fact that element reference is similar (person[0] vs da_list[0]) throw you off. person[0] is a first name. da_list[0] is simply the first item in a list at this moment in time.
It's not a rule, it's just a tradition.
In many languages, lists must be homogenous and tuples must be fixed-length. This is true of C++, C#, Haskell, Rust, etc. Tuples are used as anonymous structures. It is the same way in mathematics.
Python's type system, however, does not allow you to make these distinctions: you can make tuples of dynamic length, and you can make lists with heterogeneous data. So you are allowed to do whatever you want with lists and tuples in Python, it just might be surprising to other people reading your code. This is especially true if the people reading your code have a background in mathematics or are more familiar with other languages.
Lists are often used for iterating over them, and performing the same operation to every element in the list. Lots of list operations are based on that. For that reason, it's best to have every element be the same type, lest you get an exception because an item was the wrong type.
Tuples are more structured data; they're immutable so if you handle them correctly you won't run into type errors. That's the data structure you'd use if you specifically want to combine multiple types (like an on-the-fly struct).