Python: why str.join(iterable) instead of str.join(*strings)

Python: why str.join(iterable) instead of str.join(*strings) - python

I'm constantly wrapping my str.join() arguments in a list, e.g.
'.'.join([str_one, str_two])
The extra list wrapper always seems superfluous to me. I'd like to do...
'.'.join(str_one, str_two, str_three, ...)
... or if I have a list ...
'.'.join(*list_of_strings)
Yes I'm a minimalist, yes I'm picky, but mostly I'm just curious about the history here, or whether I'm missing something. Maybe there was a time before splats?
Edit:
I'd just like to note that max() handles both versions:
max(iterable[, key])
max(arg1, arg2, *args[, key])

For short lists this won't matter and it costs you exactly 2 characters to type. But the most common use-case (I think) for str.join() is following:
''.join(process(x) for x in some_input)
# or
result = []
for x in some_input:
result.append(process(x))
''.join(result)
where input_data can have thousand of entries and you just want to generate the output string efficiently.
If join accepted variable arguments instead of an iterable, this would have to be spelled as:
''.join(*(process(x) for x in some_input))
# or
''.join(*result)
which would create a (possibly long) tuple, just to pass it as *args.
So that's 2 characters in a short case vs. being wasteful in large data case.
History note
(Second Edit: based on HISTORY file which contains missing release from all releases. Thanks Don.)
The *args in function definitions were added in Python long time ago:
==> Release 0.9.8 (9 Jan 1993) <==
Case (a) was needed to accommodate variable-length argument lists;
there is now an explicit "varargs" feature (precede the last argument
with a '*'). Case (b) was needed for compatibility with old class
definitions: up to release 0.9.4 a method with more than one argument
had to be declared as "def meth(self, (arg1, arg2, ...)): ...".
A proper way to pass a list to such functions was using a built-in function apply(callable, sequence). (Note, this doesn't mention **kwargs which can be first seen in docs for version 1.4).
The ability to call a function with * syntax is first mentioned in release notes for 1.6:
There's now special syntax that you can use instead of the apply()
function. f(*args, **kwds) is equivalent to apply(f, args, kwds). You
can also use variations f(a1, a2, *args, **kwds) and you can leave one
or the other out: f(args), f(*kwds).
But it's missing from grammar docs until version 2.2.
Before 2.0 str.join() did not even exists and you had to do from string import join.

You'd have to write your own function to do that.
>>> def my_join(separator, *args):
return separator.join(args)
>>> my_join('.', '1', '2', '3')
'1.2.3'
Note that this doesn't avoid the creation of an extra object, it just hides that an extra object is being created. If you inspect the type of args, you'll see that it's a tuple.
If you don't want to create a function and you have a fixed list of strings then it would be possible to use format instead of join:
'{}.{}.{}.{}'.format(str_one, str_two, str_three, str_four)
It's better to just stick with '.'.join((a, b, c)).

Argh, now this is a hard question! Try arguing which style is more minimalist... Hard to give a good answer without being too subjective, since it's all about convention.
The problem is: We have a function that accepts an ordered collection; should it accept it as a single argument or as a variable-length argument list?
Python usually answers: Single argument; VLAL if you really have a reason to. Let's see how Python libs reflect this:
The standard library has a couple examples for VLAL, most notably:
when the function can be called with an arbitrary number of separate sequences - like zip or map or itertools.chain,
when there's one sequence to pass, but you don't really expect the caller to have the whole of it as a single variable. This seems to fit str.format.
And the common case for using a single argument:
When you want to do some generic data processing on a single sequence. This fits the functional trio (map*, reduce, filter), and specialized spawns of thereof, like sum or str.join. Also stateful transforms like enumerate.
The pattern is "consume an interable, give another iterable" or "consume an iterable, give a result".
Hope this answers your question.
Note: map is technically var-arg, but the common use case is just map(func, sequence) -> sequence which falls into one bucket with reduce and filter.
*The obscure case, map(func, *sequences) is conceptually like map(func, izip_longest(sequences)) - and the reason for zips to follow the var-arg convention was explained before.
I Hope you follow my thinking here; after all it's all a matter of programming style, I'm just pointing at some patterns in Python's library functions.

Related

Was Python's splat operator ... at one point?

Today, I saw a presentation from pyData 2017 where the presenter used python's splat operator *. Imagine my surprise as I saw it as a pointer until he used the method. I thought Python's splat operator was something like an ellipsis ... No? A google search turned up nothing for me. Did they change it at some point or was it always *? If they did change it, why? Is there an implementation difference and/or speed difference if they changed it?
Edit: "unpacking argument lists" for the angry commenters.

No, Python's unpacking operator (sometimes called "splat" or "spread") never used the ... ellipsis symbol. Python has an .../Ellipsis literal value, but it's only used as a singleton constant for expressing multidimensional ranges in libraries like NumPy. It has no intrinsic behavior and is not syntactically valid in locations where you would use the * unpacking operator.
We can see that the change log for Python 2.0 (released in 2000) describes the new functionality of being able to use the * unpacking operator to call a function, but using the * asterisk character to define a variadic function (sometimes called using "rest parameters") is older than that.
A new syntax makes it more convenient to call a given function with a tuple of arguments and/or a dictionary of keyword arguments. In Python 1.5 and earlier, you’d use the apply() built-in function: apply(f, args, kw) calls the function f() with the argument tuple args and the keyword arguments in the dictionary kw. apply() is the same in 2.0, but thanks to a patch from Greg Ewing, f(*args, **kw)is a shorter and clearer way to achieve the same effect. This syntax is symmetrical with the syntax for defining functions.
The source code for Python 1.0.1 (released in 1994) is still available from the Python website, and we can look at some of their examples to confirm that the use of the * asterisk character for variadic function definitions existed even then. From Demo/sockets/gopher.py:
# Browser main command, has default arguments
def browser(*args):
selector = DEF_SELECTOR
host = DEF_HOST
port = DEF_PORT
n = len(args)
if n > 0 and args[0]:
selector = args[0]

Why does isinstance require a tuple instead of any iterable? [duplicate]

I've been playing for a bit with startswith() and I've discovered something interesting:
>>> tup = ('1', '2', '3')
>>> lis = ['1', '2', '3', '4']
>>> '1'.startswith(tup)
True
>>> '1'.startswith(lis)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: startswith first arg must be str or a tuple of str, not list
Now, the error is obvious and casting the list into a tuple will work just fine as it did in the first place:
>>> '1'.startswith(tuple(lis))
True
Now, my question is: why the first argument must be str or a tuple of str prefixes, but not a list of str prefixes?
AFAIK, the Python code for startswith() might look like this:
def startswith(src, prefix):
return src[:len(prefix)] == prefix
But that just confuses me more, because even with it in mind, it still shouldn't make any difference whether is a list or tuple. What am I missing ?

There is technically no reason to accept other sequence types, no. The source code roughly does this:
if isinstance(prefix, tuple):
for substring in prefix:
if not isinstance(substring, str):
raise TypeError(...)
return tailmatch(...)
elif not isinstance(prefix, str):
raise TypeError(...)
return tailmatch(...)
(where tailmatch(...) does the actual matching work).
So yes, any iterable would do for that for loop. But, all the other string test APIs (as well as isinstance() and issubclass()) that take multiple values also only accept tuples, and this tells you as a user of the API that it is safe to assume that the value won't be mutated. You can't mutate a tuple but the method could in theory mutate the list.
Also note that you usually test for a fixed number of prefixes or suffixes or classes (in the case of isinstance() and issubclass()); the implementation is not suited for a large number of elements. A tuple implies that you have a limited number of elements, while lists can be arbitrarily large.
Next, if any iterable or sequence type would be acceptable, then that would include strings; a single string is also a sequence. Should then a single string argument be treated as separate characters, or as a single prefix?
So in other words, it's a limitation to self-document that the sequence won't be mutated, is consistent with other APIs, it carries an implication of a limited number of items to test against, and removes ambiguity as to how a single string argument should be treated.
Note that this was brought up before on the Python Ideas list; see this thread; Guido van Rossum's main argument there is that you either special case for single strings or for only accepting a tuple. He picked the latter and doesn't see a need to change this.

This has already been suggested on Python-ideas a couple of years back see: str.startswith taking any iterator instead of just tuple and GvR had this to say:
The current behavior is intentional, and the ambiguity of strings
themselves being iterables is the main reason. Since startswith() is
almost always called with a literal or tuple of literals anyway, I see
little need to extend the semantics.
In addition to that, there seemed to be no real motivation as to why to do this.
The current approach keeps things simple and fast,
unicode_startswith (and endswith) check for a tuple argument and then for a string one. They then call tailmatch in the appropriate direction. This is, arguably, very easy to understand in its current state, even for strangers to C code.
Adding other cases will only lead to more bloated and complex code for little benefit while also requiring similar changes to any other parts of the unicode object.

On a similar note, here is an excerpt from a talk by core developer, Raymond Hettinger discussing API design choices regarding certain string methods, including recent changes to the str.startswith signature. While he briefly mentions this fact that str.startswith accepts a string or tuple of strings and does not expound, the talk is informative on the decisions and pain points both core developers and contributors have dealt with leading up to the present API.

Python naming convention: indicating if function alters arguments

I was wondering if there is a proper Python convention to distinguish between functions that either alter their arguments in place or functions that leave their arguments in tact and return an altered copy. For example, consider two functions that apply some permutation. Function f takes a list as an argument and shuffles the elements around, while function g takes a list, copies it, applies f and then returns the altered copy.
For the above example, I figured that f could be called permute(x), while g could be permutation(x), using verbs and nouns to distinguish. This is not always really optimal, though, and in some cases it might lead to confusion as to whether an argument is going to get changed along the way - especially if f were to have some other value to return.
Is there a standard way to deal with this problem?

There is no handy naming convention, not written down in places like PEP-8 at any rate.
The Python standard library does use such a convention to a certain extend. Compare:
listobj.sort()
listobj.reverse()
with
sorted_listobj = sorted(listobj)
reversed_iterator = reversed(listobj)
Using a verb when acting on the object directly, an adjective when returning a new.
The convention isn't consistent. When enumerating an iterable, you use enumerate(), not enumerated(). filter() doesn't alter the input sequence in place. round() doesn't touch the input number object. compile() produces a new bytecode object. Etc. But none of those operations have in-place equivalents in the Python standard library anyway, and I am not aware of any use of adjectives where the input is altered in-place.

Finding documentation on python "native" types, e.g. set

I'm trying to learn Python, and, while I managed to stumble on the answer to my current problem, I'd like to know how I can better find answers in the future.
My goal was to take a list of strings as input, and return a string whose characters were the union of the characters in the strings, e.g.
unionStrings( ("ab", "bc"))
would return "abc".
I implemented it like this:
def unionStrings( strings ):
# Input: A list of strings
# Output: A string that is the (set) union of input strings
all = set()
for s in strings:
all = all.union(set(s))
return "".join(sorted(list(all)))
I felt the for loop was unnecessary, and searched for more neater, more pythonic(?), improvements .
First question: I stumbled on using the class method set.union(), instead of set1.union(set2). Should I have been able to find this in the standard Python docs? I've not been able to find it there.
So I tried using set.union() like this:
>>> set.union( [set(x) for x in ("ab","bc")] )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: descriptor 'union' requires a 'set' object but received a 'list'
Again, I stumbled around and finally found that I should be calling it like this:
>>> set.union( *[set(x) for x in ("ab","bc")] )
set(['a', 'c', 'b'])
Second question: I think this means that set.union is (effectively) declared as
set.union( *sets)
and not
set.union( setsList )
Is that correct? (I'm still learning how to use splat '*'.)
Third question: Where could I find documentation on the signature of set.union()? I didn't see it in the set/freezeset doc's, and I couldn't get the inspect module to give me anything. I'm not even sure set is a module, it seems to be a type. Is it defined in a module, or what?
Thanks for reading my complicated question. It's more "How do I navigate Python documentation?" than "How do I do this in Python code?".
Responding to jonrsharpe's comment:
Ohhhhh! I'm so used to C++ where you define separate static and instance methods. Now that you explain it I can really see what's happening.
The only thing I might do different is write it as
t = set.union( *[set(x) for x strings] )
return "".join(sorted(t))
because it bugs me to treat strings[0] differently from the strings in strings[1:] when, functionally, they don't play different roles. If I have to call set() on one of them, I'd rather call it on all of them, since union() is going to do it anyways. But that's just style, right?

There are several questions here. Firstly, you should know that:
Class.method(instance, arg)
is equivalent to:
instance.method(arg)
for instance methods. You can call the method on the class and explicitly provide the instance, or just call it on the instance.
For historical reasons, many of the standard library and built-in types don't follow the UppercaseWords convention for class names, but they are classes. Therefore
set.union(aset, anotherset)
is the same as
aset.union(anotherset)
set methods specifically can be tricky, because of the way they're often used. set.method(arg1, arg2, ...) requires arg1 to already be a set, the instance for the method, but all the other arguments will be converted (from 2.6 on).
This isn't directly covered in the set docs, because it's true for everything; Python is pretty consistent.
In terms of needing to "splat", note that the docs say:
union(other, ...)
rather than
union(others)
i.e. each iterable is a separate argument, hence you need to unpack your list of iterables.
Your function could therefore be:
def union_strings(strings):
if not strings:
return ""
return "".join(sorted(set(strings[0]).union(*strings[1:])))
or, avoiding the special-casing of strings[0]:
def union_strings(strings):
if not strings:
return ""
return "".join(sorted(set.union(*map(set, strings))))

When should I use varargs in designing a Python API?

Is there a good rule of thumb as to when you should prefer varargs function signatures in your API over passing an iterable to a function? ("varargs" being short for "variadic" or "variable-number-of-arguments"; i.e. *args)
For example, os.path.join has a vararg signature:
os.path.join(first_component, *rest) -> str
Whereas min allows either:
min(iterable[, key=func]) -> val
min(a, b, c, ...[, key=func]) -> val
Whereas any/all only permit an iterable:
any(iterable) -> bool

Consider using varargs when you expect your users to specify the list of arguments as code at the callsite or having a single value is the common case. When you expect your users to get the arguments from somewhere else, don't use varargs. When in doubt, err on the side of not using varargs.
Using your examples, the most common usecase for os.path.join is to have a path prefix and append a filename/relative path onto it, so the call usually looks like os.path.join(prefix, some_file). On the other hand, any() is usually used to process a list of data, when you know all the elements you don't use any([a,b,c]), you use a or b or c.

My rule of thumb is to use it when you might often switch between passing one and multiple parameters. Instead of having two functions (some GUI code for example):
def enable_tab(tab_name)
def enable_tabs(tabs_list)
or even worse, having just one function
def enable_tabs(tabs_list)
and using it as enable_tabls(['tab1']), I tend to use just: def enable_tabs(*tabs). Although, seeing something like enable_tabs('tab1') looks kind of wrong (because of the plural), I prefer it over the alternatives.

You should use it when your parameter list is variable.
Yeah, I know the answer is kinda daft, but it's true. Maybe your question was a bit diffuse. :-)
Default arguments, like min() above is more useful when you either want to different behaviours (like min() above) or when you simply don't want to force the caller to send in all parameters.
The *arg is for when you have a variable list of arguments of the same type. Joining is a typical example. You can replace it with an argument that takes a list as well.
**kw is for when you have many arguments of different types, where each argument also is connected to a name. A typical example is when you want a generic function for handling form submission or similar.

They are completely different interfaces.
In one case, you have one parameter, in the other you have many.
any(1, 2, 3)
TypeError: any() takes exactly one argument (3 given)
os.path.join("1", "2", "3")
'1\\2\\3'
It really depends on what you want to emphasize: any works over a list (well, sort of), while os.path.join works over a set of strings.
Therefore, in the first case you request a list; in the second, you request directly the strings.
In other terms, the expressiveness of the interface should be the main guideline for choosing the way parameters should be passed.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: why str.join(iterable) instead of str.join(*strings) - python

Related

Was Python's splat operator ... at one point?

Why does isinstance require a tuple instead of any iterable? [duplicate]

Python naming convention: indicating if function alters arguments

Finding documentation on python "native" types, e.g. set

When should I use varargs in designing a Python API?

Categories

Resources