Perl pattern matching in Python

Perl pattern matching in Python - python

I am a Perl user for many years and started Python recently.
I learned that there is always "one obvious way" to do certain things. I wish to check the "one" way to translate my coding style in Perl below into Python. Thanks!
The objective is to:
Detect the existance of the pattern
If found, extract certain portion of the pattern
Perl:
if ($str =~ /my(pat1)and(pat2)/) {
my ($var1, $var2) = ($1, $2);
}
As far as I had learn for Python, below is how I am coding now. It seems to be taking more steps than Perl. That's why I have doubt about my Python code.
mySearch = re.search ( r'my(pat1)and(pat2)', str )
if mySearch:
var1 = mySearch.group(1)
var2 = mySearch.group(2)

Python doesn't prioritize pattern matching and string manipulation like perl does. These are analogous patterns, and yeah, Python's is longer (it's also got a lot of great things going for it, like the fact it's OOP and doesn't use weird magical global variables).
For the record though, you can be using tuple unpacking to make this more succinct:
var1, var2 = mySearch.groups()
Update:
Tuple unpacking
Tuple unpacking is a useful feature in Python. To understand it, let's first ask, what is a tuple. A tuple is, at its heart, an immutable sequence -- unlike with a list, you cannot append or pop or any of that stuff. Syntactically, it's very simple to declare a tuple -- it's just a few values separated by commas.
my_tuple = "I", "am", "awesome"
my_tuple[0] # "I"
my_tuple[1] # "am"
my_tuple[2] # "awesome"
People often think a tuple is in fact defined by surrounding parentheses -- my_tuple = ("I", "am", "awesome") -- but this is wrong; the parentheses are only useful insofar as they clarify or enforce a certain order of operations.
Tuple unpacking is one of the sweetest features in Python. You define a tuple data structure containing undefined names on the left, and you unpack the iterable on the right into it. The right side can contain any kind of iterable, but the shape of its contained data must exactly match the tuple structure of names on the left.
# some_var and other_var are both undefined
print some_var # NameError: some_var is undefined
print other_var # NameError: other_var is undefined
my_iterable = ["so", "cool"]
# note that 'some_var, other_var' looks a whole lot like a tuple
some_var, other_var = my_iterable
print some_var # "so"
print other_var # "cool"
Again, we don't need a list on the right but any kind of iterable -- for example, a generator:
def some_generator():
yield 1
yield 2
yield 3
a, b, c = some_generator()
print a # 1
print b # 2
print c # 3
You can even do tuple unpacking with nested data structures.
nested_list = [1, [2, 3], 4]
# note that parentheses are necessary here to delimit tuples
a, (b, c), d = nested_list
If the iterable on the right doesn't match the pattern on the left, things blow up:
# THESE EXAMPLES DON'T WORK
a, b = [1, 2, 3] # ValueError: too many values to unpack
a, b = [] # ValueError: need more than 0 values to unpack
Actually, this noisy failure makes tuple unpacking my favorite way to get an item from an iterable when I think that iterable should only have one item in it and I want my code to fail if it has more than one.
# note that the left side below is how you define a tuple of one
bank_statement, = bank_statements # we def want to blow up if too many statements
Multiple Assignment
What people think of as multiple assignment is actually just plain tuple unpacking.
a, b = 1, 2
print a # 1
print b # 2
This is nothing special. The interpreter evaluates the right hand side of the equation as a tuple -- remember, a tuple is just values (literals, variables, evaluated function calls, whatever) separated by a commas -- and then the interpreter matches it against the left side, just like it did with all the examples above.
Bringing it on home
I wrote this to explain the two different answers you were getting for this problem:
var1, var2 = mySearch.group(1), mySearch.group(2)
and
var1, var2 = mySearch.groups()
First, recognize that these two statements, for your situation -- where mySearch is a MatchObject resulting from a regex with two matching groups -- are entirely functionally equivalent.
They differ only very slightly in terms of the nature of the tuple unpacking. The first one declares a tuple on the right while the second uses the tuple returned by MatchObject.groups.
This does not really apply to your situation, but it might be useful to understand that MatchObject.group and MatchObject.groups have slightly different behavior (see here and here). MatchObject.groups returns all the 'subgroups' -- i.e. capturing groups -- that the regex encounters while MatchObject.group returns an individual group and counts the entire pattern as a group accessible at 0.
In reality, for this situation, you should use whichever of these two you think is most expressive or clearest. I personally think mentioning groups 1 and 2 on the right side is redundant and I am constantly annoyed by the fact that MatchObject.groups(0) returns the string matched by the entire pattern, thus offsetting all the 'subgroups' to one-indexing.

You can extract all groups at once and assign them to variables:
var1, var2 = mySeach.groups()

In Python you could do more than one variable assignments in a single line with comma as a delimiter.
var1, var2 = mySearch.group(1), mySearch.group(2)
Other answers says about tuple unpacking. So that would be better if you want to extract all the captured group contents to variables. If you want to grab particular groups contents, you must need to go for the method I mentioned.
va1, var2, var3 = mySearch.group(2), mySearch.group(3), mySearch.group(1)
Example:
>>> import re
>>> x = "foobarbuzzlorium"
>>> m = re.search(r'(foo)(bar.*)(lorium)', x)
>>> if m:
x, y = m.group(1), m.group(3)
print(x,y)
foo lorium

Related

Why do I inconsistently get a ValueError or IndexError when splitting a string and using the results?

I have some code that processes some input text by splitting it up:
text = get_data_from_internet() # or read it from a file, whatever
a, b, c = text.split('|')
Usually, this works fine, but occasionally I will get an error message that looks like
ValueError: not enough values to unpack (expected 3, got 1)
If I instead try to get a single result from the split, like so:
first = text.split()[0]
then similarly it seems to work sometimes, but other times I get
IndexError: list index out of range
What is going on? I assume it has something to do with the data, but how can I understand the problem and fix it?
This question is intended as a canonical for common debugging questions. It is meant to explain primarily what the error message means and specifically what about the input string causes the problem. Questions like this are usually not caused by a typo; they are asked by people who need something explained.

The problem is that the result from .split does not have enough items in it. .split will produce a list of strings, depending on the input; the length of that list depends on the input string, and not on any surrounding code.
When you write a, b, c = text.split('|'), the .split method does not know that three values are expected; it gives the appropriate number of results depending on how many |s there are, and then an error occurs. In this case, the error is ValueError, because there is something wrong with the value: the list doesn't have enough items.
When you write first = text.split()[0], there may be a problem with the index (the [0] in that code), causing an IndexError. The cause is the same: the list doesn't have enough items. (For an empty list, even [0] cannot be used as an index.) See also: Does "IndexError: list index out of range" when trying to access the N'th item mean that my list has less than N items?.
As a special note: when .split is called with a delimiter (like the first example), the resulting list will have at least one item in it. This happens even if the input is an empty string:
>>> ''.split(',')
['']
However, this is not the case when using .split without a delimiter (special-case behaviour to split on any sequence of whitespace):
>>> ''.split()
[]
To solve the problem:
Make sure you are using the right tools for the job. For example, if your input is a .csv file, you should not try to write the code yourself to split the data into cells. Instead, use the csv standard library module.
Carefully check the input to figure out where the problem is occurring. Code like this is usually inside a loop that handles a line of text at a time; check what lines appear in the data to cause the problem. You can do this by, for example, using exception handling and logging. See https://ericlippert.com/2014/03/05/how-to-debug-small-programs/ for more general advice on debugging problems.
Decide what should happen when the bad input appears. Often, it is appropriate to just skip a blank line in the input. Sometimes you can fill in dummy values for whatever is missing. Sometimes you should report your own exception, etc. It is up to you to think about the specific context of your code and decide on the right course of action.

Here are some techniques for solving the problem:
Avoiding IndexError
We can manually check that the output of .split is of the correct length:
### here we want the third item
out = 'a,b'.split(',')
# ['a', 'b']
# we want the third item
n = 2
if len(out) > n:
out = out[n]
else:
out = None
# checking we have None
out == None
# True
If you specifically want the first item of a list, but the list might be empty, you can use this trick:
out = ''.split()
# []
# get the first item or None
next(iter(out), None)
# None
See Python idiom to return first item or None for details.
Avoiding ValueError
Since we don't know in advance how many items will be returned by split, here is a way to pad them with a default value (None, in this example) and ignore any extra values:
from itertools import chain, repeat
a,b,c,d,*_ = chain('A,B'.split(','), repeat(None, 4))
print(a, b, c, d)
# A B None None
We use itertools.repeat to create as many None values as might be needed, and itertools.chain to append them to the .split result.
The extended unpacking technique absorbs any extra Nones into the _ variable (not a special name, just convention). This also ignores extra values from the input. For example:
a, b, c, *_ = 'A,B,C,D,E'.split(',')
print(a, b, c)
# A B C
We can use a variation of that technique to get only the last three elements:
*_, x, y, z = 'A,B,C,D,E'.split(',')
print(x, y, z)
# C D E
Or a combination of both leading and trailing items (note that only one starred value is allowed in the expression):
a, b, *_, z = 'A,B,C,D,E'.split(',')
print(a, b, z)
# A B E

How I can to check if a Const is containing in a List in Z3Py?

In Z3Py, I need to check if a Const type is containing in a List, I define the Datatype List and the Const, I tried to use the method Contain (https://z3prover.github.io/api/html/namespacez3py.html#ada058544efbae7608acaef3233aa4422) but it only is available for string datatype and I need for all.
def funList(sort):
List = Datatype('List')
#insert: (Int, List) -> List
List.declare('insert', ('head', sort), ('tail', List))
List.declare('nil')
return List.create()
IntList = funList(IntSort())
l1 = Const('l1', IntList)
I tried to do a loop in Z3 but it doesn't work :S

Your question isn't quite clear. Are you trying to code up some sort of list membership? (Like finding an element in a list.)
If this is your goal, I'd recommend against using a Datatype based coding. Because any function you'll write on that (like elem, length, etc.) will have to be recursive. And SMT solvers don't do well with recursive definitions of that form; as it puts the logic into the undecidable fragment. Furthermore, debugging and coding is rather difficult when recursive types like these are present as the tool support isn't that great.
If your goal is to use list-like structures, I'd recommend using the built-in sequence logic instead. Here's an example:
from z3 import *
s = Solver()
# declare a sequence of integers
iseq = Const('iseq', SeqSort(IntSort()))
# assert 3 is somewhere in this sequence
s.add(Contains(iseq, Unit(IntVal(3))))
# assert the sequence has at least 5 elements
s.add(Length(iseq) >= 5)
# get a model and print it:
if s.check() == sat:
print s.model()
When I run this, I get:
$ python a.py
[iseq = Concat(Unit(9),
Concat(Unit(10),
Concat(Unit(3),
Concat(Unit(16), Unit(17)))))]
This is a bit hard to read perhaps, but it's essentially saying iseq is a sequence that is concatenation (Concat) of a bunch of singletons (Unit). In more convenient notation, this would be:
iseq = [9, 10, 3, 16, 17]
And as you can see it has 5 elements and 3 is indeed this sequence as required by our constraints.
Z3's sequence logic is pretty rich, and it contains many functions that you'd want to perform. See here: https://z3prover.github.io/api/html/z3py_8py_source.html#l09944

What is an expression like `[variable_name]` called and how is it used?

I have been working through a tutorial and encountered unfamiliar Python syntax. Looking to identify what it's called and learn how it works.
Mystery Code
The line(s) in question involve a variable name surrounded by square braces before being set:
[pivot] = get_point_by_id(svg_tree, 'pivot')
Relevant Clue to Posted Answers
My initial thought was that it had something to do with being set to a list since the return value from get_point_by_id is a list.
Untouched, the code runs fine. But if I try to recreate the syntax like this:
[variable_name] = [1,2,3]
print(variable_name)
I get the following error:
ValueError: too many values to unpack (expected 1)

I'll take a guess from your posting: in your first example, pivot is a list that happens to be of the same shape as the functional return. In contrast, variable_name in the second value is not a sequence of three scalars.
The list represents a sequence of values; you can assign the sequence to a matching sequence of variables. For instance:
>>> [a, b] = [2, 3]
>>> a
2
>>> b
3
we usually write this without brackets, simply
a, b = [2, 3]

The left-hand-side of the assignment is the target list.
From 7.2. Assignment statements:
Assignment of an object to a target list, optionally enclosed in
parentheses or square brackets, is recursively defined as follows.
If the target list is a single target with no trailing comma,
optionally in parentheses, the object is assigned to that target.
Else: The object must be an iterable with the same number of items as
there are targets in the target list, and the items are assigned, from
left to right, to the corresponding targets.
The following works because the number of items in the target list is the same as the in the object on the right-hand-side of the assignment.
In [17]: [a] = ['foo']
In [18]: a
Out[18]: 'foo'
In your mystery code, get_point_by_id must return an iterable with one item.

Starred expression in ternary operator python

I wrote a python program to print the ascii value of up to every 3 numbers in a string or "ERROR" if the length is not divisible by three. I was golf the code when I ran into a SyntaxError.
Code:
c=input()
p=len(c)
c=[int(c[i:i+3])for i in range(0,len(c),3)]
print("ERROR"if p%3else*map(chr,c),sep='')#SyntaxError here
But this works:
c=input()
p=len(c)
c=[int(c[i:i+3])for i in range(0,len(c),3)]
print(*map(chr,c),sep='')
Putting a space before the * or after the 3 doesn't work. I could just use ''.join but it's one character longer. My question is why can't I use a starred expression in a ternary operator?

Because the * has to apply to the whole expression that produces the set of arguments, not a part of it, and not conditionally. Internally, CPython uses different bytecodes for calling with unpacking vs. normal calls, so your conditional would require it to change the byte code to call print based on the arguments to print, essentially rewriting what you wrote into pseudocode like:
if p % 3:
call(print, "ERROR", sep='')
else:
call_varargs(print, *map(chr, c), sep='')
which is beyond the compiler's capability.
If you wanted to make this work, you could do the following:
print(*(("ERROR",) if p%3 else map(chr,c)), sep='')
which ensures the whole ternary evaluates to an unpackable sequence and unpacks whatever survives unconditionally, avoiding the confusion.

print(*(["ERROR"] if p%3 else map(chr,c)),sep="!")
keep it outside of the ternary

The * expander transforms a single enumerable variable into individual variables. E.g.
li = [1,2,3]
print(*li)
produces: 1 2 3 instead of [1, 2, 3].
One value vs. multiple values
It appears to remove the brackets and pass a single string to print, but this is only an appearance, it actually replaces the single list variable by 3 variables and is actually equivalent to:
print(li[0], li[1], li[2])
It works because print accepts a variable number of arguments, so in our case it can deal with the single list or with these three integers.
The conditional expression is a one-value operator
However in your code you use the star operator within a conditional expression:
c = '065066067068'
p = len(c)
c = [int(c[i:i+3]) for i in range(0, p, 3)]
print('ERROR' if p%3 else *map(chr, c), sep='!')
print would be able to accept both evaluations of the expression, a single string value ('ERROR') or multiple char values from map.
But the conditional expression prevents returning multiple values according to the condition (p%3). The expression output has to be a single value. So you have no other choice than to return the list from map as an enumerable, and un-nest it only outside of the ternary operator, e.g. in the print statement.
A string is an enumerable, not a scalar value
However this solution now introduces another problem: Un-nesting will also convert the constant string ERROR into single chars, as a string is considered by Python an enumerable of single chars (as you know since you use this property for your input string). When the condition is true, the output would be:
E!R!R!O!R
Therefore the string must be first converted to an enumerable of strings, e.g. a tuple
Final solution
if p%3: s = ('ERROR',)
else: s = map(chr, c)
print(*s, sep='!')
The outputs will be:
A!B!C!D
ERROR

Python syntax: comma after function call [duplicate]

This question already has answers here:
Why does adding a trailing comma after an expression create a tuple?
(6 answers)
Closed 6 months ago.
I had introduced a bug in a script of mine by putting a comma after a function call. Here I will illustrate with a silly example:
def uppered(my_string):
return my_string.upper()
foo = uppered("foo")
print(foo)
This returns, as expected, a string:
FOO
However, if I change the script so that I call the function with a comma appended after the function call, like this:
foo = uppered("foo"),
Then I get a tuple back (!):
('FOO',)
This seems so weird to me - I would have expected the interpreter to return a SyntaxError (as it does if I have two commas instead of one). Can someone please enlighten me what the logic is here and where to find more information on this behavior. My google-fu let me down on this very specific detail.

Because you have created a tuple.
foo = uppered("foo"),
Is equivalent to:
foo = (uppered("foo"),)
Which creates a single-elemnt tuple, obviously equivalent to
foo = ('FOO',)

Tuples in Python can be written as:
my_tuple = ()
or:
my_tuple = (1, 2)
or:
my_tuple = 1, 2
Don't believe me? Well:
>>> my_tuple = 1,
>>> my_tuple
(1,)

Well, it's how Python syntax works. A single comma after a value will turn create a tuple with this value as the first and only element. See answers for this question for a broader discussion.
Imagine there is an omitted pair of parentheses:
foo = (uppered("foo"),)
Why so? It to write like this:
stuff = foo(), bar(), baz()
# More code here
for thing in stuff:
You could've used a pair of parentheses (and you do need them in other languages), but aesthetics of Python require syntax to be as short and clear as possible. Commas are necessary as separators, but parentheses really aren't. Everything between the beginning side of expression and the end of the line (or :, or closing parentheses/brackets in case of generators or comprehensions) is to be elements of the created tuple. For one element tuple, this syntax should be read as "a tuple of this thing and nothing more", because there is nothing between comma and end of the line.

You can use .join() function. It returns string concatenated with
elements of tuple.
g = tuple('FOO')
k = ''
K = k.join(g)
print(K)
print(type(K))
FOO
class 'str'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.