More elegant way to implement regexp-like quantifiers - python

I'm writing a simple string parser which allows regexp-like quantifiers. An input string might look like this:
s = "x y{1,2} z"
My parser function translates this string to a list of tuples:
list_of_tuples = [("x", 1, 1), ("y", 1, 2), ("z", 1, 1)]
Now, the tricky bit is that I need a list of all valid combinations that are specified by the quantification. The combinations all have to have the same number of elements, and the value None is used for padding. For the given example, the expected output is
[["x", "y", None, "z"], ["x", "y", "y", "z"]]
I do have a working solution, but I'm not really happy with it: it uses two nested for loops, and I find the code somewhat obscure, so there's something generally awkward and clumsy about it:
import itertools
def permute_input(lot):
outer = []
# is there something that replaces these nested loops?
for val, start, end in lot:
inner = []
# For each tuple, create a list of constant length
# Each element contains a different number of
# repetitions of the value of the tuple, padded
# by the value None if needed.
for i in range(start, end + 1):
x = [val] * i + [None] * (end - i)
inner.append(x)
outer.append(inner)
# Outer is now a list of lists.
final = []
# use itertools.product to combine the elements in the
# list of lists:
for combination in itertools.product(*outer):
# flatten the elements in the current combination,
# and append them to the final list:
final.append([x for x
in itertools.chain.from_iterable(combination)])
return final
print(permute_input([("x", 1, 1), ("y", 1, 2), ("z", 1, 1)]))
[['x', 'y', None, 'z'], ['x', 'y', 'y', 'z']]
I suspect that there's a much more elegant way of doing this, possibly hidden somewhere in the itertools module?

One alternative way to approach the problem is to use pyparsing and this example regex parser that would expand a regular expression to possible matching strings. For your x y{1,2} z sample string it would generate two possible strings expanding the quantifier:
$ python -i regex_invert.py
>>> s = "x y{1,2} z"
>>> for item in invert(s):
... print(item)
...
x y z
x yy z
The repetition itself supports both an open-ended range and a closed range and is defined as:
repetition = (
(lbrace + Word(nums).setResultsName("count") + rbrace) |
(lbrace + Word(nums).setResultsName("minCount") + "," + Word(nums).setResultsName("maxCount") + rbrace) |
oneOf(list("*+?"))
)
To get to the desired result, we should modify the way the results are yielded from the recurseList generator and return lists instead of strings:
for s in elist[0].makeGenerator()():
for s2 in recurseList(elist[1:]):
yield [s] + [s2] # instead of yield s + s2
Then, we need to only flatten the result:
$ ipython3 -i regex_invert.py
In [1]: import collections
In [2]: def flatten(l):
...: for el in l:
...: if isinstance(el, collections.Iterable) and not isinstance(el, (str, bytes)):
...: yield from flatten(el)
...: else:
...: yield el
...:
In [3]: s = "x y{1,2} z"
In [4]: for option in invert(s):
...: print(list(flatten(option)))
...:
['x', ' ', 'y', None, ' ', 'z']
['x', ' ', 'y', 'y', ' ', 'z']
Then, if needed, you can filter the whitespace characters:
In [5]: for option in invert(s):
...: print([item for item in flatten(option) if item != ' '])
...:
['x', 'y', None, 'z']
['x', 'y', 'y', 'z']

Recursive solution (simple, good for up to few thousand tuples):
def permutations(lot):
if not lot:
yield []
else:
item, start, end = lot[0]
for prefix_length in range(start, end+1):
for perm in permutations(lot[1:]):
yield [item]*prefix_length + [None] * (end - prefix_length) + perm
It is limited by the recursion depth (~1000). If it is not enough, there is a simple optimization for start == end cases. Dependin on the expected size of list_of_tuples it might be enough
Test:
>>> list(permutations(list_of_tuples)) # list() because it's an iterator
[['x', 'y', None, 'z'], ['x', 'y', 'y', 'z']]
Without recursion (universal but less elegant):
def permutations(lot):
source = []
cnum = 1 # number of possible combinations
for item, start, end in lot: # create full list without Nones
source += [item] * (end-start+1)
cnum *= (end-start+1)
for i in range(cnum):
bitmask = [True] * len(source)
state = i
pos = 0
for _, start, end in lot:
state, m = divmod(state, end-start+1) # m - number of Nones to insert
pos += end-start+1
bitmask[pos-m:pos] = [None] * m
yield [bitmask[i] and c for i, c in enumerate(source)]
The idea behind this solution: actually, we are kind of looking full string (xyyz) though a glass wich adds certain number of None. We can count numer of possible combinations by calculating product of all (end-start+1). Then, we can just number all iterations (simple range loop) and reconstruct this mask from the iteration number. Here we reconstruct the mask by iteratively using divmod on the state number and using remainder as the number of Nones at the symbol position

The part generating the different lists based on the tuple can be written using list comprehension:
outer = []
for val, start, end in lot:
# For each tuple, create a list of constant length
# Each element contains a different number of
# repetitions of the value of the tuple, padded
# by the value None if needed.
outer.append([[val] * i + [None] * (end - i) for i in range(start, end + 1)])
(The whole thing would be again be written with list comprehension but it makes the code harder to read IMHO).
On the other hand, the list comprehension in [x for x in itertools.chain.from_iterable(combination)] could be written in a more concise way. Indeed, the whole point is to build an actual list out of an iterable. This could be done with : list(itertools.chain.from_iterable(combination)). An aternative would be to use the sum builtin. I am not sure which is better.
Finally, the final.append part could be written with a list comprehension.
# use itertools.product to combine the elements in the list of lists:
# flatten the elements in the current combination,
return [sum(combination, []) for combination in itertools.product(*outer)]
The final code is just based on the code you've written slightly re-organised:
outer = []
for val, start, end in lot:
# For each tuple, create a list of constant length
# Each element contains a different number of
# repetitions of the value of the tuple, padded
# by the value None if needed.
outer.append([[val] * i + [None] * (end - i) for i in range(start, end + 1)])
# use itertools.product to combine the elements in the list of lists:
# flatten the elements in the current combination,
return [sum(combination, []) for combination in itertools.product(*outer)]

Related

Insert an item to a Python list without using insert()

How do you add an item to a specific position on a list. When you have an empty list and want to add 'z' to the 3rd position using insert() only insert it at the last position like,
l.insert(3,'z')
l
['z']
I want the output to be
[None, None, None, 'z']
or
['','','','z']
Try this method using a list comprehension -
n = 5
s = 'z'
out = [None if i!=n-1 else s for i in range(n)]
print(out)
[None, None, None, None, 'z']
If you want to insert the string somewhere in the middle, then a more general way is to define m and n separately where n is length of the list and m is the position -
n = 5
m = 3
s = 'z'
out = [None if i!=m-1 else s for i in range(n)]
print(out)
[None, None, 'z', None, None]
Assuming you want to have it in the Nth index:
l = l[:N] + ['z'] + l[N:]
If you start with an empty list and want it to have Nones at the start and end of array, maybe this will help you (N is the number of None items you want):
l = [None] * N
l = l[:N] + ['z'] + l[N:]

Generate a sequence of number and alternating string in python

Aim
I would like to generate a sequence as list in python, such as:
['s1a', 's1b', 's2a', 's2b', ..., 's10a', 's10b']
Properties:
items contain a single prefix
numbers are sorted numerical
suffix is alternating per number
Approach
To get this, I applied the following code, using an xrange and comprehensive list approach:
# prefix
p = 's'
# suffix
s = ['a', 'b']
# numbers
n = [ i + 1 for i in list(xrange(10))]
# result
[ p + str(i) + j for i, j in zip(sorted(n * len(s)), s * len(n)) ]
Question
Is there a more simple syntax to obtain the results, e.g. using itertools?
Similar to this question?
A doubled-for list comprehension can accomplish this:
['s'+str(x)+y for x in range(1,11) for y in 'ab']
itertools.product might be your friend:
all_combos = ["".join(map(str, x)) for x in itertools.product(p, n, s)]
returns:
['s1a', 's1b', 's2a', 's2b', 's3a', 's3b', 's4a', 's4b', 's5a', 's5b', 's6a', 's6b', 's7a', 's7b', 's8a', 's8b', 's9a', 's9b', 's10a', 's10b']
EDIT: as a one-liner:
all_combos = ["".join(map(str,x)) for x in itertools.product(['s'], range(1, 11), ['a', 'b'])]
EDIT 2: as pointed out in James' answer, we can change our listed string element in the product call to just strings, and itertools will still be able to iterate over them, selecting characters from each:
all_combos = ["".join(map(str,x)) for x in itertools.product('s', range(1, 11), 'ab')]
How about:
def func(prefix,suffixes,size):
k = len(suffixes)
return [prefix+str(n/k+1)+suffixes[n%k] for n in range(size*k)]
# usage example:
print func('s',['a','b'],10)
This way you can alternate as many suffixes as you want.
And of course, each one of the suffixes can be as long as you want.
You can use a double-list comprehension, where you iterate on number and suffix. You don't need to load any
Below is a lambda function that takes 3 parameters, a prefix, a number of iterations, and a list of suffixes
foo = lambda prefix,n,suffix: list(prefix+str(i)+s for s in suffix for i in range(n))
You can use it like this
foo('p',10,'abc')
Or like that, if your suffixes have more than one letter
foo('p',10,('a','bc','de'))
For maximum versatility I would do this as a generator. That way you can either create a list, or just produce the sequence items as they are needed.
Here's code that runs on Python 2 or Python 3.
def psrange(prefix, suffix, high):
return ('%s%d%s' % (prefix, i, s) for i in range(1, 1 + high) for s in suffix)
res = list(psrange('s', ('a', 'b'), 10))
print(res)
for s in psrange('x', 'abc', 3):
print(s)
output
['s1a', 's1b', 's2a', 's2b', 's3a', 's3b', 's4a', 's4b', 's5a', 's5b', 's6a', 's6b', 's7a', 's7b', 's8a', 's8b', 's9a', 's9b', 's10a', 's10b']
x1a
x1b
x1c
x2a
x2b
x2c
x3a
x3b
x3c

slice a list based on some values

Hi I'm looking for a way to split a list based on some values, and assuming the list's length equals to sum of some values, e.g.:
list: l = ['a','b','c','d','e','f']
values: v = (1,1,2,2)
so len(l) = sum(v)
and I'd like to have a function to return a tuple or a list, like: (['a'], ['b'], ['c','d'], ['d','e'])
currently my code is like:
(list1,list2,list3,list4) = (
l[0:v[0]],
l[v[0]:v[0]+v[1]],
l[v[0]+v[1]:v[0]+v[1]+v[2]],
l[v[0]+v[1]+v[2]:v[0]+v[1]+v[2]+v[3]])`
I'm thinking about make this clearer, but closest one I have so far is (note the results are incorrect, not what I wanted)
s=0
[list1,list2,list3,list4] = [l[s:s+i] for i in v]
the problem is I couldn't increase s at the same time while iterating values in v, I'm hoping to get a better code to do so, any suggestion is appreciated, thanks!
If you weren't stuck on ancient Python, I'd point you to itertools.accumulate. Of course, even on ancient Python, you could use the (roughly) equivalent code provided in the docs I linked to do it. Using either the Py3 code or equivalent, you could do:
from itertools import accumulate # Or copy accumulate equivalent Python code
from itertools import chain
# Calls could be inlined in listcomp, but easier to read here
starts = accumulate(chain((0,), v)) # Extra value from starts ignored when ends exhausted
ends = accumulate(v)
list1,list2,list3,list4 = [l[s:e] for s, e in zip(starts, ends)]
Maybe make a generator of the values in l?
def make_list(l, v):
g = (x for x in l)
if len(l) == sum(v):
return [[next(g) for _ in range(val)] for val in v]
return None
You could just write a simple loop to iterate over v to generate a result:
l = ['a','b','c','d','e','f']
v = (1,1,2,2)
result = []
offset = 0
for size in v:
result.append(l[offset:offset+size])
offset += size
print result
Output:
[['a'], ['b'], ['c', 'd'], ['e', 'f']]
The idea here is using a nested loop. Assuming that your condition will always holds true, the logic then is to run through v and pick up i elements from l where i is an number from v.
index = 0 # this is the start index
for num in v:
temp = [] # this is a temp array, to hold individual elements in your result array.
for j in range(index, index+num): # this loop will pickup the next num elements from l
temp.append(l[j])
data.append(temp)
index += num
Output:
[['a'], ['b'], ['c', 'd'], ['e', 'f']]
The first answer https://stackoverflow.com/a/39715361/5759063 is the most pythonic way to do it. This is just the algorithmic backbone.
Best I could find is a two line solution:
breaks=[0]+[sum(v[:i+1]) for i in range(len(v))] #build a list of section indices
result=[l[breaks[i]:breaks[i+1]] for i in range(len(breaks)-1)] #split array according to indices
print result

How to concatenate strings into parenthesis in Python

In Python I have this loop that e.g. prints some value:
for row in rows:
toWrite = row[0]+","
toWrite += row[1]
toWrite += "\n"
Now this works just fine, and if I print "toWrite" it would print this:
print toWrite
#result:,
A,B
C,D
E,F
... etc
My question is, how would I concatenate these strings with parenthesis and separated with commas, so result of loop would be like this:
(A,B),(C,D),(E,F) <-- the last item in parenthesis, should not contain - end with comma
You'd group your items into pairs, then use string formatting and str.join():
','.join(['({},{})'.format(*pair) for pair in zip(*[iter(rows)] * 2)])
The zip(*[iter(rows)] * 2) expression produces elements from rows in pairs.
Each pair is formatted with '({},{})'.format(*pair); the two values in pair are slotted into each {} placeholder.
The (A,B) strings are joined together into one long string using ','.join(). Passing in a list comprehension is marginally faster than using a generator expression here as str.join() would otherwise convert it to a list anyway to be able to scan it twice (once for the output size calculation, once for building the output).
Demo:
>>> rows = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
>>> ','.join(['({},{})'.format(*pair) for pair in zip(*[iter(rows)] * 2)])
'(A,B),(C,D),(E,F),(G,H)'
Try this:
from itertools import islice, izip
','.join(('(%s, %s)' % (x, y) for x, y in izip(islice(rows, 0, None, 2), islice(rows, 1, None, 2))))
Generator and iterators are adopted here.
See itertools for a reference.

Given a list of slices, how do I split a sequence by them?

Given a list of slices, how do I separate a sequence based on them?
I have long amino-acid strings that I would like to split based on start-stop values in a list. An example is probably the most clear way of explaining it:
str = "MSEPAGDVRQNPCGSKAC"
split_points = [[1,3], [7,10], [12,13]]
output >> ['M', '(SEP)', 'AGD', '(VRQN)', 'P', '(CG)', 'SKAC']
The extra parentheses are to show which elements were selected from the split_points list. I don't expect the start-stop points to ever overlap.
I have a bunch of ideas that would work, but seem terribly inefficient (code-length wise), and it seems like there must be a nice pythonic way of doing this.
Strange way to split strings you have there:
def splitter( s, points ):
c = 0
for x,y in points:
yield s[c:x]
yield "(%s)" % s[x:y+1]
c=y+1
yield s[c:]
print list(splitter(str, split_points))
# => ['M', '(SEP)', 'AGD', '(VRQN)', 'P', '(CG)', 'SKAC']
# if some start and endpoints are the same remove empty strings.
print list(x for x in splitter(str, split_points) if x != '')
Here is a simple solution for you. to grab each of the sets specified by the point.
In[4]: str[p[0]:p[1]+1] for p in split_points]
Out[4]: ['SEP', 'VRQN', 'CG']
To get the parenthesis:
In[5]: ['(' + str[p[0]:p[1]+1] + ')' for p in split_points]
Out[5]: ['(SEP)', '(VRQN)', '(CG)']
Here is the cleaner way of doing it to do the whole deal:
results = []
for i in range(len(split_points)):
start, stop = split_points[i]
stop += 1
last_stop = split_points[i-1][1] + 1 if i > 0 else 0
results.append(string[last_stop:start])
results.append('(' + string[start:stop] + ')')
results.append(string[split_points[-1][1]+1:])
All of the below solutions are bad, and more for fun than anything else, do not use them!
This more of a WTF solution, but I figured I'd post it since it was asked for in comments:
split_points = [(x, y+1) for x, y in split_points]
split_points = [((split_points[i-1][1] if i > 0 else 0, p[0]), p) for i, p in zip(range(len(split_points)), split_points)]
results = [string[n[0]:n[1]] + '\n(' + string[m[0]:m[1]] + ')' for n, m in split_points] + [string[split_points[-1][1][1]:]]
results = '\n'.join(results).split()
still trying to figure out the one liner, here's a two:
split_points = [((split_points[i-1][1]+1 if i > 0 else 0, p[0]), (p[0], p[1]+1)) for i, p in zip(range(len(split_points)), split_points)]
print '\n'.join([string[n[0]:n[1]] + '\n(' + string[m[0]:m[1]] + ')' for n, m in split_points] + [string[split_points[-1][1][1]:]]).split()
And the one liner that should never be used:
print '\n'.join([string[n[0]:n[1]] + '\n(' + string[m[0]:m[1]] + ')' for n, m in (((split_points[i-1][1]+1 if i > 0 else 0, p[0]), (p[0], p[1]+1)) for i, p in zip(range(len(split_points)), split_points))] + [string[split_points[-1][1]:]]).split()
Here's some code that will work.
result = []
last_end = 0
for sp in split_points:
result.append(str[last_end:sp[0]])
result.append('(' + str[sp[0]:sp[1]+1] + ')')
last_end = sp[1]+1
result.append(str[last_end:])
print result
If you just want the parts in the parenthesis it becomes a little simpler:
result = [str[sp[0]:sp[1]+1] for sp in split_points]
Here's a solution that converts your split_points to regular string slices and then prints out the appropriate slices:
str = "MSEPAGDVRQNPCGSKAC"
split_points = [[1, 3], [7, 10], [12, 13]]
adjust = [s for sp in [[x, y + 1] for x, y in split_points] for s in sp]
zipped = zip([None] + adjust, adjust + [None])
out = [('(%s)' if i % 2 else '%s') % str[x:y] for i, (x, y) in
enumerate(zipped)]
print out
>>> ['M', '(SEP)', 'AGD', '(VRQN)', 'P', '(CG)', 'SKAC']
>>> str = "MSEPAGDVRQNPCGSKAC"
>>> split_points = [[1,3], [7,10], [12,13]]
>>>
>>> all_points = sum(split_points, [0]) + [len(str)-1]
>>> map(lambda i,j: str[i:j+1], all_points[:-1], all_points[1:])
['MS', 'SEP', 'PAGDV', 'VRQN', 'NPC', 'CG', 'GSKAC']
>>>
>>> str_out = map(lambda i,j: str[i:j+1], all_points[:-1:2], all_points[1::2])
>>> str_in = map(lambda i,j: str[i:j+1], all_points[1:-1:2], all_points[2::2])
>>> sum(map(list, zip(['(%s)' % s for s in str_in], str_out[1:])), [str_out[0]])
['MS', '(SEP)', 'PAGDV', '(VRQN)', 'NPC', '(CG)', 'GSKAC']
Probably not for elegance, but just because I can do it in a oneliner :)
>>> reduce(lambda a,ij:a[:-1]+[str[a[-1]:ij[0]],'('+str[ij[0]:ij[1]+1]+')',
ij[1]], split_points, [0])[:-1] + [str[split_points[-1][-1]+1:]]
['M', '(SEP)', 'PAGD', '(VRQN)', 'NP', '(CG)', 'SKAC']
Maybe you like it. Here some explanation:
In your question you pass one set of slices, and implicitly you want to have the complement set of slices as well (to generate the un-parenthesized [is that English?] slices). So basically, each slice [i,j] lacks the previous j. e.g. [7,10] lacks the 3 and [1,3] lacks the 0.
reduce processes lists and at each step passes the output so far (a) plus the next input element (ij). The trick is that apart from producing the plain output, we add each time an extra variable --- a sort of memory --- which is in the next step retrieved in a[-1]. In this particular example we store the last j value, and hence at all times we have the full information to provide both the unparenthesized and the parenthesized substring.
Finally, the memory is stripped with [:-1] and replaced by the remainder of the original str in [str[split_points[-1][-1]+1:]].

Categories