Build a combination of strings that matches the given regex [duplicate]

Build a combination of strings that matches the given regex [duplicate] - python

Suppose I have the following string:
trend = '(A|B|C)_STRING'
I want to expand this to:
A_STRING
B_STRING
C_STRING
The OR condition can be anywhere in the string. i.e STRING_(A|B)_STRING_(C|D)
would expand to
STRING_A_STRING_C
STRING_B_STRING C
STRING_A_STRING_D
STRING_B_STRING_D
I also want to cover the case of an empty conditional:
(|A_)STRING would expand to:
A_STRING
STRING
Here's what I've tried so far:
def expandOr(trend):
parenBegin = trend.index('(') + 1
parenEnd = trend.index(')')
orExpression = trend[parenBegin:parenEnd]
originalTrend = trend[0:parenBegin - 1]
expandedOrList = []
for oe in orExpression.split("|"):
expandedOrList.append(originalTrend + oe)
But this is obviously not working.
Is there any easy way to do this using regex?

Here's a pretty clean way. You'll have fun figuring out how it works :-)
def expander(s):
import re
from itertools import product
pat = r"\(([^)]*)\)"
pieces = re.split(pat, s)
pieces = [piece.split("|") for piece in pieces]
for p in product(*pieces):
yield "".join(p)
Then:
for s in ('(A|B|C)_STRING',
'(|A_)STRING',
'STRING_(A|B)_STRING_(C|D)'):
print s, "->"
for t in expander(s):
print " ", t
displays:
(A|B|C)_STRING ->
A_STRING
B_STRING
C_STRING
(|A_)STRING ->
STRING
A_STRING
STRING_(A|B)_STRING_(C|D) ->
STRING_A_STRING_C
STRING_A_STRING_D
STRING_B_STRING_C
STRING_B_STRING_D

import exrex
trend = '(A|B|C)_STRING'
trend2 = 'STRING_(A|B)_STRING_(C|D)'
>>> list(exrex.generate(trend))
[u'A_STRING', u'B_STRING', u'C_STRING']
>>> list(exrex.generate(trend2))
[u'STRING_A_STRING_C', u'STRING_A_STRING_D', u'STRING_B_STRING_C', u'STRING_B_STRING_D']

I would do this to extract the groups:
def extract_groups(trend):
l_parens = [i for i,c in enumerate(trend) if c == '(']
r_parens = [i for i,c in enumerate(trend) if c == ')']
assert len(l_parens) == len(r_parens)
return [trend[l+1:r].split('|') for l,r in zip(l_parens,r_parens)]
And then you can evaluate the product of those extracted groups using itertools.product:
expr = 'STRING_(A|B)_STRING_(C|D)'
from itertools import product
list(product(*extract_groups(expr)))
Out[92]: [('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D')]
Now it's just a question of splicing those back onto your original expression. I'll use re for that :)
#python3.3+
def _gen(it):
yield from it
p = re.compile('\(.*?\)')
for tup in product(*extract_groups(trend)):
gen = _gen(tup)
print(p.sub(lambda x: next(gen),trend))
STRING_A_STRING_C
STRING_A_STRING_D
STRING_B_STRING_C
STRING_B_STRING_D
There's probably a more readable way to get re.sub to sequentially substitute things from an iterable, but this is what came off the top of my head.

It is easy to achieve with sre_yield module:
>>> import sre_yield
>>> trend = '(A|B|C)_STRING'
>>> strings = list(sre_yield.AllStrings(trend))
>>> print(strings)
['A_STRING', 'B_STRING', 'C_STRING']
The goal of sre_yield is to efficiently generate all values that can match a given regular expression, or count possible matches efficiently... It does this by walking the tree as constructed by sre_parse (same thing used internally by the re module), and constructing chained/repeating iterators as appropriate. There may be duplicate results, depending on your input string though -- these are cases that sre_parse did not optimize.

Related

Get string of either side of a desired letter

How to get the string of either side of letter EGemail#gmail.com
If the desired letter was "." it would print "l" and "c" from "gmail" and "com"
I do not think that using [] to separate the letter would work as I think the algorithm is much more complicated

Use index().
def getNeighbors(string, desired):
index = mystring.index(desired)
return mystring[index-1], mystring[index+1]
mystring = 'email#gmail.com'
desired = '.'
print(getNeighbors(mystring, desired)) # >>> ('l', 'c')
A couple notes:
This will return the characters around the first instance of '.'. It also does not perform bounds checking. Finally, it does not check that character actually exists in the string.

One possible solution using re module:
s = 'email#gmail.com'
l = '.'
import re
print( re.findall(r'(.)?{}(.)?'.format(re.escape(l)), s))
Prints:
[('l', 'c')]
EDIT: To get only first match you can use re.search:
s = 'email#gmail.com'
l = 'l'
import re
print( re.search(r'(.)?{}(.)?'.format(re.escape(l)), s).groups() )
Prints:
('i', '#')

String tranformation into Clean list of int

I ideally want to turn this 100020630 into [100,020,630]
but so far i can only do this "100.020.630" into ["100","020","630"]
def fulltotriple(x):
X=x.split(".")
return X
print(fulltotriple("192.123.010"))
for some additionnal info my goal is no turn ip adresses into bin adresses using this as a first step =)
edit: i have not found any way of getting the list WITHOUT the " " in the list on stack overflow

Here's one approach using a list comprehension:
s = '100020630'
[s[i:i + 3] for i in range(0, len(s), 3)]
# ['100', '020', '630']

If you want to handle IP addresses, you are doing it totally wrong.
IP address is a 24-binary digit number, not a 9-decimal digit. It is splitted for 4 sub-blocks, like: 192.168.0.1. BUT. In decimal view they all can be 3-digit, or 2-digit, or any else combination. I recommend you to use ipaddress standard module:
import ipaddress
a = '192.168.0.1'
ip = ipaddress.ip_address(a)
ip.packed
will return you the packed binary format:
b'\xc0\xa8\x00\x01'
If you want to convert your IPv4 to binary format, you can use this command:
''.join(bin(i)[2:] for i in ip.packed)
It will return you this string:
'110000001010100001'

You could use the built-in wrap function:
In [3]: s = "100020630"
In [4]: import textwrap
In [6]: textwrap.wrap(s, 3)
Out[6]: ['100', '020', '630']
Wraps the single paragraph in text (a string) so every line is at most width characters long. Returns a list of output lines, without final newlines.
If you want a list of ints:
[int(num) for num in textwrap.wrap(s, 3)]
Outputs:
[100, 020, 630]

You could use wrap which is a inbuilt function in python
from textwrap import wrap
def fulltotriple(x):
x = wrap(x, 3)
return x
print(fulltotriple("100020630"))
Outputs:
['100', '020', '630']

You can use python built-ins for this:
text = '100020630'
# using wrap
from textwrap import wrap
wrap(text, 3)
>>> ['100', '020', '630']
# using map/zip
map(''.join, zip(*[iter(text)]*3))
>>> ['100', '020', '630']

Use regex to find all matches of triplets \d{3}
import re
str = "100020630"
def fulltotriple(x):
pattern = re.compile(r"\d{3}")
return [int(found_match) for found_match in pattern.findall(x)]
print(fulltotriple(str))
Outputting:
[100, 20, 630]

def fulltotriple(data):
result = []
for i in range(0, len(data), 3):
result.append(int(data[i:i + 3]))
return (result)
print(fulltotriple("192123010"))
output:
[192, 123, 10]

Finding element in a string and smarter manipulation

I have a list of characters
a = ["s", "a"]
I have some words.
b = "asp"
c= "lat"
d = "kasst"
I know that the characters in the list can appear only once or in linear order(or at most on small set can appear in the bigger one).
I would like to split my words by putting the elements in a in the middle, an the rest on the left or on the right (and put a "=" if there is nothing)
so b = ["*", "as", "p"]
If a bigger set of characters which contains
d = ["k", "ass", "t"]
I know that the combinations can be at most of length 4.
So I have divided the possible combinations depending on the length:
import itertools
c4 = [''.join(i) for i in itertools.product(a, repeat = 4)]
c3 = [''.join(i) for i in itertools.product(a, repeat = 3)]
c2 = [''.join(i) for i in itertools.product(a, repeat = 2)]
c1 = [''.join(i) for i in itertools.product(a, repeat = 1)]
For each c, starting with the greater
For simplicity, let's say I start with c3 in this case and not with length 4.
I have to do this with a lot of data.
Is there a way to simplify the code ?

You can do something similar using a regular expression:
>>> import re
>>> p = re.compile(r'([sa]{1,4})')
p matches the characters 's' or 'a' repeated between 1 and 4 times.
To split a given string at this pattern, use p.split. The use of capturing parentheses in the pattern leads to the pattern itself being included in the result.
>>> p.split('asp')
['', 'as', 'p']
>>> p.split('lat')
['l', 'a', 't']
>>> p.split('kasst')
['k', 'ass', 't']

Use regex ?
import re
a = ["s", "a"]
text = "kasst"
pattern = re.compile("[" + "".join(a) + "]{1,4}")
match = pattern.search(text)
parts = [text[:match.start()], text[match.start():match.end()], text[match.end():]]
parts = [part if part else "*" for part in parts]
However, note that this won't handle the case when there is no match on the elements in a

I would do a regular expression to simplify the matching.
import re
splitters = ''.join(a)
pattern = re.compile("([^%s]*)([%s]+)([^%s]*)" % (splitters, splitters, splitters))
words = [v if v else '=' for v in pattern.match(s).groups() ]
This doesn't allow the characters in the first or last group, so not all string will match correctly (and throw an exception). You can allow them if you want. Feel free to modify the regular expression to better match what you want it to do.
Also you only need to run the re.compile once, not for every string you are trying to match.

Most efficient way to split strings in Python

My current Python project will require a lot of string splitting to process incoming packages. Since I will be running it on a pretty slow system, I was wondering what the most efficient way to go about this would be. The strings would be formatted something like this:
Item 1 | Item 2 | Item 3 <> Item 4 <> Item 5
Explanation: This particular example would come from a list where the first two items are a title and a date, while item 3 to item 5 would be invited people (the number of those can be anything from zero to n, where n is the number of registered users on the server).
From what I see, I have the following options:
repeatedly use split()
Use a regular expression and regular expression functions
Some other Python functions I have not thought about yet (there are probably some)
Solution 1 would include splitting at | and then splitting the last element of the resulting list at <> for this example, while solution 2 would probably result in a regular expression like:
((.+)|)+((.+)(<>)?)+
Okay, this regular expression is horrible, I can see that myself. It is also untested. But you get the idea.
Now, I am looking for the way that a) takes the least amount of time and b) ideally uses the least amount of memory. If only one of the two is possible, I would prefer less time. The ideal solution would also work for strings that have more items separated with | and strings that completely lack the <>. At least the regular expression-based solution would do that.
My understanding would be that split() would use more memory (since you basically get two resulting lists, one that splits at | and the second one that splits at <>), but I don't know enough about Python's implementation of regular expressions to judge how the regular expression would perform. split() is also less dynamic than a regular expression if it comes to different numbers of items and the absence of the second separator. Still, I can't shake the impression that Python can do this better without regular expressions, and that's why I am asking.
Some notes:
Yes, I could just benchmark both solutions, but I'm trying to learn something about Python in general and how it works here, and if I just benchmark these two, I still don't know what Python functions I have missed.
Yes, optimizing at this level is only really required for high-performance stuff, but as I said, I am trying to learn things about Python.
Addition: in the original question, I completely forgot to mention that I need to be able to distinguish the parts that were separated by | from the parts with the separator <>, so a simple flat list as generated by re.split(\||<>,input) (as proposed by obmarg) will not work too well. Solutions fitting this criterium are much appreciated.
To sum the question up: Which solution would be the most efficient one, for what reasons?
Due to multiple requests, I have run some timeit on the split()-solution and the first proposed regular expression by obmarg, as well as the solutions by mgibsonbr and duncan:
import timeit
import re
def splitit(input):
res0 = input.split("|")
res = []
for element in res0:
t = element.split("<>")
if t != [element]:
res0.remove(element)
res.append(t)
return (res0, res)
def regexit(input):
return re.split( "\||<>", input )
def mgibsonbr(input): # Solution by mgibsonbr
items = re.split(r'\||<>', input) # Split input in items
offset = 0
result = [] # The result: strings for regular items, lists for <> separated ones
acc = None
for i in items:
delimiter = '|' if offset+len(i) < len(input) and input[offset+len(i)] == '|' else '<>'
offset += len(i) + len(delimiter)
if delimiter == '<>': # Will always put the item in a list
if acc is None:
acc = [i] # Create one if doesn't exist
result.append(acc)
else:
acc.append(i)
else:
if acc is not None: # If there was a list, put the last item in it
acc.append(i)
else:
result.append(i) # Add the regular items
acc = None # Clear the list, since what will come next is a regular item or a new list
return result
def split2(input): # Solution by duncan
res0 = input.split("|")
res1, res2 = [], []
for r in res0:
if "<>" in r:
res2.append(r.split("<>"))
else:
res1.append(r)
return res1, res2
print "mgibs:", timeit.Timer("mgibsonbr('a|b|c|de|f<>ge<>ah')","from __main__ import mgibsonbr").timeit()
print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()
print "split2:", timeit.Timer("split2('a|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit()
print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()
print "mgibs:", timeit.Timer("mgibsonbr('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import mgibsonbr").timeit()
print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()
print "split:", timeit.Timer("split2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit()
print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()
The results:
mgibs: 14.7349407408
split: 6.403942732
split2: 3.68306812233
regex: 5.28414318792
mgibs: 107.046683735
split: 46.0844590775
split2: 26.5595985591
regex: 28.6513302646
At the moment, it looks like split2 by duncan beats all other algorithms, regardless of length (with this limited dataset at least), and it also looks like mgibsonbr's solution has some performance issues (sorry about that, but thanks for the solution regardless).

I was slightly surprised that split() performed so badly in your code, so I looked at it a bit more closely and noticed that you're calling list.remove() in the inner loop. Also you're calling split() an extra time on each string. Get rid of those and a solution using split() beats the regex hands down on shorter strings and comes a pretty close second on the longer one.
import timeit
import re
def splitit(input):
res0 = input.split("|")
res = []
for element in res0:
t = element.split("<>")
if t != [element]:
res0.remove(element)
res.append(t)
return (res0, res)
def split2(input):
res0 = input.split("|")
res1, res2 = [], []
for r in res0:
if "<>" in r:
res2.append(r.split("<>"))
else:
res1.append(r)
return res1, res2
def regexit(input):
return re.split( "\||<>", input )
rSplitter = re.compile("\||<>")
def regexit2(input):
return rSplitter.split(input)
print("split: ", timeit.Timer("splitit( 'a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit())
print("split2:", timeit.Timer("split2( 'a|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit())
print("regex: ", timeit.Timer("regexit( 'a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit())
print("regex2:", timeit.Timer("regexit2('a|b|c|de|f<>ge<>ah')","from __main__ import regexit2").timeit())
print("split: ", timeit.Timer("splitit( 'a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit())
print("split2:", timeit.Timer("split2( 'a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit())
print("regex: ", timeit.Timer("regexit( 'a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit())
print("regex2:", timeit.Timer("regexit2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit2").timeit())
Which gives the following result:
split: 1.8427431439631619
split2: 1.0897291360306554
regex: 1.6694280610536225
regex2: 1.2277749050408602
split: 14.356198082969058
split2: 8.009285948995966
regex: 9.526430513011292
regex2: 9.083608677960001
And of course split2() gives the nested lists that you wanted whereas the regex solution doesn't.
Compiling the regex will improve performance. It does make a slight difference, but Python caches compiled regular expressions so the saving is not as much as you might expect. I think usually it isn't worth doing it for speed (though it can be in some cases), but it is often worthwhile to make the code clearer.

I'm not sure if it's the most efficient, but certainly the easiest to code seems to be something like this:
>>> input = "Item 1 | Item 2 | Item 3 <> Item 4 <> Item 5"
>>> re.split( "\||<>", input )
>>> ['Item 1 ', ' Item 2 ', ' Item 3 ', ' Item 4 ', ' Item 5']
I would think there's a fair chance of it being more efficient than a plain old split as well (depending on the input data) since you'd need to perform the second split operation on every string output from the first split, which doesn't seem likely to be efficient for either memory or time.
Though having said that I could easily be wrong, and the only way to be sure would be to time it.

Calling split multiple times is likely to be inneficient, because it might create unneeded intermediary strings. Using a regex like you proposed won't work, since the capturing group will only get the last item, not every of them. Splitting using a regex, like obmarg suggested, seems to be the best route, assuming a "flattened" list is what you're looking for.
If you don't want a flattened list, you can split using a regex first and then iterate over the results, checking the original input to see which delimiter was used:
items = re.split(r'\||<>', input)
offset = 0
for i in items:
delimiter = '|' if input[offset+len(i)] == '|' else '<>'
offset += len(i) + len(delimiter)
# Do something with i, depending on whether | or <> was the delimiter
At last, if you don't want the substrings created at all (using only the start and end indices to save space, for instance), re.finditer might do the job. Iterate over the delimiters, and do something to the text between them depending on which delimiter (| or <>) was found. It's a more complex operation, since you'll have to handle many corner cases, but might be worth it depending on your needs.
Update: for your particular case, where the input format is uniform, obmarg's solutions is the best one. If you must, post-process the result to have a nested list:
split_result = re.split( "\||<>", input )
result = [split_result[0], split_result[1], [i for i in split_result[2:] if i]]
(that last list comprehension is to ensure you'll get [] instead of [''] if there are no items after the last |)
Update 2: After reading the updated question, I finally understood what you're trying to achieve. Here's the full example, using the framework suggested earlier:
items = re.split(r'\||<>', input) # Split input in items
offset = 0
result = [] # The result: strings for regular itens, lists for <> separated ones
acc = None
for i in items:
delimiter = '|' if offset+len(i) < len(input) and input[offset+len(i)] == '|' else '<>'
offset += len(i) + len(delimiter)
if delimiter == '<>': # Will always put the item in a list
if acc is None:
acc = [i] # Create one if doesn't exist
result.append(acc)
else:
acc.append(i)
else:
if acc is not None: # If there was a list, put the last item in it
acc.append(i)
else:
result.append(i) # Add the regular items
acc = None # Clear the list, since what will come next is a regular item or a new list
print result
Tested it with your example, the result was:
['a', 'b', 'c', 'de', ['f', 'ge', 'aha'],
'b', 'c', 'de', ['f', 'ge', 'aha'],
'b', 'c', 'de', ['f', 'ge', 'aha'],
'b', 'c', 'de', ['f', 'ge', 'aha'],
'b', 'c','de', ['f', 'ge', 'aha'],
'b', 'c', 'de', ['f', 'ge', 'aha'],
'b', 'c', 'de', ['f', 'ge', 'aha'],
'b', 'c', 'de', ['f', 'ge', 'aha'],
'b', 'c', 'de', ['f', 'ge', 'aha'],
'b', 'c', 'de', ['f', 'ge', 'aha'],
'b', 'c', 'de', ['f', 'ge', 'ah']]

If you know that <> is not going to appear elsewhere in the string then you could replace '<>' with '|' followed by a single split:
>>> input = "Item 1 | Item 2 | Item 3 <> Item 4 <> Item 5"
>>> input.replace("<>", "|").split("|")
['Item 1 ', ' Item 2 ', ' Item 3 ', ' Item 4 ', ' Item 5']
This will almost certainly be faster than doing multiple splits. It may or may not be faster than using re.split - timeit is your friend.
On my system with the sample string you supplied, my version is more than three times faster than re.split:
>>> timeit input.replace("<>", "|").split("|")
1000000 loops, best of 3: 980 ns per loop
>>> import re
>>> timeit re.split(r"\||<>", input)
100000 loops, best of 3: 3.07 us per loop
(N.B.: This is using IPython, which has timeit as a built-in command).

You can make use of replace. First replace <> with |, and then split by |.
def replace_way(input):
return input.replace('<>','|').split('|')
Time performance:
import timeit
import re
def splitit(input):
res0 = input.split("|")
res = []
for element in res0:
t = element.split("<>")
if t != [element]:
res0.remove(element)
res.append(t)
return (res0, res)
def regexit(input):
return re.split( "\||<>", input )
def replace_way(input):
return input.replace('<>','|').split('|')
print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()
print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()
print "replace:",timeit.Timer("replace_way('a|b|c|de|f<>ge<>ah')","from __main__ import replace_way").timeit()
print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()
print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()
print "replace:",timeit.Timer("replace_way('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import replace_way").timeit()
Results on my machine:
split: 11.8682055461
regex: 12.7430856814
replace: 2.54299265006
split: 79.2124379066
regex: 68.6917008003
replace: 10.944842347

How to use re match objects in a list comprehension

I have a function to pick out lumps from a list of strings and return them as another list:
def filterPick(lines,regex):
result = []
for l in lines:
match = re.search(regex,l)
if match:
result += [match.group(1)]
return result
Is there a way to reformulate this as a list comprehension? Obviously it's fairly clear as is; just curious.
Thanks to those who contributed, special mention for #Alex. Here's a condensed version of what I ended up with; the regex match method is passed to filterPick as a "pre-hoisted" parameter:
import re
def filterPick(list,filter):
return [ ( l, m.group(1) ) for l in list for m in (filter(l),) if m]
theList = ["foo", "bar", "baz", "qurx", "bother"]
searchRegex = re.compile('(a|r$)').search
x = filterPick(theList,searchRegex)
>> [('bar', 'a'), ('baz', 'a'), ('bother', 'r')]

[m.group(1) for l in lines for m in [regex.search(l)] if m]
The "trick" is the for m in [regex.search(l)] part -- that's how you "assign" a value that you need to use more than once, within a list comprehension -- add just such a clause, where the object "iterates" over a single-item list containing the one value you want to "assign" to it. Some consider this stylistically dubious, but I find it practical sometimes.

return [m.group(1) for m in (re.search(regex, l) for l in lines) if m]

It could be shortened a little
def filterPick(lines, regex):
matches = map(re.compile(regex).match, lines)
return [m.group(1) for m in matches if m]
You could put it all in one line, but that would mean you would have to match every line twice which would be a bit less efficient.

Starting Python 3.8, and the introduction of assignment expressions (PEP 572) (:= operator), it's possible to use a local variable within a list comprehension in order to avoid calling multiple times the same expression:
# items = ["foo", "bar", "baz", "qurx", "bother"]
[(x, match.group(1)) for x in items if (match := re.compile('(a|r$)').search(x))]
# [('bar', 'a'), ('baz', 'a'), ('bother', 'r')]
This:
Names the evaluation of re.compile('(a|r$)').search(x) as a variable match (which is either None or a Match object)
Uses this match named expression in place (either None or a Match) to filter out non matching elements
And re-uses match in the mapped value by extracting the first group (match.group(1)).

>>> "a" in "a visit to the dentist"
True
>>> "a" not in "a visit to the dentist"
False
That also works with a search query you're hunting down in a list
`P='a', 'b', 'c'
'b' in P` returns true

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Build a combination of strings that matches the given regex [duplicate] - python

import exrex trend = '(A|B|C)_STRING' trend2 = 'STRING_(A|B)_STRING_(C|D)' >>> list(exrex.generate(trend)) [u'A_STRING', u'B_STRING', u'C_STRING'] >>> list(exrex.generate(trend2)) [u'STRING_A_STRING_C', u'STRING_A_STRING_D', u'STRING_B_STRING_C', u'STRING_B_STRING_D']

Related

Get string of either side of a desired letter

String tranformation into Clean list of int

Finding element in a string and smarter manipulation

Most efficient way to split strings in Python

How to use re match objects in a list comprehension

Categories

Resources