I am trying to split strings containing python functions, so that the resulting output keeps separate functions as list elements.
s='hello()there()' should be split into ['hello()', 'there()']
To do so I use a regex lookahead to split on the closing parenthesis, but not at the end of the string.
While the lookahead seems to work, I cannot keep the ) in the resulting strings as suggested in various posts. Simply splitting with the regex discards the separator:
import re
s='hello()there()'
t=re.split("\)(?!$)", s)
This results in: 'hello(', 'there()'] .
s='hello()there()'
t=re.split("(\))(?!$)", s)
Wrapping the separator as a group results in the ) being retained as a separate element: ['hello(', ')', 'there()']
As does this approach using the filter() function:
s='hello()there()'
u = list(filter(None, re.split("(\))(?!$)", s)))
resulting again in the parenthesis as a separate element: ['hello(', ')', 'there()']
How can I split such a string so that the functions remain intact in the output?
Use re.findall()
\w+\(\) matches one or more word characters followed by an opening and a closing parenthesis> That part matches the hello() and there()
t = re.findall(r"\w+\(\)", s)
['hello()', 'there()']
Edition:
.*? is a non-greedy match, meaning it will match as few characters as possible in the parenthesis.
s = "hello(x, ...)there(6L)now()"
t = re.findall(r"\w+\(.*?\)", s)
print(t)
['hello(x, ...)', 'there(6L)', 'now()']
You can split on a lookbehind for () and negative lookahead for the end of the string.
t = re.split(r'(?<=\(\))(?!$)', s)
Related
I'm dealing with equations like 'x_{t+1}+y_{t}=z_{t-1}'. My objective is to obtain all "variables", that is, a list with x_{t+1}, y_{t}, z_{t-1}.
I'd like to split the string by [+-=*/], but not if + or - are inside {}.
Something like this re.split('(?<!t)[\+\-\=]','x_{t+1}+y_{t}=z_{t-1}') partly does the job by not spliting if it observes t followed by a symbol. But I'd like to be more general. Assume there are no nested brackets.
How can I do this?
Instead of splitting at those characters, you could find sequences of all other characters (like x and _) and bracket parts (like {t+1}). The first such sequence in the example is x, _, {t+1}, i.e., the substring x_{t+1}.
import re
s = 'x_{t+1}+y_{t}=z_{t-1}'
print(re.findall(r'(?:\{.*?}|[^-+=*/])+', s))
Output (Try it online!):
['x_{t+1}', 'y_{t}', 'z_{t-1}']
Instead of re.split, consider using re.findall to match only the variables:
>>> re.findall(r"[a-z0-9]+(?:_\{[^\}]+\})?","x_{t+1}+y_{t}=z_{t-1}+pi", re.IGNORECASE)
['x_{t+1}', 'y_{t}', 'z_{t-1}', 'pi']
Try online
Explanation of regex:
[a-z0-9]+(?:_\{[^\}]+\})?
[a-z0-9]+ : One or more alphanumeric characters
(?: )?: A non-capturing group, optional
_\{ \} : Underscore, and opening/closing brackets
[^\}]+ : One or more non-close-bracket characters
s = "[abc]abx[abc]b"
s = re.sub("\[([^\]]*)\]a", "ABC", s)
'ABCbx[abc]b'
In the string, s, I want to match 'abc' when it's enclosed in [], and followed by a 'a'. So in that string, the first [abc] will be replaced, and the second won't.
I wrote the pattern above, it matches:
match anything starting with a '[', followed by any number of characters which is not ']', then followed by the character 'a'.
However, in the replacement, I want the string to be like:
[ABC]abx[abc]b . // NOT ABCbx[abc]b
Namely, I don't want the whole matched pattern to be replaced, but only anything with the bracket []. How to achieve that?
match.group(1) will return the content in []. But how to take advantage of this in re.sub?
Why not simply include [ and ] in the substitution?
s = re.sub("\[([^\]]*)\]a", "[ABC]a", s)
There exist more than 1 method, one of them is exploting groups.
import re
s = "[abc]abx[abc]b"
out = re.sub('(\[)([^\]]*)(\]a)', r'\1ABC\3', s)
print(out)
Output:
[ABC]abx[abc]b
Note that there are 3 groups (enclosed in brackets) in first argument of re.sub, then I refer to 1st and 3rd (note indexing starts at 1) so they remain unchanged, instead of 2nd group I put ABC. Second argument of re.sub is raw string, so I do not need to escape \.
This regex uses lookarounds for the prefix/suffix assertions, so that the match text itself is only "abc":
(?<=\[)[^]]*(?=\]a)
Example: https://regex101.com/r/NDlhZf/1
So that's:
(?<=\[) - positive look-behind, asserting that a literal [ is directly before the start of the match
[^]]* - any number of non-] characters (the actual match)
(?=\]a) - positive look-ahead, asserting that the text ]a directly follows the match text.
I have the following string inputs:
"11A4B"
"5S6B"
And want the following outputs:
["11A", "4B"]
["5S", "6B"]
Eg after each delimiter A, B or S split and keep the delimiter.
I can do with split from re (putting parenthesis on the delimiter pattern returns also the delimiter used):
re.split("([ABS])", "11A4B")
#['11', 'A', '4', 'B', '']
And can play around to have the wanted solution but I wonder if there is a pure regex solution?
A solution that will work in all Python versions will be the one based on PyPi regex module with regex.split and regex.V1 flag:
import regex
ss = ["11A4B","5S6B"]
delimiters = "ABS"
for s in ss:
print(regex.split(r'(?<=[{}])(?!$)'.format(regex.escape(delimiters)), s, flags=regex.V1))
Output:
['11A', '4B']
['5S', '6B']
Details
(?<=[ABS]) - a positive lookbehind that matches a location that is immediately preceded with A , B or S
(?!$) - and that is not immediately followed with the end of string (so, all locations at the end of the string are failed).
The regex.escape is used just in case there may be special regex chars in the delimiter list, like ^, \, - or ].
In Python 3.7, re.split also can split with zero-length matches, so, the following will work, too:
re.split(r'(?<=[{}])(?!$)'.format(re.escape(delimiters)), s)
Else, you may use workarounds, like
re.findall(r'[^ABS]*[ABS]?', s) # May result in empty items, too
re.findall(r'(?s)(?=.)[^ABS]*[ABS]?', s) # no empty items due to the lookahead requiring at least 1 char
See the regex demo.
Details
(?s) - . matches newlines, too
(?=.) - one char should appear immediately to the right of the current location
[^ABS]* - any 0+ chars other than A, B and S
[ABS]? - 1 or 0 (=optional) A, B or S char.
Use re.findall instead, and match digits followed by either A, B, or S:
re.findall(r'\d+[ABS]', '11A4B')
Output:
['11A', '4B']
If the input might have other alphabetical characters as well, then use a negated character set instead:
re.findall(r'[^ABS]+[ABS]', 'ZZZAYYYSXXXB')
Output:
['ZZZA', 'YYYS', 'XXXB']
You could use lookarounds:
(?<=[ABS])(?!$)
Se a demo on regex101.com.
Use findall:
re.findall('(.*?(?:[ABS]|.$))', "11A4B5")
I have a large list of chemical data, that contains entries like the following:
1. 2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP
2. Lead,Paints/Pigments,Zinc
I have a function that is correctly splitting the 1st entry into:
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
based on ', ' as a separator. For the second entry, ', ' won't work. But, if i could easily split any string that contains ',' with only two non-numeric characters on either side, I would be able to parse all entries like the second one, without splitting up the chemicals in entries like the first, that have numbers in their name separated by commas (i.e. 2,4,5-TP).
Is there an easy pythonic way to do this?
I explain a little bit based on #eph's answer:
import re
data_list = ['2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP', 'Lead,Paints/Pigments,Zinc']
for d in data_list:
print re.split(r'(?<=\D),\s*|\s*,(?=\D)',d)
re.split(pattern, string) will split string by the occurrences of regex pattern.
(plz read Regex Quick Start if you are not familiar with regex.)
The (?<=\D),\s*|\s*,(?=\D) consists of two part: (?<=\D),\s* and \s*,(?=\D). The meaning of each unit:
The middle | is the OR operator.
\D matches a single character that is not a digit.
\s matches a whitespace character (includes tabs and line breaks).
, matches character ",".
* attempts to match the preceding token zero or more times. Therefore, \s* means the whitespace can be appear zero or more times. (see Repetition with Star and Plus)
(?<= ... ) and (?= ...) are the lookbebind and lookahead assertions.
For example, q(?=u) matches a q that is followed by a u, without making the u part of the match.
Therefore, \s*,(?=\D) matches a , that is preceded by zero or more whitespace and followed by non-digit characters. Similarly, (?<=\D),\s* matches a , that is preceded by non-digit characters and followed by zero or more whitespace. The whole regex will find , that satisfy either case, which is equivalent to your requirement: ',' with only two non-numeric characters on either side.
Some useful tools for regex:
Regex Cheat Sheet
Online regex tester: regex101 (with a tree structure explanation to your regex)
Use regex and lookbehind/lookahead assertion
>>> re.split(r'(?<=\D\D),\s*|,\s*(?=\D\D)', s)
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> s1 = "2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP"
>>> s2 = "Lead,Paints/Pigments,Zinc"
>>> import re
>>> res1 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s1)
>>> res1
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> res2 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s2)
>>> res2
['Lead', 'Paints/Pigments', 'Zinc']
I am trying to delete the single quotes surrounding regular text. For example, given the list:
alist = ["'ABC'", '(-inf-0.5]', '(4800-20800]', "'\\'(4.5-inf)\\''", "'\\'(2.75-3.25]\\''"]
I would like to turn "'ABC'" into "ABC", but keep other quotes, that is:
alist = ["ABC", '(-inf-0.5]', '(4800-20800]', "'\\'(4.5-inf)\\''", "'\\'(2.75-3.25]\\''"]
I tried to use look-head as below:
fixRepeatedQuotes = lambda text: re.sub(r'(?<!\\\'?)\'(?!\\)', r'', text)
print [fixRepeatedQuotes(str) for str in alist]
but received error message:
sre_constants.error: look-behind requires fixed-width pattern.
Any other workaround? Thanks a lot in advance!
Try should work:
result = re.sub("""(?s)(?:')([^'"]+)(?:')""", r"\1", subject)
explanation
"""
(?: # Match the regular expression below
' # Match the character “'” literally (but the ? makes it a non-capturing group)
)
( # Match the regular expression below and capture its match into backreference number 1
[^'"] # Match a single character NOT present in the list “'"” from this character class (aka any character matches except a single and double quote)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
(?: # Match the regular expression below
' # Match the character “'” literally (but the ? makes it a non-capturing group)
)
"""
re.sub accepts a function as the replace text. Therefore,
re.sub(r"'([A-Za-z]+)'", lambda match: match.group(), "'ABC'")
yields
"ABC"