re.sub match with first occurrence of bracketed characters - python

I'm trying to capture the first group of characters before one or more underscores or dashes in a string using re.sub in Python 3.7. My current function is:
re.sub(r'(\w+)[-_]?.*', r'\1', x).
Example strings:
x = 'CAM14_20190417121301_000'
x = 'CAM16-20190417121301_000'
Actual output:
CAM14_20190417121301_000
CAM16
Desired output:
CAM14
CAM16
Why is it working when there is a dash after the first group, but not an underscore? I also tried re.sub(r'(\w+)_?.*', r'\1', x) to try and force it to catch the underscore, but that returned the same result. I would like the code to be flexible enough to catch either.

\w matches underscores, consider using this regex instead:
re.sub(r'([a-zA-Z0-9]+)[-_]?.*', r'\1', x)

Related

Regular expression match / split

I am having some trouble trying to figure out how to use regular expressions in python. Ultimately I am trying to do what sscanf does for me in C.
I am trying to match given strings that look like so:
12345_arbitrarystring_2020_05_20_10_10_10.dat
I (seem) to be able to validate this format by calling match on the following regular expression
regex = re.compile('[0-9]{5}_.+_[0-9]{4}([-_])[0-9]{2}([-_])[0-9]{2}([-_])[0-9]{2}([:_])[0-9]{2}([:_])[0-9]{2}\\.dat')
(Note that I do allow for a few other separators then just '_')
I would like to split the given string on these separators so I do:
regex = re.compile('[_\\-:.]+')
parts = regex.split(given_string)
This is all fine .. the problem is that I would like my 'arbitrarystring' part to include '-' and '_' and the last split currently, well, splits them.
Other than manually cutting the timestamp and the first 5 digits off that given string, what can I do to get that arbitrarystring part?
You could use a capturing group to get the arbitrarystring part and omit the other capturing groups.
You could for example use a character class to match 1+ word characters or a hyphen using [\w-]+
If you still want to use split, you could add capturing groups for the first and the second part, and split only those groups.
^[0-9]{5}_([\w-]+)_[0-9]{4}[-_][0-9]{2}[-_][0-9]{2}[-_][0-9]{2}[:_][0-9]{2}[:_][0-9]{2}\.dat$
^^^^^^^^
Regex demo
It seems to be possible to cut down your regex to validate the whole pattern to:
^\d{5}_(.+?)_\d{4}[-_](?:\d{2}[-_]){2}(?:\d{2}[:_]){2}\d{2}\.dat$
Refer to group 1 for your arbitrary string.
Online demo
Quick reminder: You didn't seem to have used raw strings, but instead escaping with a double backslash. Python has raw strings which makes you don't have to escape backslashes nomore.

Regular expressions: distinguish strings including/excluding a given word

I'm working in Python and try to handle StatsModel's GLM output. I'm relatively new to regular expressions.
I have strings such as
string_1 = "C(State)[T.Kansas]"
string_2 = "C(State, Treatment('Alaska'))[T.Kansas]"
I wrote the following regex:
pattern = re.compile('C\((.+?)\)\[T\.(.+?)\]')
print(pattern.search(string_1).group(1))
#State
print(pattern.search(string_2).group(1))
#State, Treatment('Alaska')
So both of these strings match the pattern. But we want to get State in both cases. Basically we want to get read of everything after comma (including it) inside first brackets.
How can we distinguish the string_2 pattern from string_1's and extract only State without , Treatment?
You can add an optional non-capturing group instead of just allowing all characters:
pattern = re.compile('C\((.+?)(?:, .+?)?\)\[T\.(.+?)\]')
(?:...) groups the contents together without capturing it. The trailing ? makes the group optional.
You may use this regex using negative character classes:
C\((\w+)[^[]*\[T\.([^]]+)\]
RegEx Demo

Strip punctuation with regular expression - python

I would like to strip all of the the punctuations (except the dot) from the beginning and end of a string, but not in the middle of it.
For instance for an original string:
##%%.Hol$a.A.$%
I would like to get the word .Hol$a.A. removed from the end and beginning but not from the middle of the word.
Another example could be for the string:
##%%...&Hol$a.A....$%
In this case the returned string should be ..&Hol$a.A.... because we do not care if the allowed characters are repeated.
The idea is to remove all of the punctuations( except the dot ) just at the beginning and end of the word. A word is defined as \w and/or a .
A practical example is the string 'Barnes&Nobles'. For text analysis is important to recognize Barnes&Nobles as a single entity, but without the '
How to accomplish the goal using Regex?
Use this simple and easily adaptable regex:
[\w.].*[\w.]
It will match exactly your desired result, nothing more.
[\w.] matches any alphanumeric character and the dot
.* matches any character (except newline normally)
[\w.] matches any alphanumeric character and the dot
To change the delimiters, simply change the set of allowed characters inside the [] brackets.
Check this regex out on regex101.com
import re
data = '##%%.Hol$a.A.$%'
pattern = r'[\w.].*[\w.]'
print(re.search(pattern, data).group(0))
# Output: .Hol$a.A.
Depending on what you mean with striping the punctuation, you can adapt the following code :
import re
res = re.search(r"^[^.]*(.[^.]*.([^.]*.)*?)[^.]*$", "##%%.Hol$a.A.$%")
mystr = res.group(1)
This will strip everything before and after the dot in the expression.
Warning, you will have to check if the result is different of None, if the string doesn't match.

Regex that considers custom escape characters in the string (not in the pattern)

I'm building a regex that must match a certain pattern that starts with a specific symbol, but at the same time it must not match a pattern that starts with two or more occurrences of that same specific symbol.
To elaborate better, this is my scenario. I have a string like this:
Hello %partials/footer/mail,
%no_slashes_here
%{using_braces}_here
%%should_not_be_matched
And I'm trying to match those substrings that start with exactly one % symbol (since in my case a double %% means "escaping" and should not be matched) and they could optionally be surrounded by curly braces. And at the end, I need to capture the matched substrings but without the % symbol.
So far my regular expression is:
%\{*([0-9a-zA-Z_/]+)\}*
And the captured matches result is:
partials/footer/mail
no_slashes_here
using_braces
should_not_be_matched
Which is very close to what I need, but I got stuck into the double %% escaping part. I don't know how to negate two or more % symbols at the beginning and at the same time allow exactly one occurrence at the beginning too.
EDIT:
Sorry that I missed that, I'm using python.
With negative lookbehind:
%(?<!%%)\{*([0-9a-zA-Z_\/]+)\}*
Regex 101
If this is line based -- you can do:
(?:^|[^%])%\{?([^%}]+)\}?
Demo
Python demo:
txt='''\
Hello %partials/footer/mail,
%no_slashes_here
%{using_braces}_here
%%should_not_be_matched
This %% niether'''
import re
for line in txt.splitlines():
m=re.search(r'(?:^|[^%])%\{?([^%}]+)\}?', line)
if m:
print m.group(1)
It is unclear from your question how % this % should be treated
What about
(?<=%)([^%]+)
Regex101 demo
I've assumed PCRE, as you've not declared which flavour of Regex you're using.

Odd behavior on negative look behind in python

I am trying to do a re.split using a regex that is utilizing look-behinds. I want to split on newlines that aren't preceded by a \r. To complicate things, I also do NOT want to split on a \n if it's preceded by a certain substring: XYZ.
I can solve my problem by installing the regex module which lets me do variable width groups in my look behind. I'm trying to avoid installing anything, however.
My working regex looks like:
regex.split("(?<!(?:\r|XYZ))\n", s)
And an example string:
s = "DATA1\nDA\r\n \r\n \r\nTA2\nDA\r\nTA3\nDAXYZ\nTA4\nDATA5"
Which when split would look like:
['DATA1', 'DA\r\n \r\n \r\nTA2', 'DA\r\nTA3', 'DAXYZ\nTA4', 'DATA5']
My closest non-working expression without the regex module:
re.split("(?<!(?:..\r|XYZ))\n", s)
But this split results in:
['DATA1', 'DA\r\n \r', ' \r', 'TA2', 'DA\r\nTA3', 'DAXYZ\nTA4', 'DATA5']
And this I don't understand. From what I understand about look behinds, this last expression should work. Any idea how to accomplish this with the base re module?
You can use:
>>> re.split(r"(?<!\r)(?<!XYZ)\n", s)
['DATA1', 'DA\r\n \r\n \r\nTA2', 'DA\r\nTA3', 'DAXYZ\nTA4', 'DATA5']
Here we have broken your lookbehind assertions into two assertions:
(?<!\r) # previous char is not \r
(?<!XYZ) # previous text is not XYZ
Python regex engine won't allow (?<!(?:\r|XYZ)) in lookbehind due to this error
error: look-behind requires fixed-width pattern
You could use re.findall
>>> s = "DATA1\nDA\r\n \r\n \r\nTA2\nDA\r\nTA3\nDAXYZ\nTA4\nDATA5"
>>> re.findall(r'(?:(?:XYZ|\r)\n|.)+', s)
['DATA1', 'DA\r\n \r\n \r\nTA2', 'DA\r\nTA3', 'DAXYZ\nTA4', 'DATA5']
Explanation:
(?:(?:XYZ|\r)\n|.)+ This would match XYZ\n or \r\n greedily if there's any if the character going to be matched is not the one from the two then the control transfered to the or part that is . which would match any character but not of line breaks. + after the non-capturing group would repeat the whole pattern one or more times.

Categories