I have a string that looks like either of these three examples:
1: Name = astring Some comments
2: Typ = one two thee Must be "sand", "mud" or "bedload"
3: RDW = 0.02 [ - ] Some comment about RDW
I first split the variable name and rest like so:
re.findall(r'\s*([a-zA-z0-9_]+)\s*=\s*(.*)', line)
I then want to split the right part of the string into a part containing the values and a part containing the comments (if there are any). I want to do this by looking at the number of whitespaces. If it exceeds say 4, then I assume the comments to start
Any idea on how to do this?
I currently have
re.findall(r'(?:(\S+)\s{0,3})+', dataString)
However if I test this using the string:
'aa aa23r234rf2134213^$&$%& bb'
Then it also selects 'bb'
You may use a single regex with re.findall:
^\s*(\w+)\s*=\s*(.*?)(?:(?:\s{4,}|\[)(.*))?$
See the regex demo.
Details:
^ - start of string
\s* - 0+ whitespaces
(\w+) - capturing group #1 matching 1 or more letters/digits/underscores
\s*=\s* - = enclosed with 0+ whitespaces
(.*?) - capturing group #2 matching any 0+ chars, as few as possible, up to the first...
(?:(?:\s{4,}|\[)(.*))? - an optional group matching
(?:\s{4,}|\[) - 4 or more whitespaces or a [
(.*) - capturing group #3 matching 0+ chars up to
$ - the end of string.
Related
I have examined a previous question relating to optional capture groups in Python, but this has not been helpful. Attempting to follow, the code I have is below:
import re
c = re.compile(r'(?P<Prelude>.*?)'
r'(?:Discussion:(?P<Discussion>.+?))?'
r'(?:References:(?P<References>.*?))?',
re.M|re.S)
test_text = r"""Prelude strings
Discussion: this is some
text.
References: My bad, I have none.
"""
test_text2 = r"""Prelude strings
Discussion: this is some
text.
"""
print(c.match(test_text).groups())
print(c.match(test_text2).groups())
Both print ('Prelude strings', None, None) instead of capturing the two groups. I am unable to determine why.
The expected result is ('Prelude strings', ' this is some\ntext.', ' My bad, I have none.') for the first, and the second the same but with None as the third capture group. It should also be possible to delete the Discussion lines and still capture References.
You can use
c = re.compile(r'^(?P<Prelude>.*?)'
r'(?:Discussion:\s*(?P<Discussion>.*?)\s*)?'
r'(?:References:\s*(?P<References>.*?))?$',
re.S)
One-line regex pattern as a string:
(?s)^(?P<Prelude>.*?)(?:Discussion:\s*(?P<Discussion>.*?)\s*)?(?:References:\s*(?P<References>.*?))?$
See the regex demo.
Details:
(?s) - same as re.S, makes . match line break chars
^ - start of the whole string (note that it no longer matches start of any line, since I removed the re.M flag)
(?P<Prelude>.*?) - Group "Prelude": any zero or more chars as few as possible
(?:Discussion:\s*(?P<Discussion>.*?)\s*)? - an optional non-capturing group matching one or zero occurrences of the following sequence:
Discussion: - a fixed string
\s* - zero or more whitespaces
(?P<Discussion>.*?) - Group "Discussion": zero or more chars as few as possible
\s* - zero or more whitespaces
(?:References:\s*(?P<References>.*?))? - an optional non-capturing group matching one or zero occurrences of the following sequence:
References: - a fixed string
\s* - zero or more whitespaces
(?P<References>.*?) - Group "References": any zero or more chars as few as possible
$ - end of the string.
I'm starting to learn regex in order to match words in python columns and replace them for other values.
df['col1']=df['col1'].str.replace(r'(?i)unlimi+\w*', 'Unlimited', regex=True)
This pattern serves to match different variations of the world Unlimited. But I have some values in the column that have not only one word, but two or more:
ex:
[Unlimited, Unlimited (on-net), Unlimited (on-off-net)]`
I was wondering if there is a way to match all of the words in the previous example with a single regex line.
You can use
df['col1']=df['col1'].str.replace(r'(?i)unlimi\w*(?:\s*\([^()]*\))?', 'Unlimited', regex=True)
See the regex demo.
The (?i)unlimi\w*(?:\s*\([^()]*\))? regex matches
(?i) - the regex to the right is case insensitive
unlimi - a fixed string
\w* - zero or more word chars
(?:\s*\([^()]*\))? - an optional sequence of
\s* - zero or more whitespaces
\( - a ( char
[^()]* - zero or more chars other than ( and )
\) - a ) char.
I am trying to parse SQL code using regex in Python.
I need an expression that would delimit group when it ends with end of string or comma but only if they follow after the matched brackets.
My current regexp matches second group only up to first occurrence of a comma, regardless of parentheses count:
(?m)^\s*'?([A-Za-z0-9_-]+)'?\s*=\s*((?s:.)*?)(?:\s*)(?=,|\Z)
For example, in the string below:
COL1 = DEF1,
COL2 = DEF(TEST,
TEST2),
COL3 = FUN(1, 2),
I get:
0: DEF1
1: DEF(TEST
2: FUN(1
And I would like it to match:
0: DEF1
1: DEF(TEST,
TEST2)
2: FUN(1, 2)
Thanks in advance!
You may use
(?sm)^\s*'?([\w-]+)'?\s*=\s*(.*?)(?=^\s*'?[\w-]+'?\s*=|\Z)
See the regex demo
Details
(?sm) - DOTALL and MULTILINE options on
^ - start of a line
-\s* - 0+ whitespaces
'? - an optional '
([\w-]+) - Group 1: one or more word or - chars
'? - an optional '
\s*=\s* - a = enclosed with 0+ whitespaces
(.*?) - Group 2: any zero or more chars other than line break chars as few as possible
(?=^\s*'?[\w-]+'?\s*=|\Z) - a positive lookahead requiring the end of string (\Z) or ^\s*'?[\w-]+'?\s*= pattern immediately to the right of the current location.
I'm working on my regex skills and i find one of my strings having duplicate words at the starting. I would like to remove the duplicate and just have one word of it -
server_server_dev1_check_1233.zzz
server_server_qa1_run_1233.xyz
server_server_dev2_1233.qqa
server_dev1_1233.zzz
data_data_dev9_check_660.log
I used the below regex but i get both server_server in my output,
((.*?))_(?!\D)
How can i have my output just to one server_ if there are two or more and if its only one server_, then take as is?
The output doesn't have to contain the digits and also the part after . i.e. .zzz, .xyz etc
Expected output -
server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check
you could back reference the word in your search expression:
>>> s = "server_server_dev1_check_1233.zzz"
>>> re.sub(r"(.*_)\1",r"\1",s)
'server_dev1_check_1233.zzz'
and use the "many times" suffix so if there are more than 2 occurrences it still works:
'server_server_server_dev1_check_1233.zzz'
>>> re.sub(r"(.*_)\1{1,}",r"\1",s)
'server_dev1_check_1233.zzz'
getting rid of the suffix is not the hardest part, just capture the rest and discard the end:
>>> re.sub(r"(.*_)\1{1,}(.*)(_\d+\..*)",r"\1\2",s)
'server_dev1_check'
You may use a single re.sub call to match and remove what you do not need and match and capture what you need:
re.sub(r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$', r'\1\2', s)
See the regex demo
Details
^ - start of string
([^_]+) - Capturing group 1: any 1+ chars other than _
(?:_\1)* - zero or more repetitions of _ followed with the same substring as in Group 1 (thanks to the inline backreference \1 that retrieves the text from Group 1)
(.*) - Group 2: any 0+ chars, as many as possible
_ - an underscore
\d+ - 1+ digits
\. - a dot
\w+ - 1+ word chars ([^.]+ will also do, 1 or more chars other than .)
$ - end of string.
The replacement pattern is \1\2, i.e. the contents of Group 1 and 2 are concatenated and make up the resulting value.
Python demo:
import re
rx = r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$'
strs = ["server_server_dev1_check_1233.zzz", "server_server_qa1_run_1233.xyz", "server_server_dev2_1233.qqa", "server_dev1_1233.zzz", "data_data_dev9_check_660.log"]
for s in strs:
print(re.sub(rx, r'\1\2', s))
Output:
server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check
Just getting to the next stage of understanding regex, hoping the community can help...
string = These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN-YIFY.cp(tt123456).MiLLENiUM.mp4
There are multiple test names preceded by a '-' hyphen which I derive from regex
\(?<=-)\w+\g
Result:
AUSVERSION
TEST
TESTAGAIN
YIFY
I can parse the very last result using greediness with regex \(?!.*-)(?<=-)\w+\g
Result:
YIFI (4th & last result)
Can you please help me parse either the 1st, 2nd, or 3rd result Globally using the same string?
In Python, you can get these matches with a simple -\s*(\w+) regex and re.findall and then access any match with the appropriate index:
See IDEONE demo:
import re
s = 'These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN-YIFY.cp(tt123456).MiLLENiUM.mp4'
r = re.findall(r'-\s*(\w+)', s)
print(r[0]) # => AUSVERSION
print(r[1]) # => TEST
print(r[2]) # => TESTAGAIN
print(r[3]) # => YIFY
The -\s*(\w+) pattern search for a hyphen, followed with 0+ whitespaces, and then captures 1+ digits, letters or underscores. re.findall only returns the texts captured with capturing groups, so you only get those Group 1 values captured with (\w+).
To get these matches one by one, with re.search, you can use ^(?:.*?-\s*(\w+)){n}, where n is the match index you want. Here is a regex demo.
A quick Python demo (in real code, assign the result of re.search and only access Group 1 value after checking if there was a match):
s = "These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN- YIFY.cp(tt123456).MiLLENiUM.mp4"
print(re.search(r'^(?:.*?-\s*(\w+))', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){2}', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){3}', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){4}', s).group(1))
Explanation of the pattern:
^ - start of string
(?:.*?-\s*(\w+)){2} - a non-capturing group that matches (here) 2 sequences of:
.*? - 0+ any characters other than a newline (since no re.DOTALL modifier is used) up to the first...
- - hyphen
\s* - 0 or more whitespaces
(\w+) - Group 1 capturing 1+ word characters (letters, digits or underscores).