Regex: Matching individual characters without matching characters inbetween - python

I have a simple regex query.
Here is the input:
DLWLALDYVASQASV
The desired output are the positions of the bolded characters. DLWLALDYVASQASV
So it would be D:6, Y:7, S:10.
I am using python, so I know I can use span() or start() to obtain the start positions of a match. But if I try to use something like: DY.{2}S It will match the characters in between and only give me the position of the first (and last in the case of span) character of the match.
Is there a function or a way to retrieve the position of each specified character, not including the characters in-between?

match = re.search(r'(D)(Y)..(S)', 'DLWLALDYVASQASV')
print([match.group(i) for i in range(4)])
>>> ['DYVAS', 'D', 'Y', 'S']
print([match.span(i) for i in range(4)])
>>> [(6, 11), (6, 7), (7, 8), (10, 11)]
print([match.start(i) for i in range(4)])
>>> [6, 6, 7, 10]
You can take subexpressions of regular expression into brackets and then access the corresponding substrings via the match object. See the documentation of Match object for more details.

Related

Python recognize part of string (position and length)

I have got a file (.VAR) which gives me a positions and lengths in a strings per row, see the example below.
*STRING1 1L8:StringONE
*STRINGWITHVARIABLELENGTH2 *ABC 29L4:StringTWO
*STRINGWITHLENGTH3 *ABC 33L2:StringTHREE
How do i retrieve the " xxLxxx:" value, which is always preceded by a space and always ends with a colon, but never on the same location within the string.
Preferably I would like to find the number before L as the position, and the number behind L as the length, but only searching for "L" would give me also the input from other values within the string. Therefore I think I have to use the space_number_L_number_colon to recognize this part, but I don't know how.
Any thoughts? TIA
You can use a regex here.
Example:
s='''*STRING1 1L8:StringONE
*STRINGWITHVARIABLELENGTH2 *ABC 29L4:StringTWO
*STRINGWITHLENGTH3 *ABC 33L2:StringTHREE'''
import re
out = re.findall(r'\s(\d+)L(\d+):', s)
output: [('1', '8'), ('29', '4'), ('33', '2')]
As integers:
out = [tuple(map(int, x)) for x in re.findall(r'\s(\d+)L(\d+):', s)]
output: [(1, 8), (29, 4), (33, 2)]
regex:
regex demo
\s # space
(\d+) # capture one or more digits
L # literal L
(\d+) # capture one or more digits
: # literal :

Python re.findall() with pattern .* returns empty string along the matching pattern [duplicate]

The Python docs for findall() and finditer() state that:
Empty matches are included in the result unless they touch the
beginning of another match
This can be demonstrated as follows:
In [20]: [m.span() for m in re.finditer('.*', 'test')]
Out[20]: [(0, 4), (4, 4)]
Can anyone tell me though, why this pattern returns an empty match in the first place? Shouldn't .* consume the entire string and return a single match? And further, why is there no empty match at the end if I anchor the pattern to the beginning of the string? e.g.
In [22]: [m.span() for m in re.finditer('^.*', 'test')]
Out[22]: [(0, 4)]
.* is zero or more, so once the four characters are consumed, the zero-length empty string at the end (which doesn't touch the start of any match) still remains; and
The empty string at the end doesn't match the pattern - it doesn't start at the start of the string.

Modify regex to include hyphenated words

I have this tokenizer I found on another stack question, however, I need to modify it and am struggling. It currently splits hyphenated words into separate tokens but I want them to be single tokens.
tokenizer:
[(m.start(0), m.end(0),m.group()) for m in re.finditer("\w+|\$[\d\.]+|\S+",target_sentence)]
given the following sentence: "half-life is a single token" it should give the following tokens (plus the character offset info):
['half-life', 'is', 'a', 'single', 'token']
Instead it gives:
[(0, 4, 'half'),
(4, 9, '-life'),
(10, 12, 'is'),
(13, 14, 'a'),
(15, 21, 'single'),
(22, 27, 'token')]
EDIT: I want the character info not just word tokens so string.split is not going to cut it
Your regex is matching half using \w+ and matching remaining -life using last alternate \S+.
You may use this regex to capture optional hyphenated words:
\w+(?:-\w+)*|\$[\d.]+|\S+
RegEx Demo
\w(?:-\w+)* will match 1 or more words separated by hyphen.
Try this-
[m.group() for m in re.finditer("[\w-]+|\$[\d\.-]+|\S+",target_sentence)]
>> ['half-life', 'is', 'a', 'single', 'token']
Only have the code return m.group() instead of the matching indices
include - characters in the character classes

Spaces in regular expression

I have this code to find :) and :( in a text:
for match in re.finditer(r':\)|:\(', ":) :):( :) :("):
print match.span()
and give me this answer:
(0, 2)
(3, 5)
(5, 7)
(8, 10)
(12, 14)
It works, but I need it to show me only those which the word is alone(next to no other character) so the answer would be:
(0, 2)
(8, 10)
(12, 14)
I tried adding \b but got no answer
This is a case to add (x) to the pattern
for match in re.finditer(r'(?<![\w()]):(?:\)|\()(?![\w:])', ":) :):( :) :( (x)"):
print match.span()
shows:
(0, 2)
(8, 10)
(12, 14)
ans I want
(0, 2)
(8, 10)
(12, 14)
(16, 19)
If by no other character, you mean no other visible character, so that the only characters allowed around the smiley are space (including tabs), you could use something like this:
for match in re.finditer(r"(?:(?<=\s)|(?<=^)):[()](?=\s|$)", ":) :):( :) :("):
print match.span()
(?:(?<=\s)|(?<=^)) makes sure there's either a whitespace character or the beginning of the line before the smiley,
:[()] matches : followed by either ( or )
(?=\s|$) makes sure that there's either a whitespace character or the end of the line after the smiley.
If you additionally want to match the smiley x), you can use this:
r"(?:(?<=\s)|(?<=^))(?::[()]|x\))(?=\s|$)"
If you want to match x( as well, it becomes a little easier:
r"(?:(?<=\s)|(?<=^))[x:][()](?=\s|$)"
[ ... ] is a character class and you don't need to escape stuff in there. Be wary of the placements of - and ^ since those two have special meanings in a character class.
EDIT: Seems that I got the wrong additional smiley x) For this (meaning :), :( and (x)), it will be something a bit like that:
r"(?:(?<=\s)|(?<=^))(?::[()]|\(x\))(?=\s|$)"
reEDIT: Actually, the positive assertions can be shortened with negative ones, which makes it simpler:
r"(?<!\S)(?::[()]|\(x\))(?!\S)"
:, ( and ) are non word characters, so \b won't work. You'd use the inverse, \B:
r'\B:(?\)|\()\B'
Where \b matches on the boundary between \w and \W or vice-versa, \B only matches between two \w or two \W points. Since : and the parenthesis characters are both \W characters, this means they must sit next to another non-word character (or the start or end of the line).
This will still match other smileys too however.
To completely exclude other smileys you need to use both a negative look-ahead and a negative look-behind:
r'(?<![\w()]):(?\)|\()(?![\w:])'
This says:
(?<![\w()]): No word character or parentheses before the smiley (start of string is fine)
(?![\w:]): No word character or colon after the smiley (end of string is fine)
Demo:
>>> for match in re.finditer(r'(?<![\w()]):(?:\)|\()(?![\w:])', ":) :):( :) :("):
... print match.span()
...
(0, 2)
(8, 10)
(12, 14)
For your updated pattern version, you clearly don't mind if ( is in front, so we remove that from the excluded characters preceding the pattern, and update : to [x:] to match either an x or a colon:
r'(?<![\w)])[x:](?:\)|\()(?![\w:])'
Demo:
>>> for match in re.finditer(r'(?<![\w)])[x:](?:\)|\()(?![\w:])', ":) :):( :) :( (x)"):
... print match.span()
...
(0, 2)
(8, 10)
(12, 14)
(16, 18)

Regex to parse SDDL

I'm using python to parse out an SDDL using regex. The SDDL is always in the form of 'type:some text' repeated up to 4 times. The types can be either 'O', 'G', 'D', or 'S' followed by a colon. The 'some text' will be variable in length.
Here is a sample SDDL:
O:DAG:S-1-5-21-2021943911-1813009066-4215039422-1735D:(D;;0xf0007;;;AN)(D;;0xf0007;;;BG)S:NO_ACCESS_CONTROL
Here is what I have so far. Two of the tuples are returned just fine, but the other two - ('G','S-1-5-21-2021943911-1813009066-4215039422-1735') and ('S','NO_ACCESS_CONTROL') are not.
import re
sddl="O:DAG:S-1-5-21-2021943911-1813009066-4215039422-1735D:(D;;0xf0007;;;AN)(D;;0xf0007;;;BG)S:NO_ACCESS_CONTROL"
matches = re.findall('(.):(.*?).:',sddl)
print matches
[('O', 'DA'), ('D', '(D;;0xf0007;;;AN)(D;;0xf0007;;;BG)')]
what I'd like to have returned is
[('O', 'DA'), ('G','S-1-5-21-2021943911-1813009066-4215039422-1735'), ('D', '(D;;0xf0007;;;AN)(D;;0xf0007;;;BG)'),('S','NO_ACCESS_CONTROL')]
Try the following:
(.):(.*?)(?=.:|$)
Example:
>>> re.findall(r'(.):(.*?)(?=.:|$)', sddl)
[('O', 'DA'), ('G', 'S-1-5-21-2021943911-1813009066-4215039422-1735'), ('D', '(D;;0xf0007;;;AN)(D;;0xf0007;;;BG)'), ('S', 'NO_ACCESS_CONTROL')]
This regex starts out the same way as yours, but instead of including the .: at the end as a part of the match, a lookahead is used. This is necessary because re.findall() will not return overlapping matches, so you need each match to stop before the next match begins.
The lookahead (?=.:|$) essentially means "match only if the next characters are anything followed by a colon, or we are at the end of the string".
It seems like using regex isn't the best solution to this problem. Really, all you want to do is split across the colons and then do some transformations on the resulting list.
chunks = sddl.split(':')
pairs = [(chunks[i][-1], chunks[i+1][:-1] \
if i < (len(chunks) - 2) \
else chunks[i+1])
for i in range(0, len(chunks) - 1)]

Categories