How to get the first number from span=(2494, 2516) here? - python

I want to cut a text from the point where my regex expression is found to the end of the text. The position may vary, so I need that number as a variable.
The position can already be seen in the result of studentnrRegex.search(text):
>>> studentnrRegex = re.compile(r'(Studentnr = 18\d\d\d\d\d\d\d\d)')
>>> start = studentnrRegex.search(text)
>>> start
<_sre.SRE_Match object; span=(2494, 2516), match='Studentnr = 1825010243'>
>>> myText = text[2494:]
>>> myText
'Studentnr = 1825010243\nTEXT = blablabla
Can I get the start position as a variable directly from my variable start, in this case 2494?

The match object returned by calling .search() has .start() and .end() methods that return the starting and ending positions of the match.
studentnrRegex = re.compile(r'(Studentnr = 18\d\d\d\d\d\d\d\d)')
m = studentnrRegex.search(text)
start = m.start()
print(mytext[start:])
You can accomplish the same thing with a different regex that matches the student number and everything after it. This will save you the trouble of doing the slice:
studentnrRegex = re.compile(r'(Studentnr = 18\d{8}).*', re.DOTALL)
m = studentnrRegex.search(text)
print(m.group())
The {8} matches 8 repeats of the \d and the .* matches all remaining characters until the end of the string (including newlines) as long as the re.DOTALL flag is specified. The full match is group 0, which is the default value for the .group() method of the match object. You can access the student number as m.group(1).

Related

Python regex pattern building

I'm trying to incrementally build the following regex pattern in python using reusable pattern components. I'd expect the pattern p to match the text in lines completely but it ends up matching only the first line..
import re
nbr = re.compile(r'\d+')
string = re.compile(r'(\w+[ \t]+)*(\w+)')
p1 = re.compile(rf"{string.pattern}\s+{nbr.pattern}\s+{string.pattern}")
p2 = re.compile(rf"{nbr.pattern}\s+{string.pattern}")
p1orp2 = re.compile(rf"{p1.pattern}|{p2.pattern}")
p = re.compile(rf"({p1orp2.pattern}\n)+")
lines = (f"aaaa 100284 aaaa\n"
f"aaaa 365870 bbbb\n"
f"757166 cccc\n"
f"111054 cccc\n"
f"999657 dddd\n"
f"999 eeee\n"
f"2955 ffff\n")
match = p.search(lines)
print(match)
print(match.group(0))
here's what gets printed:
<re.Match object; span=(0, 14), match='aaaa 1284 aaaa'>
aaaa 1284 aaaa
The problem is here:
p1orp2 = re.compile(rf"{p1.pattern}|{p2.pattern}")
p = re.compile(rf"({p1orp2.pattern}\n)+")
In p the \n is appended to p1orp2, but this influences the scope of the | in p1orp2: the added \n belongs to the second option, not to the first option. It is the same if you would have attached that \n already in the definition of p1orp2:
p1orp2 = re.compile(rf"{p1.pattern}|{p2.pattern}\n")
p = re.compile(rf"({p1orp2.pattern})+")
...while you really want to allow the p1 pattern to be followed by \n as well:
p1orp2 = re.compile(rf"{p1.pattern}\n|{p2.pattern}\n")
p = re.compile(rf"({p1orp2.pattern})+")
To achieve that with the \n where it was, you could use parentheses in the definition of p1orp2 so it limits the scope of the | operator:
p1orp2 = re.compile(rf"({p1.pattern}|{p2.pattern})")
p = re.compile(rf"({p1orp2.pattern}\n)+")
With this change it will work as you intended.
The issue with the regex pattern is that the capturing group in p1 only captures the last word in the sequence of words separated by whitespace or tabs. Therefore, the second part of p1 matches only the last word in the second line, and the first part of p1 and p2 don't match the lines that don't start with a word. As a result, p1orp2 doesn't match the entire input.
To fix this, you need to modify string to capture all the words in the sequence, not just the last one. Here's an updated version of your code:
word_sequence = re.compile(r"\w+(?:[ \t]+\w+)*")
nbr = re.compile(r"\d+")
p1 = re.compile(rf"{word_sequence.pattern}\s+{nbr.pattern}\s+
{word_sequence.pattern}")
p2 = re.compile(rf"{nbr.pattern}\s+{word_sequence.pattern}")
p1orp2 = re.compile(rf"{p1.pattern}|{p2.pattern}")
p = re.compile(rf"({p1orp2.pattern}\n)+")
lines = (
f"aaaa 1284 aaaa\n"
f"aaaa 3650 bbbb\n"
f"75071 cccc\n"
f"111872214054 cccc\n"
f"999 dddd\n"
f"999 eeee\n"
f"295255 ffff\n"
)
match = p.search(lines)
print(match)
print(match.group(0))

Finding the index of the second match of a regular expression in python

So I am trying to rename files to match the naming convention for plex mediaserver. ( SxxEyy )
Now I have a ton of files that use eg. 411 for S04E11. I have written a little function that will search for an occurrence of this pattern and replace it with the correct convention. Like this :
pattern1 = re.compile('[Ss]\\d+[Ee]\\d+')
pattern2 = re.compile('[\.\-]\d{3,4}')
def plexify_name(string):
#If the file matches the pattern we want, don't change it
if pattern1.search(string):
return string
elif pattern2.search(string):
piece_to_change = pattern2.search(string)
endpos = piece_to_change.end()
startpos = piece_to_change.start()
#Cut out the piece to change
cut = string[startpos+1:endpos-1]
if len(cut) == 4:
cut = 'S'+cut[0:2] + 'E' + cut[2:4]
if len(cut) == 3:
cut = 'S0'+cut[0:1] + 'E' + cut[1:3]
return string[0:startpos+1] + cut + string[endpos-1:]
And this works very well. But it turns out that some of the filenames will have a year in them eg. the.flash.2014.118.mp4 In which case it will change the 2014.
I tried using
pattern2.findall(string)
Which does return a list of strings like this --> ['.2014', '.118'] but what I want is a list of matchobjects so I can check if there is 2 and in that case use the start/end of the second. I can't seem to find something to do this in the re documentation. I am missing something or do I need to take a totally different approach?
You could try anchoring the match to the file extension:
pattern2 = re.compile(r'[.-]\d{3,4}(?=[.]mp4$)')
Here, (?= ... ) is a look-ahead assertion, meaning that the thing has to be there for the regex to match, but it's not part of the match:
>>> pattern2.findall('test.118.mp4')
['.118']
>>> pattern2.findall('test.2014.118.mp4')
['.118']
>>> pattern2.findall('test.123.mp4.118.mp4')
['.118']
Of course, you want it to work with all possible extensions:
>>> p2 = re.compile(r'[.-]\d{3,4}(?=[.][^.]+$)')
>>> p2.findall('test.2014.118.avi')
['.118']
>>> p2.findall('test.2014.118.mov')
['.118']
If there is more stuff between the episode number and the extension, regexes for matching that start to get tricky, so I would suggest a non-regex approach for dealing with that:
>>> f = 'test.123.castle.2014.118.x264.mp4'
>>> [p for p in f.split('.') if p.isdigit()][-1]
'118'
Or, alternatively, you can get match objects for all matches by using finditer and expanding the iterator by converting it to a list:
>>> p2 = re.compile(r'[.-]\d{3,4}')
>>> f = 'test.2014.712.x264.mp4'
>>> matches = list(p2.finditer(f))
>>> matches[-1].group(0)
'.712'

Python string regular expression

I need to do a string compare to see if 2 strings are equal, like:
>>> x = 'a1h3c'
>>> x == 'a__c'
>>> True
independent of the 3 characters in middle of the string.
You need to use anchors.
>>> import re
>>> x = 'a1h3c'
>>> pattern = re.compile(r'^a.*c$')
>>> pattern.match(x) != None
True
This would check for the first and last char to be a and c . And it won't care about the chars present at the middle.
If you want to check for exactly three chars to be present at the middle then you could use this,
>>> pattern = re.compile(r'^a...c$')
>>> pattern.match(x) != None
True
Note that end of the line anchor $ is important , without $, a...c would match afoocbarbuz.
Your problem could be solved with string indexing, but if you want an intro to regex, here ya go.
import re
your_match_object = re.match(pattern,string)
the pattern in your case would be
pattern = re.compile("a...c") # the dot denotes any char but a newline
from here, you can see if your string fits this pattern with
print pattern.match("a1h3c") != None
https://docs.python.org/2/howto/regex.html
https://docs.python.org/2/library/re.html#search-vs-match
if str1[0] == str2[0]:
# do something.
You can repeat this statement as many times as you like.
This is slicing. We're getting the first value. To get the last value, use [-1].
I'll also mention, that with slicing, the string can be of any size, as long as you know the relative position from the beginning or the end of the string.

Python re match last underscore in a string

I have some strings that look like this
S25m\S25m_16Q_-2dB.png
S25m\S25m_1_16Q_0dB.png
S25m\S25m_2_16Q_2dB.png
I want to get the string between slash and the last underscore, and also the string between last underscore and extension, so
Desired:
[S25m_16Q, S25m_1_16Q, S25m_2_16Q]
[-2dB, 0dB, 2dB]
I was able to get the whole thing between slash and extension by doing
foo = "S25m\S25m_16Q_-2dB.png"
match = re.search(r'([a-zA-Z0-9_-]*)\.(\w+)', foo)
match.group(1)
But I don't know how to make a pattern so I could split it by the last underscore.
Capture the groups you want to get.
>>> re.search(r'([-\w]*)_([-\w]+)\.\w+', "S25m\S25m_16Q_-2dB.png").groups()
('S25m_16Q', '-2dB')
>>> re.search(r'([-\w]*)_([-\w]+)\.\w+', "S25m\S25m_1_16Q_0dB.png").groups()
('S25m_1_16Q', '0dB')
>>> re.search(r'([-\w]*)_([-\w]+)\.\w+', "S25m\S25m_2_16Q_2dB.png").groups()
('S25m_2_16Q', '2dB')
* matches the previous character set greedily (consumes as many as possible); it continues to the last _ since \w includes letters, numbers, and underscore.
>>> zip(*[m.groups() for m in re.finditer(r'([-\w]*)_([-\w]+)\.\w+', r'''
... S25m\S25m_16Q_-2dB.png
... S25m\S25m_1_16Q_0dB.png
... S25m\S25m_2_16Q_2dB.png
... ''')])
[('S25m_16Q', 'S25m_1_16Q', 'S25m_2_16Q'), ('-2dB', '0dB', '2dB')]
A non-regex solution (albeit rather messy):
>>> import os
>>> s = "S25m\S25m_16Q_-2dB.png"
>>> first, _, last = s.partition("\\")[2].rpartition('_')
>>> print (first, os.path.splitext(last)[0])
('S25m_16Q', '-2dB')
I know it says using re, but why not just use split?
strings = """S25m\S25m_16Q_-2dB.png
S25m\S25m_1_16Q_0dB.png
S25m\S25m_2_16Q_2dB.png"""
strings = strings.split("\n")
parts = []
for string in strings:
string = string.split(".png")[0] #Get rid of file extension
string = string.split("\\")
splitString = string[1].split("_")
firstPart = "_".join(splitString[:-1]) # string between slash and last underscore
parts.append([firstPart, splitString[-1]])
for line in parts:
print line
['S25m_16Q', '-2dB']
['S25m_1_16Q', '0dB']
['S25m_2_16Q', '2dB']
Then just transpose the array,
for line in zip(*parts):
print line
('S25m_16Q', 'S25m_1_16Q', 'S25m_2_16Q')
('-2dB', '0dB', '2dB')

Python Regex - How to Get Positions and Values of Matches

How can I get the start and end positions of all matches using the re module? For example given the pattern r'[a-z]' and the string 'a1b2c3d4' I'd want to get the positions where it finds each letter. Ideally, I'd like to get the text of the match back too.
import re
p = re.compile("[a-z]")
for m in p.finditer('a1b2c3d4'):
print(m.start(), m.group())
Taken from
Regular Expression HOWTO
span() returns both start and end indexes in a single tuple. Since the
match method only checks if the RE matches at the start of a string,
start() will always be zero. However, the search method of RegexObject
instances scans through the string, so the match may not start at zero
in that case.
>>> p = re.compile('[a-z]+')
>>> print p.match('::: message')
None
>>> m = p.search('::: message') ; print m
<re.MatchObject instance at 80c9650>
>>> m.group()
'message'
>>> m.span()
(4, 11)
Combine that with:
In Python 2.2, the finditer() method is also available, returning a sequence of MatchObject instances as an iterator.
>>> p = re.compile( ... )
>>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
>>> iterator
<callable-iterator object at 0x401833ac>
>>> for match in iterator:
... print match.span()
...
(0, 2)
(22, 24)
(29, 31)
you should be able to do something on the order of
for match in re.finditer(r'[a-z]', 'a1b2c3d4'):
print match.span()
For Python 3.x
from re import finditer
for match in finditer("pattern", "string"):
print(match.span(), match.group())
You shall get \n separated tuples (comprising first and last indices of the match, respectively) and the match itself, for each hit in the string.
note that the span & group are indexed for multi capture groups in a regex
regex_with_3_groups=r"([a-z])([0-9]+)([A-Z])"
for match in re.finditer(regex_with_3_groups, string):
for idx in range(0, 4):
print(match.span(idx), match.group(idx))

Categories