How to fix my nonworking Python regex match? - python

I want to grab the whole number out of this string <some>some 344.3404.3 numbers<tag>.
Using the Pythex emulator website this works with [\d\.]* (a digit or point repeated zero or more times).
In Python i get back the whole string:
Input:
import re
re.match(r'[\d\.]*', '<some>some 344.3404.3 numbers<tag>').string
Output:
'<some>some 344.3404.3 numbers<tag>'
What am i missing?
Running python 3.3.5, win7, 64bit.

The string attribute of a regex match object contains the input string of the match, not the matched content.
If you want the (first) matching part, you need to change three things:
use re.search() because re.match() will only find a match at the start of the string,
access the group() method of the match object,
use + instead of * or you'll get an empty (zero-length) match unless the match happens to be at the start of the string.
Therefore, use
>>> re.search(r'[\d.]+', '<some>some 344.3404.3 numbers<tag>').group()
'344.3404.3'
or
>>> re.findall(r'[\d.]+', '<some>some 344.3404.3 numbers more 234.432<tag>')
['344.3404.3', '234.432']
if you expect more than one match.

You can use this:
re.search(r'[\d.]+', '<some>some 344.3404.3 numbers<tag>').group()
Notes: Your pattern didn't work because [\d.]* will match the empty string at the first position. This is why I have replaced the quantifier with + and changed the method from match to search.
There is no need to escape the dot inside a character class, since it is seen by default as a literal character.

Related

Extracting a word between two path separators that comes after a specific word

I have the following path stored as a python string 'C:\ABC\DEF\GHI\App\Module\feature\src' and I would like to extract the word Module that is located between words \App\ and \feature\ in the path name. Note that there are file separators '\' in between which ought not to be extracted, but only the string Module has to be extracted.
I had the few ideas on how to do it:
Write a RegEx that matches a string between \App\ and \feature\
Write a RegEx that matches a string after \App\ --> App\\[A-Za-z0-9]*\\, and then split that matched string in order to find the Module.
I think the 1st solution is better, but that unfortunately it goes over my RegEx knowledge and I am not sure how to do it.
I would much appreciate any help.
Thank you in advance!
The regex you want is:
(?<=\\App\\).*?(?=\\feature\\)
Explanation of the regex:
(?<=behind)rest matches all instances of rest if there is behind immediately before it. It's called a positive lookbehind
rest(?=ahead) matches all instances of rest where there is ahead immediately after it. This is a positive lookahead.
\ is a reserved character in regex patterns, so to use them as part of the pattern itself, we have to escape it; hence, \\
.* matches any character, zero or more times.
? specifies that the match is not greedy (so we are implicitly assuming here that \feature\ only shows up once after \App\).
The pattern in general also assumes that there are no \ characters between \App\ and \feature\.
The full code would be something like:
str = 'C:\\ABC\\DEF\\GHI\\App\\Module\\feature\\src'
start = '\\App\\'
end = '\\feature\\'
pattern = rf"(?<=\{start}\).*?(?=\{end}\)"
print(pattern) # (?<=\\App\\).*?(?=\\feature\\)
print(re.search(pattern, str)[0]) # Module
A link on regex lookarounds that may be helpful: https://www.regular-expressions.info/lookaround.html
We can do that by str.find somethings like
str = 'C:\\ABC\\DEF\\GHI\\App\\Module\\feature\\src'
import re
start = '\\App\\'
end = '\\feature\\'
print( (str[str.find(start)+len(start):str.rfind(end)]))
print("\n")
output
Module
Your are looking for groups. With some small modificatians you can extract only the part between App and Feature.
(?:App\\\\)([A-Za-z0-9]*)(?:\\\\feature)
The brackets ( ) define a Match group which you can get by match.group(1). Using (?:foo) defines a non-matching group, e.g. one that is not included in your result. Try the expression here: https://regex101.com/r/24mkLO/1

Modifying a regular expression by another by adding a something to it

I am trying to modifying my regex expression using replace. What ultimately want to do is to add 01/ in the front of my existing pattern.It is litterally replacing a pattern by another.
Here is what I am doing with replace:
df['found_d'].str.replace(pattern2, '1/'+pattern2)
#must be str, not _sre.SRE_Pattern
I would like to use sub it takes 3 arguments and I am not too sure of how to use it at this point.
Here is an expected input:
df['found_d']= 01/07/91 or 01/07/1991
I need to add a missing date to my pattern.
No need for callables, re provides dedicated means to access the matched text during replacement.
In order to append a literal 01/ to a pattern match, use a \g<0> unambiguous backreference to the whole pattern in the replacement pattern rather than using the regex pattern:
df['found_d'] = df['found_d'].str.replace(pattern2, r'01/\g<0>')
^^^^^^^^^^^
Starting from version 0.20, pandas str.replace can accept a callable that will receive a match object. For example if a column has a pattern of 2 uppercase letters followed with 2 decimal digits and you would want to reverse them with a colon between, you could use:
df['col'] = df['col'].str.replace(r'([A-Z]{2})([0-9]{2})',
lamdba m: "{}:{}".format(m.group(2), m.group(1)))
It gives you the full power of the re module inside pandas, changing here 'AB12' with '12:AB'

Need a specific explanation of part of a regex code

I'm developing a calculator program in Python, and need to remove leading zeros from numbers so that calculations work as expected. For example, if the user enters "02+03" into the calculator, the result should return 5. In order to remove these leading zeroes in-front of digits, I asked a question on here and got the following answer.
self.answer = eval(re.sub(r"((?<=^)|(?<=[^\.\d]))0+(\d+)", r"\1\2", self.equation.get()))
I fully understand how the positive lookbehind to the beginning of the string and lookbehind to the non digit, non period character works. What I'm confused about is where in this regex code can I find the replacement for the matched patterns?
I found this online when researching regex expressions.
result = re.sub(pattern, repl, string, count=0, flags=0)
Where is the "repl" in the regex code above? If possible, could somebody please help to explain what the r"\1\2" is used for in this regex also?
Thanks for your help! :)
The "repl" part of the regex is this component:
r"\1\2"
In the "find" part of the regex, group capturing is taking place (ordinarily indicated by "()" characters around content, although this can be overridden by specific arguments).
In python regex, the syntax used to indicate a reference to a positional captured group (sometimes called a "backreference") is "\n" (where "n" is a digit refering to the position of the group in the "find" part of the regex).
So, this regex is returning a string in which the overall content is being replaced specifically by parts of the input string matched by numbered groups.
Note: I don't believe the "\1" part of the "repl" is actually required. I think:
r"\2"
...would work just as well.
Further reading: https://www.regular-expressions.info/brackets.html
Firstly, repl includes what you are about to replace.
To understand \1\2 you need to know what capture grouping is.
Check this video out for basics of Group capturing.
Here , since your regex splits every match it finds into groups which are 1,2... so on. This is so because of the parenthesis () you have placed in the regex.
$1 , $2 or \1,\2 can be used to refer to them.
In this case: The regex is replacing all numbers after the leading 0 (which is caught by group 2) with itself.
Note: \1 is not necessary. works fine without it.
See example:
>>> import re
>>> s='awd232frr2cr23'
>>> re.sub('\d',' ',s)
'awd frr cr '
>>>
Explanation:
As it is, '\d' is for integer so removes them and replaces with repl (in this case ' ').

find position of a substring in a string

i am having a python string of format
mystr = "hi.this(is?my*string+"
here i need to get the position of 'is' that is surrounded by special characters or non-alphabetic characters (i.e. second 'is' in this example). however, using
mystr.find('is')
will return the position if 'is' that is associated with 'this' which is not desired. how can i find the position of a substring that is surrounded by non-alphabetic characters in a string? using python 2.7
Here the best option is to use a regular expression. Python has the re module for working with regular expressions.
We use a simple search to find the position of the "is":
>>> match = re.search(r"[^a-zA-Z](is)[^a-zA-Z]", mystr)
This returns the first match as a match object. We then simply use MatchObject.start() to get the starting position:
>>> match.start(1)
8
Edit: A good point made, we make "is" a group and match that group to ensure we get the correct position.
As pointed out in the comments, this makes a few presumptions. One is that surrounded means that "is" cannot be at the beginning or end of the string, if that is the case, a different regular expression is needed, as this only matches surrounded strings.
Another is that this counts numbers as the special characters - you stated non-alphabetic, which I take to mean numbers included. If you don't want numbers to count, then using r"\b(is)\b" is the correct solution.

Python regular expression to match # followed by 0-7 followed by ##

I would like to intercept string starting with \*#\*
followed by a number between 0 and 7
and ending with: ##
so something like \*#\*0##
but I could not find a regex for this
Assuming you want to allow only one # before and two after, I'd do it like this:
r'^(\#{1}([0-7])\#{2})'
It's important to note that Alex's regex will also match things like
###7######
########1###
which may or may not matter.
My regex above matches a string starting with #[0-7]## and ignores the end of the string. You could tack a $ onto the end if you wanted it to match only if that's the entire line.
The first backreference gives you the entire #<number>## string and the second backreference gives you the number inside the #.
None of the above examples are taking into account the *#*
^\*#\*[0-7]##$
Pass : *#*7##
Fail : *#*22324324##
Fail : *#3232#
The ^ character will match the start of the string, \* will match a single asterisk, the # characters do not need to be escape in this example, and finally the [0-7] will only match a single character between 0 and 7.
r'\#[0-7]\#\#'
The regular expression should be like ^#[0-7]##$
As I understand the question, the simplest regular expression you need is:
rex= re.compile(r'^\*#\*([0-7])##$')
The {1} constructs are redundant.
After doing rex.match (or rex.search, but it's not necessary here), .group(1) of the match object contains the digit given.
EDIT: The whole matched string is always available as match.group(0). If all you need is the complete string, drop any parentheses in the regular expression:
rex= re.compile(r'^\*#\*[0-7]##$')

Categories