replacement pattern in python(re.sub()) - python

The question
Can someone please explain the process of the following re.sub() to me.
I am thinking the process is as following:
look for a "." then look for a digit then look for another digit that is between 1 and 9. Now I am lost. What is the question mark for? What does the \d* do? Why do we need to use raw string regex in this case?

If you want to understand the process, I can simply explain it to you. I don't know if this regular expression is doing what you want or not..
At first, the . is a special character in regex which means any character. But, we here want to use the dot character. In regex, this can be done by using escaping character \ like so \.. So, using . means any character and using \. means a dot.
The \d represents any digit and acts exactly like [0-9]
When you used [1-9], by then you specified to get the numbers from 1 till 9 which means that zero is excluded.
We can use the asterisk * to choose zero or more characters. Unlike + which is used to choose one or more characters. So, using \d* means any consecutive digits from [0-9] or none.
The ? is used to indicate using just one character or none. So, using [1-9]? means try to find just one digit between 1 and 9 IF FOUND.
The Parenthesis () is used for grouping the whole regular expression in one output.
If you want to know more about regular expression, here is an awesome cheat sheet.
NOTE:
I think the regex you have written in the question is not correct. I think it should be as follows (\d*\.\d\d[1-9]?) to obtain the same result. I will try to explain this regular expression using this number 3.141500012. \d*\. means find any number of digits that could be found before the dot which would match the 3.. then after that \d\d matches two digits after the dot which are 14. Finally, the [1-9]? matches any digit between 1 and 9 if found which matches 1 in our example.

Related

Catch multiple pattern with regex [duplicate]

This is an example string:
123456#p654321
Currently, I am using this match to capture 123456 and 654321 in to two different groups:
([0-9].*)#p([0-9].*)
But on occasions, the #p654321 part of the string will not be there, so I will only want to capture the first group. I tried to make the second group "optional" by appending ? to it, which works, but only as long as there is a #p at the end of the remaining string.
What would be the best way to solve this problem?
You have the #p outside of the capturing group, which makes it a required piece of the result. You are also using the dot character (.) improperly. Dot (in most reg-ex variants) will match any character. Change it to:
([0-9]*)(?:#p([0-9]*))?
The (?:) syntax is how you get a non-capturing group. We then capture just the digits that you're interested in. Finally, we make the whole thing optional.
Also, most reg-ex variants have a \d character class for digits. So you could simplify even further:
(\d*)(?:#p(\d*))?
As another person has pointed out, the * operator could potentially match zero digits. To prevent this, use the + operator instead:
(\d+)(?:#p(\d+))?
Your regex will actually match no digits, because you've used * instead of +.
This is what (I think) you want:
(\d+)(?:#p(\d+))?

regex to match coordinates

I am trying to write a regex expression in python that can match the following lines - I am just able to match the very first number by doing something like this
re.compile(r'\d.\d{14}\s+')
but could not do rest. Also tried doing [^-\d] to catch the negative sign - does not seem working.
Any help? Thanks!
First, lets start by looking at the numbers. You've already got a decent expression for finding a single number (\d.\d{14}\s+), but there are a couple things wrong with it.
In regex, . indicates any single character. This means that your expression will accept any character after the first digit.
It's not taking into account the possibility that there could be a negative sign at the beginning.
Both of these problems are really easy to fix. The first can be fixed by simply escaping the period (\.). The second can be fixed by adding the negative sign to the pattern and giving it a quantifier. In this case, the ? quantifier will be the best option because it matches between 0 and 1 times. All this means is that it won't care if the symbol is there, but if it is it will match it. After these 2 changes, the pattern looks like this: -?\d\.\d{14}\s+.
Next, we need to tell it to match more than once. This can be done very easily by putting the pattern in a group and applying a quantifier to said group. Now the question is which quantifier should be used. In your example, there are only 3 numbers before the single character at the end of the line. You can match this pattern exactly 3 times by using the {3} quantifier. If you know there will be at least 1 but don't know how many in total there will be, you can use the + quantifier. For this example I will be using the {3} quantifier just so it's more specific to your question. After adding this, the pattern will look something like this: (-?\d\.\d{14}\s+){3}
Now all that's left is to match the character at the end. You can use \S to match any single word character. You can add a quantifier to it, but again, for the purposes of your question, I won't be since there's only a single character. The final expression would look like (-?\d\.\d{14}\s+){3}\S.

How to check if the whole input string (real numbers separated by a space) matches a regex in Python?

I have an input string consisting of a sequence of real numbers separated by a single space. It is also acceptable for the string to contain only one real number (no spaces). My goal is to check whether the string structure matches the following (in this order):
optional (0/1): minus (-)
1/more digits
optional (1+): a period and 1/more digits
optional (0+): a group consisting of a space and the first group (the first three bullet points)
It should describe the string completely. If not, it should print an error message and exit.
My current regular expression is ^(-?\d+(\.?\d)*)( \1)*$ which I thought would be okay, but even the first group doesn't match all the real numbers individually. And I need it to check the string from the beginning to the end, including the spaces.
My code for this function looks like this:
import re
def structure_check(string):
structure = r"^(-?\d+(\.?\d)*)( \1)*$"
if re.match(structure,string):
return("OK")
else:
print("Input error")
exit()
It should accept strings like: 15 35 -45 8 -2.3 4564.18 56 etc., but it doesn't correspond to changes in the input (doesn't match) at all. It shouldn't match if there is too many spaces, incorrectly placed . or -, or if there are other characters than digits, periods, dashes (-) and spaces.
I could also do this with just the first group while iterating over a list created by splitting the input string by space, but I would prefer to check it according to my main goal, since I wouldn't have to split the input in the validation function and also to save some more code lines by checking the input alltogether (eg. for excess spaces, or unsupported characters, which I'd have to otherwise check separately).
Sorry if I missed any answered questions, I couldn't find any appropriate for my problem in Python. If you know about any, feel free to link them, please. And thank you, I am a beginner and started learning regex for a project just about yesterday.
You can use:
^((?:[+-]?\d+(?:[.]\d+)?)(?:[ \t]|$))*$
Demo and explantation
I added + to the optional sign. If you only want to match with no sign or -, just remove that from the optional character class.
You could also use an unrolled version to prevent matching a space at the end.
^-?\d+(?:\.\d+)?(?: -?\d+(?:\.\d+)?)*$
Regex demo
The backreference \1 will match exactly what is matched in group 1 and for your pattern will match for example 123 123 123
If you want to repeat the group, you could recurse the first group using the PyPi regex module and (?1)
^(-?\d+(?:\.\d+)?)(?: (?1))*$
See a Python example
Problem is in your regexp, to be specific, in ( \1)* part.
This, described, means: space and string that was matched in group 1 zero or more times
Thus, your regexp will match for the following, for example:
15 15 15
-5.3 -5.3 -5.3 -5.3
And so on.
To fix the regexp, I would replace the group reference with the actual group, like so:
^(-?\d+(\.?\d)*)( -?\d+(\.?\d)*)*$
I would also point out that this regexp allows the numbers to have multiple decimal dots, (e.g. 1.2.3 passes) however I'm not sure if that's intended or not.
In JavaScript you can use the method .test of regex. The regex should work in python.
let ok = /^(([+\-]?\d+(\.\d+)?)( |$))+$/.test("15 35 -45 8 -2.3 4564.18 56");
console.log(ok);
Explanation: (.\d+)? You must make the whole group optional. The number can be followed by a space or the end of a string ( |$). The pattern is repeated throughout the string so I wrapped the entire expression in a group. Insert ^ at the beginning of the regex and $ at the end of the regex to force the regex to check the string completely.

regex by using "." to take one character [duplicate]

I want a regular expression to match a string that may or may not start with plus symbol and then contain any number of digits.
Those should be matched
+35423452354554
or
3423564564
This should work
\+?\d+
Matches an optional + at the beginning of the line and digits after it
EDIT:
As of OP's request of clarification: 3423kk55 is matched because so it is the first part (3423). To match a whole string only use this instead:
^\+?\d+$
It'll look something like this:
\+?\d+
The \+ means a literal plus sign, the ? means that the preceding group (the plus sign) can appear 0 or 1 times, \d indicates a digit character, and the final + requires that the preceding group (the digit) appears one or more times.
EDIT: When using regular expressions, bear in mind that there's a difference between find and matches (in Java at least, though most regex implementations have similar methods). find will find the substring somewhere in the owning string, and matches will try to match the entire string against the pattern, failing if there are extra characters before or after. Ensure you're using the right method, and remember that you can add a ^ to force the beginning of the line and a $ to force the end of the line (making the entire thing look like ^\+?\d+$.
Simple ^\+?\d+$
Start line, then 1 or 0 plus signs, followed by at least 1 digit, then end of lnie
A Perl regular expression for it could be: \+?\d+

Some Regex stuff

I'm just starting to figure out regex and would love some help trying to understand it. I've been using this to help me get started, but am still having some trouble figuring it out.
What I am trying to do is take this text:
<td>8.54/10 over 190 reviews</td>
And pull out the "8.54", so basically anything in between the first ">" and the "/"
Using my noob skills, I came up with this: [0-9].[0-9][0-9], which WILL match that 8.54, and will work for everything BUT 10.00, which I do need to account for.
Can anyone help me refine my expression to apply to that last case as well?
Use quantifiers.
You want one or more digits, followed by a dot, followed by one or more digits. A digit can also be written \d, and the "one or more" quantifier is +.
The dot needs to be escaped as it is a regex metacharacter which means "any character". Your regex therefore should be:
\d+\.\d+
Now, beware that a quantifier applies to atoms only. Character classes ([...]), complemented character classes ([^...]) and special character classes (\d, \w...) are atoms, however if you want to apply a quantifier to more than a simple atom, you'll need to group these atoms using the grouping operator, (). Ie, (ab)+ will look for one or more of ab.
Maybe answered my own question. Found this:
[0-9]+(?:.[0-9]*)
It seems to work, does anyone have any changes to this?
\d is often used instead of [0-9] (mnemonically, “digit”) and it's necessary to remember that sometimes fractional numbers are written without any digits before the decimal point. Thus:
(?<=>)(?:\d+(?:\.\d*)?|\.\d+)(?=/)
OK, that's a bit of a complex RE. Here's how it breaks down (in extended form).
(?<= > ) # With a “>” before (but not matched)…
(?: # … match either this
\d+ # at least one digit, followed by…
(?: # …match
\. \d* # a dot followed by any number of digits
) ? # optionally
| # … or this
\. \d+ # a dot followed by at least one digit
) #
(?= / ) # … and with a “/” afterwards (but not matched)
This might work:
\>(.*?)/
# (.*?) is a "non-greedy" group which maches as few characters as possible
Then access the actual value using
m.group(1)
where m is the match object returned by re.search or re.finditer
If you want to access the value directly (re.findall), use
(?>=\>)(.*?)(?=/)

Categories