I want a regular expression to match a string that may or may not start with plus symbol and then contain any number of digits.
Those should be matched
+35423452354554
or
3423564564
This should work
\+?\d+
Matches an optional + at the beginning of the line and digits after it
EDIT:
As of OP's request of clarification: 3423kk55 is matched because so it is the first part (3423). To match a whole string only use this instead:
^\+?\d+$
It'll look something like this:
\+?\d+
The \+ means a literal plus sign, the ? means that the preceding group (the plus sign) can appear 0 or 1 times, \d indicates a digit character, and the final + requires that the preceding group (the digit) appears one or more times.
EDIT: When using regular expressions, bear in mind that there's a difference between find and matches (in Java at least, though most regex implementations have similar methods). find will find the substring somewhere in the owning string, and matches will try to match the entire string against the pattern, failing if there are extra characters before or after. Ensure you're using the right method, and remember that you can add a ^ to force the beginning of the line and a $ to force the end of the line (making the entire thing look like ^\+?\d+$.
Simple ^\+?\d+$
Start line, then 1 or 0 plus signs, followed by at least 1 digit, then end of lnie
A Perl regular expression for it could be: \+?\d+
Related
I am trying to write a regex expression in python that can match the following lines - I am just able to match the very first number by doing something like this
re.compile(r'\d.\d{14}\s+')
but could not do rest. Also tried doing [^-\d] to catch the negative sign - does not seem working.
Any help? Thanks!
First, lets start by looking at the numbers. You've already got a decent expression for finding a single number (\d.\d{14}\s+), but there are a couple things wrong with it.
In regex, . indicates any single character. This means that your expression will accept any character after the first digit.
It's not taking into account the possibility that there could be a negative sign at the beginning.
Both of these problems are really easy to fix. The first can be fixed by simply escaping the period (\.). The second can be fixed by adding the negative sign to the pattern and giving it a quantifier. In this case, the ? quantifier will be the best option because it matches between 0 and 1 times. All this means is that it won't care if the symbol is there, but if it is it will match it. After these 2 changes, the pattern looks like this: -?\d\.\d{14}\s+.
Next, we need to tell it to match more than once. This can be done very easily by putting the pattern in a group and applying a quantifier to said group. Now the question is which quantifier should be used. In your example, there are only 3 numbers before the single character at the end of the line. You can match this pattern exactly 3 times by using the {3} quantifier. If you know there will be at least 1 but don't know how many in total there will be, you can use the + quantifier. For this example I will be using the {3} quantifier just so it's more specific to your question. After adding this, the pattern will look something like this: (-?\d\.\d{14}\s+){3}
Now all that's left is to match the character at the end. You can use \S to match any single word character. You can add a quantifier to it, but again, for the purposes of your question, I won't be since there's only a single character. The final expression would look like (-?\d\.\d{14}\s+){3}\S.
The question
Can someone please explain the process of the following re.sub() to me.
I am thinking the process is as following:
look for a "." then look for a digit then look for another digit that is between 1 and 9. Now I am lost. What is the question mark for? What does the \d* do? Why do we need to use raw string regex in this case?
If you want to understand the process, I can simply explain it to you. I don't know if this regular expression is doing what you want or not..
At first, the . is a special character in regex which means any character. But, we here want to use the dot character. In regex, this can be done by using escaping character \ like so \.. So, using . means any character and using \. means a dot.
The \d represents any digit and acts exactly like [0-9]
When you used [1-9], by then you specified to get the numbers from 1 till 9 which means that zero is excluded.
We can use the asterisk * to choose zero or more characters. Unlike + which is used to choose one or more characters. So, using \d* means any consecutive digits from [0-9] or none.
The ? is used to indicate using just one character or none. So, using [1-9]? means try to find just one digit between 1 and 9 IF FOUND.
The Parenthesis () is used for grouping the whole regular expression in one output.
If you want to know more about regular expression, here is an awesome cheat sheet.
NOTE:
I think the regex you have written in the question is not correct. I think it should be as follows (\d*\.\d\d[1-9]?) to obtain the same result. I will try to explain this regular expression using this number 3.141500012. \d*\. means find any number of digits that could be found before the dot which would match the 3.. then after that \d\d matches two digits after the dot which are 14. Finally, the [1-9]? matches any digit between 1 and 9 if found which matches 1 in our example.
Please note: I'm programming in Python (version 3.6), but would also like to port these regexes to SAS as well.
The large picture here is that I'm working with a SAS log, and I want to exclude lines printed to the log that are from %include statements. Basically, what I'm trying to accomplish would look like this:
54210 proc sort data=inds out=outds;
And the lines that I DON'T want will look like this:
33406 +%global var1 var2 var3;
The key is that the 11th character will be a '+', but there will always be a group of numbers to the left followed by a group of spaces, whose length will ultimately be 11 spaces - unless it's an %include line, which I want to exclude.
What I have so far is this:
^[0-9]{1,11} (?! {2,10}\+)
This has worked to grab what I want exactly from the logs that I've tested, but it's far from right. The easy way out would be using this expression:
^[0-9]{1,11} {3,10}
And then add an extra condition that will ignore the line if the 11th character is a '+', but can I do this in a single regex? I came across lookaheads/lookbehinds working on this, but the problem is that the first matched group can vary in length, which moves around where the '+' would be expected - so is there a way that I can match a group within a set length, and then negate the match if it's followed by a character?
You may use
^\d+ +(?<=.{11})
See the regex demo
Details
^ - start of string
\d+ + - 1+ digits and then 1+ spaces
(?<=.{11}) - a positive lookbehind check that requires exactly 11 chars immediately to the left of the current location.
You can use ^[0-9\s]{,11}\+ for discarding unwanted logs. It matches up to 11 digits and/or spaces followed by a + (which seems to be the pattern for unwanted items). In case you want to negate the match you can simply do not re.match(...).
Using a lookahead you can refuse strings that contain a + within their first 11 characters and then match the desired pattern: ^(?=[^+]{11})[0-9]{1,11} {3,10}.
(?= # Look ahead and assert equal that ...
[^+] # ... anything but a plus ...
{11} # ... matches the following 11 characters.
)
Rather than regex filtering, have you considered setting appropriate logging options in your SAS code so that lines from %include statements aren't logged in the first place? i.e. set option nosource2; at the start of your program.
Documentation:
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000279225.htm
I bumped into the problem while playing around in Python: when I create a random string, let's say "test 1981", the following Python call returns with an empty string.
>>> re.search('\d?', "test 1981").group()
''
I was wondering why this is. I was reading through some other posts, and it seems that it has to do with greedy vs. non-greedy operators. Is it that the '?' checks to see if the first value is a digit, and if it's not, it takes the easier, quicker path and just outputs nothing?
Any clarification would help. Thanks!
Your pattern matches a digit or the empty string. It starts at the first character and tries to match a digit, what it is doing next is trying to match the alternative, means the empty string, voilĂ a match is found before the first character.
I think you expected it to move on and try to match on the next character, but that is not done, first it tries to match what the quantifier allows on the first position. And that is 0 or one digit.
The use of the optional quantifier makes only sense in combination with a required part, say you want a digit followed by an optional one:
>>> re.search('\d\d?', "test 1981").group()
'19'
Otherwise your pattern is always true.
Regex
\d?
simply means that it should optionally (?) match single digit (\d).
If you use something like this, it will work as you expect (match single digit anywhere in the string):
\d
re.search('\d?', "test 1981").group() greedily matches the first match of the pattern (0 or 1 digits) it can find. In this case that's zero digits. Note that re.search('\d?', "1981 test").group() actually matches the string '1' at the beginning of the string. What you're probably looking for here is re.search('\d+', "test 1981").group(), which finds the whole string 1981 no matter where it is.
I would like to intercept string starting with \*#\*
followed by a number between 0 and 7
and ending with: ##
so something like \*#\*0##
but I could not find a regex for this
Assuming you want to allow only one # before and two after, I'd do it like this:
r'^(\#{1}([0-7])\#{2})'
It's important to note that Alex's regex will also match things like
###7######
########1###
which may or may not matter.
My regex above matches a string starting with #[0-7]## and ignores the end of the string. You could tack a $ onto the end if you wanted it to match only if that's the entire line.
The first backreference gives you the entire #<number>## string and the second backreference gives you the number inside the #.
None of the above examples are taking into account the *#*
^\*#\*[0-7]##$
Pass : *#*7##
Fail : *#*22324324##
Fail : *#3232#
The ^ character will match the start of the string, \* will match a single asterisk, the # characters do not need to be escape in this example, and finally the [0-7] will only match a single character between 0 and 7.
r'\#[0-7]\#\#'
The regular expression should be like ^#[0-7]##$
As I understand the question, the simplest regular expression you need is:
rex= re.compile(r'^\*#\*([0-7])##$')
The {1} constructs are redundant.
After doing rex.match (or rex.search, but it's not necessary here), .group(1) of the match object contains the digit given.
EDIT: The whole matched string is always available as match.group(0). If all you need is the complete string, drop any parentheses in the regular expression:
rex= re.compile(r'^\*#\*[0-7]##$')