I'd like to match the patterns digits.digits, digits.[digits], and [digits].digits with regex in Python.
Source for this: the Postgres docs state than a numeric constant can take any of these forms:
digits
digits.[digits][e[+-]digits]
[digits].digits[e[+-]digits]
digitse[+-]digits
Where brackets indicate optionality and digits is one or more digits, 0-9.
I'd like to match a small subset of this syntax,
digits.[digits]
[digits].digits
In other words, at least one digit must be before or after the decimal point. (Or, before and after.)
From the string numbers = '.42 5.42 5. .', the call to re.findall(regex, numbers) should return ['.42', '5.42', '5.'].
What I have tried is an if-then conditional, (?(id/name)yes-pattern|no-pattern):
regex = r'(\d+)?(?(1)\.\d*|\.\d+)'
The issue is that this mandates a capturing group, which (1) references, and re.findall(r'(\d+)?(?(1)\.\d*|\.\d+)', numbers) gives ['', '5', '5'] because it's grabbing the capture group.
Please ignore word boundaries, leading zeros, exponential notation, etc for now. A naive regex would be:
regex = r'\d+\.\d*|\d*\.\d+'
But as the complexity of the syntax grows, I'd prefer not to just |-together separate regexes.
How can I structure this to have re.findall(regex, numbers) return the list above?
While you may use your regex with re.finditer to get the first group with each whole match value ([x.group(0) for x in re.finditer(regex, numbers)]), you may also get the values you need with
re.findall(r'(?=\.?\d)\d*\.\d*', s)
See the regex demo
Details
(?=\.?\d) - a positive lookahead that requires an optional . followed with a digit immediately to the right of the current location
\d* - 0+ digits
\. - a dot
\d* - 0+ digits
So, even though \d* in the consuming pattern can match 0 digits, the lookahead requires at least one there.
Python demo:
import re
s=".42 5.42 5. ."
print(re.findall(r'(?=\.?\d)\d*\.\d*', s))
# => ['.42', '5.42', '5.']
Related
I don't understand why it only gives 125, the first number only, why it does not give all positive numbers in that string? My goal is to extract all positive numbers.
import re
pattern = re.compile(r"^[+]?\d+")
text = "125 -898 8969 4788 -2 158 -947 599"
matches = pattern.finditer(text)
for match in matches:
print(match)
Try using the regular expression
-\d+|(\d+)
Disregard the matches. The strings representing non-negative integers are saved in capture group 1.
Demo
The idea is to match but not save to a capture group what you don't want (negative numbers), and both match and save to a capture group what you do want (non-negative numbers).
The regex attempts to match -\d+. If that succeeds the regex engine's internal string pointer is moved to just after the last digit matched. If -\d+ is not matched an attempt is made to match the second part of the alternation (following |). If \d+ is matched the match is saved to capture group 1.
Any plus signs in the string can be disregarded.
For a fuller description of this technique see The Greatest Regex Trick Ever. (Search for "Tarzan"|(Tarzan) to get to the punch line.)
The following pattern will only match non negative numbers:
pattern = re.compile("(?:^|[^\-\d])(\d+)")
pattern.findall(text)
OUTPUT
['125', '8969', '4788', '158', '599']
For the sake of completeness another idea by use of \b and a lookbehind.
\b(?<!-)\d+
See this demo at regex101
Your pattern ^[+]?\d+ is anchored at the start of the string, and will give only that match at the beginning.
Another option is to assert a whitspace boundary to the left, and match the optional + followed by 1 or more digits.
(?<!\S)\+?\d+\b
(?<!\S) Assert a whitespace boundary to the left
\+? Match an optional +
\d+\b Match 1 or more digits followed by a word bounadry
Regex demo
Use , to sperate the numbers in the string.
I have a filename having numerals like test_20200331_2020041612345678.csv.
So I just want to read only first 8 characters from the number between last underscore and .csv using a regex.
For e.g: From the file name test_20200331_2020041612345678.csv --> i want to read only 20200416 using regex.
Regex tried: (?<=_)(\d+)(?=\.)
But it is returning the full number between underscore and period i.e 2020041612345678
Also, when tried quantifier like (?<=_)(\d{8})(?=\.) its not matching with any string
The (?<=_)(\d{8})(?=\.) does not work because the (?=\.) positive lookahead requires the presence of a . char immediately to the right of the current location, i.e. right after the eigth digit, but there are more digits in between.
You may add \d* before \. to match any amount of digits after the required 8 digits, use
(?<=_)\d{8}(?=\d*\.)
Or, with a capturing group, you do not even need lookarounds (just make sure you access Group 1 when a match is obtained):
_(\d{8})\d*\.
See the regex demo
Python demo:
import re
s = "test_20200331_2020041612345678.csv"
m = re.search(r"(?<=_)\d{8}(?=\d*\.)", s)
# m = re.search(r"_(\d{8})\d*\.", s) # capturing group approach
if m:
print(m.group()) # => 20200416
# print(m.group(1)) # capturing group approach
I am trying to use regular expressions to identify 4 to 5 digit numbers. The code below is working effectively in all cases unless there are consecutive 0's preceding a one, two or 3 digit number. I don't want '0054','0008',or '0009' to be a match, but i would want '10354' or '10032', or '9005', or '9000' to all be matches. Is there a good way to implement this using regular expressions? Here is my current code that works for most cases except when there are preceding 0's to a series of digits less than 4 or 5 characters in length.
import re
line = 'US Machine Operations | 0054'
match = re.search(r'\d{4,5}', line)
if match is None:
print(0)
else:
print(int(match[0]))
You may use
(?<!\d)[1-9]\d{3,4}(?!\d)
See the regex demo.
NOTE: In Pandas str.extract, you must wrap the part you want to be returned with a capturing group, a pair of unescaped parentheses. So, you need to use
(?<!\d)([1-9]\d{3,4})(?!\d)
^ ^
Example:
df2['num_col'] = df2.Warehouse.str.extract(r'(?<!\d)([1-9]\d{3,4})(?!\d)', expand = False).astype(float)
Just because you can simple use a capturing group, you may use an equivalent regex:
(?:^|\D)([1-9]\d{3,4})(?!\d)
Details
(?<!\d) - no digit immediately to the left
or (?:^|\D) - start of string or non-digit char (a non-capturing group is used so that only 1 capturing group could be accommodated in the pattern and let str.extract only extract what needs extracting)
[1-9] - a non-zero digit
\d{3,4} - three or four digits
(?!\d) - no digit immediately to the right is allowed
Python demo:
import re
s = "US Machine Operations | 0054 '0054','0008',or '0009' to be a match, but i would want '10354' or '10032', or '9005', or '9000'"
print(re.findall(r'(?<!\d)[1-9]\d{3,4}(?!\d)', s))
# => ['10354', '10032', '9005', '9000']
I have a pattern which looks like:
abc*_def(##)
and i want to look if this matches for some strings.
E.x. it matches for:
abc1_def23
abc10_def99
but does not match for:
abc9_def9
So the * stands for a number which can have one or more digits.
The # stands for a number with one digit
I want the value in the parenthesis as result
What would be the easiest and simplest solution for this problem?
Replace the * and # through regex expression and then look if they match?
Like this:
pattern = pattern.replace('*', '[0-9]*')
pattern = pattern.replace('#', '[0-9]')
pattern = '^' + pattern + '$'
Or program it myself?
Based on your requirements, I would go for a regex for the simple reason it's already available and tested, so it's easiest as you were asking.
The only "complicated" thing in your requirements is avoiding after def the same digit you have after abc.
This can be done with a negative backreference. The regex you can use is:
\babc(\d+)_def((?!\1)\d{1,2})\b
\b captures word boundaries; if you enclose your regex between two \b
you will restrict your search to words, i.e. text delimited by space,
punctuations etc
abc captures the string abc
\d+ captures one or more digits; if there is an upper limit to the number of digits you want, it has to be \d{1,MAX} where MAX is your maximum number of digits; anyway \d stands for a digit and + indicates 1 or more repetitions
(\d+) is a group: the use of parenthesis defines \d+ as something you want to "remember" inside your regex; it's somehow similar to defining a variable; in this case, (\d+) is your first group since you defined no other groups before it (i.e. to its left)
_def captures the string _def
(?!\1) is the part where you say "I don't want to repeat the first group after _def. \1 represents the first group, while (?!whatever) is a check that results positive is what follows the current position is NOT (the negation is given by !) whatever you want to negate.
Live demo here.
I had the hardest time getting this to work. The trick was the $
#!python2
import re
yourlist = ['abc1_def23', 'abc10_def99', 'abc9_def9', 'abc955_def9', 'abc_def9', 'abc9_def9288', 'abc49_def9234']
for item in yourlist:
if re.search(r'abc[0-9]+_def[0-9][0-9]$', item):
print item, 'is a match'
You could match your pattern like:
abc\d+_def(\d{2})
abc Match literally
\d+ Match 1 or more digits
_ Match underscore
def - Match literally
( Capturing group (Your 2 digits will be in this group)
\d{2} Match 2 digits
) Close capturing group
Then you could for example use search to check for a match and use .group(1) to get the digits between parenthesis.
Demo Python
You could also add word boundaries:
\babc\d+_def(\d{2})\b
I wrote a little python script to parse all rows of a large data document.
I collected some type of rows:
LLNNNLL [Mixed Data and Numbers] 1.650,00
NNNNNN-LNN [Mixed Data and Numbers] 49,00
LLNNNL [Mixed Data and Numbers] 208,00
LLNNNLLL [Mixed Data and Numbers] 3,00
This is my regex pattern: pattern = "^([A-Z\-0-9]){4,10}.*\d+,\d{2}"
Is there a more accurate way to do that?
Eg.: how can I specify that each row must have at least numbers AND letter?
how can I specify that each row must have at least numbers AND letter?
That can be done with the help of positive lookaheads.
pattern = "^(?=[^A-Z]*[A-Z])(?=\D*\d)[A-Z0-9-]{4,10}.*\d+,\d{2}"
The (?=[^A-Z]*[A-Z]) will be triggered at the start of the string and will require at least one A-Z letter in the string. The (?=\D*\d) will also be triggered (after the preceding lookahead returns true) and will require at least one digit. If there is no digit in the string, the match will be failed (no match will be found).
Also, if the number must be at the end of the "row" add a $ anchor (end of string).
Besides, note that .* will "eat up the digits (supposed to be matched with \d+,\d{2}) up to the one before a comma since the .* pattern is greedy. It makes no difference here unless you want to capture the float number. Then, use lazy matching .*?.
In case the pattern should be case insensitive, use a case insensitive flag re.I when compiling the pattern, or add (?i) inline modifier to the pattern start.
UPDATE
If you want to limit the condition to the first non-whitespace chunk, you can use
^(?=[0-9-]*[A-Z])(?=[A-Z-]*\d)[A-Z0-9-]{4,10}.*\d+,\d{2}
^^^^^^^ ^^^^^^^
where we check if there is a letter after optional 0+ digits/hyphen and a digit after 0+ letters or hyphen (see demo) or
^(?=\S*[A-Z])(?=\S*\d)[A-Z0-9-]{4,10}.*\d+,\d{2}
where we check for letters and digits after 0+ non-whitespace characters (\S*). See another demo