This will be really quick marks for someone...
Here's my string:
Jan 13.BIGGS.04222 ABC DMP 15
I'm looking to match:
the date at the front (mmm yy) format
the name in the second field
the digits at the end. There could be between one and three.
Here is what I have so far:
(\w{3} \d{2})\.(\w*)\..*(\d{1,3})$
Through a lot of playing around with http://www.pythonregex.com/ I can get to matching the '5', but not '15'.
What am I doing wrong?
Use .*? to match .* non-greedily:
In [9]: re.search(r'(\w{3} \d{2})\.(\w*)\..*?(\d{1,3})$', text).groups()
Out[9]: ('Jan 13', 'BIGGS', '15')
Without the question mark, .* matches as many characters as possible, including the digit you want to match with \d{1,3}.
Alternatively to what #unutbu has proposed, you can also use word boundary \b - this matches "word border":
(\w{3} \d{2})\.(\w*)\..*\b(\d{1,3})$
From the site you referred:
>>> regex = re.compile("(\w{3} \d{2})\.(\w*)\..*\b(\d{1,3})$")
>>> regex.findall('Jan 13.BIGGS.04222 ABC DMP 15')
[(u'Jan 13', u'BIGGS', u'15')]
.* before numbers are greedy and match as much as it can, leaveing least possible digits to the last block. You either need to make it non-greedy (with ? like unutbu said) or make it do not match digits, replacing . with \D
Related
I don't understand why it only gives 125, the first number only, why it does not give all positive numbers in that string? My goal is to extract all positive numbers.
import re
pattern = re.compile(r"^[+]?\d+")
text = "125 -898 8969 4788 -2 158 -947 599"
matches = pattern.finditer(text)
for match in matches:
print(match)
Try using the regular expression
-\d+|(\d+)
Disregard the matches. The strings representing non-negative integers are saved in capture group 1.
Demo
The idea is to match but not save to a capture group what you don't want (negative numbers), and both match and save to a capture group what you do want (non-negative numbers).
The regex attempts to match -\d+. If that succeeds the regex engine's internal string pointer is moved to just after the last digit matched. If -\d+ is not matched an attempt is made to match the second part of the alternation (following |). If \d+ is matched the match is saved to capture group 1.
Any plus signs in the string can be disregarded.
For a fuller description of this technique see The Greatest Regex Trick Ever. (Search for "Tarzan"|(Tarzan) to get to the punch line.)
The following pattern will only match non negative numbers:
pattern = re.compile("(?:^|[^\-\d])(\d+)")
pattern.findall(text)
OUTPUT
['125', '8969', '4788', '158', '599']
For the sake of completeness another idea by use of \b and a lookbehind.
\b(?<!-)\d+
See this demo at regex101
Your pattern ^[+]?\d+ is anchored at the start of the string, and will give only that match at the beginning.
Another option is to assert a whitspace boundary to the left, and match the optional + followed by 1 or more digits.
(?<!\S)\+?\d+\b
(?<!\S) Assert a whitespace boundary to the left
\+? Match an optional +
\d+\b Match 1 or more digits followed by a word bounadry
Regex demo
Use , to sperate the numbers in the string.
I'm trying to write a regular expression to find a specific substring within a string.
I'm looking for dates in the following format:
"January 1, 2018"
I have already done some research but have not been able to figure out how to make a regular expression for my specific case.
The current version of my regular expression is
re.search("[\w*?\s\d\d\s\d\d\d\d]", my_string)
I'm fairly inexperienced with regular expression but from reading the documentation this is what I could come up with as far as matching the date format I'm working with.
Here is my thought process behind my regular expression:
\w should match to any unicode word character and * should repeat the previous match so these together should match some thing like this "January". ? makes * not greedy so it won't try to match anything in the form of January 20 as in it should stop at the first whitespace character.
\s should match white space.
\d\d and \d\d\d\d should match a two digit and four digit number respectively.
Here's a testable sample of my code:
import re
my_string = "January 01, 1990\n By SomeAuthor"
print(re.search("[\w*?\s\d\d\s\d\d\d\d]", my_string))
EDIT:
I have also tried :[A-Za-z]\s\d{1,2}\s\d{2, 4}
Your pattern may be a bit greedy in certain areas like in the month name. Also, you're missing the optional comma. Finally, you can use the ignore case flag to simplify your pattern. Here is an example using re in verbose mode.
import re
text = "New years day was on January 1, 2018, and boy was it a good time!"
pattern = re.compile(r"""
[a-z]+ # at least one+ ascii letters (ignore case is use)
\s # one space after
\d\d? # one or two digits
,? # an oprtional comma
\s # one space after
\d{4} # four digits (year)
""",re.IGNORECASE|re.VERBOSE)
result = pattern.search(text).group()
print(result)
output
January 1, 2018
Try
In [992]: my_string = "January 01, 1990\n By SomeAuthor"
...: print(re.search("[A-Z][a-z]+\s+\d{1,2},\s+\d{4}", my_string))
...:
<_sre.SRE_Match object; span=(0, 16), match='January 01, 1990'>
[A-Z] is any uppercase letter
[a-z]+ is 1 or more lowercase letters
\s+ is 1 or more space characters
\d{1,2} is at least 1 and at most 2 digits
here:
re.search("\w+\s+\d\d?\s*,\s*\d{4}",date_string)
import re
my_string = "January 01, 1990\n By SomeAuthor"
regex = re.compile('\w+\s+\d+, \d{4}')
result = regex.search(my_string)
result will contain the matched text and the character span.
I'd like to match the patterns digits.digits, digits.[digits], and [digits].digits with regex in Python.
Source for this: the Postgres docs state than a numeric constant can take any of these forms:
digits
digits.[digits][e[+-]digits]
[digits].digits[e[+-]digits]
digitse[+-]digits
Where brackets indicate optionality and digits is one or more digits, 0-9.
I'd like to match a small subset of this syntax,
digits.[digits]
[digits].digits
In other words, at least one digit must be before or after the decimal point. (Or, before and after.)
From the string numbers = '.42 5.42 5. .', the call to re.findall(regex, numbers) should return ['.42', '5.42', '5.'].
What I have tried is an if-then conditional, (?(id/name)yes-pattern|no-pattern):
regex = r'(\d+)?(?(1)\.\d*|\.\d+)'
The issue is that this mandates a capturing group, which (1) references, and re.findall(r'(\d+)?(?(1)\.\d*|\.\d+)', numbers) gives ['', '5', '5'] because it's grabbing the capture group.
Please ignore word boundaries, leading zeros, exponential notation, etc for now. A naive regex would be:
regex = r'\d+\.\d*|\d*\.\d+'
But as the complexity of the syntax grows, I'd prefer not to just |-together separate regexes.
How can I structure this to have re.findall(regex, numbers) return the list above?
While you may use your regex with re.finditer to get the first group with each whole match value ([x.group(0) for x in re.finditer(regex, numbers)]), you may also get the values you need with
re.findall(r'(?=\.?\d)\d*\.\d*', s)
See the regex demo
Details
(?=\.?\d) - a positive lookahead that requires an optional . followed with a digit immediately to the right of the current location
\d* - 0+ digits
\. - a dot
\d* - 0+ digits
So, even though \d* in the consuming pattern can match 0 digits, the lookahead requires at least one there.
Python demo:
import re
s=".42 5.42 5. ."
print(re.findall(r'(?=\.?\d)\d*\.\d*', s))
# => ['.42', '5.42', '5.']
Let's say that I have a string that looks like this:
a = '1253abcd4567efgh8910ijkl'
I want to find all substrings that starts with a digit, and ends with an alphabet.
I tried,
b = re.findall('\d.*\w',a)
but this gives me,
['1253abcd4567efgh8910ijkl']
I want to have something like,
['1234abcd','4567efgh','8910ijkl']
How can I do this? I'm pretty new to regex method, and would really appreciate it if anyone can show how to do this in different method within regex, and explain what's going on.
\w will match any wordcharacter which consists of numbers, alphabets and the underscore sign. You need to use [a-zA-Z] to capture letters only. See this example.
import re
a = '1253abcd4567efgh8910ijkl'
b = re.findall('(\d+[A-Za-z]+)',a)
Output:
['1253abcd', '4567efgh', '8910ijkl']
\d will match digits. \d+ will match one or more consecutive digits. For e.g.
>>> re.findall('(\d+)',a)
['1253', '4567', '8910']
Similarly [a-zA-Z]+ will match one or more alphabets.
>>> re.findall('([a-zA-Z]+)',a)
['abcd', 'efgh', 'ijkl']
Now put them together to match what you exactly want.
From the Python manual on regular expressions, it tells us that \w:
matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]
So you are actually over capturing what you need. Refine your regular expression a bit:
>>> re.findall(r'(\d+[a-z]+)', a, re.I)
['1253abcd', '4567efgh', '8910ijkl']
The re.I makes your expression case insensitive, so it will match upper and lower case letters as well:
>>> re.findall(r'(\d+[a-z]+)', '12124adbad13434AGDFDF434348888AAA')
['12124adbad']
>>> re.findall(r'(\d+[a-z]+)', '12124adbad13434AGDFDF434348888AAA', re.I)
['12124adbad', '13434AGDFDF', '434348888AAA']
\w matches string with any alphanumeric character. And you have used \w with *. So your code will provide a string which is starting with a digit and contains alphanumeric characters of any length.
Solution:
>>>b=re.findall('\d*[A-Za-z]*', a)
>>>b
['1253abcd', '4567efgh', '8910ijkl', '']
you will get '' (an empty string) at the end of the list to display no match. You can remove it using
b.pop(-1)
I don't know how to find the string using regular expression, the format of string is below.
[ any symbol 0~n times any number 1~n times] 1~n times.
It's seems like phone number matched. But the difference is that can insert any symbols and white space between numbers, for example
458###666###2##111####111
OR
(123)))444###555%%6222%%%%
I don't know if I explain the question clearly.
Anyway, thanks for your reply.
I think this represents the pattern you described
^(?:(\D?)\1*\d+)+$
See it here on Regexr
^ matches the start of the string
(\D?)\1* will match an optional non digit (\D), put it into a backreference and match this same character again 0 or more times using \1*
\d+ at least 1 digit
(?:(\D?)\1*\d+)+ the complete non capturing group is repeated 1 or more times
$ matches the end of the string
It will match
458###666###2##111####111
(123)))444###555%%6222%%%%1
(((((((((123)))444###555%%6222%%%%1
But not
s(123)))444###555%%6222%%%%1
(123)))444###555%%6222%%%%
Your statement:
[ any symbol 0~n times any number 1~n times] 1~n times.
does not fit to your second example (123)))444###555%%6222%%%% that does not end with a digit.
If you need to gather all the groups of digits from the string you can use \d+ regex:
>>> re.findall('\d+', '458###666###2##111####111 OR (123)))444###555%%6222%%%%')
['458', '666', '2', '111', '111', '123', '444', '555', '6222']
[ NOTE, I am ignoring the 'in python', opting instead for a more general 'build regular expressions' answer, in the hope that this will not only provide the desired answer but be something to take away for different RE-related problems ]
First, you want to match any symbol (or possibly any symbol, except a number), 0 or more times. That would be one of .* or [^0-9]* (the first is the 'anything wildcard', the second is a character class of everything except the numbers 0 to 9. The * is a 'match at least no times'.
Second, you want to match one or more digits. That, too, is relatively easy: [0-9]+ (or if you have a sufficiently old and anal RE library, [0-9][0-9]*, but that is highly unlikely to be the case outside a CS exam).
Third, you want to group that and repeat the grouping at least one time.
The general syntax for grouping is to enclose the group in parentheses (except in emacs, where you need \(, as the plain parenthesis is frequently matched). So, something along the lines of ([^0-9]*[0-9]+)+ should do the trick.