If I had a sentence that has an age and a time :
import re
text = "I am 21 and work at 3:30"
answer= re.findall(r'\b\d{2}\b', text)
print(answer)
The issue is that it gives me not only the 21, but 30 (since it looks for 2 digits). How do I avoid this so it will only count the numbers and not the non-alphanumeric characters that leads to the issue? I tried to use [0-99] instead of the {} braces but that didn't seem to help.
Using \s\d{2}\s will give you only 2 digit combinations with spaces around them (before and after).
Or if you want to match without trailing whitespace: \s\d{2}
Thats because : is considered as non-word constituent character when you match empty string at word boundary with \b. In Regex term, a word for \b is \w+.
You can check for digits with space or start/end of input line around:
(?:^|\s)(\d{2})(?:\s|$)
Example:
In [85]: text = "I am 21 and work at 3:30"
...: re.findall(r'(?:^|\s)(\d{2})(?:\s|$)', text)
Out[85]: ['21']
You can use (?<!)(?!) negative lookahead to isolate and capture only 2 (two) digits.
Regex: (?<!\S)\d{2}(?!\S)
You can use the following regex:
^\d{2}$|(?<=\s)\d{2}(?=\s)|(?<=\s)\d{2}$|^\d{2}(?=\s)
that will match all the 21 in the following strings:
I am 21 and work at 3:30
21
abc 12:23
12345
I am 21
21 am I
demo: https://regex101.com/r/gP1KSf/1
Explanations:
^\d{2}$ match 2 digits only string or
(?<=\s)\d{2}(?=\s) 2 digits surrounded by space class char or
(?<=\s)\d{2}$ 2 digits at the end of the string and with a preceded by a a space class char
^\d{2}(?=\s) 2 digits at the beginning of the string and followed by a space class char
Related
I am trying to match Australian phone numbers. As the numbers can start with 0 or +61 or 61 followed by 2 or 3 or 4 or 5 or 7 or 8 and then followed by 8 digit number.
txt = "My phone number is 0412345677 or +61412345677 or 61412345677"
find_ph = re.find_all(r'(0|\+61|61)[234578]\d{8}', text)
find_ph
returns
['0', '61']
But I want it to return
['0412345677', '+61412345677' or '61412345677']
Can you please point me in the right direction?
>>> pattern = r'((?:0|\+61|61)[234578]\d{8})'
>>> find_ph = re.findall(pattern, txt)
>>> print(find_ph)
['0412345677', '+61412345677', '61412345677']
The problem you had was that the parentheses around just the prefix part were telling the findall function to only capture those characters, while matching all the rest. (Incidentally it's findall not find_all, and your string was in the variable txtnot text).
Instead, make that a non-capturing group with (?:0|+61|61). Now you capture the whole of the string that matches the entire pattern.
You can using Non-capturing group,
Regex Demo
import re
re.findall("(?:0|\+61|61)\d+", text)
['0412345677', '+61412345677', '61412345677']
One Solution
re.findall(r'(?:0|61|\+61)[2345678]\d{8}', txt)
# ['0412345677', '+61412345677', '61412345677']
Explanation
(?:0|61|\+61) Non-capturing group for 0, 61 or +61
(?:0|61|\+61)[2345678]\d{8} following by one digit except 0, 1, 9
\d{8} followed by 8 digits
I am trying to extract first 5 character+digit from last hyphen.
Here is the example
String -- X008-TGa19-ER751QF7
Output -- X008-TGa19-ER751
String -- X002-KF13-ER782cPU80
Output -- X002-KF13-ER782
My attempt -- I could manage to take element from the last -- (\w+)[^-.]*$
But now how to take first 5, then return my the entire value as the output as shown in the example.
You can optionally repeat a - and 1+ word chars from the start of the string. Then match the last - and match 5 word chars.
^\w+(?:-\w+)*-\w{5}
^ Start of string
\w+ Math 1+ word chars
(?:-\w+)* Optionally repeat - and 1+ word chars
-\w{5} Match - and 5 word chars
Regex demo
import re
regex = r"^\w+(?:-\w+)*-\w{5}"
s = ("X008-TGa19-ER751QF7\n"
"X002-KF13-ER782cPU80")
print(re.findall(regex, s, re.MULTILINE))
Output
['X008-TGa19-ER751', 'X002-KF13-ER782']
Note that \w can also match _.
If there can also be other character in the string, to get the first 5 digits or characters except _ after the last hyphen, you can match word characters without an underscore using a negated character class [^\W_]{5}
Repeat that 5 times while asserting no more underscore at the right.
^.*-[^\W_]{5}(?=[^-]*$)
Regex demo
(\w+-\w+-\w{5}) seems to capture what you're asking for.
Example:
https://regex101.com/r/PcPSim/1
If you are open for non-regex solution, you can use this which is based on splitting, slicing and joining the strings:
>>> my_str = "X008-TGa19-ER751QF7"
>>> '-'.join(s[:5] for s in my_str.split('-'))
'X008-TGa19-ER751'
Here I am splitting the string based on hyphen -, slicing the string to get at max five chars per sub-string, and joining it back using str.join() to get the string in your desired format.
^(.*-[^-]{5})[^-]*$
Capture group 1 is what you need
https://regex101.com/r/SYz9i5/1
Explanation
^(.*-[^-]{5})[^-]*$
^ Start of line
( Capture group 1 start
.* Any number of any character
- hyphen
[^-]{5} 5 non-hyphen character
) Capture group 1 end
[^-]* Any number of non-hyphen character
$ End of line
Another simpler one is
^(.*-.{5}).*$
This should be quite straight-forward.
This is making use of behaviour greedy match of first .*, which will try to match as much as possible, so the - will be the last one with at least 5 character following it.
https://regex101.com/r/CFqgeF/1/
I want remove first 4 words from paragraph
Original : Mywebsite 21 12 34 have 10000 traffic
What i want result : have 10000 traffic
i have 1000 of line same as original paragraph ( Mywebsite 21 12 34 have 10000 traffic)
i have regex search code which is work like this :
Below code is remove first word from sentence :
^\w+\s+(.*) = replace with $1
Following code will remove all numbers from line :
[0-9 ]+ = replace with space
I want combine above code, and make one regex search code work as i explain above, but not to affect any other words same line .
If your lines are all in the exact same format, i.e. if you always need to remove the first 4 words you can do something like this which is way simpler to understand than a RegEx:
# Iterate through all your lines
for line in lines:
# Split the line string on spaces to create an array of words.
words = line.split(' ')
# Exclude the 4 first words and re-join the string with the remaining words.
line = ' '.join(words[4:])
The pattern that you tried ^\w+\s+(.*) matches 1+ word chars, 1+ whitespace chars and then any char except a newline until the end of the string so that will match the whole string.
To remove the first word and the following 3 times 2 digits, you might use:
^\s*\w+(?: \d{2}){3}\s*
^ Start of string
\s* Match 0+ whitespace chars
\w+ Match 1+ word chars
(?: \d{2}){3} Repeat 3 times matching a space and 2 digits
\s* Match 0+ whitespace chars
Regex demo | Python demo
Note that \s also matches a newline. If you only want to match spaces or tabs you could use [ \t] instead.
You may use
re.sub(r'^(\w+\s)[\d\s]+', r'\1', text)
See the regex demo a
The pattern will match
^ - start of string
(\w+\s) - Capturing group 1: one or more word chars and a whitespace
[\d\s]+ - 1+ whitespace or digit chars.
Python demo:
import re
rx = re.compile(r"^(\w+\s)[\d\s]+")
s = "Mywebsite 21 12 34 have 10000 traffic"
print( rx.sub(r"\1", s) ) # => Mywebsite have 10000 traffic
I have the following data and want to match certain strings as commented below.
FTUS80 KWBC 081454 AAA\r\r TAF AMD #should match 'AAA'
LTUS41 KCTP 082111 RR3\r\r TMLLNS\r #should match 'RR3' and 'TMLLNS'
SRUS55 KSLC 082010\r\r HM5SLC\r\r #should match 'HM5SLC'
SRUS55 KSLC 082010\r\r SIGC \r\r #should match 'SIGC ' including whitespace
I need the following conditions met. But it doesn't work when I put it all together so I know I have mistakes. Thanks in advance.
Start match after 6 digit string: (?<=\d{6})
match if 3 character mixed uppercase/digits and before first 2 carriage returns: ([A-Z0-9]{3})(?=\r)
match if 6 characters mixed uppercase/digits after carriage returns: (?<=\r\r[A-Z0-9]{6})
match if 4 characters and two spaces: ([A-Z0-9]{4} )
There is probably a more elegant way, but you could do something like the following:
(?:\d{6}\s?)([A-Z\d]{3})?(?:[\r\n]{2}\s)([A-Z\d]{6}|[A-Z\d]{4}\s{2})?
(?:\d{6}\s?) non capture group of 6 digits followed by an optional space
([A-Z\d]{3})? optional capture group of 3 uppercase letters / digits
(?:[\r\n]{2}\s) non capture group of two line endings followed by 1 space
([A-Z\d]{6}|[A-Z\d]{4}\s{2})? optional capture group of either 6 uppercase letters / digits OR 4 uppercase letters / digits followed by 2 spaces
It's not clear what's the end of line here but assuming it's Unix one \n, the following expression captures strings as requested (double quotes added to show white space)
sed -rne 's/^.{18} ?([A-Z0-9]{3,3})?\r{2}?([^\r]+)?\r.*$/"\1\2"/p' text.txt
Result
"AAA"
"RR3 TMLLNS"
" HM5SLC"
" SIGC "
.{18} first 18 characters
?([A-Z0-9]{3,3})? matches AAA or RR3 without leading space
\r{2}?([^\r]+)?\r matches TMLLNS, HM5SLC or SIGC preceded by 2 \r and followed by 1 \r characters.
I'm trying to write a regular expression to find a specific substring within a string.
I'm looking for dates in the following format:
"January 1, 2018"
I have already done some research but have not been able to figure out how to make a regular expression for my specific case.
The current version of my regular expression is
re.search("[\w*?\s\d\d\s\d\d\d\d]", my_string)
I'm fairly inexperienced with regular expression but from reading the documentation this is what I could come up with as far as matching the date format I'm working with.
Here is my thought process behind my regular expression:
\w should match to any unicode word character and * should repeat the previous match so these together should match some thing like this "January". ? makes * not greedy so it won't try to match anything in the form of January 20 as in it should stop at the first whitespace character.
\s should match white space.
\d\d and \d\d\d\d should match a two digit and four digit number respectively.
Here's a testable sample of my code:
import re
my_string = "January 01, 1990\n By SomeAuthor"
print(re.search("[\w*?\s\d\d\s\d\d\d\d]", my_string))
EDIT:
I have also tried :[A-Za-z]\s\d{1,2}\s\d{2, 4}
Your pattern may be a bit greedy in certain areas like in the month name. Also, you're missing the optional comma. Finally, you can use the ignore case flag to simplify your pattern. Here is an example using re in verbose mode.
import re
text = "New years day was on January 1, 2018, and boy was it a good time!"
pattern = re.compile(r"""
[a-z]+ # at least one+ ascii letters (ignore case is use)
\s # one space after
\d\d? # one or two digits
,? # an oprtional comma
\s # one space after
\d{4} # four digits (year)
""",re.IGNORECASE|re.VERBOSE)
result = pattern.search(text).group()
print(result)
output
January 1, 2018
Try
In [992]: my_string = "January 01, 1990\n By SomeAuthor"
...: print(re.search("[A-Z][a-z]+\s+\d{1,2},\s+\d{4}", my_string))
...:
<_sre.SRE_Match object; span=(0, 16), match='January 01, 1990'>
[A-Z] is any uppercase letter
[a-z]+ is 1 or more lowercase letters
\s+ is 1 or more space characters
\d{1,2} is at least 1 and at most 2 digits
here:
re.search("\w+\s+\d\d?\s*,\s*\d{4}",date_string)
import re
my_string = "January 01, 1990\n By SomeAuthor"
regex = re.compile('\w+\s+\d+, \d{4}')
result = regex.search(my_string)
result will contain the matched text and the character span.