Dealing with spaces in regex - python

I'm a RegEx newbie and this has been driving me nuts for the past 48 hours. I tried everything I could while reading hundreds of examples and documents. I want to learn.
I need to extract the month name from strings like these, with the month being the word in the middle (multilingual):
10 july 2014
9 dicembre2014
1januar2011
18août2002 (note: non-[A-z] character in the month if it matters)
The closest I got was [\D]{3,}(?=.{4,}) yielding:
' july '
' dicembre'
'januar'
'août'
But it still matches the spaces around the name. I tried adding [^\s] but obviously it's not that simple.
What is the simplest RegEx way to find the right match?

If you set re.UNICODE flag, you can use unicode properties, and thus a \w also matches all letters from all scripts (including û, ñ, á, etc.). Then, [^\W\d_] would match only letters, but from any script:
\w matches word characters (letters, digits or underscore "_")
\W is the negated shorthand, it matches non-word characters (same as [^\w])
\d matches digits
So [^\W\d_] will match anything EXCEPT non-word characters, digits or "_"... which means it will only match letters
Code:
#python 3.4.3
import re
str = u"10 july 2014 \n 9 dicembre2014 \n 1januar2011\n 18août2002"
pattern = r'([0-3]?\d)\s*([^\W\d_]{3,})\s*((?:\d{2}){1,2})'
result = re.findall(pattern, str, re.UNICODE)
for date in result :
print(date)
Output:
('10', 'july', '2014')
('9', 'dicembre', '2014')
('1', 'januar', '2011')
('18', 'août', '2002')
Check online here

Related

How to avoid a specific pattern when using regular expression?

I want to match a pattern like '2 years', '4 days' in a text, and meanwhile want to avoid a pattern like '2 years old', i.e., I don't want a 'old' following 'years'. I thought a negative lookahead (?!old) would help. But I don't know how to do it. I tried
r=regex.compile(r'\b(\d+)\s*(years?|months?|days?)\s*(?!old)\b')
but it still match '2 years'.
For a full match you can omit the capture groups, and if there should be at least a single whitespace char between the words and the digits you can repeat 1 or more times using \s+
To prevent partial matches, you can use word boundaries \b
\b\d+\s+(?:year|month|day)s?\b(?!\s+old\b)
The pattern matches
\b\d+\s+ A word boundary, match 1+ digits and 1+ whitespace chars
(?:year|month|day)s?\b Match any of the alternatives and optional s
(?!\s+old\b) Negative lookahead, assert not 1+whitespace chars followed by old and a word boundary to the right
See a regex demo
Put \s* inside the lookahead:
r'\b(\d+)\s*(years?|months?|days?)(?!\s*old)\b'
As far as I understand, your regexp matched \s* zero times for the 2 years old case. The assertion fails since 2 years ends at word boundary and the content after it is space followed by old.

Cannot understand the code for removing words with numbers [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I want to remove words with numbers. After research I understood that
s = "ABCD abcd AB55 55CD A55D 5555"
>>> re.sub("\S*\d\S*", "", s).strip()
This code works to solve my situation
However, I am not able to understand how this code works. I know about regex and I know individually \d recognizes all the numbers [0-9]. \S is for white spaces. and * is 0 or more occurrences of the pattern to its left
"\S*\d\S*"
This part I am not able to understand
But I am not sure I understand how this code identifies AB55.
Can anyone please explain to me? Thanks
this replaces a digit with any non-space symbols around with empty string ""
the AB55 is viewed like : AB are \S*, 5 is \d, 5 is \S*
55CD : empty string is \S*, 5 is \d, 5CD is \S*
A55D : A is \S*, 5 is \d, 5D is \S*
5555 : empty string is \S*, 5 is \d, 555 is \S*
The re.sub("\S*\d\S*", "", s) replaces all this substrings to empty string "" and .strip() is useless since it removes whitespace at the begin and end of the previous result
You misunderstand the code. \S is the opposite of \s: it matches with everything except whitespace.
Since the Kleene star (*) is greedy, it thus means that it aims to match as much non-space characters as possible, followed by a digit followed by as much non-space characters as possible. It will thus match a full word, where at least one character is a digit.
All these matches are then replaced by the empty string, and therefore removed from the original string.
Your code first matches 0+ times non whitespace chars \S* (where \s* matches whitespace chars) and will match all the way until the end of the "word". Then it backtracks to match a digit and and again match 0+ non whitespace chars.
The pattern will for example also match a single digit.
You could slightly optimize the pattern to first match not a whitespace char or a digit [^\s\d]* using a negated character class to prevent the first \S* match the whole word.
[^\s\d]*\d\S*
Regex demo
This is how your regex works, you mention about \S for white spaces. But it is not.
This is what python documentation mention about \s and \S
\s
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
\S
Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
This is with \s which is for whitespace characters.
and you'll get an output like this,
>>> import re
>>>
>>> s = "ABCD abcd AB55 55CD A55D 5555"
>>> re.sub("\s*\d\s*", "", s).strip()
'ABCD abcd ABCD AD'

Regular expression to find a date substring Python 3.7

I'm trying to write a regular expression to find a specific substring within a string.
I'm looking for dates in the following format:
"January 1, 2018"
I have already done some research but have not been able to figure out how to make a regular expression for my specific case.
The current version of my regular expression is
re.search("[\w*?\s\d\d\s\d\d\d\d]", my_string)
I'm fairly inexperienced with regular expression but from reading the documentation this is what I could come up with as far as matching the date format I'm working with.
Here is my thought process behind my regular expression:
\w should match to any unicode word character and * should repeat the previous match so these together should match some thing like this "January". ? makes * not greedy so it won't try to match anything in the form of January 20 as in it should stop at the first whitespace character.
\s should match white space.
\d\d and \d\d\d\d should match a two digit and four digit number respectively.
Here's a testable sample of my code:
import re
my_string = "January 01, 1990\n By SomeAuthor"
print(re.search("[\w*?\s\d\d\s\d\d\d\d]", my_string))
EDIT:
I have also tried :[A-Za-z]\s\d{1,2}\s\d{2, 4}
Your pattern may be a bit greedy in certain areas like in the month name. Also, you're missing the optional comma. Finally, you can use the ignore case flag to simplify your pattern. Here is an example using re in verbose mode.
import re
text = "New years day was on January 1, 2018, and boy was it a good time!"
pattern = re.compile(r"""
[a-z]+ # at least one+ ascii letters (ignore case is use)
\s # one space after
\d\d? # one or two digits
,? # an oprtional comma
\s # one space after
\d{4} # four digits (year)
""",re.IGNORECASE|re.VERBOSE)
result = pattern.search(text).group()
print(result)
output
January 1, 2018
Try
In [992]: my_string = "January 01, 1990\n By SomeAuthor"
...: print(re.search("[A-Z][a-z]+\s+\d{1,2},\s+\d{4}", my_string))
...:
<_sre.SRE_Match object; span=(0, 16), match='January 01, 1990'>
[A-Z] is any uppercase letter
[a-z]+ is 1 or more lowercase letters
\s+ is 1 or more space characters
\d{1,2} is at least 1 and at most 2 digits
here:
re.search("\w+\s+\d\d?\s*,\s*\d{4}",date_string)
import re
my_string = "January 01, 1990\n By SomeAuthor"
regex = re.compile('\w+\s+\d+, \d{4}')
result = regex.search(my_string)
result will contain the matched text and the character span.

Python regex with \w does not work

I want to have a regex to find a phrase and two words preceding it if there are two words.
For example I have the string (one sentence per line):
Chevy is my car and Rusty is my horse.
My car is very pretty my dog is red.
If i use the regex:
re.finditer(r'[\w+\b|^][\w+\b]my car',txt)
I do not get any match.
If I use the regex:
re.finditer(r'[\S+\s|^][\S+\s]my car',txt)
I am getting:
's my car' and '. My car' (I am ignoring case and using multi-line)
Why is the regex with \w+\b not finding anything? It should find two words and 'my car'
How can I get two complete words before 'my car' if there are two words. If there is only one word preceding my car, I should get it. If there are no words preceding it I should get only 'my car'. In my string example I should get: 'Chevy is my car' and 'My car' (no preceding words here)
In your r'[\w+\b|^][\w+\b]my car regex, [\w+\b|^] matches 1 symbol that is either a word char, a +, a backdpace, |, or ^ and [\w+\b] matches 1 symbol that is either a word char, or +, or a backspace.
The point is that inside a character class, quantifiers and a lot (but not all) special characters match literal symbols. E.g. [+] matches a plus symbol, [|^] matches either a | or ^. Since you want to match a sequence, you need to provide a sequence of subpatterns outside of a character class.
It seems as if you intended to use \b as a word boundary, however, \b inside a character class matches only a backspace character.
To find two words and 'my car', you can use, for example
\S+\s+\S+\s+my car
See the regex demo (here, \S+ matches one or more non-whitespace symbols, and \s+ matches 1 or more whitespaces, and the 2 occurrences of these 2 consecutive subpatterns match these symbols as a sequence).
To make the sequences before my car optional, just use a {0,2} quantifier like this:
(?:\S+[ \t]+){0,2}my car
See this regex demo (to be used with the re.IGNORECASE flag). See Python demo:
import re
txt = 'Chevy is my car and Rusty is my horse.\nMy car is very pretty my dog is red.'
print(re.findall(r'(?:\S+[ \t]+){0,2}my car', txt, re.I))
Details:
(?:\S+[ \t]+){0,2} - 0 to 2 sequences of 1+ non-whitespaces followed with 1+ space or tab symbols (you may also replace it with [^\S\r\n] to match any horizontal space or \s if you also plan to match linebreaks).
my car - a literal text my car.

Python Regular Expression -- not matching digits at end of string

This will be really quick marks for someone...
Here's my string:
Jan 13.BIGGS.04222 ABC DMP 15
I'm looking to match:
the date at the front (mmm yy) format
the name in the second field
the digits at the end. There could be between one and three.
Here is what I have so far:
(\w{3} \d{2})\.(\w*)\..*(\d{1,3})$
Through a lot of playing around with http://www.pythonregex.com/ I can get to matching the '5', but not '15'.
What am I doing wrong?
Use .*? to match .* non-greedily:
In [9]: re.search(r'(\w{3} \d{2})\.(\w*)\..*?(\d{1,3})$', text).groups()
Out[9]: ('Jan 13', 'BIGGS', '15')
Without the question mark, .* matches as many characters as possible, including the digit you want to match with \d{1,3}.
Alternatively to what #unutbu has proposed, you can also use word boundary \b - this matches "word border":
(\w{3} \d{2})\.(\w*)\..*\b(\d{1,3})$
From the site you referred:
>>> regex = re.compile("(\w{3} \d{2})\.(\w*)\..*\b(\d{1,3})$")
>>> regex.findall('Jan 13.BIGGS.04222 ABC DMP 15')
[(u'Jan 13', u'BIGGS', u'15')]
.* before numbers are greedy and match as much as it can, leaveing least possible digits to the last block. You either need to make it non-greedy (with ? like unutbu said) or make it do not match digits, replacing . with \D

Categories