Regular expression to find a date substring Python 3.7 - python

I'm trying to write a regular expression to find a specific substring within a string.
I'm looking for dates in the following format:
"January 1, 2018"
I have already done some research but have not been able to figure out how to make a regular expression for my specific case.
The current version of my regular expression is
re.search("[\w*?\s\d\d\s\d\d\d\d]", my_string)
I'm fairly inexperienced with regular expression but from reading the documentation this is what I could come up with as far as matching the date format I'm working with.
Here is my thought process behind my regular expression:
\w should match to any unicode word character and * should repeat the previous match so these together should match some thing like this "January". ? makes * not greedy so it won't try to match anything in the form of January 20 as in it should stop at the first whitespace character.
\s should match white space.
\d\d and \d\d\d\d should match a two digit and four digit number respectively.
Here's a testable sample of my code:
import re
my_string = "January 01, 1990\n By SomeAuthor"
print(re.search("[\w*?\s\d\d\s\d\d\d\d]", my_string))
EDIT:
I have also tried :[A-Za-z]\s\d{1,2}\s\d{2, 4}

Your pattern may be a bit greedy in certain areas like in the month name. Also, you're missing the optional comma. Finally, you can use the ignore case flag to simplify your pattern. Here is an example using re in verbose mode.
import re
text = "New years day was on January 1, 2018, and boy was it a good time!"
pattern = re.compile(r"""
[a-z]+ # at least one+ ascii letters (ignore case is use)
\s # one space after
\d\d? # one or two digits
,? # an oprtional comma
\s # one space after
\d{4} # four digits (year)
""",re.IGNORECASE|re.VERBOSE)
result = pattern.search(text).group()
print(result)
output
January 1, 2018

Try
In [992]: my_string = "January 01, 1990\n By SomeAuthor"
...: print(re.search("[A-Z][a-z]+\s+\d{1,2},\s+\d{4}", my_string))
...:
<_sre.SRE_Match object; span=(0, 16), match='January 01, 1990'>
[A-Z] is any uppercase letter
[a-z]+ is 1 or more lowercase letters
\s+ is 1 or more space characters
\d{1,2} is at least 1 and at most 2 digits

here:
re.search("\w+\s+\d\d?\s*,\s*\d{4}",date_string)

import re
my_string = "January 01, 1990\n By SomeAuthor"
regex = re.compile('\w+\s+\d+, \d{4}')
result = regex.search(my_string)
result will contain the matched text and the character span.

Related

Extract date from inside a string with Python

I have the following string, while the first letters can differ and can also be sometimes two, sometimes three or four.
PR191030.213101.ABD
I want to extract the 191030 and convert that to a valid date.
filename_without_ending.split(".")[0][-6:]
PZA191030_392001_USB
Sometimes it looks liket his
This solution is not valid since this is also might differ from time to time. The only REAL pattern is really the first six numbers.
How do I do this?
Thank you!
You could get the first 6 digits using a pattern an a capturing group
^[A-Z]{2,4}(\d{6})\.
^ Start of string
[A-Z]{2,4} Match 2, 3 or 4 uppercase chars
( Capture group 1
\d{6} Match 6 digits
)\. Close group and match trailing dot
Regex demo | Python demo
For example
import re
regex = r"^[A-Z]{2,4}(\d{6})\."
test_str = "PR191030.213101.ABD"
matches = re.search(regex, test_str)
if matches:
print(matches.group(1))
Output
191030
You can do:
a = 'PR191030.213101.ABD'
int(''.join([c for c in a if c.isdigit()][:6]))
Output:
191030
This can also be done by:
filename_without_ending.split(".")[0][2::]
This splits the string from the 3rd letter to the end.
Since first letters can differ we have to ignore alphabets and extract digits.
So using re module (for regular expressions) apply regex pattern on string. It will give matching pattern out of string.
'\d' is used to match [0-9]digits and + operator used for matching 1 digit atleast(1/more).
findall() will find all the occurences of matching pattern in a given string while #search() is used to find matching 1st occurence only.
import re
str="PR191030.213101.ABD"
print(re.findall(r"\d+",str)[0])
print(re.search(r"\d+",str).group())

Dealing with spaces in regex

I'm a RegEx newbie and this has been driving me nuts for the past 48 hours. I tried everything I could while reading hundreds of examples and documents. I want to learn.
I need to extract the month name from strings like these, with the month being the word in the middle (multilingual):
10 july 2014
9 dicembre2014
1januar2011
18août2002 (note: non-[A-z] character in the month if it matters)
The closest I got was [\D]{3,}(?=.{4,}) yielding:
' july '
' dicembre'
'januar'
'août'
But it still matches the spaces around the name. I tried adding [^\s] but obviously it's not that simple.
What is the simplest RegEx way to find the right match?
If you set re.UNICODE flag, you can use unicode properties, and thus a \w also matches all letters from all scripts (including û, ñ, á, etc.). Then, [^\W\d_] would match only letters, but from any script:
\w matches word characters (letters, digits or underscore "_")
\W is the negated shorthand, it matches non-word characters (same as [^\w])
\d matches digits
So [^\W\d_] will match anything EXCEPT non-word characters, digits or "_"... which means it will only match letters
Code:
#python 3.4.3
import re
str = u"10 july 2014 \n 9 dicembre2014 \n 1januar2011\n 18août2002"
pattern = r'([0-3]?\d)\s*([^\W\d_]{3,})\s*((?:\d{2}){1,2})'
result = re.findall(pattern, str, re.UNICODE)
for date in result :
print(date)
Output:
('10', 'july', '2014')
('9', 'dicembre', '2014')
('1', 'januar', '2011')
('18', 'août', '2002')
Check online here

Why won't my for-loop match any of my defined regular expressions?

We have just started to learn about regular expressions, but I'm not able to find any matches in my string when I try to use the regex to search. What am I doing wrong?
I created six separate strings (just for the look of it), and concatenated them into one and then i try to loop through the words in the split string and search for one of the regexes i declared
Below, I translated the string, on the right side, just so you know what it says.
myString1 = "Skal vi moetes neste torsdag?" # - Shall we meet next Thursday
myString2 = "Hva med aa heller moetes mandag?" # - How about Monday
myString3 = "Hvordan gikk moetet forrige mandag?" # - How did the meeting on monday go?
myString4 = "Det gikk bra, vi skal moetes igjen tirsdag onsdag fredag lørdag søndag 13. september." # - It went well, we are meeting again on Sunday September 13th.
myString5 = "Altsaa, 13/09/2014?"
myString6 = "Ja, Sunday 13. september 2014." # - Yes, Sunday, September 13th 2014
myStringAll = (myString1 + myString2 + myString3 + myString4 + myString5 + myString6)
myWords = myStringAll.split()
regWeekDays = re.compile(r'^(man|tirs|ons|tors|fre|lør|søn)dag$', re.IGNORECASE)
regNextLast = re.compile(r'^[neste].$', re.IGNORECASE)
regDay = re.compile(r'^([0-2][0-9]|3[0-1])$')
regYear = re.compile(r'^([1-2][0-9][0-9][0-9])$')
for words in myWords:
matches = re.findall(regNextLast, words)
if matches:
print words
There are several problems with your regular expressions, but the main one is that you use ^ and $ at the beginning and end of each expression. ^ means to match the beginning of the string and $ matches the end of a string. Unless your strings are strictly the length of the expressions, findall won't match anything.
An Example:
In [55]: re.findall(r'^a$', 'abcdefghijkl')
Out[55]: [] # "a" is not matched!
^ and $ should only be used to explicitly match the beginning and end of a string, respectively (or the end of a line in some cases, see the documentation). Strip these out and your expressions should start matching.
Here are some more specific problems:
In ^(man|tirs|ons|tors|fre|lør|søn)dag$ only the first part (man|tirs|ons|tors|fre|lør|søn) will be captured and returned by findall. Change this to a non capturing group so that the entire expression is returned: (?:man|tirs|ons|tors|fre|lør|søn)dag
In ^[neste].$, I assume you want to capture the string "neste". Currently you have a set [neste] which will match one of the following characters: n, e, s, or t. Change this to simply neste. The documentation on sets can be found here.
^([0-2][0-9]|3[0-1])$ is mostly fine, aside from the ^ and $, you can omit the hyphen between 0 and 1, exclude the parentheses, and use the digit symbol \d (equivalent to [0-9], however: [0-2]\d|3[01]
Finally in '^([1-2][0-9][0-9][0-9])$', (again, aside from the ^ and $) the expression should work as expected, but you can make it more succinct. You can use the curly bracket syntax to specify repeats. Thus a string matching any year from 1000-2999 becomes:
[12]\d{3}
I recommend that you peruse the HOWTO on Regular Expressions.
r'^[neste].$' is probably not what you meant. This regular expression looks for a string two characters in length where the first character is in ('e', 'n', 's', 't') followed by any other single character. None of the two-character substrings you've split out of your larger string match this pattern.
Maybe you could benefit from a tutorial: http://www.regular-expressions.info/tutorial.html

Python Regular Expression -- not matching digits at end of string

This will be really quick marks for someone...
Here's my string:
Jan 13.BIGGS.04222 ABC DMP 15
I'm looking to match:
the date at the front (mmm yy) format
the name in the second field
the digits at the end. There could be between one and three.
Here is what I have so far:
(\w{3} \d{2})\.(\w*)\..*(\d{1,3})$
Through a lot of playing around with http://www.pythonregex.com/ I can get to matching the '5', but not '15'.
What am I doing wrong?
Use .*? to match .* non-greedily:
In [9]: re.search(r'(\w{3} \d{2})\.(\w*)\..*?(\d{1,3})$', text).groups()
Out[9]: ('Jan 13', 'BIGGS', '15')
Without the question mark, .* matches as many characters as possible, including the digit you want to match with \d{1,3}.
Alternatively to what #unutbu has proposed, you can also use word boundary \b - this matches "word border":
(\w{3} \d{2})\.(\w*)\..*\b(\d{1,3})$
From the site you referred:
>>> regex = re.compile("(\w{3} \d{2})\.(\w*)\..*\b(\d{1,3})$")
>>> regex.findall('Jan 13.BIGGS.04222 ABC DMP 15')
[(u'Jan 13', u'BIGGS', u'15')]
.* before numbers are greedy and match as much as it can, leaveing least possible digits to the last block. You either need to make it non-greedy (with ? like unutbu said) or make it do not match digits, replacing . with \D

python regex: get end digits from a string

I am quite new to python and regex (regex newbie here), and I have the following simple string:
s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
I would like to extract only the last digits in the above string i.e 767980716 and I was wondering how I could achieve this using python regex.
I wanted to do something similar along the lines of:
re.compile(r"""-(.*?)""").search(str(s)).group(1)
indicating that I want to find the stuff in between (.*?) which starts with a "-" and ends at the end of string - but this returns nothing..
I was wondering if anyone could point me in the right direction..
Thanks.
You can use re.match to find only the characters:
>>> import re
>>> s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
>>> re.match('.*?([0-9]+)$', s).group(1)
'767980716'
Alternatively, re.finditer works just as well:
>>> next(re.finditer(r'\d+$', s)).group(0)
'767980716'
Explanation of all regexp components:
.*? is a non-greedy match and consumes only as much as possible (a greedy match would consume everything except for the last digit).
[0-9] and \d are two different ways of capturing digits. Note that the latter also matches digits in other writing schemes, like ୪ or ൨.
Parentheses (()) make the content of the expression a group, which can be retrieved with group(1) (or 2 for the second group, 0 for the whole match).
+ means multiple entries (at least one number at the end).
$ matches only the end of the input.
Nice and simple with findall:
import re
s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
print re.findall('^.*-([0-9]+)$',s)
>>> ['767980716']
Regex Explanation:
^ # Match the start of the string
.* # Followed by anthing
- # Upto the last hyphen
([0-9]+) # Capture the digits after the hyphen
$ # Upto the end of the string
Or more simply just match the digits followed at the end of the string '([0-9]+)$'
Your Regex should be (\d+)$.
\d+ is used to match digit (one or more)
$ is used to match at the end of string.
So, your code should be: -
>>> s = "99-my-name-is-John-Smith-6376827-%^-1-2-767980716"
>>> import re
>>> re.compile(r'(\d+)$').search(s).group(1)
'767980716'
And you don't need to use str function here, as s is already a string.
Use the below regex
\d+$
$ depicts the end of string..
\d is a digit
+ matches the preceding character 1 to many times
Save the regular expressions for something that requires more heavy lifting.
>>> def parse_last_digits(line): return line.split('-')[-1]
>>> s = parse_last_digits(r"99-my-name-is-John-Smith-6376827-%^-1-2-767980716")
>>> s
'767980716'
I have been playing around with several of these solutions, but many seem to fail if there are no numeric digits at the end of the string. The following code should work.
import re
W = input("Enter a string:")
if re.match('.*?([0-9]+)$', W)== None:
last_digits = "None"
else:
last_digits = re.match('.*?([0-9]+)$', W).group(1)
print("Last digits of "+W+" are "+last_digits)
Try using \d+$ instead. That matches one or more numeric characters followed by the end of the string.

Categories