Identify and Extract Date from String - Python - python

I am looking to identify and extract a date from a number of different strings. The dates may not be formatted the same. I have been using the datefinder package but I am having some issues saving the output.
Goal: Extract the date from a string, which may be formatted in a number of different ways (ie April,22 or 4/22 or 22-Apr etc) and if there is no date, set the value to 'None' and append the date list with either the date or 'None'.
Please see the examples below.
Example 1: (This returns a date, but does not get appended to my list)
import datefinder
extracted_dates = []
sample_text = 'As of February 27, 2019 there were 28 dogs at the kennel.'
matches = datefinder.find_dates(sample_text)
for match in matches:
if match == None:
date = 'None'
extracted_dates.append(date)
else:
date = str(match)
extracted_dates.append(date)
Example 2: (This does not return a date, and does not get appended to my list)
import datefinder
extracted_dates = []
sample_text = 'As of the date, there were 28 dogs at the kennel.'
matches = datefinder.find_dates(sample_text)
for match in matches:
if match == None:
date = 'None'
extracted_dates.append(date)
else:
date = str(match)
extracted_dates.append(date)

I have tried using your package, but it seemed that there was no fast and general way of extracting the real date on your example.
I instead used the DateParser package and more specifically the search_dates method
I briefly tested it on your examples only.
from dateparser.search import search_dates
sample_text = 'As of February 27, 2019 there were 28 dogs at the kennel.'
extracted_dates = []
# Returns a list of tuples of (substring containing the date, datetime.datetime object)
dates = search_dates(sample_text)
if dates is not None:
for d in dates:
extracted_dates.append(str(d[1]))
else:
extracted_dates.append('None')
print(extracted_dates)

Related

How to identify multiple dates within a python string?

For a given string, I want to identify the dates in it.
import datefinder
string = str("/plot 23/01/2023 24/02/2021 /cmd")
matches = list(datefinder.find_dates(string))
if matches:
print(matches)
else:
print("no date in string")
Output:
no date in string
However, there are clearly dates in the string. Ultimately I want to identify which date is the oldest by putting in a variable Date1, and which date is the newest by putting in a variable Date2.
I believe that if a string contains multiple dates, datefinder is unable to parse it. In your case, splitting the string using string.split() and applying the find_dates method should do the job.
You've only given 1 example, but based on that example, you can use regex.
import re
from datetime import datetime
string = "/plot 23/01/2023 24/02/2021 /cmd"
dates = [datetime.strptime(d, "%d/%m/%Y") for d in re.findall(r"\d{2}/\d{2}/\d{4}", string)]
print(f"earliest: {min(dates)}, latest: {max(dates)}")
Output
earliest: 2021-02-24 00:00:00, latest: 2023-01-23 00:00:00

replace the date section of a string in python

if I have a string 'Tpsawd_20220320_default_economic_v5_0.xls'.
I want to replace the date part (20220320) with a date variable (i.e if I define the date = 20220410, it will replace 20220320 with this date). How should I do it with build-in python package? Please note the date location in the string can vary. it might be 'Tpsawd_default_economic_v5_0_20220320.xls' or 'Tpsawd_default_economic_20220320_v5_0.xls'
Yes, this can be done with regex fairly easily~
import re
s = 'Tpsawd_20220320_default_economic_v5_0.xls'
date = '20220410'
s = re.sub(r'\d{8}', date, s)
print(s)
Output:
Tpsawd_20220410_default_economic_v5_0.xls
This will replace the first time 8 numbers in a row are found with the given string, in this case date.

How to filter a list based on a substring in each element?

I have an inconsistent list of strings that contain dates in different formats. I need to determine the dates in each list.
My list/array looks like:
dates_list = []
my_array = [
'5364345354_01.05.2019.pdf',
'5364344354_ 01.05.2019.pdf',
'5345453454 - 21.06.2019.pdf',
'4675535643 - 19 June 2019.docx',
'57467874 25.06.18.pdf',
'6565653635_20 March 2019.txt',
'252252452_31.1.2019.txt'
]
I've tried a for loop and tried splitting the string however each string has different delimiters before the date. So what's a plausible way to find the date from each string in this inconsistent list. The only help looking at the list is that the date are all positioned at the end of each string
well it is not the best way to do it, but it may solve your problem, you can adapt it more :
dates_list = []
my_array = [
'5364345354_01.05.2019.pdf',
'5364344354_ 01.05.2019.pdf',
'5345453454 - 21.06.2019.pdf',
'4675535643 - 19 June 2019.docx',
'57467874 25.06.18.pdf',
'6565653635_20 March 2019.txt',
'252252452_31.1.2019.txt'
]
import os
for i in my_array :
for j in i :
if j >= '0' and j <= '9' :
i = i.replace(j,"",1)
else:
break
print(os.path.splitext(i)[0].replace("_","").replace("-",""))
output :
01.05.2019
01.05.2019
21.06.2019
19 June 2019
25.06.18
20 March 2019
31.1.2019
It is still unclear what you want to do with the dates or if you want them in some sort of consistent format however all your question states is you want to extract the date from the file name. you can do this with regex based on your samples which you say are the only 7 formats you have.
my_array = [
'5364345354_01.05.2019.pdf',
'5364344354_ 01.05.2019.pdf',
'5345453454 - 21.06.2019.pdf',
'4675535643 - 19 June 2019.docx',
'57467874 25.06.18.pdf',
'6565653635_20 March 2019.txt',
'252252452_31.1.2019.txt'
]
import re
for filename in my_array:
date = re.search(r'(\d{1,2}([.\s])(?:\d{1,2}|\w+)\2\d{2,4})', filename).group()
print(f"The date '{date}' was extracted from the file name '{filename}'")
OUTPUT
The date '01.05.2019' was extracted from the file name '5364345354_01.05.2019.pdf'
The date '01.05.2019' was extracted from the file name '5364344354_ 01.05.2019.pdf'
The date '21.06.2019' was extracted from the file name '5345453454 - 21.06.2019.pdf'
The date '19 June 2019' was extracted from the file name '4675535643 - 19 June 2019.docx'
The date '25.06.18' was extracted from the file name '57467874 25.06.18.pdf'
The date '20 March 2019' was extracted from the file name '6565653635_20 March 2019.txt'
The date '31.1.2019' was extracted from the file name '252252452_31.1.2019.txt'
the datetime module is helpful when working with dates and date formats and may help to convert your various format dates to a single format.
Extra characters, like the number in front of the dates, still need to be stripped manually. Other answers already pointed out several ways to do it, here I propose my own which does not require regex. I'm going to assume that the patterns are the one shown in your example, if there are other patters they need to be included in the code.
Once the numbers at the beginning of the strings and the file extension are discarded, datetime.strptime() is used to read the date and create a datetime object.
Then datetime.strftime() is used to get back a string representing the date with a given, unique format.
import datetime
my_array = [
'5364345354_01.05.2019.pdf',
'5364344354_ 01.05.2019.pdf',
'5345453454 - 21.06.2019.pdf',
'4675535643 - 19 June 2019.docx',
'57467874 25.06.18.pdf',
'6565653635_20 March 2019.txt',
'252252452_31.1.2019.txt'
]
def multiformat(string, format_list, format_res):
delim = None
if '_' in string:
delim = '_'
elif '-' in string:
delim = '-'
else:
delim = ' '
strdate = string.split(delim)[1].strip().split('.')[:-1]
txtdate = ' '.join(strdate)
print(txtdate)
date = None
for frm in format_list:
try:
date = datetime.datetime.strptime(txtdate, frm)
break
except ValueError:
pass
return date.strftime(format_res)
format_dates = ['%d %m %Y', '%d %m %y', '%d %B %Y']
dates_list = list(map(lambda x : multiformat(x, format_dates, '%d-%m-%Y'), my_array))
print(dates_list)
This prints:
['01-05-2019', '01-05-2019', '21-06-2019', '19-06-2019', '25-06-2018', '20-03-2019', '31-01-2019']
This can be solved with regex. The pattern I'm using here works in this case, but it's not pretty.
import re
regex = re.compile(r'\d{1,2}(\.| )\w+\1\d{2,4}')
for f in my_array:
print(regex.search(f).group())
Output:
01.05.2019
01.05.2019
21.06.2019
19 June 2019
25.06.18
20 March 2019
31.1.2019
Broken down:
\d{1,2} - One or two digits
(\.| ) ... \1 - A dot or a space, then the same again
\w+ - One or more letters, digits, or underscores
\d{2,4} - Two or four digits
you could try this, a little hacky but you do have some variations in your date formats:)
import re
mons = {'January':'01','February':'02','March':'03','April':'04','May':'05','June':'06','July':'07','August':'08','September':'09','October':'10','November':'11','December':'12'}
unformatted = [re.sub('\d{5,}|\s-\s|_|\s','',d.rsplit('.',1)[0]).replace('.','-') for d in my_array]
output:
['01-05-2019', '01-05-2019', '21-06-2019', '19June2019', '25-06-18', '20March2019', '31-1-2019']
for i,d in enumerate(unformatted):
if any(c.isalpha() for c in d):
key = re.search('[a-zA-Z]+',d).group()
unformatted[i] = d.replace(key,'-'+mons[key]+'-')
if len(d.split('-')[-1])==2:
yr = d.split('-')[-1]
unformatted[i] = d[:-2]+'20'+yr
#was having issues getting this one to work in the same loop..but:
for i,d in enumerate(unformatted):
if len(d.split('-')[1])==1:
mnth = d.split('-')[1]
unformatted[i] = d[:3]+'0'+mnth+d[-5:]
output:
['01-05-2019', '01-05-2019', '21-06-2019', '19-06-2019', '25-06-2018', '20-03-2019', '31-01-2019']
this not only extracts the date for each entry, but puts them into the same format so you can use them in pandas, or whatever you need to do with them afterwards
if the provided example contains all variations of the dates, this should work, if not you could make some minor adaptations and should be able to get it to work

Sorting by month-year groups by month instead

I have a curious python problem.
The script takes two csv files, one with a column of dates and the other a column of text snippets. in the other excel file there is a bunch of names (substrings).
All that the code does is step through both lists building up a name-mentioned-per-month matrix.
FILE with dates and text: (Date, Snippet first column)
ENTRY 1 : Sun 21 nov 2014 etc, The release of the iphone 7 was...
-strings file
iphone 7
apple
apples
innovation etc.
The problem is that when i try to order it so that the columns follow in asceding order, e.g. oct-2014, nov-2014, dec-2014 and so on, it just groups the months together instead, which isn't what i want
import csv
from datetime import datetime
file_1 = input('Enter first CSV name (one with the date and snippet): ')
file_2 = input('Enter second CSV name (one with the strings): ')
outp = input('Enter the output CSV name: ')
file_1_list = []
head = True
for row in csv.reader(open(file_1, encoding='utf-8', errors='ignore')):
if head:
head = False
continue
date = datetime.strptime(row[0].strip(), '%a %b %d %H:%M:%S %Z %Y')
date_str = date.strftime('%b %Y')
file_1_list.append([date_str, row[1].strip()])
file_2_dict = {}
for line in csv.reader(open(file_2, encoding='utf-8', errors='ignore')):
s = line[0].strip()
for d in file_1_list:
if s.lower() in d[1].lower():
if s in file_2_dict.keys():
if d[0] in file_2_dict[s].keys():
file_2_dict[s][d[0]] += 1
else:
file_2_dict[s][d[0]] = 1
else:
file_2_dict[s] = {
d[0]: 1
}
months = []
for v in file_2_dict.values():
for k in v.keys():
if k not in months:
months.append(k)
months.sort()
rows = [[''] + months]
for k in file_2_dict.keys():
tmp = [k]
for m in months:
try:
tmp.append(file_2_dict[k][m])
except:
tmp.append(0)
rows.append(tmp)
print("still working on it be patient")
writer = csv.writer(open(outp, "w", encoding='utf-8', newline=''))
for r in rows:
writer.writerow(r)
print('Done...')
From my understanding I am months.sort() isnt doing what i expect it to?
I have looked here , where they apply some other function to sort the data, using attrgetter,
from operator import attrgetter
>>> l = [date(2014, 4, 11), date(2014, 4, 2), date(2014, 4, 3), date(2014, 4, 8)]
and then
sorted(l, key=attrgetter('month'))
But I am not sure whether that would work for me?
From my understanding I parse the dates 12-13, am I missing an order data first, like
data = sorted(data, key = lambda row: datetime.strptime(row[0], "%b-%y"))
I have only just started learning python and so many things are new to me i dont know what is right and what isnt?
What I want(of course with the correctly sorted data):
This took a while because you had so much unrelated stuff about reading csv files and finding and counting tags. But you already have all that, and it should have been completely excluded from the question to avoid confusing people.
It looks like your actual question is "How do I sort dates?"
Of course "Apr-16" comes before "Oct-14", didn't they teach you the alphabet in school? A is the first letter! I'm just being silly to emphasize a point -- it's because they are simple strings, not dates.
You need to convert the string to a date with the datetime class method strptime, as you already noticed. Because the class has the same name as the module, you need to pay attention to how it is imported. You then go back to a string later with the member method strftime on the actual datetime (or date) instance.
Here's an example:
from datetime import datetime
unsorted_strings = ['Oct-14', 'Dec-15', 'Apr-16']
unsorted_dates = [datetime.strptime(value, '%b-%y') for value in unsorted_strings]
sorted_dates = sorted(unsorted_dates)
sorted_strings = [value.strftime('%b-%y') for value in sorted_dates]
print(sorted_strings)
['Oct-14', 'Dec-15', 'Apr-16']
or skipping to the end
from datetime import datetime
unsorted_strings = ['Oct-14', 'Dec-15', 'Apr-16']
print (sorted(unsorted_strings, key = lambda x: datetime.strptime(x, '%b-%y')))
['Oct-14', 'Dec-15', 'Apr-16']

Python: Parse String as Date with Formatting

A user can input a string and the string contains a date in the following formats MM/DD/YY or MM/DD/YYYY. Is there an efficient way to pull the date from the string? I was thinking of using RegEx for \d+\/\d+\/\d+. I also want the ability to be able to sort the dates. I.e. if the strings contain 8/17/15 and 08/16/2015, it would list the 8/16 date first and then 8/17
Have a look at datetime.strptime, it's a built in function that knows how to create a datetime object from a string. It accepts a string to be converted and the format the date is written in.
from datetime import datetime
def str_to_date(string):
pattern = '%m/%d/%Y' if len(string) > 8 else '%m/%d/%y'
try:
return datetime.strptime(string, pattern).date()
except ValueError:
raise # TODO: handle invalid input
The function returns a date() object which can be directly compared with other date() objects (e.g. when sorting) them.
Usage:
>>> d1 = str_to_date('08/13/2015')
>>> d2 = str_to_date('08/12/15')
>>> d1
datetime.date(2015, 8, 13)
>>> d2
datetime.date(2015, 8, 12)
>>> d1 > d2
True
Update
OP explained in a comment that strings such as 'foo 08/13/2015 bar' should not be automatically thrown away, and that the date should be extracted from them.
To achieve that, we must first search for a candidate string in user's input:
import re
from datetime import date
user_string = input('Enter something') # use raw_input() in Python 2.x
pattern = re.compile(r'(\d{2})/(\d{2})/(\d{4}|\d{2})') # 4 digits match first!
match = re.search(pattern, user_string)
if not match:
d = None
else:
month, day, year = map(int, match.groups())
try:
d = date(year, month, day)
except ValueError:
d = None # or handle error in a different way
print(d)
The code reads user input and then tries to find a pattern in it that represents a date in MM/DD/YYYY or MM/DD/YY format. Note that the last capturing group (in parentheses, i.e. ()) checks for either four or two consecutive digits.
If it finds a candidate date, it unpacks the capturing groups in the match, converting them to integers at the same time. It then uses the three matched pieces to tries to create a new date() object. If that fails, the candidate date was invalid, e.g. '02/31/2015'
Footnotes:
the code will only catch the first date candidate in the input
the regular expression used will, in its current form, also match dates in inputs like '12308/13/2015123'. If this is not desirable it would have to be modified, probably adding some lookahead/lookbehind assertions.
you could also try strptime:
import time
dates = ('08/17/15', '8/16/2015')
for date in dates:
print(date)
ret = None
try:
ret = time.strptime(date, "%m/%d/%Y")
except ValueError:
ret = time.strptime(date, "%m/%d/%y")
print(ret)
UPDATE
update after comments:
this way you will get a valid date back or None if the date can not be parsed:
import time
dates = ('08/17/15', '8/16/2015', '02/31/15')
for date in dates:
print(date)
ret = None
try:
ret = time.strptime(date, "%m/%d/%Y")
except ValueError:
try:
ret = time.strptime(date, "%m/%d/%y")
except ValueError:
pass
print(ret)
UPDATE 2
one more update after the comments about the requirements.
this is a version (it only takes care of the dates; not the text before/after. but using the regex group this can easily be extracted):
import re
import time
dates = ('foo 1 08/17/15', '8/16/2015 bar 2', 'foo 3 02/31/15 bar 4')
for date in dates:
print(date)
match = re.search('(?P<date>[0-9]+/[0-9]+/[0-9]+)', date)
date_str = match.group('date')
ret = None
try:
ret = time.strptime(date_str, "%m/%d/%Y")
except ValueError:
try:
ret = time.strptime(date_str, "%m/%d/%y")
except ValueError:
pass
print(ret)
Why not use strptime to store them as datetime objects. These objects can easily be compared and sorted that way.
import datetime
try:
date = datetime.datetime.strptime("08/03/2015", "%m/%d/%Y")
except:
date = datetime.datetime.strptime("08/04/15", "%m/%d/%y")
finally:
dateList.append(date)
Note the difference between %Y and %y. You can then just compare dates made this way to see which ones are greater or less. You can also sort it using dateList.sort()
If you want the date as a string again you can use:
>>> dateString = date.strftime("%Y-%m-%d")
>>> print dateString
'2015-08-03'
Why bother with regex when you can use datetime.strptime?
You can use the date parser from Pandas.
import pandas as pd
timestr = ['8/8/95', '8/15/2014']
>>> [pd.datetools.parse(d) for d in timestr]
[datetime.datetime(1995, 8, 8, 0, 0), datetime.datetime(2014, 8, 15, 0, 0)]
Using regex groups we'd get something like this:
import re
ddate = '08/16/2015'
reg = re.compile('(\d+)\/(\d+)\/(\d+)')
matching = reg.match(ddate)
if matching is not None:
print(matching.groups())
Would yield
('08','16','2015')
You could parse this after, but if you wanted to get rid of leading 0's from the first place you could use
reg = re.compile('0*(\d+)\/0*(\d+)\/(\d+)')

Categories