Extract date from file name with import re in python - python

My file name looks like as follow:
show_data_paris_12112019.xlsx
I want to extract the date only and I have tried this script:
date = os.path.basename(xls)
pattern = r'(?<=_)_*(?=\.xlsx)'
re.search(pattern, date).group(0)
event_date = re.search(pattern, date).group(0)
event_date_obj = datetime.strptime (event_date, '%Y%m%d')
but It gives me errors. How can I fix this?
Thank you.

It looks to me like the regex you're using is also at fault, and so it fails when trying to group(0) from the empty return.
Assuming all your dates are stored as digits the following regex i've made seems to work quite well.
(?!.+_)\d+(?=\.xlsx)
The next issue is when formatting the date it experiences an issue with the way you're formatting the date, to me it looks like 12112019 would be the 12/11/2019 obviously this could also be the 11/12/2019 but the basic is that we change the way strftime formats the date.
So for the date / month / year format we would use
# %d%m%Y
event_date_obj = datetime.strptime(event_date, '%d%m%Y')
And we would simply swap %d and %m for the month / date / year format. So your complete code would look something like this:
date = os.path.basename(xls)
pattern = "(?!.+_)\d+(?=\.xlsx)"
event_date = re.search(pattern, date).group(0)
event_date_obj = datetime.strptime (event_date, '%d%m%Y')
For further information on how to use strftime see https://strftime.org/.

_* matches a sequence of zero or more underscore characters.
(?<=_) means that it has to be preceded by an underscore.
(?=\.xlsx) means that it has to be followed by .xlsx.
So this will match the ___ in foo____.xlsx. But it doesn't match your filename, because the data is between the underscore and .xlsx.
You should match \d+ rather than _* between the lookbehind and lookahead.
pattern = r'(?<=_)\d+(?=\.xlsx)'
And if the data is always 8 digits, use \d{8} to be more specific.

Related

Extract date from strings that contains names+dates

I need to extract the dates from a series of strings like this:
'MIHAI MĂD2Ă3.07.1958'
or
'CLAUDIU-MIHAI17.12.1999'
How to do this?
Tried this:
for index,row in DF.iterrows():
try:
if math.isnan(row['Data_Nasterii']):
match = re.search(r'\d{2}.\d{2}.\d{4}', row['Prenume'])
date = datetime.strptime(match.group(), '%d.%m.%Y').date()
s = datetime.strftime(datetime.strptime(str(date), '%Y-%m-%d'), '%d-%m-%Y')
row['Data_Nasterii'] = s
except TypeError:
pass
The . (dot) in regex doesn't mean the character dot, it means "anything" and needs to be escaped (\) to be an actual dot. other than that your first group is \d{2} but some of your dates have a single digit day. I would use the following:
re.search(r'(\d+\.\d+\.\d+)', row['Prenume'])
which means at least one number followed by a dot followed by at least one number.....
if you have some mixed characters in your day you can try the following (sub par) solution:
''.join(re.search(r'(\d*)(?:[^0-9\.]*)(\d*\.\d+\.\d+)', row['Prenume']).groups())
this will filter out up to one block in your "day", its not pretty but it works(and returns a string)
You can use the str accessor along with a regex:
DF['Prenume'].str.extract(r'\d{1,2}\.\d{2}\.\d{4}')
You need to escape the dot (.) as \. or you can use it inside a character class - "[.]". It is a meta character in regex, which matches any character. If you need to validate more you can refer this!
eg: r'[0-9]{2}[.][0-9]{2}[.][0-9]{4}' or r'\d{2}\.\d{2}\.\d{4}'
text = 'CLAUDIU-MIHAI17.12.1999'
pattern = r'\d{2}\.\d{2}\.\d{4}'
if re.search(pattern, text):
print("yes")
Another good solution could be using dateutil.parser:
import pandas as pd
import dateutil.parser as dparser
df = pd.DataFrame({'A': ['MIHAI MĂD2Ă3.07.1958',
'CLAUDIU-MIHAI17.12.1999']})
df['userdate'] = df['A'].apply(lambda x: dparser.parse(x.encode('ascii',errors='ignore'),fuzzy=True))
output
A userdate
0 MIHAI MĂD2Ă3.07.1958 1958-07-23
1 CLAUDIU-MIHAI17.12.1999 1999-12-17

How to detect dash or underscore in datetime string to use in strptime?

I have several thousand files which feature datetime in their file name.
Sadly the devider between the datetime blocks are not always the same.
Example:
Data_trul-100A1-Berlin_2019-01-31_150480.dat
Data_tral-2000B2-Frankf-2018_02_27-190200.dat
Data_bash-300003_Hambrg_2017-04-12_210500.dat
I managed to find the datetime part in the string with a regular expression
import re
strings = ['Data_trul-100A1-Berlin_2019-01-31_150430.dat',
'Data_tral-2000B2-Frankf-2018_02_27-190200.dat',
'Data_bash-300003_Hambrg_2017-04-12_210500.dat']
for part_string in strings:
match = re.search('\d{4}[-_]\d{2}[-_]\d{2}[-_]\d{6}', part_string)
print(match.group())
However, now I am stuck to convert the group to datetime
from datetime import datetime
date = datetime.strptime(match.group(), "%Y-%m-%d_%H%M%S")
because I need to specify dashes or underscores.
I came up with the following solution to just replace it, but that feels like cheating.
for part_string in strings:
part_string = part_string.replace('-',"_")
match = re.search('\d{4}_\d{2}_\d{2}_\d{6}', part_string)
date = datetime.strptime(match.group(), "%Y_%m_%d_%H%M%S")
print(date)
Is there a more elegant way? Using regex to find the divider and pass it on to strptime?
You could change your regular expression to find 4 separate elements
match = re.search('(\d{4})[-_](\d{2})[-_](\d{2})[-_](\d{6})', part_string)
Then combine them into one standard string format
fixedstring = "{}_{}_{}_{}".format(match.groups())
date = datetime.strptime(match.group(), "%Y_%m_%d_%H%M%S")
Of course at this point you could just split the HHMMSS part of the time into their own elements and build the datetime object directly,
m = re.search('(\d{4})[-_](\d{2})[-_](\d{2})[-_](\d{2})(\d{2})(\d{2})', part_string)
date = datetime.datetime(year=m.group(0),
month=m.group(1),
day=m.group(2),
hour=m.group(3),
minute=m.group(4),
second=m.group(5))

Matching multiple possibilities with regex in Python

I am trying to process a log file using Python and extract the date, time and log message of each entry and store it in a list of dicts. I am using the re.search() and group() methods for this purpose.
The problem is the date/time take various formats such as.
dd/mm/yy, hh:mm AM - logs
dd/mm/yyyy, hh:mm a.m. - logs
dd/mm/yy HH:mm - logs
My program looks something like this:
import re
infile=open('logfile.txt', 'r')
loglist=[]
logdict={}
for aline in infile.readlines():
line=re.search(r'^(\d?\d/\d?\d/\d\d), (\d?\d:\d?\d \w\w) - (.*?)',aline)
if line:
logdict['date'] = line.group(1)
logdict['time'] = line.group(2)
logdict['logmsg'] = line.group(3)
loglist.append(logdict)
However, this matches only the first of the above-mentioned formats.
How can I match the other formats as well and also maintain the groups? Or is there an easier method of doing this?
You can use {m,n} after a pattern to indicate that there can be between m and n repetitions. So use \d{1,2} to indicate 1 or 2 digits. And you an use an alternation to indicate multiple possibilities, e.g. \d{2}|\d{4} for 2- or 4-digit years.
So the regexp can be:
^(\d{1,2}/\d{1,2}/(?:\d{2}|\d{4})),? (\d{1,2}:\d{1,2}(?: [AaPp]\.?[Mm]\.?)?) - (.*)'
I would first extract the data with a regex and then validate it manually. I wouldn't use the regex for two things, validation and extraction.
For clarity I would also assign names to these regex and make sure that each individual regex would return an atom such as a time or a date or am_pm and then string them together to form the sentence.
Note: I have not assigned names to the groups but I think its possible but not sure how
However in the end you could get your date_time and do a split on it such as date_time.split("/") which would return you day, month, year which you can then validate or use.
import re
log_records = ["10/10/1960, 10:50 AM - logs",
"5/15/2001, 23:11 a.m. - logs",
"50/100/1069 300:100 - logs"]
parsed_records = []
date_month_year_ptrn = r"((\d+/){2,2}\d+)"
time_ptrn = r"(\d+:\d+)"
morning_evening_ptrn = r"((\w+\.?)+)?"
everything_else_ptrn = r"(.*)"
log_record_ptrn = "^{date_ptrn},?\s+{time_ptrn}\s+{morn_even_ptrn}\s*-\s+{log_msg}$"
log_record_ptrn = log_record_ptrn.format(date_ptrn=date_month_year_ptrn,
time_ptrn=time_ptrn,
morn_even_ptrn=morning_evening_ptrn,
log_msg=everything_else_ptrn)
def extract_log_record_from_match(matcher):
if log_record_match:
# I am pretty sure you can attach names to these numbers
# but not sure how to do this
date_time = log_record_match.group(1)
time_ = log_record_match.group(3)
am_pm = log_record_match.group(4)
log_message = log_record_match.group(6)
return date_time, time_, am_pm, log_message
return None
def print_records(records):
for record in parsed_records:
if record:
print(record)
for log_record in log_records:
log_record_match = re.search(log_record_ptrn, log_record, re.IGNORECASE)
parsed_records.append(extract_log_record_from_match(log_record_match))
print_records(parsed_records)

Date regex produce false output

I need to find the files whose pattern match to the date format in Python. Could someone please help me on this. I have a regex but it's not working as required.
date = '2012-01-15'
match = re.findall(r'^(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$', date)
print match
Output:
[('20', '01','15')]
Seems like you just missed a pair of parenthesis around the complete year match and you probably want to suppress the century match with ?::
match = re.findall(r'^((?:19|20)\d\d)[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$', date)
# ^ ^^ ^
this gives [('2012', '01', '15')] on your example

Number Trouble with Regex in Python

I'm trying to filter a date retrieved from a .csv file, but no combination I try seems to work. The date comes in as "2011-10-01 19:25:01" or "year-month-date hour:min:sec".
I want just the year, month and date but I get can't seem to get ride of the time from the string:
date = bug[2] # Column in which the date is located
date = date.replace('\"','') #getting rid of the quotations
mdate = date.replace(':','')
re.split('$[\d]+',mdate) # trying to get rid of the trailing set of number (from the time)
Thanks in advance for the advice.
If your source is a string, you'd probably better use strptime
import datetime
string = "2011-10-01 19:25:01"
dt = datetime.datetime.strptime(string, "%Y-%m-%d %H:%M:%S")
After that, use
dt.year
dt.month
dt.day
to access the data you want.
Use datetime to parse your input as a datetime object, then output it in whatever format you like: http://docs.python.org/library/datetime.html
I think you're confusing the circumflex for start of line and dollar for end of line. Try ^[\d-]+.
If the format is always "YYYY-MM-DD HH:mm:ss", then try this:
date = date[1:11]
In a prompt:
>>> date = '"2012-01-12 15:13:20"'
>>> date[1:11]
'2012-01-12'
>>>
No need for regex
>>> date = '"2011-10-01 19:25:01"'
>>> date.strip('"').split()[0]
'2011-10-01'
One problem with your code is that in your last regular expression, $ matches the end of the string, so that regular expression will never match anything. You could do this much more simply by splitting by spaces and only taking the first result. After removing the quotation marks, the line
date.split()
will return ["2011-10-01", "19:25:01"], so the first element of that list is what you need.

Categories