Reading from file and formatting into two dimensional array - python

I'm attempting to read data from multiple text files and move the data into a two-dimensional array. The data needs to remain in a specific order.
Could regex assist with this?
If you have any insight on how to improve this section of the code please let me know.

the datetime module provides (most) everything date-related
from datetime import datetime
date = "Sat 30-Mar-1996 7:40 PM"
fmt = "%a %d-%b-%Y %I:%M %p"
a = datetime.strptime(date, fmt)
print(a.year)
>>> 1996

You can parse the date-time string very easily by splitting its components and using iterable unpacking, e.g.,
def parse_date(d):
day_of_week, date, hhmm, ampm = d.split()
day_of_month, month, year = date.split('-')
hour, minute = hhmm.split(':')
return (year, month, day_of_month,
​hour if ampm=='AM' or str(int(hour)+12), minute,
day_of_week)
and later, in the body of the loop
year, m, dom, ​h, m, dow = parse_date(fields[-1].strip())
or, if you are interested only in year
year, *_ = parse_date(fields[-1].strip())

You're probably looking for regular expressions, which are a very powerful way to analyze and extract data from strings. For an intro into them, I'd check out this site or the python docs, but in your case I think you probably want something like '| ([a-zA-Z]*) ([0-9]*)-([a-zA-Z]*)-([0-9]*) ([0-9:]* [a-zA-Z]*) |' would work. A more specific description of the format the time would be in is necessary for a 100% correct regex [short for regular expressions].
To use regex in python, you want the re library. First, create the pattern matcher with matcher = re.compile(your_regex_string_here). Then, find the match with result = matcher.match(file_contents). (You could also just do result = re.match(regex_string,file_contents).) Whatever your regex, anything surrounded by parentheses is known as a "capturing group", which can be extracted from the result with result.group(); result.group(0) will return full match, and result.group(n) will return the contents of the nth capturing group - that is, the nth set of parentheses. In the above example, result.group(4) would return the year, though you could get any of the day of the week, day, month, year, and time by using groups 1-5.
The DateTime module as mentioned in another answer is also a great tool.

Related

Extract date from file name with import re in python

My file name looks like as follow:
show_data_paris_12112019.xlsx
I want to extract the date only and I have tried this script:
date = os.path.basename(xls)
pattern = r'(?<=_)_*(?=\.xlsx)'
re.search(pattern, date).group(0)
event_date = re.search(pattern, date).group(0)
event_date_obj = datetime.strptime (event_date, '%Y%m%d')
but It gives me errors. How can I fix this?
Thank you.
It looks to me like the regex you're using is also at fault, and so it fails when trying to group(0) from the empty return.
Assuming all your dates are stored as digits the following regex i've made seems to work quite well.
(?!.+_)\d+(?=\.xlsx)
The next issue is when formatting the date it experiences an issue with the way you're formatting the date, to me it looks like 12112019 would be the 12/11/2019 obviously this could also be the 11/12/2019 but the basic is that we change the way strftime formats the date.
So for the date / month / year format we would use
# %d%m%Y
event_date_obj = datetime.strptime(event_date, '%d%m%Y')
And we would simply swap %d and %m for the month / date / year format. So your complete code would look something like this:
date = os.path.basename(xls)
pattern = "(?!.+_)\d+(?=\.xlsx)"
event_date = re.search(pattern, date).group(0)
event_date_obj = datetime.strptime (event_date, '%d%m%Y')
For further information on how to use strftime see https://strftime.org/.
_* matches a sequence of zero or more underscore characters.
(?<=_) means that it has to be preceded by an underscore.
(?=\.xlsx) means that it has to be followed by .xlsx.
So this will match the ___ in foo____.xlsx. But it doesn't match your filename, because the data is between the underscore and .xlsx.
You should match \d+ rather than _* between the lookbehind and lookahead.
pattern = r'(?<=_)\d+(?=\.xlsx)'
And if the data is always 8 digits, use \d{8} to be more specific.

extract hour from a string _ unclear format

this question maybe is duplicated but I didn't find any exact solution for this. I have this type of string that includes date and time.
"check_in": "10/25/2019 14:30"
I need to extract an hour from it but this is not always a valid format. I tried this pattern so far but it includes the ":" character.
\d+?(:)
(\d+:)
(\d+)*:
Regular expressions aren't always the best way to deal with strings representing dates, especially if you can't rely on the input format to be consistent. Use a specialized parser instead:
>>> from dateutil import parser
>>> parser.parse("10/25/2019 14:30").hour
14
>>> parser.parse("10/25/2019 2:30 PM").hour
14
>>> parser.parse("2019-10-25T143000").hour
14
The module dateutil isn't in the standard library but is well worth the trouble of downloading.
\d+(?=:)
Demo
You don't need match the :, but need check it. So use Positive Lookahead (?=:).
First, this is what is wrong with your regexes:
\d+?(:) - finds number and column (14:) and puts the column into a group
(\d+:) - finds number and column (14:) and puts all of it into a group
(\d+)*: - finds (optionally, because of *) number and column (14:) and puts the number into a group
So, the last one could work:
>>> match = re.search(r'(\d+)*:', "10/25/2019 14:30")
>>> match.group(0) # whole result
'14:'
>>> match.group(1) # just the number
'14'
But then again, it would give wrong result (instead of breaking) on something like "time: 14:30", making it difficult to debug the error later. What you want is to use a more strict search, e.g. matching the whole string and labelling all groups:
>>> regex = r'(?P<month>\d\d)/(?P<day>\d\d)/(?P<year>\d{4}) (?P<hour>\d\d):(?P<minute>\d\d)'
>>> re.search(regex, "10/25/2019 14:30").group('hour')
'14'
Another, easier and even safer way is to use strptime:
>>> import datetime
>>> datetime.datetime.strptime("10/25/2019 14:30", "%m/%d/%Y %H:%M")
datetime.datetime(2019, 10, 25, 14, 30)
Now you have the complete datetime object and you can extract the .hour if you want.

Matching multiple possibilities with regex in Python

I am trying to process a log file using Python and extract the date, time and log message of each entry and store it in a list of dicts. I am using the re.search() and group() methods for this purpose.
The problem is the date/time take various formats such as.
dd/mm/yy, hh:mm AM - logs
dd/mm/yyyy, hh:mm a.m. - logs
dd/mm/yy HH:mm - logs
My program looks something like this:
import re
infile=open('logfile.txt', 'r')
loglist=[]
logdict={}
for aline in infile.readlines():
line=re.search(r'^(\d?\d/\d?\d/\d\d), (\d?\d:\d?\d \w\w) - (.*?)',aline)
if line:
logdict['date'] = line.group(1)
logdict['time'] = line.group(2)
logdict['logmsg'] = line.group(3)
loglist.append(logdict)
However, this matches only the first of the above-mentioned formats.
How can I match the other formats as well and also maintain the groups? Or is there an easier method of doing this?
You can use {m,n} after a pattern to indicate that there can be between m and n repetitions. So use \d{1,2} to indicate 1 or 2 digits. And you an use an alternation to indicate multiple possibilities, e.g. \d{2}|\d{4} for 2- or 4-digit years.
So the regexp can be:
^(\d{1,2}/\d{1,2}/(?:\d{2}|\d{4})),? (\d{1,2}:\d{1,2}(?: [AaPp]\.?[Mm]\.?)?) - (.*)'
I would first extract the data with a regex and then validate it manually. I wouldn't use the regex for two things, validation and extraction.
For clarity I would also assign names to these regex and make sure that each individual regex would return an atom such as a time or a date or am_pm and then string them together to form the sentence.
Note: I have not assigned names to the groups but I think its possible but not sure how
However in the end you could get your date_time and do a split on it such as date_time.split("/") which would return you day, month, year which you can then validate or use.
import re
log_records = ["10/10/1960, 10:50 AM - logs",
"5/15/2001, 23:11 a.m. - logs",
"50/100/1069 300:100 - logs"]
parsed_records = []
date_month_year_ptrn = r"((\d+/){2,2}\d+)"
time_ptrn = r"(\d+:\d+)"
morning_evening_ptrn = r"((\w+\.?)+)?"
everything_else_ptrn = r"(.*)"
log_record_ptrn = "^{date_ptrn},?\s+{time_ptrn}\s+{morn_even_ptrn}\s*-\s+{log_msg}$"
log_record_ptrn = log_record_ptrn.format(date_ptrn=date_month_year_ptrn,
time_ptrn=time_ptrn,
morn_even_ptrn=morning_evening_ptrn,
log_msg=everything_else_ptrn)
def extract_log_record_from_match(matcher):
if log_record_match:
# I am pretty sure you can attach names to these numbers
# but not sure how to do this
date_time = log_record_match.group(1)
time_ = log_record_match.group(3)
am_pm = log_record_match.group(4)
log_message = log_record_match.group(6)
return date_time, time_, am_pm, log_message
return None
def print_records(records):
for record in parsed_records:
if record:
print(record)
for log_record in log_records:
log_record_match = re.search(log_record_ptrn, log_record, re.IGNORECASE)
parsed_records.append(extract_log_record_from_match(log_record_match))
print_records(parsed_records)

How to replace string with certain format in python

i am trying to do string manipulation based on format. str.replace(old,new) alllows changing by specific string pattern. is it possible to find and replace by format? for example,
i want to find all datetime like value in a long string and replace it with another format
assuming % is wildcard for number and datetime is %%/%%/%%T%%:%%
str.replace(%%/%%/%%T%%:%%, 'dummy value')
EDIT:
sorry i should have been more clearer. re.sub seems like I can use that, but how do it substitute it with a date converted value. in this case, e.g.
YY/MM/DDTHH:MM to (YY/MM/DD HH:MM)+8 hours
The easiest way to do this is probably using a combination of regular expression syntax, applying re.sub and using the fact that the repl parameter can be a function that takes a match and returns a string to replace it, and datetime's syntax for strptime and strftime:
>>> from datetime import datetime
>>> import re
>>> def replacer(match):
return datetime.strptime(
match.group(), # matched text
'%y/%m/%dT%H:%M', # source format in datetime syntax
).strftime('%d %B %Y at %H.%M') # destination format in datetime syntax
>>> re.sub(
r'\d{2}/\d{2}/\d{2}T\d{2}:\d{2}', # source format in regex syntax
replacer, # function to process match
'The date and time was 12/12/12T12:12 exactly.', # string to process
)
'The date and time was 12 December 2012 at 12.12 exactly.'
The only downside of this is that you need to define the source format in both datetime and re syntax, which isn't very DRY; if they don't match, you'll get nowhere.

Number Trouble with Regex in Python

I'm trying to filter a date retrieved from a .csv file, but no combination I try seems to work. The date comes in as "2011-10-01 19:25:01" or "year-month-date hour:min:sec".
I want just the year, month and date but I get can't seem to get ride of the time from the string:
date = bug[2] # Column in which the date is located
date = date.replace('\"','') #getting rid of the quotations
mdate = date.replace(':','')
re.split('$[\d]+',mdate) # trying to get rid of the trailing set of number (from the time)
Thanks in advance for the advice.
If your source is a string, you'd probably better use strptime
import datetime
string = "2011-10-01 19:25:01"
dt = datetime.datetime.strptime(string, "%Y-%m-%d %H:%M:%S")
After that, use
dt.year
dt.month
dt.day
to access the data you want.
Use datetime to parse your input as a datetime object, then output it in whatever format you like: http://docs.python.org/library/datetime.html
I think you're confusing the circumflex for start of line and dollar for end of line. Try ^[\d-]+.
If the format is always "YYYY-MM-DD HH:mm:ss", then try this:
date = date[1:11]
In a prompt:
>>> date = '"2012-01-12 15:13:20"'
>>> date[1:11]
'2012-01-12'
>>>
No need for regex
>>> date = '"2011-10-01 19:25:01"'
>>> date.strip('"').split()[0]
'2011-10-01'
One problem with your code is that in your last regular expression, $ matches the end of the string, so that regular expression will never match anything. You could do this much more simply by splitting by spaces and only taking the first result. After removing the quotation marks, the line
date.split()
will return ["2011-10-01", "19:25:01"], so the first element of that list is what you need.

Categories