Python Regex re.search - groupdict() - Date format matching - python

I need to get the date month from various strings such as '14th oct', '14oct', '14.10', '14 10' and '14/10'. For these cases my below code working fine.
query = '14.oct'
print(re.search(r'(?P<date>\b\d{1,2})(?:\b|st|nd|rd|th)?(?:[\s\.\-/_\\,]*)(?P<month>\d{1,2}|[a-z]{3,9})', query, re.I).groupdict())
Result:-
{'date': '14', 'month': 'oct'}
But for this case (1410), its still capturing the date and month. But I don't want that, since this will be another number format of that entire string and not to be considered as date and month. The result should be None.
How to change the search pattern for this? (with groupdict() only)
Edited:-
The mentioned parathesis in the number above (1410) is just to differentiate from other text. What I want to mean is 1410 only.
The below solution is what I want and I got the idea from the answer of #the-fourth-bird by adding (?!\d{3,}\b) in the regex pattern.
Thanks🙏🏽
Final Solution
import re
queries = ['14 10', '14.10', '1410', '14-10', '14/10', '14,10', '17800', '14th oct', '14thoct', '14th-oct', '14th/oct', '14-oct', '14.oct', '14oct']
max_indent = len(max(queries, key = len)) + 1
for query in queries:
if resp := re.search(r'(?P<date>\b(?!\d{3,}\b)\d{1,2})(?:\b|st|[nr]d|th)?(?:[\s.-/_\\,-]*)(?P<month>\d{1,2}|[a-z]{3,9})', query, re.I):
print(f"{query:{max_indent}}- {resp.groupdict()}")
else:
print(f"{query:{max_indent}}- 'Not a date'")
Result:-
14 10 - {'date': '14', 'month': '10'}
14.10 - {'date': '14', 'month': '10'}
1410 - 'Not a date'
14-10 - {'date': '14', 'month': '10'}
14/10 - {'date': '14', 'month': '10'}
14,10 - {'date': '14', 'month': '10'}
17800 - 'Not a date'
14th oct - {'date': '14', 'month': 'oct'}
14thoct - {'date': '14', 'month': 'oct'}
14th-oct - {'date': '14', 'month': 'oct'}
14th/oct - {'date': '14', 'month': 'oct'}
14-oct - {'date': '14', 'month': 'oct'}
14.oct - {'date': '14', 'month': 'oct'}
14oct - {'date': '14', 'month': 'oct'}

Not sure if you don't want to match 1410 as in 4 digits only or (1410) with the parenthesis, but to exclude matching both you can make sure there are not 4 consecutive digits:
(?P<date>\b(?!\d{4}\b)\d{1,2})(?:st|[nr]d|th)?[\s./_\\,-]*(?P<month>\d{1,2}|[a-z]{3,9})
Regex demo
To not match any date between parenthesis
\([^()]*\)|(?P<date>\b\d{1,2})(?:st|[nr]d|th)?[\s./_\\,-]*(?P<month>\d{1,2}|[a-z]{3,9})
\([^()]*\) Match from opening till closing parenthesis
| Or
(?P<date>\b\d{1,2}) Match 1-2 digits
(?:st|[nr]d|th)? Optionally match st nd rd th
[\s./_\\,-]* Optionally repeat matching any of the listed
(?P<month>\d{1,2}|[a-z]{3,9}) Match 1-2 digits or 3-9 chars a-z
Regex demo
For example
import re
pattern = r"\([^()]*\)|(?P<date>\b\d{1,2})(?:st|[nr]d|th)?(?:[\s./_\\,-]*)(?P<month>\d{1,2}|[a-z]{3,9})"
strings = ["14th oct", "14oct", "14.10", "14 10", "14/10", "1410", "(1410)"]
for s in strings:
m = re.search(pattern, s, re.I)
if m.group(1):
print(m.groupdict())
else:
print(f"{s} --> Not valid")
Output
{'date': '14', 'month': 'oct'}
{'date': '14', 'month': 'oct'}
{'date': '14', 'month': '10'}
{'date': '14', 'month': '10'}
{'date': '14', 'month': '10'}
{'date': '14', 'month': '10'}
(1410) --> Not valid

How to change the search pattern for this?
You might try using negative lookbehind assertion literal ( combined with negative lookahead assertion literal ) as follows
import re
query = '14.oct'
noquery = '(1410)'
print(re.search(r'(?<!\()(?P<date>\b\d{1,2})(?:\b|st|nd|rd|th)?(?:[\s\.\-/_\\,]*)(?P<month>\d{1,2}|[a-z]{3,9})(?!\))', query, re.I).groupdict())
print(re.search(r'(?<!\()(?P<date>\b\d{1,2})(?:\b|st|nd|rd|th)?(?:[\s\.\-/_\\,]*)(?P<month>\d{1,2}|[a-z]{3,9})(?!\))', noquery, re.I))
output
{'date': '14', 'month': 'oct'}
None
Beware that it does prevent all bracketed forms, i.e. not only (1410) but also (14 10), (14/10) and so on.

Related

How to convert format date from MM/DD/YYYY to YYYY-MM-DDT00:00:00.000Z in Python

I used pandas to create a list of dictionaries. The following codes is how I create the list:
sheetwork = client.open('RMA Daily Workload').sheet1
list_of_work = sheetwork.get_all_records()
dfr = pd.DataFrame(list_of_work, columns = ['date' , 'value'])
rnow = dfrnow.to_dict('records')
The following is the output of my list:
rnow =
[{'date': '01/02/2020', 'value': 13},
{'date': '01/03/2020', 'value': 2},
{'date': '01/06/2020', 'value': 5},
...
{'date': '01/07/2020', 'value': 6}]
I want to change the date format from MM/DD/YYYY to YYYY-MM-DDT00:00:00.000Z, so that my data will be compatible with my javascript file where I want to add my data.
I want my list to be shown as:
rnow =
[{'date': '2020-01-02T00:00:00.000Z', 'value': 13},
{'date': '2020-01-03T00:00:00.000Z', 'value': 2},
{'date': '2020-01-06T00:00:00.000Z', 'value': 5},
...
{'date': '2020-01-07T00:00:00.000Z', 'value': 6}]
I tried so many methods but can only convert them into 2020-01-02 00:00:00 but not 2020-01-02T00:00:00.000Z. Please advise what should I do
If you need exact T00:00:00.000Z this string after the time, try to use string format after time conversion,
e.g.,
import datetime
# '2020-01-07T00:00:00.000Z'
datetime.datetime.strptime("07/02/2020", '%d/%m/%Y').strftime('%Y-%m-%dT00:00:00.000Z'))
How to apply to pandas:
def func(x):
myDate = x.date
return datetime.datetime.strptime(myDate, '%d/%m/%Y').strftime('%Y-%m-%dT00:00:00.000Z')
df['new_date'] = df.apply(func, axis=1)
To make it easy and keeping UTC and since you are using pandas:
rnow = [{'date': '01/02/2020', 'value': 13},
{'date': '01/03/2020', 'value': 2},
{'date': '01/06/2020', 'value': 5},
{'date': '01/07/2020', 'value': 6}]
def get_isoformat(date):
return pd.to_datetime(date, dayfirst=False, utc=True).isoformat()
for i in range (len(rnow)):
rnow[i]['date'] = get_isoformat(rnow[i]['date'])
rnow
which outputs:
[{'date': '2020-01-02T00:00:00+00:00', 'value': 13},
{'date': '2020-01-03T00:00:00+00:00', 'value': 2},
{'date': '2020-01-06T00:00:00+00:00', 'value': 5},
{'date': '2020-01-07T00:00:00+00:00', 'value': 6}]
in fact, you probably want to consider using the function get_isoformat() applied to your dataframe for simplicity. Also, if you use utc=None will get rid of the +00:00 part in case you don't want it or need it.
Edit
To get specificly 2020-01-02T00:00:00Z try:
pd.to_datetime(date, dayfirst=False, utc=False).isoformat()+'Z'
You can use the isoformat function of Python's builtin datetime package:
from datetime import datetime, timezone
formatted = datetime.strptime('01/02/2020', '%m/%d/%Y', tzInfo=timezone.utc).isoformat()
formatted
# Output: '2020-01-02T00:00:00+00:00'
Note that Python doesn't support the Z suffix for UTC timezone, instead it will be +00:00 which is according to ISO 8601 as well and should parse in other code just fine.
If this is a problem, you can omit the timezone and instead manually put a Z there:
from datetime import datetime
formatted = datetime.strptime('01/02/2020', '%m/%d/%Y').isoformat() + 'Z'
formatted
# Output: '2020-01-02T00:00:00Z'
Alternatively (in a more "manual" approach), you could format the date using strftime:
from datetime import datetime
formatted = datetime.strptime('01/02/2020', '%m/%d/%Y').strftime('%Y-%m-%dT00:00:00Z')
formatted
# Output: '2020-01-02T00:00:00Z'

Best way to break apart long string using redshift SQL (included in question)?

looking for the best way to break apart this blob of information into columns
DATE
AMOUNT
TYPE
UNDISCLOSED
INVESTORS
INVESTORS WEBSITES
[{'date': 'Mon Aug 07 00:00:00 UTC 2004', 'amount': '1900000', 'type': 'Series D', 'undisclosed': 'false', 'investor': [{'name': 'Jobius Venture', 'website': 'jobiusvc.com'}]}, {'date': 'Tues July 06 00:00:00 UTC 2010', 'amount': '12000000000', 'type': 'Series A1', 'undisclosed': 'false', 'investor': [{'name': 'Fatthead Partners', 'website': 'fpartnazs.com'}, {'name': 'Jobius Venture', 'website': 'jobiusvc.com'}, {'name': 'Pista Pentures ', 'website': 'pisptavc.com'}]}, {'date': 'Sat Jun 01 00:00:00 UTC 2015', 'amount': '10000000000', 'type': 'Series X', 'undisclosed': 'false', 'investor': [{'name': 'Fatthead Partners', 'website': 'fpartnazs.com'}, {'name': 'Jobius Venture', 'website': 'jobiusvc.com'}, {'name': 'Pista Pentures', 'website': 'vistavc.com'}]}, {'date': 'Sun Aug 31 00:00:00 UTC 2015', 'amount': '3913000', 'type': 'Unknown', 'undisclosed': 'false'}, {'date': 'Mon Aug 12 00:00:00 UTC 2023', 'amount': '40000', 'type': 'Series D34', 'undisclosed': 'false', 'investor': [{'name': 'Fatthead Partners', 'website': 'fpartnazs.com'}, {'name': 'Jobius Venture', 'website': 'jobiusvc.com'}]}]
Your output is almost in JSON format.
For JSON, you could use: JSON_EXTRACT_PATH_TEXT Function - Amazon Redshift
However, it seems that the quotation marks are not standard JSON. It should use double-quotes (") in JSON, not single quotes (').
Also, the string appears to start with a List ([...]), which makes it incompatible with the JSON functions. A JSON object would normally be in {..} braces.
The output looks more like it came from a Python program. If so, and you have access to the Python program, it would be better to have it output in correct JSON format, so that you could use the above function. (Or just output the fields you actually want.)
You could write a Python User-Defined Function to do the conversion, such as:
create or replace function f_parse (str varchar(2000))
returns varchar
stable
as $$
return eval(str)[0]['date']
$$ language plpythonu;
Then:
select f_parse(s) from table
Results in: Mon Aug 07 00:00:00 UTC 2004
However, it appears that multiple records are in that one line, so I really suggest that you get a better version of the input data rather than trying to parse that line.

Format String Vertically

I am receiving such data with python that comes as a string from firebase database. How can I format it into more readable data for the user?
Received Output
{'date': '07-Oct-2019', 'day': 'Monday', 'driver': 'John '}
Desired OutPut
date : 07-Oct-2019
day : Monday
driver : jop
Simple one line should do
d={'date': '07-Oct-2019', 'day': 'Monday', 'driver': 'John '}
print("\n".join([k+":"+v for k,v in d.items()]))

Change dates in list with multiple dictionaries in Python

I have a list with multiple dictionaries, like the following:
[{'Date': '6-1-2017', 'Rate':'0.3', 'Type':'A'},
{'Date': '6-1-2017', 'Rate':'0.4', 'Type':'B'},
{'Date': '6-1-2017', 'Rate':'0.6', 'Type':'F'},
{'Date': '6-1-2017', 'Rate':'0.1', 'Type':'B'}
]
I would now like to change the dates, because they need to be in the format 'yyymmdd', which starts at 1900-01-01. In other words, I would like to change the '6-1-2017' to '1170106'.
As this has to be done every week (with the then current date), I do not want to change this by hand. So next week, '13-1-2017' has to be transformed into '1170113'.
Anyone ideas how to do this? I have tried several things, but I can't even get my code to select the date-values of all dictionaries.
Many thanks!
You can use the datetime module, which provides a lot of functionality to manipulate datetime objects including converting datetime to string and the way back, accessing different components of the datetime object, etc:
from datetime import datetime
for l in lst:
l['Date'] = datetime.strptime(l['Date'], "%d-%m-%Y")
l['Date'] = str(l['Date'].year - 1900) + l['Date'].strftime("%m%d")
lst
#[{'Date': '1170106', 'Rate': '0.3', 'Type': 'A'},
# {'Date': '1170106', 'Rate': '0.4', 'Type': 'B'},
# {'Date': '1170106', 'Rate': '0.6', 'Type': 'F'},
# {'Date': '1170106', 'Rate': '0.1', 'Type': 'B'}]

python parse java calendar to isodate

I've data like this.
startDateTime: {'timeZoneID': 'America/New_York', 'date': {'year': '2014', 'day': '29', 'month': '1'}, 'second': '0', 'hour': '12', 'minute': '0'}
This is just a representation for 1 attribute. Like this i've 5 other attributes. LastModified, created etc.
I wanted to derive this as ISO Date format yyyy-mm-dd hh:mi:ss. is this the right way for doing this?
def parse_date(datecol):
x=datecol;
y=str(x.get('date').get('year'))+'-'+str(x.get('date').get('month')).zfill(2)+'-'+str(x.get('date').get('day')).zfill(2)+' '+str(x.get('hour')).zfill(2)+':'+str(x.get('minute')).zfill(2)+':'+str(x.get('second')).zfill(2)
print y;
return;
That works, but I'd say it's cleaner to use the string formatting operator here:
def parse_date(c):
d = c["date"]
print "%04d-%02d-%02d %02d:%02d:%02d" % tuple(map(str, (d["year"], d["month"], d["day"], c["hour"], c["minute"], c["second"])))
Alternatively, you can use the time module to convert your fields into a Python time value, and then format that using strftime. Remember the time zone, though.

Categories