Related
I know there are many questions regarding this type of sorting, I tried many time by referring to those questions and also by going through the re topic in python too
My question is:
class Example(models.Model):
_inherit = 'sorting.example'
def unable_to_sort(self):
data_list = ['Abigail Peterson Jan 25','Paul Williams Feb 1','Anita Oliver Jan 24','Ernest Reed Jan 28']
self.update({'list_of_birthday_week': ','.join(r for r in data_list)})
I need to be sorted according to the month & date like:
data_list = ['Anita Oliver Jan 24','Abigail Peterson Jan 25','Ernest Reed Jan 28','Paul Williams Feb 1']
is there any way to achieve this ?
Use a regex to extract the date than use it as key of sorted function.
import re
pattern = r'(\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\D?(?:\d{1,2}\D?))'
sort_by_date = lambda x: datetime.strptime(re.search(pattern, x).group(0), '%b %d')
out = sorted(data_list, key=sort_by_date)
Output:
>>> out
['Anita Oliver Jan 24',
'Abigail Peterson Jan 25',
'Ernest Reed Jan 28',
'Paul Williams Feb 1']
Input:
>>> data_list
['Abigail Peterson Jan 25',
'Paul Williams Feb 1',
'Ernest Reed Jan 28',
'Anita Oliver Jan 24']
You need to extract the date part from the string, and then turn the date string into a comparable format. For the first task, regexen would be a decent choice here, and for the second part, datetime.strptime would be appropriate:
>>> import re
>>> from datetime import *
>>>
>>> re.search('\w+ \d+$', 'Abigail Peterson Jan 25')
<re.Match object; span=(17, 23), match='Jan 25'>
>>> re.search('\w+ \d+$', 'Abigail Peterson Jan 25')[0]
'Jan 25'
>>>
>>> datetime.strptime('Jan 25', '%b %d')
datetime.datetime(1900, 1, 25, 0, 0)
>>> datetime.strptime(re.search('\w+ \d+$', 'Abigail Peterson Jan 25')[0], '%b %d')
datetime.datetime(1900, 1, 25, 0, 0)
Then turn that into a callback for list.sort:
>>> data_list.sort(key=lambda i: datetime.strptime(re.search('\w+ \d+$', i)[0], '%b %d'))
['Anita Oliver Jan 24', 'Abigail Peterson Jan 25', 'Ernest Reed Jan 28', 'Paul Williams Feb 1']
You can also use split() to accomplish that.
from datetime import datetime
...
def unable_to_sort(self):
data_list = ['Abigail Peterson Jan 25','Paul Williams Feb 1','Anita Oliver Jan 24','Ernest Reed Jan 28']
def get_date(data):
name, str_date = data.split(" ")[:-2], data.split(" ")[-2:]
month, day = str_date
return datetime.strptime(f"{month} {day}", "%b %d")
sorted_data_list = sorted(data_list, key=get_date)
self.update({'list_of_birthday_week': ','.join(r for r in sorted_data_list)})
You can use sorted function with keys datetime.strptime() and date value.
from datetime import datetime
data_list = ['Anita Oliver Jan 24','Abigail Peterson Jan 25','Ernest Reed Jan 28','Paul Williams Feb 1']
k=[x.split() for x in data_list]
days_sorted = sorted(k, key=lambda x: (datetime.strptime(x[2],'%b'),x[3]))
[['Anita', 'Oliver', 'Jan', '24'],
['Abigail', 'Peterson', 'Jan', '25'],
['Ernest', 'Reed', 'Jan', '28'],
['Paul', 'Williams', 'Feb', '1']]
I'm a bit stuck with proceeding with the custom function that would be able to parse in any given language, with a set of rules that would be able to identify and parse months and years including some special cases (that I will show).
Input data:
dates_eg = [["Issued Feb 2018 · No Expiration Date"], ["Oct 2021 - Present"],
["Jan 2019 - Oct 2021 · 2 yrs 10 mos"], ["1994 - 2000"], ["Sep 2010 – Sep 2015 • 5 yrs"],
["Nov 2019 – Present"], ["Sep 2015 – Mar 2017 • 1 yr 6 mos"], ["Apr 2019 – Present • 2 yrs 8 mos"],
["Issued: Aug 2018 · Expired Aug 2020"], ["Mar 2017 - Dec 2017 · 10 mos"],
["May 2021 - Present · 9 mos"]]
NOTE: No API translation should be used. Instead of it, I've created a list of FLAG words (in certain languages) and STOPWORDS that would be able to identify all the cases.
Present, Issued, Expired, No Expiration Date -> all of these four keywords/stopwords would be identified through the lists (see underneath).
Current functions work workflow:
e.g. "Issued Feb 2018 · No Expiration Date"
Firstly, algorithm named "raw_dates_extractor" is separating the start date ('Issued Feb 2018') and the end date ('No Expiration Date') with a separator ' · '
Secondly, clean_extract_dates algorithm is identifying 'Issued' and gets rid of it
All the aforementioned resulting in:
{'Start_date_raw0': 'Feb 2018', 'Start_date_raw1': 'Oct 2021', 'Start_date_raw2': 'Jan 2019', 'Start_date_raw3': '1994', 'Start_date_raw4': 'Sep 2010', 'Start_date_raw5': 'Nov 2019', 'Start_date_raw6': 'Sep 2015', 'Start_date_raw7': 'Apr 2019', 'Start_date_raw8': 'Aug 2018', 'Start_date_raw9': 'Mar 2017', 'Start_date_raw10': 'May 2021', 'End_date_raw0': 'No Expiration Date', 'End_date_raw1': 'Present', 'End_date_raw2': 'Oct 2021', 'End_date_raw3': '2000', 'End_date_raw4': 'Sep 2015', 'End_date_raw5': 'Present', 'End_date_raw6': 'Mar 2017', 'End_date_raw7': 'Present', 'End_date_raw8': 'Aug 2020', 'End_date_raw9': 'Dec 2017', 'End_date_raw10': 'Present'}
However, the desired output would go further by extracting/identifying years, months, and stopwords/flag words (like 'No Expiration Date', 'Present' would be flagged as 'ONGOING' = 1 otherwise 0.
From the first initial raw example:
"Issued Feb 2018 · No Expiration Date" would be:
Desired result:
{'start_year': 2021, 'start_month': "Apr", 'end_year': None,
'end_month': None, 'is_on_going': True})
Complete Code attempt:
Lists with Expired, Issued, No Expiration Date, Present
import re
import pandas as pd
# "Present" word in all language combination
present_all_lang = ["současnost", "I dag", "Heute", "Present", "actualidad", "aujourd’hui", "वर्तमान"
"Saat ini", "Presente",
"現在", "현재", "Kini", "heden", "Nå", "obecnie", "o momento"
"Prezent", "настоящее время", "nu", "ปัจจุบัน ",
"Kasalukuyan", "Halen", "至今", "現在"]
# "Issued" word in all language combination
issued_all_lang = ["Vydáno", "Udstedt", "Ausgestellt", "Issued", "Expedición", "Émise le", "नव॰", "Diterbitkan"
"Data di rilascio",
"発行日", "발행일", "Dikeluarkan", "Toegekend", "Utstedt", "Wydany", "Emitido em"
"Дата выдачи", "Utfärdat", "ออกเมื่อ",
"Inisyu", "tarihinde verildi", "颁发日期", "發照日期"]
# "No experience date" in all language combination
no_experience_all_lang = ["Bez data vypršení", "Ingen udløbsdato", "Kein Ablaufdatum", "No Expiration Date",
"Sin fecha de vencimiento", "Sans date d’expiration", "समाप्ति की कोई तारीख नहीं",
"Tidak Ada Tanggal Kedaluwarsa", "Nessuna data di scadenza", "有効期限なし"
"만료일 없음",
"Tiada Tarikh Tamat Tempoh", "Geen vervaldatum" "Ingen utløpsdato",
"Brak daty ważności", "Sem data de expiração", "Без истечения срока действия"
"Inget utgångsdatum", "ไม่มีวันหมดอายุ",
"Walang Petsa ng Pagtatapos"
"Sona Erme Tarihi Yok", "长期有效", "永久有效"]
# "Expired" in all lang
expires_all_lang = ["Vyprší", "Udløbet", "Gültig bis", "Expired", "Vencimiento", "Expire le",
"को समाप्त हो गया है", "Kedaluwarsa", "Data di scadenza", "有効期限", " 만료됨", "Tamat Tempoh",
"Verloopt op", "Utløper", "Wygasa", "Expira em", "A expirat la", "Срок действия истек",
"Upphörde att gälla", "หมดอายุเมื่อ", "Mag-e-expire sa", "tarihinde süresi bitiyor", "有效期至", "過期"]
# colon my appear get rid of it!
def raw_dates_extractor(list_of_examples: dates_eg) -> dict:
raw_start_date, raw_end_date, merged_start_end_dates = {}, {}, {}
for row, datum in enumerate(list_of_examples):
date_splitter = re.split("[·•–-]", datum[0])
raw_start_date[f'Start_date_raw{row}'] = date_splitter[0].strip(' ')
raw_end_date[f'End_date_raw{row}'] = date_splitter[1].strip(' ')
merged_start_end_dates = raw_start_date | raw_end_date
return merged_start_end_dates
output:
raw_dates_extractor(dates_eg)
{'Start_date_raw0': 'Issued Feb 2018', 'Start_date_raw1': 'Oct 2021', 'Start_date_raw2': 'Jan 2019', 'Start_date_raw3': '1994', 'Start_date_raw4': 'Sep 2010', 'Start_date_raw5': 'Nov 2019', 'Start_date_raw6': 'Sep 2015', 'Start_date_raw7': 'Apr 2019', 'Start_date_raw8': 'Issued: Aug 2018', 'Start_date_raw9': 'Mar 2017', 'Start_date_raw10': 'May 2021', 'End_date_raw0': 'No Expiration Date', 'End_date_raw1': 'Present', 'End_date_raw2': 'Oct 2021', 'End_date_raw3': '2000', 'End_date_raw4': 'Sep 2015', 'End_date_raw5': 'Present', 'End_date_raw6': 'Mar 2017', 'End_date_raw7': 'Present', 'End_date_raw8': 'Expired Aug 2020', 'End_date_raw9': 'Dec 2017', 'End_date_raw10': 'Present'}
def clean_extract_dates(input_dictionary: dict, aggregate_stop_words: list):
container_replica = {}
for date_element in input_dictionary:
query_req = input_dictionary[f'{date_element}']
if ":" in query_req:
query_req = query_req.replace(":", "")
if "." in query_req:
query_req = query_req.replace(".", "")
stopwords = aggregate_stop_words
query_words = query_req.split()
result_words = [word for word in query_words if word not in stopwords]
end_result = ' '.join(result_words)
container_replica[f'{date_element}'] = end_result
return container_replica
output:
clean_extract_dates(raw_dates_extractor(dates_eg), aggregate_lists)
{'Start_date_raw0': 'Feb 2018', 'Start_date_raw1': 'Oct 2021', 'Start_date_raw2': 'Jan 2019', 'Start_date_raw3': '1994', 'Start_date_raw4': 'Sep 2010', 'Start_date_raw5': 'Nov 2019', 'Start_date_raw6': 'Sep 2015', 'Start_date_raw7': 'Apr 2019', 'Start_date_raw8': 'Aug 2018', 'Start_date_raw9': 'Mar 2017', 'Start_date_raw10': 'May 2021', 'End_date_raw0': 'No Expiration Date', 'End_date_raw1': 'Present', 'End_date_raw2': 'Oct 2021', 'End_date_raw3': '2000', 'End_date_raw4': 'Sep 2015', 'End_date_raw5': 'Present', 'End_date_raw6': 'Mar 2017', 'End_date_raw7': 'Present', 'End_date_raw8': 'Aug 2020', 'End_date_raw9': 'Dec 2017', 'End_date_raw10': 'Present'}
aggregate_lists = issued_all_lang + no_experience_all_lang + expires_all_lang
saved_dictionary = clean_extract_dates(raw_dates_extractor(dates_eg), aggregate_lists)
I need to match a string to identify if it's valid date range or not, my string could include both months in text and years in numbers, with out specific order ( there's no fixed format like MM-YYYY-DD etc ).
A valid string could be:
February 2016 - March 2019
September 2015 to August 2019
April 2015 to present
September 2018 - present
Invalid string:
George Mason University august 2019
Stratusburg university February 2018
Some text and month followed by year
I already looked into issues such as
a) Constructing Regular Expressions to match numeric ranges
b) Regex to match month name followed by year
and many others, but most of the input strings in those issues seems to have the luxury of some fixed pattern for the month and year, which I don't have.
I tried this regex in python:
import re
pat = r"(\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\D?(\d{1,2}(st|nd|rd|th)?)?(([,.\-\/])\D?)?((19[7-9]\d|20\d{2})|\d{2})*"
st = "University of Pennsylvania February 2018"
re.search(pat, st)
but that recognizes both valid and invalid strings from my example, I want to avoid invalid strings in my eventual output.
For input "University of Pennsylvania February 2018" the expected output should be False
For "February 2018 to Present",output must be True.
This REGEX validate date range that respect this format MONTH YEAR (MONTH YEAR | PRESENT)
import re
# just for complexity adding to valid range in first line
text = """
February 2016 - March 2019 February 2017 - March 2019
September 2015 to August 2019
April 2015 to present
September 2018 - present
George Mason University august 2019
Stratusburg university February 2018
Some text and month followed by year
"""
# writing the REGEX in one line will make it very UGLY
MONTHS_RE = ['Jan(?:uary)?', 'Feb(?:ruary)', 'Mar(?:ch)', 'Apr(?:il)?', 'May', 'Jun(?:e)?', 'Aug(?:ust)?', 'Sep(?:tember)?',
'(?:Nov|Dec)(?:ember)?']
# to match MONTH NAME and capture it (Jan(?:uary)?|Feb(?:ruary)...|(?:Nov|Dec)(?:ember)?)
RE_MONTH = '({})'.format('|'.join(MONTHS_RE))
# THIS MATCHE MONTH FOLLOWED BY YEAR{2 or 4} I will use two times in Final REGEXP
RE_DATE = '{RE_MONTH}(?:[\s]+)(\d{{2,4}})'.format(RE_MONTH=RE_MONTH)
# FINAL REGEX
RE_VALID_RANGE = re.compile('{RE_DATE}.+?(?:{RE_DATE}|(present))'.format(RE_DATE=RE_DATE), flags=re.IGNORECASE)
# if you want to extract both valid an invalide
valid_ranges = []
invalid_ranges = []
for line in text.split('\n'):
if line:
groups = re.findall(RE_VALID_RANGE, line)
if groups:
# If you want to do something with range
# all valid ranges are here my be 1 or 2 depends on the number of valid range in one line
# every group have 4 elements because there is 4 capturing group
# if M2,Y2 are not empty present is empty or the inverse only one of them is there (because of (?:{RE_DATE}|(present)) )
M1, Y1, M2, Y2, present = groups[0] # here use loop if you want to verify the values even more
valid_ranges.append(line)
else:
invalid_ranges.append(line)
print('VALID: ', valid_ranges)
print('INVALID:', invalid_ranges)
# this yields only valid ranges if there is 2 in one line will yield two valid ranges
# if you are dealing with lines this is not what you want
valid_ranges = []
for match in re.finditer(RE_VALID_RANGE, text):
# if you want to check the ranges
M1, Y1, M2, Y2, present = match.groups()
valid_ranges.append(match.group(0)) # the text is returned here
print('VALID USING <finditer>: ', valid_ranges)
OUPUT:
VALID: ['February 2016 - March 2019 February 2017 - March 2019', 'September 2015 to August 2019', 'April 2015 to present', 'September 2018 - present']
INVALID: ['George Mason University august 2019', 'Stratusburg university February 2018', 'Some text and month followed by year']
VALID USING <finditer>: ['February 2016 - March 2019', 'February 2017 - March 2019', 'September 2015 to August 2019', 'April 2015 to present', 'September 2018 - present']
I hate writing long regular expression in a single str variable I love to break it to understand what It does when I read my code after six Months. Note how the first line is divided to two valid range string using finditer
If you want just to extract ranges you can use this:
valid_ranges = re.findall(RE_VALID_RANGE, text)
But this returns the groups ([M1, Y1, M2, Y2, present)..] not the Text :
[('February', '2016', 'March', '2019', ''), ('February', '2017', 'March', '2019', ''), ('September', '2015', 'August', '2019', ''), ('April', '2015', '', '', 'present'), ('September', '2018', '', '', 'present')]
Maybe, you could reduce the boundaries of your expression with some simple ones such as:
(?i)^\S+\s+(\d{2})?(\d{2})\s*(?:[-_]|to)\s*(present|\S+)\s*(\d{2})?(\d{2})?$
or maybe,
(?i)\S+\s+(\d{2})?(\d{2})\s*(?:[-_]|to)\s*(present|\S+)\s*(\d{2})?(\d{2})?
Test
import re
regex = r"(?i)^\S+\s+(\d{2})?(\d{2})\s*(?:[-_]|to)\s*(present|\S+)\s*(\d{2})?(\d{2})?$"
string = """
February 2016 - March 2019
September 2015 to August 2019
April 2015 to present
September 2018 - present
Feb. 2016 - March 2019
Sept 2015 to Aug. 2019
April 2015 to present
Nov. 2018 - present
Invalid string:
George Mason University august 2019
Stratusburg university February 2018
Some text and month followed by year
"""
print(re.findall(regex, string, re.M))
Output
[('20', '16', 'March', '20', '19'), ('20', '15', 'August', '20', '19'), ('20', '15', 'present', '', ''), ('20', '18', 'present', '', ''), ('20', '16', 'March', '20', '19'), ('20', '15', 'Aug.', '20', '19'), ('20', '15', 'present', '', ''), ('20', '18', 'present', '', '')]
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
Say I have two python lists as:
ListA = ['Jan 2018', 'Feb 2018', 'Mar 2018']
ListB = ['Sales Jan 2018','Units sold Jan 2018','Sales Feb 2018','Units sold Feb 2018','Sales Mar 2018','Units sold Mar 2018']
I need to get an output as:
List_op = ['Jan 2018 Sales Jan 2018 Units sold Jan 2018','Feb 2018 Sales Feb 2018 Units sold Feb 2018','Mar 2018 Sales Mar 2018 Units sold Mar 2018']
My approach so far:
res=set()
for i in ListB:
for j in ListA:
if j in i:
res.add(f'{i} {j}')
print (res)
this gives me result as:
{'Units sold Jan 2018 Jan 2018', 'Sales Feb 2018 Feb 2018', 'Units sold Mar 2018 Mar 2018', 'Units sold Feb 2018 Feb 2018', 'Sales Jan 2018 Jan 2018', 'Sales Mar 2018 Mar 2018'}
which is definitely not the solution I'm looking for.
What I think is regular expression could be a handful here but I'm not sure how to approach. Any help in this regard is highly appreciated.
Thanks in advance.
Edit:
Values in ListA and ListB are not necessarily to be in order. Therefore for a particular month/year value in ListA, the same month/year value from ListB has to be matched and picked for both 'Sales' and 'Units sold' component and needs to be concatenated.
My main goal here is to get the list which I can use later to generate a statement that I'll be using to write Hive query.
Added more explanation as suggested by #andrew_reece
Assuming no additional edge cases that need taking care of, your original code is not bad, just needs a slight update:
List_op = []
for a in ListA:
combined = a
for b in ListB:
if a in b:
combined += " " + b
List_op.append(combined)
List_op
['Jan 2018 Sales Jan 2018 Units sold Jan 2018',
'Feb 2018 Sales Feb 2018 Units sold Feb 2018',
'Mar 2018 Sales Mar 2018 Units sold Mar 2018']
Supposing ListA and ListB are sorted:
ListA = ['Jan 2018', 'Feb 2018', 'Mar 2018']
ListB = ['Sales Jan 2018','Units sold Jan 2018','Sales Feb 2018','Units sold Feb 2018','Sales Mar 2018','Units sold Mar 2018']
print([v1 + " " + v2 for v1, v2 in zip(ListA, [v1 + " " + v2 for v1, v2 in zip(ListB[::2], ListB[1::2])])])
This will print:
['Jan 2018 Sales Jan 2018 Units sold Jan 2018', 'Feb 2018 Sales Feb 2018 Units sold Feb 2018', 'Mar 2018 Sales Mar 2018 Units sold Mar 2018']
In my example I firstly concatenate ListB variables together and then join ListA with this new list.
String concatenation can become expensive. In Python 3.6+, you can use more efficient f-strings within a list comprehension:
res = [f'{i} {j} {k}' for i, j, k in zip(ListA, ListB[::2], ListB[1::2])]
print(res)
['Jan 2018 Sales Jan 2018 Units sold Jan 2018',
'Feb 2018 Sales Feb 2018 Units sold Feb 2018',
'Mar 2018 Sales Mar 2018 Units sold Mar 2018']
Using itertools.islice, you can avoid the expense of creating new lists:
from itertools import islice
zipper = zip(ListA, islice(ListB, 0, None, 2), islice(ListB, 1, None, 2))
res = [f'{i} {j} {k}' for i, j, k in zipper]
I am trying to create a key: value1, value2 dictionary from a byte list and having issues with it.
Here is my list
[b'Expected in April 2018',
b'Murder At Koh E Fiza',
b'34',
b'06 April 2018',
b'Subedar Joginder Singh',
b'0',
b'06 April 2018',
b'Blackmail',
b'86',
b'06 April 2018',
b'Missing',
b'0',
b'13 April 2018',
b'October',
b'59',
b'13 April 2018',
b'Mercury',
b'0',
b'20 April 2018',
b'Omerta',
b'50']
I have tried following code:
b = dict(zip(list[1::3],(list[2::3]+list[0::3])))
but I don't get third value as key value pair.
I have also tried
b = dict(zip(list[1::3],list[2::3]+list[0::3]))
same issue I am getting following output with both of these statements
{b'Murder At Koh E Fiza': b'34', b'Subedar Joginder Singh': b'0',
b'Blackmail': b'86', b'Missing': b'0', b'October': b'59', b'Mercury': b'0',
b'Omerta': b'50'}
I am looking for the following output
b'Murder At Koh E Fiza': b'34',b'Expected in April 2018',
b'Subedar Joginder Singh': b'0',b'06 April 2018',
Please let me know
I think you are looking to associate a list or a tuple with each key in your dictionary. So something like this should work:
dict( zip(list[1::3], zip( list[2::3], list[0::3] ) ))
Which results in
{'Mercury': ('0', '13 April 2018'), 'Murder At Koh E Fiza': ('34', 'Expected in April 2018'), 'October': ('59', '13 April 2018'), 'Missing': ('0', '06 April 2018'), 'Blackmail': ('86', '06 April 2018'), 'Omerta': ('50', '20 April 2018'), 'Subedar Joginder Singh': ('0', '06 April 2018')}
You can use zip alongside with a dict comprehension
a = [b'Expected in April 2018',
b'Murder At Koh E Fiza',
b'34',
b'06 April 2018',
b'Subedar Joginder Singh',
b'0',
b'06 April 2018',
b'Blackmail',
b'86',
b'06 April 2018',
b'Missing',
b'0',
b'13 April 2018',
b'October',
b'59',
b'13 April 2018',
b'Mercury',
b'0',
b'20 April 2018',
b'Omerta',
b'50']
final = {v: [m, k] for k, v, m in zip(a, a[1:], a[2:])}
print(final.get(b'Murder At Koh E Fiza')
print(final.get(b'Subedar Joginder Singh'))
output:
[b'34', b'Expected in April 2018']
[b'0', b'06 April 2018']
You can find that out without zipping as well:
my_list = ['Expected in April 2018', 'Murder At Koh E Fiza', '34', '06
April 2018', 'Subedar Joginder Singh', '0', '06 April 2018',
'Blackmail', '86', '06 April 2018', 'Missing', '0', '13 April 2018',
'October', '59', '13 April 2018', 'Mercury', '0', '20 April 2018',
'Omerta', '50']
n = 3
composite_list = [my_list[x:x+n] for x in range(0, len(my_list),n)]
d = {n[1]: [n[0], n[2]] for n in composite_list}