split element inside list python [duplicate] - python

This question already has answers here:
Is there a way to split a string by every nth separator in Python?
(6 answers)
Closed 4 years ago.
I was trying to split the element inside the list based on certain length, here is the list ['Mar 18 Mar 17 Mar 16 Mar 15 Mar 14']. Could any one help me to retrieve the values from the list in the following format:
['Mar 18','Mar 17','Mar 16','Mar 15','Mar 14']

A regular expression based approach that would handle cases like Apr 1 or Dec 31 as well as multiple elements in the initial list:
import re
lst = ['Mar 18 Mar 17 Mar 16 Mar 15 Mar 14']
[x for y in lst for x in re.findall(r'[A-Z][a-z]+ \d{1,2}', y)]
# ['Mar 18', 'Mar 17', 'Mar 16', 'Mar 15', 'Mar 14']

list = ['Mar 18 Mar 17 Mar 16 Mar 15 Mar 14']
span = 2
s = list[0].split(" ")
s = [" ".join(words[i:i+span]) for i in range(0, len(s), span)]
print(s)
For me, this prints
['Mar 18','Mar 17','Mar 16','Mar 15','Mar 14']
Taken from this answer.

Try this code !
You can do it by the concept of regular expression (just by import re library in python)
import re
lst = ['Mar 18 Mar 17 Mar 16 Mar 15 Mar 14 Mar 2']
obj = re.findall(r'[A-Z][a-z]+[ ](?:\d{2}|\d{1})', lst[0])
print(obj)
Output :
['Mar 18', 'Mar 17', 'Mar 16', 'Mar 15', 'Mar 14', 'Mar 2']

You can also try this one
>>> list = ['Mar 18 Mar 17 Mar 16 Mar 15 Mar 14']
>>> result = list[0].split(" ")
>>> [i+' '+j for i,j in zip(result[::2], result[1::2])]
Output
['Mar 18', 'Mar 17', 'Mar 16', 'Mar 15', 'Mar 14']

Related

sort a list of string numerically in python

I know there are many questions regarding this type of sorting, I tried many time by referring to those questions and also by going through the re topic in python too
My question is:
class Example(models.Model):
_inherit = 'sorting.example'
def unable_to_sort(self):
data_list = ['Abigail Peterson Jan 25','Paul Williams Feb 1','Anita Oliver Jan 24','Ernest Reed Jan 28']
self.update({'list_of_birthday_week': ','.join(r for r in data_list)})
I need to be sorted according to the month & date like:
data_list = ['Anita Oliver Jan 24','Abigail Peterson Jan 25','Ernest Reed Jan 28','Paul Williams Feb 1']
is there any way to achieve this ?
Use a regex to extract the date than use it as key of sorted function.
import re
pattern = r'(\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\D?(?:\d{1,2}\D?))'
sort_by_date = lambda x: datetime.strptime(re.search(pattern, x).group(0), '%b %d')
out = sorted(data_list, key=sort_by_date)
Output:
>>> out
['Anita Oliver Jan 24',
'Abigail Peterson Jan 25',
'Ernest Reed Jan 28',
'Paul Williams Feb 1']
Input:
>>> data_list
['Abigail Peterson Jan 25',
'Paul Williams Feb 1',
'Ernest Reed Jan 28',
'Anita Oliver Jan 24']
You need to extract the date part from the string, and then turn the date string into a comparable format. For the first task, regexen would be a decent choice here, and for the second part, datetime.strptime would be appropriate:
>>> import re
>>> from datetime import *
>>>
>>> re.search('\w+ \d+$', 'Abigail Peterson Jan 25')
<re.Match object; span=(17, 23), match='Jan 25'>
>>> re.search('\w+ \d+$', 'Abigail Peterson Jan 25')[0]
'Jan 25'
>>>
>>> datetime.strptime('Jan 25', '%b %d')
datetime.datetime(1900, 1, 25, 0, 0)
>>> datetime.strptime(re.search('\w+ \d+$', 'Abigail Peterson Jan 25')[0], '%b %d')
datetime.datetime(1900, 1, 25, 0, 0)
Then turn that into a callback for list.sort:
>>> data_list.sort(key=lambda i: datetime.strptime(re.search('\w+ \d+$', i)[0], '%b %d'))
['Anita Oliver Jan 24', 'Abigail Peterson Jan 25', 'Ernest Reed Jan 28', 'Paul Williams Feb 1']
You can also use split() to accomplish that.
from datetime import datetime
...
def unable_to_sort(self):
data_list = ['Abigail Peterson Jan 25','Paul Williams Feb 1','Anita Oliver Jan 24','Ernest Reed Jan 28']
def get_date(data):
name, str_date = data.split(" ")[:-2], data.split(" ")[-2:]
month, day = str_date
return datetime.strptime(f"{month} {day}", "%b %d")
sorted_data_list = sorted(data_list, key=get_date)
self.update({'list_of_birthday_week': ','.join(r for r in sorted_data_list)})
You can use sorted function with keys datetime.strptime() and date value.
from datetime import datetime
data_list = ['Anita Oliver Jan 24','Abigail Peterson Jan 25','Ernest Reed Jan 28','Paul Williams Feb 1']
k=[x.split() for x in data_list]
days_sorted = sorted(k, key=lambda x: (datetime.strptime(x[2],'%b'),x[3]))
[['Anita', 'Oliver', 'Jan', '24'],
['Abigail', 'Peterson', 'Jan', '25'],
['Ernest', 'Reed', 'Jan', '28'],
['Paul', 'Williams', 'Feb', '1']]

Text mining on string dates from fuzzy example

I'm a bit stuck with proceeding with the custom function that would be able to parse in any given language, with a set of rules that would be able to identify and parse months and years including some special cases (that I will show).
Input data:
dates_eg = [["Issued Feb 2018 · No Expiration Date"], ["Oct 2021 - Present"],
["Jan 2019 - Oct 2021 · 2 yrs 10 mos"], ["1994 - 2000"], ["Sep 2010 – Sep 2015 • 5 yrs"],
["Nov 2019 – Present"], ["Sep 2015 – Mar 2017 • 1 yr 6 mos"], ["Apr 2019 – Present • 2 yrs 8 mos"],
["Issued: Aug 2018 · Expired Aug 2020"], ["Mar 2017 - Dec 2017 · 10 mos"],
["May 2021 - Present · 9 mos"]]
NOTE: No API translation should be used. Instead of it, I've created a list of FLAG words (in certain languages) and STOPWORDS that would be able to identify all the cases.
Present, Issued, Expired, No Expiration Date -> all of these four keywords/stopwords would be identified through the lists (see underneath).
Current functions work workflow:
e.g. "Issued Feb 2018 · No Expiration Date"
Firstly, algorithm named "raw_dates_extractor" is separating the start date ('Issued Feb 2018') and the end date ('No Expiration Date') with a separator ' · '
Secondly, clean_extract_dates algorithm is identifying 'Issued' and gets rid of it
All the aforementioned resulting in:
{'Start_date_raw0': 'Feb 2018', 'Start_date_raw1': 'Oct 2021', 'Start_date_raw2': 'Jan 2019', 'Start_date_raw3': '1994', 'Start_date_raw4': 'Sep 2010', 'Start_date_raw5': 'Nov 2019', 'Start_date_raw6': 'Sep 2015', 'Start_date_raw7': 'Apr 2019', 'Start_date_raw8': 'Aug 2018', 'Start_date_raw9': 'Mar 2017', 'Start_date_raw10': 'May 2021', 'End_date_raw0': 'No Expiration Date', 'End_date_raw1': 'Present', 'End_date_raw2': 'Oct 2021', 'End_date_raw3': '2000', 'End_date_raw4': 'Sep 2015', 'End_date_raw5': 'Present', 'End_date_raw6': 'Mar 2017', 'End_date_raw7': 'Present', 'End_date_raw8': 'Aug 2020', 'End_date_raw9': 'Dec 2017', 'End_date_raw10': 'Present'}
However, the desired output would go further by extracting/identifying years, months, and stopwords/flag words (like 'No Expiration Date', 'Present' would be flagged as 'ONGOING' = 1 otherwise 0.
From the first initial raw example:
"Issued Feb 2018 · No Expiration Date" would be:
Desired result:
{'start_year': 2021, 'start_month': "Apr", 'end_year': None,
'end_month': None, 'is_on_going': True})
Complete Code attempt:
Lists with Expired, Issued, No Expiration Date, Present
import re
import pandas as pd
# "Present" word in all language combination
present_all_lang = ["současnost", "I dag", "Heute", "Present", "actualidad", "aujourd’hui", "वर्तमान"
"Saat ini", "Presente",
"現在", "현재", "Kini", "heden", "Nå", "obecnie", "o momento"
"Prezent", "настоящее время", "nu", "ปัจจุบัน ",
"Kasalukuyan", "Halen", "至今", "現在"]
# "Issued" word in all language combination
issued_all_lang = ["Vydáno", "Udstedt", "Ausgestellt", "Issued", "Expedición", "Émise le", "नव॰", "Diterbitkan"
"Data di rilascio",
"発行日", "발행일", "Dikeluarkan", "Toegekend", "Utstedt", "Wydany", "Emitido em"
"Дата выдачи", "Utfärdat", "ออกเมื่อ",
"Inisyu", "tarihinde verildi", "颁发日期", "發照日期"]
# "No experience date" in all language combination
no_experience_all_lang = ["Bez data vypršení", "Ingen udløbsdato", "Kein Ablaufdatum", "No Expiration Date",
"Sin fecha de vencimiento", "Sans date d’expiration", "समाप्ति की कोई तारीख नहीं",
"Tidak Ada Tanggal Kedaluwarsa", "Nessuna data di scadenza", "有効期限なし"
"만료일 없음",
"Tiada Tarikh Tamat Tempoh", "Geen vervaldatum" "Ingen utløpsdato",
"Brak daty ważności", "Sem data de expiração", "Без истечения срока действия"
"Inget utgångsdatum", "ไม่มีวันหมดอายุ",
"Walang Petsa ng Pagtatapos"
"Sona Erme Tarihi Yok", "长期有效", "永久有效"]
# "Expired" in all lang
expires_all_lang = ["Vyprší", "Udløbet", "Gültig bis", "Expired", "Vencimiento", "Expire le",
"को समाप्त हो गया है", "Kedaluwarsa", "Data di scadenza", "有効期限", " 만료됨", "Tamat Tempoh",
"Verloopt op", "Utløper", "Wygasa", "Expira em", "A expirat la", "Срок действия истек",
"Upphörde att gälla", "หมดอายุเมื่อ", "Mag-e-expire sa", "tarihinde süresi bitiyor", "有效期至", "過期"]
# colon my appear get rid of it!
def raw_dates_extractor(list_of_examples: dates_eg) -> dict:
raw_start_date, raw_end_date, merged_start_end_dates = {}, {}, {}
for row, datum in enumerate(list_of_examples):
date_splitter = re.split("[·•–-]", datum[0])
raw_start_date[f'Start_date_raw{row}'] = date_splitter[0].strip(' ')
raw_end_date[f'End_date_raw{row}'] = date_splitter[1].strip(' ')
merged_start_end_dates = raw_start_date | raw_end_date
return merged_start_end_dates
output:
raw_dates_extractor(dates_eg)
{'Start_date_raw0': 'Issued Feb 2018', 'Start_date_raw1': 'Oct 2021', 'Start_date_raw2': 'Jan 2019', 'Start_date_raw3': '1994', 'Start_date_raw4': 'Sep 2010', 'Start_date_raw5': 'Nov 2019', 'Start_date_raw6': 'Sep 2015', 'Start_date_raw7': 'Apr 2019', 'Start_date_raw8': 'Issued: Aug 2018', 'Start_date_raw9': 'Mar 2017', 'Start_date_raw10': 'May 2021', 'End_date_raw0': 'No Expiration Date', 'End_date_raw1': 'Present', 'End_date_raw2': 'Oct 2021', 'End_date_raw3': '2000', 'End_date_raw4': 'Sep 2015', 'End_date_raw5': 'Present', 'End_date_raw6': 'Mar 2017', 'End_date_raw7': 'Present', 'End_date_raw8': 'Expired Aug 2020', 'End_date_raw9': 'Dec 2017', 'End_date_raw10': 'Present'}
def clean_extract_dates(input_dictionary: dict, aggregate_stop_words: list):
container_replica = {}
for date_element in input_dictionary:
query_req = input_dictionary[f'{date_element}']
if ":" in query_req:
query_req = query_req.replace(":", "")
if "." in query_req:
query_req = query_req.replace(".", "")
stopwords = aggregate_stop_words
query_words = query_req.split()
result_words = [word for word in query_words if word not in stopwords]
end_result = ' '.join(result_words)
container_replica[f'{date_element}'] = end_result
return container_replica
output:
clean_extract_dates(raw_dates_extractor(dates_eg), aggregate_lists)
{'Start_date_raw0': 'Feb 2018', 'Start_date_raw1': 'Oct 2021', 'Start_date_raw2': 'Jan 2019', 'Start_date_raw3': '1994', 'Start_date_raw4': 'Sep 2010', 'Start_date_raw5': 'Nov 2019', 'Start_date_raw6': 'Sep 2015', 'Start_date_raw7': 'Apr 2019', 'Start_date_raw8': 'Aug 2018', 'Start_date_raw9': 'Mar 2017', 'Start_date_raw10': 'May 2021', 'End_date_raw0': 'No Expiration Date', 'End_date_raw1': 'Present', 'End_date_raw2': 'Oct 2021', 'End_date_raw3': '2000', 'End_date_raw4': 'Sep 2015', 'End_date_raw5': 'Present', 'End_date_raw6': 'Mar 2017', 'End_date_raw7': 'Present', 'End_date_raw8': 'Aug 2020', 'End_date_raw9': 'Dec 2017', 'End_date_raw10': 'Present'}
aggregate_lists = issued_all_lang + no_experience_all_lang + expires_all_lang
saved_dictionary = clean_extract_dates(raw_dates_extractor(dates_eg), aggregate_lists)

Removing substring from list of strings

I have the column of values as below,
array(['Mar 2018', 'Jun 2018', 'Sep 2018', 'Dec 2018', 'Mar 2019',
'Jun 2019', 'Sep 2019', 'Dec 2019', 'Mar 2020', 'Jun 2020',
'Sep 2020', 'Dec 2020'], dtype=object)
From this values I require output as,
array(['Mar'18', 'Jun'18', 'Sep'18', 'Dec'18', 'Mar'19',
'Jun'19', 'Sep'19', 'Dec'19', 'Mar'20', 'Jun'20',
'Sep'20', 'Dec'20'], dtype=object)
I have tried with following code,
df['Period'] = df['Period'].replace({'20','''})
But here it wasnt converting , how to replace the same?
Any help?
Thanks
With your shown samples, please try following.
df['Period'].replace(r" \d{2}", "'", regex=True)
Output will be as follows.
0 Mar'18
1 Jun'18
2 Sep'18
3 Dec'18
4 Mar'19
5 Jun'19
6 Sep'19
7 Dec'19
8 Mar'20
9 Jun'20
10 Sep'20
11 Dec'20
try this regex:
df['Period'].str.replace(r"\s\d{2}(\d{2})", r"'\1", regex=True)
in the replacement part, \1 refers to the capturing group, which is the last two digits in this case.
Following your code (slightly changed to work) will not get you what you need as it will replace all '20's.
>>> df['Period'] = df['Period'].str.replace('20','')
Out[179]:
Period
0 Mar 18
1 Jun 18
2 Sep 18
3 Dec 18
4 Mar 19
5 Jun 19
6 Sep 19
7 Dec 19
8 Mar
9 Jun
10 Sep
11 Dec
Another way without using regex, would be with with vectorized str methods, more here:
df['Period_refined'] = df['Period'].str[:3] + "'" + df['Period'].str[-2:]
Output
df
Period Period_refined
0 Mar 2018 Mar'18
1 Jun 2018 Jun'18
2 Sep 2018 Sep'18
3 Dec 2018 Dec'18
4 Mar 2019 Mar'19
5 Jun 2019 Jun'19
6 Sep 2019 Sep'19
7 Dec 2019 Dec'19
8 Mar 2020 Mar'20
9 Jun 2020 Jun'20
10 Sep 2020 Sep'20
11 Dec 2020 Dec'20

Concatenate ListA elements with partially matching ListB elements

Say I have two python lists as:
ListA = ['Jan 2018', 'Feb 2018', 'Mar 2018']
ListB = ['Sales Jan 2018','Units sold Jan 2018','Sales Feb 2018','Units sold Feb 2018','Sales Mar 2018','Units sold Mar 2018']
I need to get an output as:
List_op = ['Jan 2018 Sales Jan 2018 Units sold Jan 2018','Feb 2018 Sales Feb 2018 Units sold Feb 2018','Mar 2018 Sales Mar 2018 Units sold Mar 2018']
My approach so far:
res=set()
for i in ListB:
for j in ListA:
if j in i:
res.add(f'{i} {j}')
print (res)
this gives me result as:
{'Units sold Jan 2018 Jan 2018', 'Sales Feb 2018 Feb 2018', 'Units sold Mar 2018 Mar 2018', 'Units sold Feb 2018 Feb 2018', 'Sales Jan 2018 Jan 2018', 'Sales Mar 2018 Mar 2018'}
which is definitely not the solution I'm looking for.
What I think is regular expression could be a handful here but I'm not sure how to approach. Any help in this regard is highly appreciated.
Thanks in advance.
Edit:
Values in ListA and ListB are not necessarily to be in order. Therefore for a particular month/year value in ListA, the same month/year value from ListB has to be matched and picked for both 'Sales' and 'Units sold' component and needs to be concatenated.
My main goal here is to get the list which I can use later to generate a statement that I'll be using to write Hive query.
Added more explanation as suggested by #andrew_reece
Assuming no additional edge cases that need taking care of, your original code is not bad, just needs a slight update:
List_op = []
for a in ListA:
combined = a
for b in ListB:
if a in b:
combined += " " + b
List_op.append(combined)
List_op
['Jan 2018 Sales Jan 2018 Units sold Jan 2018',
'Feb 2018 Sales Feb 2018 Units sold Feb 2018',
'Mar 2018 Sales Mar 2018 Units sold Mar 2018']
Supposing ListA and ListB are sorted:
ListA = ['Jan 2018', 'Feb 2018', 'Mar 2018']
ListB = ['Sales Jan 2018','Units sold Jan 2018','Sales Feb 2018','Units sold Feb 2018','Sales Mar 2018','Units sold Mar 2018']
print([v1 + " " + v2 for v1, v2 in zip(ListA, [v1 + " " + v2 for v1, v2 in zip(ListB[::2], ListB[1::2])])])
This will print:
['Jan 2018 Sales Jan 2018 Units sold Jan 2018', 'Feb 2018 Sales Feb 2018 Units sold Feb 2018', 'Mar 2018 Sales Mar 2018 Units sold Mar 2018']
In my example I firstly concatenate ListB variables together and then join ListA with this new list.
String concatenation can become expensive. In Python 3.6+, you can use more efficient f-strings within a list comprehension:
res = [f'{i} {j} {k}' for i, j, k in zip(ListA, ListB[::2], ListB[1::2])]
print(res)
['Jan 2018 Sales Jan 2018 Units sold Jan 2018',
'Feb 2018 Sales Feb 2018 Units sold Feb 2018',
'Mar 2018 Sales Mar 2018 Units sold Mar 2018']
Using itertools.islice, you can avoid the expense of creating new lists:
from itertools import islice
zipper = zip(ListA, islice(ListB, 0, None, 2), islice(ListB, 1, None, 2))
res = [f'{i} {j} {k}' for i, j, k in zipper]

Regex for extracting all complex dates formats from a string in python

I have following string:
dateEntries = "04-20-2009; 04/20/09; 4/20/09; 4/3/09; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2009; 20 March 2009; 2 Mar. 2009; 20 March, 2009; Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009; Feb 2009; Sep 2009; Oct 2010; 6/2008; 12/2009; 2009; 2010"
Here I want to extract all mentioned dates using regex. As an attempt I have written following regex:
import re
regEx = r'(?:\d{1,2}[-/th|st|nd|rd\s]*)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z\s,.]*(?:\d{1,2}[-/th|st|nd|rd)\s,]*)?(?:\d{2,4})'
re.findall(regEx, dateEntries)
I was expecting this to work but it only return subset of dates.
A = ['Mar 20, 2009',
'March 20, 2009',
'Mar. 20, 2009',
'Mar 20 2009',
'20 Mar 2009',
'20 March 2009',
'2 Mar. 2009',
'20 March, 2009',
'Mar 20th, 2009',
'Mar 21st, 2009',
'Mar 22nd, 2009',
'Feb 2009',
'Sep 2009',
'Oct 2010']
I'm not getting why its not returning the dates:
B=[04-20-2009; 04/20/09; 4/20/09; 4/3/09; 6/2008; 12/2009; 2009; 2010"]
I created the regEx by extending the r'(?:\d{1,2}[-\s\/])?(?:\d{1,2}[-\/\s])?(?:\d{2,4})' which works good for set B. But regEx is not able to produce A+B
Can anyone help in making a regex for extracting all dates mentioned in my dateEntries ?
NOTE: I want to solve this using regex only.
You are just missing a single ? after the (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) group to mark it as not necessary. Additionally I added a + behind the last two groups to make sure the regex doesn't split dates like "20 March 2009" into two different dates.
The full code:
import re
regEx = r'(?:\d{1,2}[-/th|st|nd|rd\s]*)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[a-z\s,.]*(?:\d{1,2}[-/th|st|nd|rd)\s,]*)+(?:\d{2,4})+'
dateEntries = "04-20-2009; 04/20/09; 4/20/09; 4/3/09; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2009; 20 March 2009; 2 Mar. 2009; 20 March, 2009; Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009; Feb 2009; Sep 2009; Oct 2010; 6/2008; 12/2009; 2009; 2010"
result = re.findall(regEx, dateEntries)
print(result)
If your date has leading whitespaces, the result will also have leading whitespaces. If you continue using the date string you could remove them for example with the .strip() method
Your regex pattern is totally unreadable.. Please build your regex pattern with simple building blocks. That would make the code a lot more readable
import re
import calendar
full_months = [month for month in calendar.month_name if month]
short_months = [d[:3] for d in full_months]
months = '|'.join(short_months + full_months)
sep = r'[.,]?\s+' # seperator
day = r'\d+'
year = r'\d+'
day_or_year = r'\d+(?:\w+)?'
r = re.compile(rf'(?:{day}{sep})?(?:{months}){sep}{day_or_year}(?:{sep}{year})?')
r.findall(dateEntries)
# ['Mar 20, 2009', 'March 20, 2009', 'Mar. 20, 2009', 'Mar 20 2009', '20 Mar 2009', '20 March 2009', '2 Mar. 2009', '20 March, 2009', 'Mar 20th, 2009', 'Mar 21st, 2009', 'Mar 22nd, 2009', 'Feb 2009', 'Sep 2009', 'Oct 2010']
Try Regex:
^(?:\d{1,2}(?:(?:-|/)|(?:th|st|nd|rd)?\s))?(?:(?:(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)(?:(?:-|/)|(?:,|\.)?\s)?)?(?:\d{1,2}(?:(?:-|/)|(?:th|st|nd|rd)?\s))?)(?:\d{2,4})$
Demo
You can try the following regex
(?:\d{1,2}[-/th|st|nd|rd\s]*)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[a-z\s,.]*(?:\d{1,2}[-/th|st|nd|rd)\s,]*)+(?:\d{2,4})+

Categories